{"id":1659,"date":"2019-01-31T18:00:52","date_gmt":"2019-01-31T18:00:52","guid":{"rendered":"https:\/\/www.aiproblog.com\/index.php\/2019\/01\/31\/how-to-develop-deep-learning-neural-networks-with-greedy-layer-wise-pretraining\/"},"modified":"2019-01-31T18:00:52","modified_gmt":"2019-01-31T18:00:52","slug":"how-to-develop-deep-learning-neural-networks-with-greedy-layer-wise-pretraining","status":"publish","type":"post","link":"https:\/\/www.aiproblog.com\/index.php\/2019\/01\/31\/how-to-develop-deep-learning-neural-networks-with-greedy-layer-wise-pretraining\/","title":{"rendered":"How to Develop Deep Learning Neural Networks With Greedy Layer-Wise Pretraining"},"content":{"rendered":"<p>Author: Jason Brownlee<\/p>\n<div>\n<p>Training deep neural networks was traditionally challenging as the vanishing gradient meant that weights in layers close to the input layer were not updated in response to errors calculated on the training dataset.<\/p>\n<p>An innovation and important milestone in the field of deep learning was greedy layer-wise pretraining that allowed very deep neural networks to be successfully trained, achieving then state-of-the-art performance.<\/p>\n<p>In this tutorial, you will discover greedy layer-wise pretraining as a technique for developing deep multi-layered neural network models.<\/p>\n<p>After completing this tutorial, you will know:<\/p>\n<ul>\n<li>Greedy layer-wise pretraining provides a way to develop deep multi-layered neural networks whilst only ever training shallow networks.<\/li>\n<li>Pretraining can be used to iteratively deepen a supervised model or an unsupervised model that can be repurposed as a supervised model.<\/li>\n<li>Pretraining may be useful for problems with small amounts labeled data and large amounts of unlabeled data.<\/li>\n<\/ul>\n<p>Let\u2019s get started.<\/p>\n<div id=\"attachment_6936\" style=\"width: 650px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" class=\"size-full wp-image-6936\" src=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2019\/02\/How-to-Develop-Deep-Neural-Networks-With-Greedy-Layer-Wise-Pretraining.jpg\" alt=\"How to Develop Deep Neural Networks With Greedy Layer-Wise Pretraining\" width=\"640\" height=\"427\" srcset=\"http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2019\/02\/How-to-Develop-Deep-Neural-Networks-With-Greedy-Layer-Wise-Pretraining.jpg 640w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2019\/02\/How-to-Develop-Deep-Neural-Networks-With-Greedy-Layer-Wise-Pretraining-300x200.jpg 300w\" sizes=\"(max-width: 640px) 100vw, 640px\"><\/p>\n<p class=\"wp-caption-text\">How to Develop Deep Neural Networks With Greedy Layer-Wise Pretraining<br \/>Photo by <a href=\"https:\/\/www.flickr.com\/photos\/30478819@N08\/42342676291\/\">Marco Verch<\/a>, some rights reserved.<\/p>\n<\/div>\n<h2>Tutorial Overview<\/h2>\n<p>This tutorial is divided into four parts; they are:<\/p>\n<ol>\n<li>Greedy Layer-Wise Pretraining<\/li>\n<li>Multi-Class Classification Problem<\/li>\n<li>Supervised Greedy Layer-Wise Pretraining<\/li>\n<li>Unsupervised Greedy Layer-Wise Pretraining<\/li>\n<\/ol>\n<h2>Greedy Layer-Wise Pretraining<\/h2>\n<p>Traditionally, training deep neural networks with many layers was challenging.<\/p>\n<p>As the number of hidden layers is increased, the amount of error information propagated back to earlier layers is dramatically reduced. This means that weights in hidden layers close to the output layer are updated normally, whereas weights in hidden layers close to the input layer are updated minimally or not at all. Generally, this problem prevented the training of very deep neural networks and was referred to as the <em>vanishing gradient problem<\/em>.<\/p>\n<p>An important milestone in the resurgence of neural networking that initially allowed the development of deeper neural network models was the technique of greedy layer-wise pretraining, often simply referred to as \u201c<em>pretraining<\/em>.\u201d<\/p>\n<blockquote>\n<p>The deep learning renaissance of 2006 began with the discovery that this greedy learning procedure could be used to find a good initialization for a joint learning procedure over all the layers, and that this approach could be used to successfully train even fully connected architectures.<\/p>\n<\/blockquote>\n<p>\u2014 Page 528, <a href=\"https:\/\/amzn.to\/2NJW3gE\">Deep Learning<\/a>, 2016.<\/p>\n<p>Pretraining involves successively adding a new hidden layer to a model and refitting, allowing the newly added model to learn the inputs from the existing hidden layer, often while keeping the weights for the existing hidden layers fixed. This gives the technique the name \u201c<em>layer-wise<\/em>\u201d as the model is trained one layer at a time.<\/p>\n<p>The technique is referred to as \u201c<em>greedy<\/em>\u201d because the piecewise or layer-wise approach to solving the harder problem of training a deep network. As an optimization process, dividing the training process into a succession of layer-wise training processes is seen as a greedy shortcut that likely leads to an aggregate of locally optimal solutions, a shortcut to a good enough global solution.<\/p>\n<blockquote>\n<p>Greedy algorithms break a problem into many components, then solve for the optimal version of each component in isolation. Unfortunately, combining the individually optimal components is not guaranteed to yield an optimal complete solution.<\/p>\n<\/blockquote>\n<p>\u2014 Page 323, <a href=\"https:\/\/amzn.to\/2NJW3gE\">Deep Learning<\/a>, 2016.<\/p>\n<p>Pretraining is based on the assumption that it is easier to train a shallow network instead of a deep network and contrives a layer-wise training process that we are always only ever fitting a shallow model.<\/p>\n<blockquote>\n<p>\u2026 builds on the premise that training a shallow network is easier than training a deep one, which seems to have been validated in several contexts.<\/p>\n<\/blockquote>\n<p>\u2014 Page 529, <a href=\"https:\/\/amzn.to\/2NJW3gE\">Deep Learning<\/a>, 2016.<\/p>\n<p>The key benefits of pretraining are:<\/p>\n<ul>\n<li>Simplified training process.<\/li>\n<li>Facilitates the development of deeper networks.<\/li>\n<li>Useful as a weight initialization scheme.<\/li>\n<li>Perhaps lower generalization error.<\/li>\n<\/ul>\n<blockquote>\n<p>In general, pretraining may help both in terms of optimization and in terms of generalization.<\/p>\n<\/blockquote>\n<p>\u2014 Page 325, <a href=\"https:\/\/amzn.to\/2NJW3gE\">Deep Learning<\/a>, 2016.<\/p>\n<p>There are two main approaches to pretraining; they are:<\/p>\n<ul>\n<li>Supervised greedy layer-wise pretraining.<\/li>\n<li>Unsupervised greedy layer-wise pretraining.<\/li>\n<\/ul>\n<p>Broadly, supervised pretraining involves successively adding hidden layers to a model trained on a supervised learning task. Unsupervised pretraining involves using the greedy layer-wise process to build up an unsupervised autoencoder model, to which a supervised output layer is later added.<\/p>\n<blockquote>\n<p>It is common to use the word \u201cpretraining\u201d to refer not only to the pretraining stage itself but to the entire two phase protocol that combines the pretraining phase and a supervised learning phase. The supervised learning phase may involve training a simple classifier on top of the features learned in the pretraining phase, or it may involve supervised fine-tuning of the entire network learned in the pretraining phase.<\/p>\n<\/blockquote>\n<p>\u2014 Page 529, <a href=\"https:\/\/amzn.to\/2NJW3gE\">Deep Learning<\/a>, 2016.<\/p>\n<p>Unsupervised pretraining may be appropriate when you have a significantly larger number of unlabeled examples that can be used to initialize a model prior to using a much smaller number of examples to fine tune the model weights for a supervised task.<\/p>\n<blockquote>\n<p>\u2026. we can expect unsupervised pretraining to be most helpful when the number of labeled examples is very small. Because the source of information added by unsupervised pretraining is the unlabeled data, we may also expect unsupervised pretraining to perform best when the number of unlabeled examples is very large.<\/p>\n<\/blockquote>\n<p>\u2014 Page 532, <a href=\"https:\/\/amzn.to\/2NJW3gE\">Deep Learning<\/a>, 2016.<\/p>\n<p>Although the weights in prior layers are held constant, it is common to fine tune all weights in the network at the end after the addition of the final layer. As such, this allows pretraining to be considered a type of weight initialization method.<\/p>\n<blockquote>\n<p>\u2026 it makes use of the idea that the choice of initial parameters for a deep neural network can have a significant regularizing effect on the model (and, to a lesser extent, that it can improve optimization).<\/p>\n<\/blockquote>\n<p>\u2014 Page 530-531, <a href=\"https:\/\/amzn.to\/2NJW3gE\">Deep Learning<\/a>, 2016.<\/p>\n<p>Greedy layer-wise pretraining is an important milestone in the history of deep learning, that allowed the early development of networks with more hidden layers than was previously possible. The approach can be useful on some problems; for example, it is best practice to use unsupervised pretraining for text data in order to provide a richer distributed representation of words and their interrelationships via <a href=\"https:\/\/machinelearningmastery.com\/what-are-word-embeddings\/\">word2vec<\/a>.<\/p>\n<blockquote>\n<p>Today, unsupervised pretraining has been largely abandoned, except in the field of natural language processing [\u2026] the advantage of pretraining is that one can pretrain once on a huge unlabeled set (for example with a corpus containing billions of words), learn a good representation (typically of words, but also of sentences), and then use this representation or fine-tune it for a supervised task for which the training set contains substantially fewer examples.<\/p>\n<\/blockquote>\n<p>\u2014 Page 535, <a href=\"https:\/\/amzn.to\/2NJW3gE\">Deep Learning<\/a>, 2016.<\/p>\n<p>Nevertheless, it is likely better performance may be achieved using modern methods such as better activation functions, weight initialization, variants of gradient descent, and regularization methods.<\/p>\n<blockquote>\n<p>Today, we now know that greedy layer-wise pretraining is not required to train fully connected deep architectures, but the unsupervised pretraining approach was the first method to succeed.<\/p>\n<\/blockquote>\n<p>\u2014 Page 528, <a href=\"https:\/\/amzn.to\/2NJW3gE\">Deep Learning<\/a>, 2016.<\/p>\n<div class=\"woo-sc-hr\"><\/div>\n<p><center><\/p>\n<h3>Want Better Results with Deep Learning?<\/h3>\n<p>Take my free 7-day email crash course now (with sample code).<\/p>\n<p>Click to sign-up and also get a free PDF Ebook version of the course.<\/p>\n<p><a href=\"https:\/\/machinelearningmastery.lpages.co\/leadbox\/1433e7773f72a2%3A164f8be4f346dc\/5764144745676800\/\" target=\"_blank\" style=\"background: rgb(255, 206, 10); color: rgb(255, 255, 255); text-decoration: none; font-family: Helvetica, Arial, sans-serif; font-weight: bold; font-size: 16px; line-height: 20px; padding: 10px; display: inline-block; max-width: 300px; border-radius: 5px; text-shadow: rgba(0, 0, 0, 0.25) 0px -1px 1px; box-shadow: rgba(255, 255, 255, 0.5) 0px 1px 3px inset, rgba(0, 0, 0, 0.5) 0px 1px 3px;\">Download Your FREE Mini-Course<\/a><script data-leadbox=\"1433e7773f72a2:164f8be4f346dc\" data-url=\"https:\/\/machinelearningmastery.lpages.co\/leadbox\/1433e7773f72a2%3A164f8be4f346dc\/5764144745676800\/\" data-config=\"%7B%7D\" type=\"text\/javascript\" src=\"https:\/\/machinelearningmastery.lpages.co\/leadbox-1543333086.js\"><\/script><\/p>\n<p><\/center><\/p>\n<div class=\"woo-sc-hr\"><\/div>\n<h2>Multi-Class Classification Problem<\/h2>\n<p>We will use a small multi-class classification problem as the basis to demonstrate the effect of greedy layer-wise pretraining on model performance.<\/p>\n<p>The scikit-learn class provides the <a href=\"http:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.datasets.make_blobs.html\">make_blobs() function<\/a> that can be used to create a multi-class classification problem with the prescribed number of samples, input variables, classes, and variance of samples within a class.<\/p>\n<p>The problem will be configured with two input variables (to represent the <em>x<\/em> and <em>y<\/em> coordinates of the points) and a standard deviation of 2.0 for points within each group. We will use the same random state (seed for the pseudorandom number generator) to ensure that we always get the same data points.<\/p>\n<pre class=\"crayon-plain-tag\"># generate 2d classification dataset\r\nX, y = make_blobs(n_samples=1000, centers=3, n_features=2, cluster_std=2, random_state=2)<\/pre>\n<p>The results are the input and output elements of a dataset that we can model.<\/p>\n<p>In order to get a feeling for the complexity of the problem, we can plot each point on a two-dimensional scatter plot and color each point by class value.<\/p>\n<p>The complete example is listed below.<\/p>\n<pre class=\"crayon-plain-tag\"># scatter plot of blobs dataset\r\nfrom sklearn.datasets.samples_generator import make_blobs\r\nfrom matplotlib import pyplot\r\nfrom numpy import where\r\n# generate 2d classification dataset\r\nX, y = make_blobs(n_samples=1000, centers=3, n_features=2, cluster_std=2, random_state=2)\r\n# scatter plot for each class value\r\nfor class_value in range(3):\r\n\t# select indices of points with the class label\r\n\trow_ix = where(y == class_value)\r\n\t# scatter plot for points with a different color\r\n\tpyplot.scatter(X[row_ix, 0], X[row_ix, 1])\r\n# show plot\r\npyplot.show()<\/pre>\n<p>Running the example creates a scatter plot of the entire dataset. We can see that the standard deviation of 2.0 means that the classes are not linearly separable (separable by a line), causing many ambiguous points.<\/p>\n<p>This is desirable as it means that the problem is non-trivial and will allow a neural network model to find many different \u201cgood enough\u201d candidate solutions.<\/p>\n<div id=\"attachment_6933\" style=\"width: 1290px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" class=\"size-full wp-image-6933\" src=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2018\/11\/Scatter-Plot-of-Blobs-Dataset-with-Three-Classes-and-Points-Colored-by-Class-Value-2.png\" alt=\"Scatter Plot of Blobs Dataset With Three Classes and Points Colored by Class Value\" width=\"1280\" height=\"960\" srcset=\"http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/11\/Scatter-Plot-of-Blobs-Dataset-with-Three-Classes-and-Points-Colored-by-Class-Value-2.png 1280w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/11\/Scatter-Plot-of-Blobs-Dataset-with-Three-Classes-and-Points-Colored-by-Class-Value-2-300x225.png 300w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/11\/Scatter-Plot-of-Blobs-Dataset-with-Three-Classes-and-Points-Colored-by-Class-Value-2-768x576.png 768w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/11\/Scatter-Plot-of-Blobs-Dataset-with-Three-Classes-and-Points-Colored-by-Class-Value-2-1024x768.png 1024w\" sizes=\"(max-width: 1280px) 100vw, 1280px\"><\/p>\n<p class=\"wp-caption-text\">Scatter Plot of Blobs Dataset With Three Classes and Points Colored by Class Value<\/p>\n<\/div>\n<h2>Supervised Greedy Layer-Wise Pretraining<\/h2>\n<p>In this section, we will use greedy layer-wise supervised learning to build up a deep Multilayer Perceptron (MLP) model for the blobs supervised learning multi-class classification problem.<\/p>\n<p>Pretraining is not required to address this simple predictive modeling problem. Instead, this is a demonstration of how to perform supervised greedy layer-wise pretraining that can be used as a template for larger and more challenging supervised learning problems.<\/p>\n<p>As a first step, we can develop a function to create 1,000 samples from the problem and split them evenly into train and test datasets. The <em>prepare_data()<\/em> function below implements this and returns the train and test sets in terms of the input and output components.<\/p>\n<pre class=\"crayon-plain-tag\"># prepare the dataset\r\ndef prepare_data():\r\n\t# generate 2d classification dataset\r\n\tX, y = make_blobs(n_samples=1000, centers=3, n_features=2, cluster_std=2, random_state=2)\r\n\t# one hot encode output variable\r\n\ty = to_categorical(y)\r\n\t# split into train and test\r\n\tn_train = 500\r\n\ttrainX, testX = X[:n_train, :], X[n_train:, :]\r\n\ttrainy, testy = y[:n_train], y[n_train:]\r\n\treturn trainX, testX, trainy, testy<\/pre>\n<p>We can call this function to prepare the data.<\/p>\n<pre class=\"crayon-plain-tag\"># prepare data\r\ntrainX, testX, trainy, testy = prepare_data()<\/pre>\n<p>Next, we can train and fit a base model.<\/p>\n<p>This will be an MLP that expects two inputs for the two input variables in the dataset and has one hidden layer with 10 nodes and uses the rectified linear activation function. The output layer has three nodes in order to predict the probability for each of the three classes and uses the softmax activation function.<\/p>\n<pre class=\"crayon-plain-tag\"># define model\r\nmodel = Sequential()\r\nmodel.add(Dense(10, input_dim=2, activation='relu', kernel_initializer='he_uniform'))\r\nmodel.add(Dense(3, activation='softmax'))<\/pre>\n<p>The model is fit using stochastic gradient descent with the sensible default learning rate of 0.01 and a high momentum value of 0.9. The model is optimized using cross entropy loss.<\/p>\n<pre class=\"crayon-plain-tag\"># compile model\r\nopt = SGD(lr=0.01, momentum=0.9)\r\nmodel.compile(loss='categorical_crossentropy', optimizer=opt, metrics=['accuracy'])<\/pre>\n<p>The model is then fit on the training dataset for 100 epochs with a default batch size of 32 examples.<\/p>\n<pre class=\"crayon-plain-tag\"># fit model\r\nmodel.fit(trainX, trainy, epochs=100, verbose=0)<\/pre>\n<p>The <em>get_base_model()<\/em> function below ties these elements together, taking the training dataset as arguments and returning a fit baseline model.<\/p>\n<pre class=\"crayon-plain-tag\"># define and fit the base model\r\ndef get_base_model(trainX, trainy):\r\n\t# define model\r\n\tmodel = Sequential()\r\n\tmodel.add(Dense(10, input_dim=2, activation='relu', kernel_initializer='he_uniform'))\r\n\tmodel.add(Dense(3, activation='softmax'))\r\n\t# compile model\r\n\topt = SGD(lr=0.01, momentum=0.9)\r\n\tmodel.compile(loss='categorical_crossentropy', optimizer=opt, metrics=['accuracy'])\r\n\t# fit model\r\n\tmodel.fit(trainX, trainy, epochs=100, verbose=0)\r\n\treturn model<\/pre>\n<p>We can call this function to prepare the base model to which we can later add layers one at a time.<\/p>\n<pre class=\"crayon-plain-tag\"># get the base model\r\nmodel = get_base_model(trainX, trainy)<\/pre>\n<p>We need to be able to easily evaluate the performance of a model on the train and test sets.<\/p>\n<p>The <em>evaluate_model()<\/em> function below takes the train and test sets as arguments as well as a model and returns the accuracy on both datasets.<\/p>\n<pre class=\"crayon-plain-tag\"># evaluate a fit model\r\ndef evaluate_model(model, trainX, testX, trainy, testy):\r\n\t_, train_acc = model.evaluate(trainX, trainy, verbose=0)\r\n\t_, test_acc = model.evaluate(testX, testy, verbose=0)\r\n\treturn train_acc, test_acc<\/pre>\n<p>We can call this function to calculate and report the accuracy of the base model and store the scores away in a dictionary against the number of layers in the model (currently two, one hidden and one output layer) so we can plot the relationship between layers and accuracy later.<\/p>\n<pre class=\"crayon-plain-tag\"># evaluate the base model\r\nscores = dict()\r\ntrain_acc, test_acc = evaluate_model(model, trainX, testX, trainy, testy)\r\nprint('> layers=%d, train=%.3f, test=%.3f' % (len(model.layers), train_acc, test_acc))<\/pre>\n<p>We can now outline the process of greedy layer-wise pretraining.<\/p>\n<p>A function is required that can add a new hidden layer and retrain the model but only update the weights in the newly added layer and in the output layer.<\/p>\n<p>This requires first storing the current output layer including its configuration and current set of weights.<\/p>\n<pre class=\"crayon-plain-tag\"># remember the current output layer\r\noutput_layer = model.layers[-1]<\/pre>\n<p>Then removing the output layer from the stack of layers in the model.<\/p>\n<pre class=\"crayon-plain-tag\"># remove the output layer\r\nmodel.pop()<\/pre>\n<p>All of the remaining layers in the model can then be marked as non-trainable, meaning that their weights cannot be updated when the <em>fit()<\/em> function is called again.<\/p>\n<pre class=\"crayon-plain-tag\"># mark all remaining layers as non-trainable\r\nfor layer in model.layers:\r\n\tlayer.trainable = False<\/pre>\n<p>We can then add a new hidden layer, in this case with the same configuration as the first hidden layer added in the base model.<\/p>\n<pre class=\"crayon-plain-tag\"># add a new hidden layer\r\nmodel.add(Dense(10, activation='relu', kernel_initializer='he_uniform'))<\/pre>\n<p>Finally, the output layer can be added back and the model can be refit on the training dataset.<\/p>\n<pre class=\"crayon-plain-tag\"># re-add the output layer\r\nmodel.add(output_layer)\r\n# fit model\r\nmodel.fit(trainX, trainy, epochs=100, verbose=0)<\/pre>\n<p>We can tie all of these elements into a function named <em>add_layer()<\/em> that takes the model and the training dataset as arguments.<\/p>\n<pre class=\"crayon-plain-tag\"># add one new layer and re-train only the new layer\r\ndef add_layer(model, trainX, trainy):\r\n\t# remember the current output layer\r\n\toutput_layer = model.layers[-1]\r\n\t# remove the output layer\r\n\tmodel.pop()\r\n\t# mark all remaining layers as non-trainable\r\n\tfor layer in model.layers:\r\n\t\tlayer.trainable = False\r\n\t# add a new hidden layer\r\n\tmodel.add(Dense(10, activation='relu', kernel_initializer='he_uniform'))\r\n\t# re-add the output layer\r\n\tmodel.add(output_layer)\r\n\t# fit model\r\n\tmodel.fit(trainX, trainy, epochs=100, verbose=0)<\/pre>\n<p>This function can then be called repeatedly based on the number of layers we wish to add to the model.<\/p>\n<p>In this case, we will add 10 layers, one at a time, and evaluate the performance of the model after each additional layer is added to get an idea of how it is impacting performance.<\/p>\n<p>Train and test accuracy scores are stored in the dictionary against the number of layers in the model.<\/p>\n<pre class=\"crayon-plain-tag\"># add layers and evaluate the updated model\r\nn_layers = 10\r\nfor i in range(n_layers):\r\n\t# add layer\r\n\tadd_layer(model, trainX, trainy)\r\n\t# evaluate model\r\n\ttrain_acc, test_acc = evaluate_model(model, trainX, testX, trainy, testy)\r\n\tprint('> layers=%d, train=%.3f, test=%.3f' % (len(model.layers), train_acc, test_acc))\r\n\t# store scores for plotting\r\n\tscores[len(model.layers)] = (train_acc, test_acc)<\/pre>\n<p>At the end of the run, a line plot is created showing the number of layers in the model (x-axis) compared to the number model accuracy on the train and test datasets.<\/p>\n<p>We would expect the addition of layers to improve the performance of the model on the training dataset and perhaps even on the test dataset.<\/p>\n<pre class=\"crayon-plain-tag\"># plot number of added layers vs accuracy\r\npyplot.plot(scores.keys(), [scores[k][0] for k in scores.keys()], label='train', marker='.')\r\npyplot.plot(scores.keys(), [scores[k][1] for k in scores.keys()], label='test', marker='.')\r\npyplot.legend()\r\npyplot.show()<\/pre>\n<p>Tying all of these elements together, the complete example is listed below.<\/p>\n<pre class=\"crayon-plain-tag\"># supervised greedy layer-wise pretraining for blobs classification problem\r\nfrom sklearn.datasets.samples_generator import make_blobs\r\nfrom keras.layers import Dense\r\nfrom keras.models import Sequential\r\nfrom keras.optimizers import SGD\r\nfrom keras.utils import to_categorical\r\nfrom matplotlib import pyplot\r\n\r\n# prepare the dataset\r\ndef prepare_data():\r\n\t# generate 2d classification dataset\r\n\tX, y = make_blobs(n_samples=1000, centers=3, n_features=2, cluster_std=2, random_state=2)\r\n\t# one hot encode output variable\r\n\ty = to_categorical(y)\r\n\t# split into train and test\r\n\tn_train = 500\r\n\ttrainX, testX = X[:n_train, :], X[n_train:, :]\r\n\ttrainy, testy = y[:n_train], y[n_train:]\r\n\treturn trainX, testX, trainy, testy\r\n\r\n# define and fit the base model\r\ndef get_base_model(trainX, trainy):\r\n\t# define model\r\n\tmodel = Sequential()\r\n\tmodel.add(Dense(10, input_dim=2, activation='relu', kernel_initializer='he_uniform'))\r\n\tmodel.add(Dense(3, activation='softmax'))\r\n\t# compile model\r\n\topt = SGD(lr=0.01, momentum=0.9)\r\n\tmodel.compile(loss='categorical_crossentropy', optimizer=opt, metrics=['accuracy'])\r\n\t# fit model\r\n\tmodel.fit(trainX, trainy, epochs=100, verbose=0)\r\n\treturn model\r\n\r\n# evaluate a fit model\r\ndef evaluate_model(model, trainX, testX, trainy, testy):\r\n\t_, train_acc = model.evaluate(trainX, trainy, verbose=0)\r\n\t_, test_acc = model.evaluate(testX, testy, verbose=0)\r\n\treturn train_acc, test_acc\r\n\r\n# add one new layer and re-train only the new layer\r\ndef add_layer(model, trainX, trainy):\r\n\t# remember the current output layer\r\n\toutput_layer = model.layers[-1]\r\n\t# remove the output layer\r\n\tmodel.pop()\r\n\t# mark all remaining layers as non-trainable\r\n\tfor layer in model.layers:\r\n\t\tlayer.trainable = False\r\n\t# add a new hidden layer\r\n\tmodel.add(Dense(10, activation='relu', kernel_initializer='he_uniform'))\r\n\t# re-add the output layer\r\n\tmodel.add(output_layer)\r\n\t# fit model\r\n\tmodel.fit(trainX, trainy, epochs=100, verbose=0)\r\n\r\n# prepare data\r\ntrainX, testX, trainy, testy = prepare_data()\r\n# get the base model\r\nmodel = get_base_model(trainX, trainy)\r\n# evaluate the base model\r\nscores = dict()\r\ntrain_acc, test_acc = evaluate_model(model, trainX, testX, trainy, testy)\r\nprint('> layers=%d, train=%.3f, test=%.3f' % (len(model.layers), train_acc, test_acc))\r\nscores[len(model.layers)] = (train_acc, test_acc)\r\n# add layers and evaluate the updated model\r\nn_layers = 10\r\nfor i in range(n_layers):\r\n\t# add layer\r\n\tadd_layer(model, trainX, trainy)\r\n\t# evaluate model\r\n\ttrain_acc, test_acc = evaluate_model(model, trainX, testX, trainy, testy)\r\n\tprint('> layers=%d, train=%.3f, test=%.3f' % (len(model.layers), train_acc, test_acc))\r\n\t# store scores for plotting\r\n\tscores[len(model.layers)] = (train_acc, test_acc)\r\n# plot number of added layers vs accuracy\r\npyplot.plot(scores.keys(), [scores[k][0] for k in scores.keys()], label='train', marker='.')\r\npyplot.plot(scores.keys(), [scores[k][1] for k in scores.keys()], label='test', marker='.')\r\npyplot.legend()\r\npyplot.show()<\/pre>\n<p>Running the example reports the classification accuracy on the train and test sets for the base model (two layers), then after each additional layer is added (from three to 12 layers).<\/p>\n<p>Your specific results may vary given the stochastic nature of the learning algorithm. Try running the example a few times.<\/p>\n<p>In this case, we can see that the baseline model does reasonably well on this problem. As the layers are increased, we can roughly see an increase in accuracy for the model on the training dataset, likely as it is beginning to overfit the data. We see a rough drop in classification accuracy on the test dataset, likely because of the overfitting.<\/p>\n<pre class=\"crayon-plain-tag\">> layers=2, train=0.816, test=0.830\r\n> layers=3, train=0.834, test=0.830\r\n> layers=4, train=0.836, test=0.824\r\n> layers=5, train=0.830, test=0.824\r\n> layers=6, train=0.848, test=0.820\r\n> layers=7, train=0.830, test=0.826\r\n> layers=8, train=0.850, test=0.824\r\n> layers=9, train=0.840, test=0.838\r\n> layers=10, train=0.842, test=0.830\r\n> layers=11, train=0.850, test=0.830\r\n> layers=12, train=0.850, test=0.826<\/pre>\n<p>A line plot is also created showing the train (blue) and test set (orange) accuracy as each additional layer is added to the model.<\/p>\n<p>In this case, the plot suggests a slight overfitting of the training dataset, but perhaps better test set performance after seven added layers.<\/p>\n<div id=\"attachment_6934\" style=\"width: 1290px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" class=\"size-full wp-image-6934\" src=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2018\/11\/Line-Plot-for-Supervised-Greedy-Layer-Wise-Pretraining-Showing-Model-Layers-vs-Train-and-Test-Set-Classification-Accuracy-on-the-Blobs-Classification-Problem.png\" alt=\"Line Plot for Supervised Greedy Layer-Wise Pretraining Showing Model Layers vs Train and Test Set Classification Accuracy on the Blobs Classification Problem\" width=\"1280\" height=\"960\" srcset=\"http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/11\/Line-Plot-for-Supervised-Greedy-Layer-Wise-Pretraining-Showing-Model-Layers-vs-Train-and-Test-Set-Classification-Accuracy-on-the-Blobs-Classification-Problem.png 1280w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/11\/Line-Plot-for-Supervised-Greedy-Layer-Wise-Pretraining-Showing-Model-Layers-vs-Train-and-Test-Set-Classification-Accuracy-on-the-Blobs-Classification-Problem-300x225.png 300w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/11\/Line-Plot-for-Supervised-Greedy-Layer-Wise-Pretraining-Showing-Model-Layers-vs-Train-and-Test-Set-Classification-Accuracy-on-the-Blobs-Classification-Problem-768x576.png 768w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/11\/Line-Plot-for-Supervised-Greedy-Layer-Wise-Pretraining-Showing-Model-Layers-vs-Train-and-Test-Set-Classification-Accuracy-on-the-Blobs-Classification-Problem-1024x768.png 1024w\" sizes=\"(max-width: 1280px) 100vw, 1280px\"><\/p>\n<p class=\"wp-caption-text\">Line Plot for Supervised Greedy Layer-Wise Pretraining Showing Model Layers vs Train and Test Set Classification Accuracy on the Blobs Classification Problem<\/p>\n<\/div>\n<p>An interesting extension to this example would be to allow all weights in the model to be fine tuned with a small learning rate for a large number of training epochs to see if this can further reduce generalization error.<\/p>\n<h2>Unsupervised Greedy Layer-Wise Pretraining<\/h2>\n<p>In this section, we will explore using greedy layer-wise pretraining with an unsupervised model.<\/p>\n<p>Specifically, we will develop an autoencoder model that will be trained to reconstruct input data. In order to use this unsupervised model for classification, we will remove the output layer, add and fit a new output layer for classification.<\/p>\n<p>This is slightly more complex than the previous supervised greedy layer-wise pretraining, but we can reuse many of the same ideas and code from the previous section.<\/p>\n<p>The first step is to define, fit, and evaluate an autoencoder model. We will use the same two-layer base model as we did in the previous section, except modify it to predict the input as the output and use mean squared error to evaluate how good the model is at reconstructing a given input sample.<\/p>\n<p>The <em>base_autoencoder()<\/em> function below implements this, taking the train and test sets as arguments, then defines, fits, and evaluates the base unsupervised autoencoder model, printing the reconstruction error on the train and test sets and returning the model.<\/p>\n<pre class=\"crayon-plain-tag\"># define, fit and evaluate the base autoencoder\r\ndef base_autoencoder(trainX, testX):\r\n\t# define model\r\n\tmodel = Sequential()\r\n\tmodel.add(Dense(10, input_dim=2, activation='relu', kernel_initializer='he_uniform'))\r\n\tmodel.add(Dense(2, activation='linear'))\r\n\t# compile model\r\n\tmodel.compile(loss='mse', optimizer=SGD(lr=0.01, momentum=0.9))\r\n\t# fit model\r\n\tmodel.fit(trainX, trainX, epochs=100, verbose=0)\r\n\t# evaluate reconstruction loss\r\n\ttrain_mse = model.evaluate(trainX, trainX, verbose=0)\r\n\ttest_mse = model.evaluate(testX, testX, verbose=0)\r\n\tprint('> reconstruction error train=%.3f, test=%.3f' % (train_mse, test_mse))\r\n\treturn model<\/pre>\n<p>We can call this function in order to prepare our base autoencoder to which we can add and greedily train layers.<\/p>\n<pre class=\"crayon-plain-tag\"># get the base autoencoder\r\nmodel = base_autoencoder(trainX, testX)<\/pre>\n<p>Evaluating an autoencoder model on the blobs multi-class classification problem requires a few steps.<\/p>\n<p>The hidden layers will be used as the basis of a classifier with a new output layer that must be trained then used to make predictions before adding back the original output layer so that we can continue to add layers to the autoencoder.<\/p>\n<p>The first step is to reference, then remove the output layer of the autoencoder model.<\/p>\n<pre class=\"crayon-plain-tag\"># remember the current output layer\r\noutput_layer = model.layers[-1]\r\n# remove the output layer\r\nmodel.pop()<\/pre>\n<p>All of the remaining hidden layers in the autoencoder must be marked as non-trainable so that the weights are not changed when we train the new output layer.<\/p>\n<pre class=\"crayon-plain-tag\"># mark all remaining layers as non-trainable\r\nfor layer in model.layers:\r\nlayer.trainable = False<\/pre>\n<p>We can now add a new output layer that predicts the probability of an example belonging to reach of the three classes. The model must also be re-compiled using a new loss function suitable for multi-class classification.<\/p>\n<pre class=\"crayon-plain-tag\"># add new output layer\r\nmodel.add(Dense(3, activation='softmax'))\r\n# compile model\r\nmodel.compile(loss='categorical_crossentropy', optimizer=SGD(lr=0.01, momentum=0.9), metrics=['acc'])<\/pre>\n<p>The model can then be re-fit on the training dataset, specifically training the output layer on how to make class predictions using the learned features from the autoencoder as input.<\/p>\n<p>The classification accuracy of the fit model can then be evaluated on the train and test datasets.<\/p>\n<pre class=\"crayon-plain-tag\"># fit model\r\nmodel.fit(trainX, trainy, epochs=100, verbose=0)\r\n# evaluate model\r\n_, train_acc = model.evaluate(trainX, trainy, verbose=0)\r\n_, test_acc = model.evaluate(testX, testy, verbose=0)<\/pre>\n<p>Finally, we can put the autoencoder back together but removing the classification output layer, adding back the original autoencoder output layer and recompiling the model with an appropriate loss function for reconstruction.<\/p>\n<pre class=\"crayon-plain-tag\"># put the model back together\r\nmodel.pop()\r\nmodel.add(output_layer)\r\nmodel.compile(loss='mse', optimizer=SGD(lr=0.01, momentum=0.9))<\/pre>\n<p>We can tie this together into an <em>evaluate_autoencoder_as_classifier()<\/em> function that takes the model as well as the train and test sets, then returns the train and test set classification accuracy.<\/p>\n<pre class=\"crayon-plain-tag\"># evaluate the autoencoder as a classifier\r\ndef evaluate_autoencoder_as_classifier(model, trainX, trainy, testX, testy):\r\n\t# remember the current output layer\r\n\toutput_layer = model.layers[-1]\r\n\t# remove the output layer\r\n\tmodel.pop()\r\n\t# mark all remaining layers as non-trainable\r\n\tfor layer in model.layers:\r\n\t\tlayer.trainable = False\r\n\t# add new output layer\r\n\tmodel.add(Dense(3, activation='softmax'))\r\n\t# compile model\r\n\tmodel.compile(loss='categorical_crossentropy', optimizer=SGD(lr=0.01, momentum=0.9), metrics=['acc'])\r\n\t# fit model\r\n\tmodel.fit(trainX, trainy, epochs=100, verbose=0)\r\n\t# evaluate model\r\n\t_, train_acc = model.evaluate(trainX, trainy, verbose=0)\r\n\t_, test_acc = model.evaluate(testX, testy, verbose=0)\r\n\t# put the model back together\r\n\tmodel.pop()\r\n\tmodel.add(output_layer)\r\n\tmodel.compile(loss='mse', optimizer=SGD(lr=0.01, momentum=0.9))\r\n\treturn train_acc, test_acc<\/pre>\n<p>This function can be called to evaluate the baseline autoencoder model and then store the accuracy scores in a dictionary against the number of layers in the model (in this case two).<\/p>\n<pre class=\"crayon-plain-tag\"># evaluate the base model\r\nscores = dict()\r\ntrain_acc, test_acc = evaluate_autoencoder_as_classifier(model, trainX, trainy, testX, testy)\r\nprint('> classifier accuracy layers=%d, train=%.3f, test=%.3f' % (len(model.layers), train_acc, test_acc))\r\nscores[len(model.layers)] = (train_acc, test_acc)<\/pre>\n<p>We are now ready to define the process for adding and pretraining layers to the model.<\/p>\n<p>The process for adding layers is much the same as the supervised case in the previous section, except we are optimizing reconstruction loss rather than classification accuracy for the new layer.<\/p>\n<p>The <em>add_layer_to_autoencoder()<\/em> function below adds a new hidden layer to the autoencoder model, updates the weights for the new layer and the hidden layers, then reports the reconstruction error on the train and test sets input data. The function does re-mark all prior layers as non-trainable, which is redundant because we already did this in the <em>evaluate_autoencoder_as_classifier()<\/em> function, but I have left it in, in case you decide to reuse this function in your own project.<\/p>\n<pre class=\"crayon-plain-tag\"># add one new layer and re-train only the new layer\r\ndef add_layer_to_autoencoder(model, trainX, testX):\r\n\t# remember the current output layer\r\n\toutput_layer = model.layers[-1]\r\n\t# remove the output layer\r\n\tmodel.pop()\r\n\t# mark all remaining layers as non-trainable\r\n\tfor layer in model.layers:\r\n\t\tlayer.trainable = False\r\n\t# add a new hidden layer\r\n\tmodel.add(Dense(10, activation='relu', kernel_initializer='he_uniform'))\r\n\t# re-add the output layer\r\n\tmodel.add(output_layer)\r\n\t# fit model\r\n\tmodel.fit(trainX, trainX, epochs=100, verbose=0)\r\n\t# evaluate reconstruction loss\r\n\ttrain_mse = model.evaluate(trainX, trainX, verbose=0)\r\n\ttest_mse = model.evaluate(testX, testX, verbose=0)\r\n\tprint('> reconstruction error train=%.3f, test=%.3f' % (train_mse, test_mse))<\/pre>\n<p>We can now repeatedly call this function, adding layers, and evaluating the effect by using the autoencoder as the basis for evaluating a new classifier.<\/p>\n<pre class=\"crayon-plain-tag\"># add layers and evaluate the updated model\r\nn_layers = 5\r\nfor _ in range(n_layers):\r\n\t# add layer\r\n\tadd_layer_to_autoencoder(model, trainX, testX)\r\n\t# evaluate model\r\n\ttrain_acc, test_acc = evaluate_autoencoder_as_classifier(model, trainX, trainy, testX, testy)\r\n\tprint('> classifier accuracy layers=%d, train=%.3f, test=%.3f' % (len(model.layers), train_acc, test_acc))\r\n\t# store scores for plotting\r\n\tscores[len(model.layers)] = (train_acc, test_acc)<\/pre>\n<p>As before, all accuracy scores are collected and we can use them to create a line graph of the number of model layers vs train and test set accuracy.<\/p>\n<pre class=\"crayon-plain-tag\"># plot number of added layers vs accuracy\r\nkeys = scores.keys()\r\npyplot.plot(keys, [scores[k][0] for k in keys], label='train', marker='.')\r\npyplot.plot(keys, [scores[k][1] for k in keys], label='test', marker='.')\r\npyplot.legend()\r\npyplot.show()<\/pre>\n<p>Tying all of this together, the complete example of unsupervised greedy layer-wise pretraining for the blobs multi-class classification problem is listed below.<\/p>\n<pre class=\"crayon-plain-tag\"># unsupervised greedy layer-wise pretraining for blobs classification problem\r\nfrom sklearn.datasets.samples_generator import make_blobs\r\nfrom keras.layers import Dense\r\nfrom keras.models import Sequential\r\nfrom keras.optimizers import SGD\r\nfrom keras.utils import to_categorical\r\nfrom matplotlib import pyplot\r\n\r\n# prepare the dataset\r\ndef prepare_data():\r\n\t# generate 2d classification dataset\r\n\tX, y = make_blobs(n_samples=1000, centers=3, n_features=2, cluster_std=2, random_state=2)\r\n\t# one hot encode output variable\r\n\ty = to_categorical(y)\r\n\t# split into train and test\r\n\tn_train = 500\r\n\ttrainX, testX = X[:n_train, :], X[n_train:, :]\r\n\ttrainy, testy = y[:n_train], y[n_train:]\r\n\treturn trainX, testX, trainy, testy\r\n\r\n# define, fit and evaluate the base autoencoder\r\ndef base_autoencoder(trainX, testX):\r\n\t# define model\r\n\tmodel = Sequential()\r\n\tmodel.add(Dense(10, input_dim=2, activation='relu', kernel_initializer='he_uniform'))\r\n\tmodel.add(Dense(2, activation='linear'))\r\n\t# compile model\r\n\tmodel.compile(loss='mse', optimizer=SGD(lr=0.01, momentum=0.9))\r\n\t# fit model\r\n\tmodel.fit(trainX, trainX, epochs=100, verbose=0)\r\n\t# evaluate reconstruction loss\r\n\ttrain_mse = model.evaluate(trainX, trainX, verbose=0)\r\n\ttest_mse = model.evaluate(testX, testX, verbose=0)\r\n\tprint('> reconstruction error train=%.3f, test=%.3f' % (train_mse, test_mse))\r\n\treturn model\r\n\r\n# evaluate the autoencoder as a classifier\r\ndef evaluate_autoencoder_as_classifier(model, trainX, trainy, testX, testy):\r\n\t# remember the current output layer\r\n\toutput_layer = model.layers[-1]\r\n\t# remove the output layer\r\n\tmodel.pop()\r\n\t# mark all remaining layers as non-trainable\r\n\tfor layer in model.layers:\r\n\t\tlayer.trainable = False\r\n\t# add new output layer\r\n\tmodel.add(Dense(3, activation='softmax'))\r\n\t# compile model\r\n\tmodel.compile(loss='categorical_crossentropy', optimizer=SGD(lr=0.01, momentum=0.9), metrics=['acc'])\r\n\t# fit model\r\n\tmodel.fit(trainX, trainy, epochs=100, verbose=0)\r\n\t# evaluate model\r\n\t_, train_acc = model.evaluate(trainX, trainy, verbose=0)\r\n\t_, test_acc = model.evaluate(testX, testy, verbose=0)\r\n\t# put the model back together\r\n\tmodel.pop()\r\n\tmodel.add(output_layer)\r\n\tmodel.compile(loss='mse', optimizer=SGD(lr=0.01, momentum=0.9))\r\n\treturn train_acc, test_acc\r\n\r\n# add one new layer and re-train only the new layer\r\ndef add_layer_to_autoencoder(model, trainX, testX):\r\n\t# remember the current output layer\r\n\toutput_layer = model.layers[-1]\r\n\t# remove the output layer\r\n\tmodel.pop()\r\n\t# mark all remaining layers as non-trainable\r\n\tfor layer in model.layers:\r\n\t\tlayer.trainable = False\r\n\t# add a new hidden layer\r\n\tmodel.add(Dense(10, activation='relu', kernel_initializer='he_uniform'))\r\n\t# re-add the output layer\r\n\tmodel.add(output_layer)\r\n\t# fit model\r\n\tmodel.fit(trainX, trainX, epochs=100, verbose=0)\r\n\t# evaluate reconstruction loss\r\n\ttrain_mse = model.evaluate(trainX, trainX, verbose=0)\r\n\ttest_mse = model.evaluate(testX, testX, verbose=0)\r\n\tprint('> reconstruction error train=%.3f, test=%.3f' % (train_mse, test_mse))\r\n\r\n# prepare data\r\ntrainX, testX, trainy, testy = prepare_data()\r\n# get the base autoencoder\r\nmodel = base_autoencoder(trainX, testX)\r\n# evaluate the base model\r\nscores = dict()\r\ntrain_acc, test_acc = evaluate_autoencoder_as_classifier(model, trainX, trainy, testX, testy)\r\nprint('> classifier accuracy layers=%d, train=%.3f, test=%.3f' % (len(model.layers), train_acc, test_acc))\r\nscores[len(model.layers)] = (train_acc, test_acc)\r\n# add layers and evaluate the updated model\r\nn_layers = 5\r\nfor _ in range(n_layers):\r\n\t# add layer\r\n\tadd_layer_to_autoencoder(model, trainX, testX)\r\n\t# evaluate model\r\n\ttrain_acc, test_acc = evaluate_autoencoder_as_classifier(model, trainX, trainy, testX, testy)\r\n\tprint('> classifier accuracy layers=%d, train=%.3f, test=%.3f' % (len(model.layers), train_acc, test_acc))\r\n\t# store scores for plotting\r\n\tscores[len(model.layers)] = (train_acc, test_acc)\r\n# plot number of added layers vs accuracy\r\nkeys = scores.keys()\r\npyplot.plot(keys, [scores[k][0] for k in keys], label='train', marker='.')\r\npyplot.plot(keys, [scores[k][1] for k in keys], label='test', marker='.')\r\npyplot.legend()\r\npyplot.show()<\/pre>\n<p>Running the example reports both reconstruction error and classification accuracy on the train and test sets for the model for the base model (two layers) then after each additional layer is added (from three to 12 layers).<\/p>\n<p>Your specific results may vary given the stochastic nature of the learning algorithm. Try running the example a few times.<\/p>\n<p>In this case, we can see that reconstruction error starts low, in fact near-perfect, then slowly increases during training. Accuracy on the training dataset seems to decrease as layers are added to the encoder, although accuracy test seems to improve as layers are added, at least until the model has five layers, after which performance appears to crash.<\/p>\n<pre class=\"crayon-plain-tag\">> reconstruction error train=0.000, test=0.000\r\n> classifier accuracy layers=2, train=0.830, test=0.832\r\n> reconstruction error train=0.001, test=0.002\r\n> classifier accuracy layers=3, train=0.826, test=0.842\r\n> reconstruction error train=0.002, test=0.002\r\n> classifier accuracy layers=4, train=0.820, test=0.838\r\n> reconstruction error train=0.016, test=0.028\r\n> classifier accuracy layers=5, train=0.828, test=0.834\r\n> reconstruction error train=2.311, test=2.694\r\n> classifier accuracy layers=6, train=0.764, test=0.762\r\n> reconstruction error train=2.192, test=2.526\r\n> classifier accuracy layers=7, train=0.764, test=0.760<\/pre>\n<p>A line plot is also created showing the train (blue) and test set (orange) accuracy as each additional layer is added to the model.<\/p>\n<p>In this case, the plot suggests there may be some minor benefits in the unsupervised greedy layer-wise pretraining, but perhaps beyond five layers the model becomes unstable.<\/p>\n<div id=\"attachment_6935\" style=\"width: 1290px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" class=\"size-full wp-image-6935\" src=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2018\/11\/Line-Plot-for-Unsupervised-Greedy-Layer-Wise-Pretraining-Showing-Model-Layers-vs-Train-and-Test-Set-Classification-Accuracy-on-the-Blobs-Classification-Problem.png\" alt=\"Line Plot for Unsupervised Greedy Layer-Wise Pretraining Showing Model Layers vs Train and Test Set Classification Accuracy on the Blobs Classification Problem\" width=\"1280\" height=\"960\" srcset=\"http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/11\/Line-Plot-for-Unsupervised-Greedy-Layer-Wise-Pretraining-Showing-Model-Layers-vs-Train-and-Test-Set-Classification-Accuracy-on-the-Blobs-Classification-Problem.png 1280w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/11\/Line-Plot-for-Unsupervised-Greedy-Layer-Wise-Pretraining-Showing-Model-Layers-vs-Train-and-Test-Set-Classification-Accuracy-on-the-Blobs-Classification-Problem-300x225.png 300w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/11\/Line-Plot-for-Unsupervised-Greedy-Layer-Wise-Pretraining-Showing-Model-Layers-vs-Train-and-Test-Set-Classification-Accuracy-on-the-Blobs-Classification-Problem-768x576.png 768w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/11\/Line-Plot-for-Unsupervised-Greedy-Layer-Wise-Pretraining-Showing-Model-Layers-vs-Train-and-Test-Set-Classification-Accuracy-on-the-Blobs-Classification-Problem-1024x768.png 1024w\" sizes=\"(max-width: 1280px) 100vw, 1280px\"><\/p>\n<p class=\"wp-caption-text\">Line Plot for Unsupervised Greedy Layer-Wise Pretraining Showing Model Layers vs Train and Test Set Classification Accuracy on the Blobs Classification Problem<\/p>\n<\/div>\n<p>An interesting extension would be to explore whether fine tuning of all weights in the model prior or after fitting a classifier output layer improves performance.<\/p>\n<h2>Further Reading<\/h2>\n<p>This section provides more resources on the topic if you are looking to go deeper.<\/p>\n<h3>Papers<\/h3>\n<ul>\n<li><a href=\"https:\/\/papers.nips.cc\/paper\/3048-greedy-layer-wise-training-of-deep-networks\">Greedy Layer-Wise Training of Deep Networks<\/a>, 2007.<\/li>\n<li><a href=\"http:\/\/www.jmlr.org\/papers\/v11\/erhan10a.html\">Why Does Unsupervised Pre-training Help Deep Learning<\/a>, 2010.<\/li>\n<\/ul>\n<h3>Books<\/h3>\n<ul>\n<li>Section 8.7.4 Supervised Pretraining, <a href=\"https:\/\/amzn.to\/2NJW3gE\">Deep Learning<\/a>, 2016.<\/li>\n<li>Section 15.1 Greedy Layer-Wise Unsupervised Pretraining, <a href=\"https:\/\/amzn.to\/2NJW3gE\">Deep Learning<\/a>, 2016.<\/li>\n<\/ul>\n<h2>Summary<\/h2>\n<p>In this tutorial, you discovered greedy layer-wise pretraining as a technique for developing deep multi-layered neural network models.<\/p>\n<p>Specifically, you learned:<\/p>\n<ul>\n<li>Greedy layer-wise pretraining provides a way to develop deep multi-layered neural networks whilst only ever training shallow networks.<\/li>\n<li>Pretraining can be used to iteratively deepen a supervised model or an unsupervised model that can be repurposed as a supervised model.<\/li>\n<li>Pretraining may be useful for problems with small amounts labeled data and large amounts of unlabeled data.<\/li>\n<\/ul>\n<p>Do you have any questions?<br \/>\nAsk your questions in the comments below and I will do my best to answer.<\/p>\n<p>The post <a rel=\"nofollow\" href=\"https:\/\/machinelearningmastery.com\/greedy-layer-wise-pretraining-tutorial\/\">How to Develop Deep Learning Neural Networks With Greedy Layer-Wise Pretraining<\/a> appeared first on <a rel=\"nofollow\" href=\"https:\/\/machinelearningmastery.com\/\">Machine Learning Mastery<\/a>.<\/p>\n<\/div>\n<p><a href=\"https:\/\/machinelearningmastery.com\/greedy-layer-wise-pretraining-tutorial\/\">Go to Source<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Author: Jason Brownlee Training deep neural networks was traditionally challenging as the vanishing gradient meant that weights in layers close to the input layer were [&hellip;] <span class=\"read-more-link\"><a class=\"read-more\" href=\"https:\/\/www.aiproblog.com\/index.php\/2019\/01\/31\/how-to-develop-deep-learning-neural-networks-with-greedy-layer-wise-pretraining\/\">Read More<\/a><\/span><\/p>\n","protected":false},"author":1,"featured_media":1660,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_bbp_topic_count":0,"_bbp_reply_count":0,"_bbp_total_topic_count":0,"_bbp_total_reply_count":0,"_bbp_voice_count":0,"_bbp_anonymous_reply_count":0,"_bbp_topic_count_hidden":0,"_bbp_reply_count_hidden":0,"_bbp_forum_subforum_count":0,"footnotes":""},"categories":[24],"tags":[],"_links":{"self":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts\/1659"}],"collection":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/comments?post=1659"}],"version-history":[{"count":0,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts\/1659\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/media\/1660"}],"wp:attachment":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/media?parent=1659"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/categories?post=1659"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/tags?post=1659"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}