{"id":1538,"date":"2019-01-06T18:00:20","date_gmt":"2019-01-06T18:00:20","guid":{"rendered":"https:\/\/www.aiproblog.com\/index.php\/2019\/01\/06\/how-to-create-an-equally-linearly-and-exponentially-weighted-average-of-neural-network-model-weights-in-keras\/"},"modified":"2019-01-06T18:00:20","modified_gmt":"2019-01-06T18:00:20","slug":"how-to-create-an-equally-linearly-and-exponentially-weighted-average-of-neural-network-model-weights-in-keras","status":"publish","type":"post","link":"https:\/\/www.aiproblog.com\/index.php\/2019\/01\/06\/how-to-create-an-equally-linearly-and-exponentially-weighted-average-of-neural-network-model-weights-in-keras\/","title":{"rendered":"How to Create an Equally, Linearly, and Exponentially Weighted Average of Neural Network Model Weights in Keras"},"content":{"rendered":"<p>Author: Jason Brownlee<\/p>\n<div>\n<p>The training process of neural networks is a challenging optimization process that can often fail to converge.<\/p>\n<p>This can mean that the model at the end of training may not be a stable or best-performing set of weights to use as a final model.<\/p>\n<p>One approach to address this problem is to use an average of the weights from multiple models seen toward the end of the training run. This is called Polyak-Ruppert averaging and can be further improved by using a linearly or exponentially decreasing weighted average of the model weights. In addition to resulting in a more stable model, the performance of the averaged model weights can also result in better performance.<\/p>\n<p>In this tutorial, you will discover how to combine the weights from multiple different models into a single model for making predictions.<\/p>\n<p>After completing this tutorial, you will know:<\/p>\n<ul>\n<li>The stochastic and challenging nature of training neural networks can mean that the optimization process does not converge.<\/li>\n<li>Creating a model with the average of the weights from models observed towards the end of a training run can result in a more stable and sometimes better-performing solution.<\/li>\n<li>How to develop final models created with the equal, linearly, and exponentially weighted average of model parameters from multiple saved models.<\/li>\n<\/ul>\n<p>Let\u2019s get started.<\/p>\n<div id=\"attachment_6783\" style=\"width: 650px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" class=\"size-full wp-image-6783\" src=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2019\/01\/How-to-Create-an-Equally-Linearly-and-Exponentially-Weighted-Average-of-Neural-Network-Model-Weights-in-Keras.jpg\" alt=\"How to Create an Equally, Linearly, and Exponentially Weighted Average of Neural Network Model Weights in Keras\" width=\"640\" height=\"427\" srcset=\"http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2019\/01\/How-to-Create-an-Equally-Linearly-and-Exponentially-Weighted-Average-of-Neural-Network-Model-Weights-in-Keras.jpg 640w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2019\/01\/How-to-Create-an-Equally-Linearly-and-Exponentially-Weighted-Average-of-Neural-Network-Model-Weights-in-Keras-300x200.jpg 300w\" sizes=\"(max-width: 640px) 100vw, 640px\"><\/p>\n<p class=\"wp-caption-text\">How to Create an Equally, Linearly, and Exponentially Weighted Average of Neural Network Model Weights in Keras<br \/>Photo by <a href=\"https:\/\/www.flickr.com\/photos\/kamanomargano\/14009292812\/\">netselesoobrazno<\/a>, some rights reserved.<\/p>\n<\/div>\n<h2>Tutorial Overview<\/h2>\n<p>This tutorial is divided into seven parts; they are:<\/p>\n<ol>\n<li>Average Model Weight Ensemble<\/li>\n<li>Multi-Class Classification Problem<\/li>\n<li>Multilayer Perceptron Model<\/li>\n<li>Save Multiple Models to File<\/li>\n<li>New Model With Average Model Weights<\/li>\n<li>Predicting With an Average Model Weight Ensemble<\/li>\n<li>Linearly and Exponentially Decreasing Weighted Average<\/li>\n<\/ol>\n<h2>Average Model Weight Ensemble<\/h2>\n<p>Learning the weights for a deep neural network model requires solving a high-dimensional non-convex optimization problem.<\/p>\n<p>A challenge with solving this optimization problem is that there are many \u201c<em>good<\/em>\u201d solutions and it is possible for the learning algorithm to bounce around and fail to settle in on one. In the area of stochastic optimization, this is referred to as problems with the convergence of the optimization algorithm on a solution, where a solution is defined by a set of specific weight values.<\/p>\n<p>A symptom you may see if you have a problem with the convergence of your model is train and\/or test loss value that shows higher than expected variance, e.g. it thrashes or bounces up and down over training epochs.<\/p>\n<p>One approach to address this problem is to combine the weights collected towards the end of the training process. Generally, this might be referred to as temporal averaging and is known as Polyak Averaging or Polyak-Ruppert averaging, named for the original developers of the method.<\/p>\n<blockquote>\n<p>Polyak averaging consists of averaging together several points in the trajectory through parameter space visited by an optimization algorithm.<\/p>\n<\/blockquote>\n<p>\u2014 Page 322, <a href=\"https:\/\/amzn.to\/2A1vOOd\">Deep Learning<\/a>, 2016.<\/p>\n<p>Averaging multiple noisy sets of weights during the learning process may paradoxically sound less desirable than tuning the optimization process itself, but may prove a desirable solution, especially for very large neural networks that may take days, weeks, or even months to train.<\/p>\n<blockquote>\n<p>The essential advancement was reached on the basis of the paradoxical idea: a slow algorithm having less than optimal convergence rate must be averaged.<\/p>\n<\/blockquote>\n<p>\u2014 <a href=\"https:\/\/epubs.siam.org\/doi\/abs\/10.1137\/0330046\">Acceleration of Stochastic Approximation by Averaging<\/a>, 1992.<\/p>\n<p>Averaging the weights of multiple models from a single training run has the effect of calming down the noisy optimization process that may be noisy because of the choice of learning hyperparameters (e.g. learning rate) or the shape of the mapping function that is being learned. The result is a final model or set of weights that may offer a more stable, and perhaps more accurate result.<\/p>\n<blockquote>\n<p>The basic idea is that the optimization algorithm may leap back and forth across a valley several times without ever visiting a point near the bottom of the valley. The average of all of the locations on either side should be close to the bottom of the valley though.<\/p>\n<\/blockquote>\n<p>\u2014 Page 322, <a href=\"https:\/\/amzn.to\/2A1vOOd\">Deep Learning<\/a>, 2016.<\/p>\n<p>The simplest implementation of Polyak-Ruppert averaging involves calculating the average of the weights of the models over the last few training epochs.<\/p>\n<p>This can be improved by calculating a weighted average, where more weight is applied to more recent models, which is linearly decreased through prior epochs. An alternative and more widely used approach is to use an exponential decay in the weighted average.<\/p>\n<blockquote>\n<p>Polyak-Ruppert averaging has been shown to improve the convergence of standard SGD [\u2026] . Alternatively, an exponential moving average over the parameters can be used, giving higher weight to more recent parameter value.<\/p>\n<\/blockquote>\n<p>\u2014 <a href=\"https:\/\/arxiv.org\/abs\/1412.6980\">Adam: A Method for Stochastic Optimization<\/a>, 2014.<\/p>\n<p>Using an average or weighted average of model weights in the final model is a common technique in practice for ensuring the very best results are achieved from the training run. The approach is one of many \u201c<em>tricks<\/em>\u201d used in the Google Inception V2 and V3 deep convolutional neural network models for photo classification, a milestone in the field of deep learning.<\/p>\n<blockquote>\n<p>Model evaluations are performed using a running average of the parameters computed over time.<\/p>\n<\/blockquote>\n<p>\u2014 <a href=\"https:\/\/arxiv.org\/abs\/1512.00567\">Rethinking the Inception Architecture for Computer Vision<\/a>, 2015.<\/p>\n<div class=\"woo-sc-hr\"><\/div>\n<p><center><\/p>\n<h3>Want Better Results with Deep Learning?<\/h3>\n<p>Take my free 7-day email crash course now (with sample code).<\/p>\n<p>Click to sign-up and also get a free PDF Ebook version of the course.<\/p>\n<p><a href=\"https:\/\/machinelearningmastery.lpages.co\/leadbox\/1433e7773f72a2%3A164f8be4f346dc\/5764144745676800\/\" target=\"_blank\" style=\"background: rgb(255, 206, 10); color: rgb(255, 255, 255); text-decoration: none; font-family: Helvetica, Arial, sans-serif; font-weight: bold; font-size: 16px; line-height: 20px; padding: 10px; display: inline-block; max-width: 300px; border-radius: 5px; text-shadow: rgba(0, 0, 0, 0.25) 0px -1px 1px; box-shadow: rgba(255, 255, 255, 0.5) 0px 1px 3px inset, rgba(0, 0, 0, 0.5) 0px 1px 3px;\">Download Your FREE Mini-Course<\/a><script data-leadbox=\"1433e7773f72a2:164f8be4f346dc\" data-url=\"https:\/\/machinelearningmastery.lpages.co\/leadbox\/1433e7773f72a2%3A164f8be4f346dc\/5764144745676800\/\" data-config=\"%7B%7D\" type=\"text\/javascript\" src=\"https:\/\/machinelearningmastery.lpages.co\/leadbox-1543333086.js\"><\/script><\/p>\n<p><\/center><\/p>\n<div class=\"woo-sc-hr\"><\/div>\n<h2>Multi-Class Classification Problem<\/h2>\n<p>We will use a small multi-class classification problem as the basis to demonstrate the model weight ensemble.<\/p>\n<p>The scikit-learn class provides the <a href=\"http:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.datasets.make_blobs.html\">make_blobs() function<\/a> that can be used to create a multi-class classification problem with the prescribed number of samples, input variables, classes, and variance of samples within a class.<\/p>\n<p>The problem has two input variables (to represent the <em>x<\/em> and <em>y<\/em> coordinates of the points) and a standard deviation of 2.0 for points within each group. We will use the same random state (seed for the pseudorandom number generator) to ensure that we always get the same data points.<\/p>\n<pre class=\"crayon-plain-tag\"># generate 2d classification dataset\r\nX, y = make_blobs(n_samples=1000, centers=3, n_features=2, cluster_std=2, random_state=2)<\/pre>\n<p>The results are the input and output elements of a dataset that we can model.<\/p>\n<p>In order to get a feeling for the complexity of the problem, we can plot each point on a two-dimensional scatter plot and color each point by class value.<\/p>\n<p>The complete example is listed below.<\/p>\n<pre class=\"crayon-plain-tag\"># scatter plot of blobs dataset\r\nfrom sklearn.datasets.samples_generator import make_blobs\r\nfrom matplotlib import pyplot\r\nfrom numpy import where\r\n# generate 2d classification dataset\r\nX, y = make_blobs(n_samples=1000, centers=3, n_features=2, cluster_std=2, random_state=2)\r\n# scatter plot for each class value\r\nfor class_value in range(3):\r\n\t# select indices of points with the class label\r\n\trow_ix = where(y == class_value)\r\n\t# scatter plot for points with a different color\r\n\tpyplot.scatter(X[row_ix, 0], X[row_ix, 1])\r\n# show plot\r\npyplot.show()<\/pre>\n<p>Running the example creates a scatter plot of the entire dataset. We can see that the standard deviation of 2.0 means that the classes are not linearly separable (separable by a line) causing many ambiguous points.<\/p>\n<p>This is desirable as it means that the problem is non-trivial and will allow a neural network model to find many different \u201cgood enough\u201d candidate solutions resulting in a high variance.<\/p>\n<div id=\"attachment_6778\" style=\"width: 1290px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" class=\"size-full wp-image-6778\" src=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2018\/10\/Scatter-Plot-of-Blobs-Dataset-with-Three-Classes-and-Points-Colored-by-Class-Value-6.png\" alt=\"Scatter Plot of Blobs Dataset With Three Classes and Points Colored by Class Value\" width=\"1280\" height=\"960\" srcset=\"http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/10\/Scatter-Plot-of-Blobs-Dataset-with-Three-Classes-and-Points-Colored-by-Class-Value-6.png 1280w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/10\/Scatter-Plot-of-Blobs-Dataset-with-Three-Classes-and-Points-Colored-by-Class-Value-6-300x225.png 300w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/10\/Scatter-Plot-of-Blobs-Dataset-with-Three-Classes-and-Points-Colored-by-Class-Value-6-768x576.png 768w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/10\/Scatter-Plot-of-Blobs-Dataset-with-Three-Classes-and-Points-Colored-by-Class-Value-6-1024x768.png 1024w\" sizes=\"(max-width: 1280px) 100vw, 1280px\"><\/p>\n<p class=\"wp-caption-text\">Scatter Plot of Blobs Dataset With Three Classes and Points Colored by Class Value<\/p>\n<\/div>\n<h2>Multilayer Perceptron Model<\/h2>\n<p>Before we define a model, we need to contrive a problem that is appropriate for the ensemble.<\/p>\n<p>In our problem, the training dataset is relatively small. Specifically, there is a 10:1 ratio of examples in the training dataset to the holdout dataset. This mimics a situation where we may have a vast number of unlabeled examples and a small number of labeled examples with which to train a model.<\/p>\n<p>We will create 1,100 data points from the blobs problem. The model will be trained on the first 100 points and the remaining 1,000 will be held back in a test dataset, unavailable to the model.<\/p>\n<p>The problem is a multi-class classification problem, and we will model it using a softmax activation function on the output layer. This means that the model will predict a vector with three elements with the probability that the sample belongs to each of the three classes. Therefore, we must one hot encode the class values before we split the rows into the train and test datasets. We can do this using the Keras <em>to_categorical()<\/em> function.<\/p>\n<pre class=\"crayon-plain-tag\"># generate 2d classification dataset\r\nX, y = make_blobs(n_samples=1100, centers=3, n_features=2, cluster_std=2, random_state=2)\r\n# one hot encode output variable\r\ny = to_categorical(y)\r\n# split into train and test\r\nn_train = 100\r\ntrainX, testX = X[:n_train, :], X[n_train:, :]\r\ntrainy, testy = y[:n_train], y[n_train:]<\/pre>\n<p>Next, we can define and compile the model.<\/p>\n<p>The model will expect samples with two input variables. The model then has a single hidden layer with 25 nodes and a rectified linear activation function, then an output layer with three nodes to predict the probability of each of the three classes and a softmax activation function.<\/p>\n<p>Because the problem is multi-class, we will use the categorical cross entropy loss function to optimize the model and <a href=\"https:\/\/keras.io\/optimizers\/#sgd\">stochastic gradient descent<\/a> with a small learning rate and momentum.<\/p>\n<pre class=\"crayon-plain-tag\"># define model\r\nmodel = Sequential()\r\nmodel.add(Dense(25, input_dim=2, activation='relu'))\r\nmodel.add(Dense(3, activation='softmax'))\r\nopt = SGD(lr=0.01, momentum=0.9)\r\nmodel.compile(loss='categorical_crossentropy', optimizer=opt, metrics=['accuracy'])<\/pre>\n<p>The model is fit for 500 training epochs and we will evaluate the model each epoch on the test set, using the test set as a validation set.<\/p>\n<pre class=\"crayon-plain-tag\"># fit model\r\nhistory = model.fit(trainX, trainy, validation_data=(testX, testy), epochs=500, verbose=0)<\/pre>\n<p>At the end of the run, we will evaluate the performance of the model on the train and test sets.<\/p>\n<pre class=\"crayon-plain-tag\"># evaluate the model\r\n_, train_acc = model.evaluate(trainX, trainy, verbose=0)\r\n_, test_acc = model.evaluate(testX, testy, verbose=0)\r\nprint('Train: %.3f, Test: %.3f' % (train_acc, test_acc))<\/pre>\n<p>Then finally, we will plot learning curves of the model accuracy over each training epoch on both the training and validation datasets.<\/p>\n<pre class=\"crayon-plain-tag\"># learning curves of model accuracy\r\npyplot.plot(history.history['acc'], label='train')\r\npyplot.plot(history.history['val_acc'], label='test')\r\npyplot.legend()\r\npyplot.show()<\/pre>\n<p>Tying all of this together, the complete example is listed below.<\/p>\n<pre class=\"crayon-plain-tag\"># develop an mlp for blobs dataset\r\nfrom sklearn.datasets.samples_generator import make_blobs\r\nfrom keras.utils import to_categorical\r\nfrom keras.models import Sequential\r\nfrom keras.layers import Dense\r\nfrom keras.optimizers import SGD\r\nfrom matplotlib import pyplot\r\n# generate 2d classification dataset\r\nX, y = make_blobs(n_samples=1100, centers=3, n_features=2, cluster_std=2, random_state=2)\r\n# one hot encode output variable\r\ny = to_categorical(y)\r\n# split into train and test\r\nn_train = 100\r\ntrainX, testX = X[:n_train, :], X[n_train:, :]\r\ntrainy, testy = y[:n_train], y[n_train:]\r\n# define model\r\nmodel = Sequential()\r\nmodel.add(Dense(25, input_dim=2, activation='relu'))\r\nmodel.add(Dense(3, activation='softmax'))\r\nopt = SGD(lr=0.01, momentum=0.9)\r\nmodel.compile(loss='categorical_crossentropy', optimizer=opt, metrics=['accuracy'])\r\n# fit model\r\nhistory = model.fit(trainX, trainy, validation_data=(testX, testy), epochs=500, verbose=0)\r\n# evaluate the model\r\n_, train_acc = model.evaluate(trainX, trainy, verbose=0)\r\n_, test_acc = model.evaluate(testX, testy, verbose=0)\r\nprint('Train: %.3f, Test: %.3f' % (train_acc, test_acc))\r\n# learning curves of model accuracy\r\npyplot.plot(history.history['acc'], label='train')\r\npyplot.plot(history.history['val_acc'], label='test')\r\npyplot.legend()\r\npyplot.show()<\/pre>\n<p>Running the example prints the performance of the final model on the train and test datasets.<\/p>\n<p>Your specific results will vary (by design!) given the high variance nature of the model.<\/p>\n<p>In this case, we can see that the model achieved about 86% accuracy on the training dataset, which we know is optimistic, and about 81% on the test dataset, which we would expect to be more realistic.<\/p>\n<pre class=\"crayon-plain-tag\">Train: 0.860, Test: 0.812<\/pre>\n<p>A line plot is also created showing the learning curves for the model accuracy on the train and test sets over each training epoch.<\/p>\n<p>We can see that training accuracy is more optimistic over most of the run, as we also noted with the final scores. Importantly, we do see a reasonable amount of variance in the accuracy during training on both the train and test datasets, potentially providing a good basis for using model weight averaging.<\/p>\n<div id=\"attachment_6779\" style=\"width: 1290px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" class=\"size-full wp-image-6779\" src=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2018\/10\/Line-Plot-Learning-Curves-of-Model-Accuracy-on-Train-and-Test-Dataset-over-Each-Training-Epoch-6.png\" alt=\"Line Plot Learning Curves of Model Accuracy on Train and Test Dataset over Each Training Epoch\" width=\"1280\" height=\"960\" srcset=\"http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/10\/Line-Plot-Learning-Curves-of-Model-Accuracy-on-Train-and-Test-Dataset-over-Each-Training-Epoch-6.png 1280w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/10\/Line-Plot-Learning-Curves-of-Model-Accuracy-on-Train-and-Test-Dataset-over-Each-Training-Epoch-6-300x225.png 300w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/10\/Line-Plot-Learning-Curves-of-Model-Accuracy-on-Train-and-Test-Dataset-over-Each-Training-Epoch-6-768x576.png 768w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/10\/Line-Plot-Learning-Curves-of-Model-Accuracy-on-Train-and-Test-Dataset-over-Each-Training-Epoch-6-1024x768.png 1024w\" sizes=\"(max-width: 1280px) 100vw, 1280px\"><\/p>\n<p class=\"wp-caption-text\">Line Plot Learning Curves of Model Accuracy on Train and Test Dataset over Each Training Epoch<\/p>\n<\/div>\n<h2>Save Multiple Models to File<\/h2>\n<p>One approach to the model weight ensemble is to keep a running average of model weights in memory.<\/p>\n<p>There are three downsides to this approach:<\/p>\n<ul>\n<li>It requires that you know beforehand the way in which the model weights will be combined; perhaps you want to experiment with different approaches.<\/li>\n<li>It requires that you know the number of epochs to use for training; maybe you want to use early stopping.<\/li>\n<li>It requires that you keep at least one copy of the entire network in memory; this could be very expensive for large models and fragile if the training process crashes or is killed.<\/li>\n<\/ul>\n<p>An alternative is to save model weights to file during training as a first step, and later combine the weights from the saved models in order to make a final model.<\/p>\n<p>Perhaps the simplest way to implement this is to manually drive the training process, one epoch at a time, then save models at the end of the epoch if we have exceeded an upper limit on the number of epochs.<\/p>\n<p>For example, with our test problem, we will train the model for 500 epochs and perhaps save models from epoch 490 onwards (e.g. between and including epochs 490 and 499).<\/p>\n<pre class=\"crayon-plain-tag\"># fit model\r\nn_epochs, n_save_after = 500, 490\r\nfor i in range(n_epochs):\r\n\t# fit model for a single epoch\r\n\tmodel.fit(trainX, trainy, epochs=1, verbose=0)\r\n\t# check if we should save the model\r\n\tif i >= n_save_after:\r\n\t\tmodel.save('model_' + str(i) + '.h5')<\/pre>\n<p>Models can be saved to file using the <em>save()<\/em> function on the model and specifying a filename that includes the epoch number.<\/p>\n<p>Note, saving and loading neural network models in Keras requires that you have the h5py library installed. You can install this library using pip as follows:<\/p>\n<pre class=\"crayon-plain-tag\">pip install h5py<\/pre>\n<p>Tying all of this together, the complete example of fitting the model on the training dataset and saving all models from the last 10 epochs is listed below.<\/p>\n<pre class=\"crayon-plain-tag\"># save models to file toward the end of a training run\r\nfrom sklearn.datasets.samples_generator import make_blobs\r\nfrom keras.utils import to_categorical\r\nfrom keras.models import Sequential\r\nfrom keras.layers import Dense\r\n# generate 2d classification dataset\r\nX, y = make_blobs(n_samples=1100, centers=3, n_features=2, cluster_std=2, random_state=2)\r\n# one hot encode output variable\r\ny = to_categorical(y)\r\n# split into train and test\r\nn_train = 100\r\ntrainX, testX = X[:n_train, :], X[n_train:, :]\r\ntrainy, testy = y[:n_train], y[n_train:]\r\n# define model\r\nmodel = Sequential()\r\nmodel.add(Dense(25, input_dim=2, activation='relu'))\r\nmodel.add(Dense(3, activation='softmax'))\r\nmodel.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])\r\n# fit model\r\nn_epochs, n_save_after = 500, 490\r\nfor i in range(n_epochs):\r\n\t# fit model for a single epoch\r\n\tmodel.fit(trainX, trainy, epochs=1, verbose=0)\r\n\t# check if we should save the model\r\n\tif i >= n_save_after:\r\n\t\tmodel.save('model_' + str(i) + '.h5')<\/pre>\n<p>Running the example saves 10 models into the current working directory.<\/p>\n<h2>New Model With Average Models Weights<\/h2>\n<p>We can create a new model from multiple existing models with the same architecture.<\/p>\n<p>First, we need to load the models into memory. This is reasonable as the models are small. If you are working with very large models, it might be easier to load models one at a time and average the weights in memory.<\/p>\n<p>The <em>load_model()<\/em> Keras function can be used to load a saved model from file. The function <em>load_all_models()<\/em> below will load models from the current working directory. It takes the start and end epochs as arguments so that you can experiment with different groups of models saved over contiguous epochs.<\/p>\n<pre class=\"crayon-plain-tag\"># load models from file\r\ndef load_all_models(n_start, n_end):\r\n\tall_models = list()\r\n\tfor epoch in range(n_start, n_end):\r\n\t\t# define filename for this ensemble\r\n\t\tfilename = 'model_' + str(epoch) + '.h5'\r\n\t\t# load model from file\r\n\t\tmodel = load_model(filename)\r\n\t\t# add to list of members\r\n\t\tall_models.append(model)\r\n\t\tprint('>loaded %s' % filename)\r\n\treturn all_models<\/pre>\n<p>We can call the function to load all of the models.<\/p>\n<pre class=\"crayon-plain-tag\"># load models in order\r\nmembers = load_all_models(490, 500)\r\nprint('Loaded %d models' % len(members))<\/pre>\n<p>Once loaded, we can create a new model with the weighted average of the model weights.<\/p>\n<p>Each model has a <em>get_weights()<\/em> function that returns a list of arrays, one for each layer in the model. We can enumerate each layer in the model, retrieve the same layer from each model, and calculate the weighted average. This will give us a set of weights.<\/p>\n<p>We can then use the <em>clone_model()<\/em> Keras function to create a clone of the architecture and call <em>set_weights()<\/em> function to use the average weights we have prepared. The <em>model_weight_ensemble()<\/em> function below implements this.<\/p>\n<pre class=\"crayon-plain-tag\"># create a model from the weights of multiple models\r\ndef model_weight_ensemble(members, weights):\r\n\t# determine how many layers need to be averaged\r\n\tn_layers = len(members[0].get_weights())\r\n\t# create an set of average model weights\r\n\tavg_model_weights = list()\r\n\tfor layer in range(n_layers):\r\n\t\t# collect this layer from each model\r\n\t\tlayer_weights = array([model.get_weights()[layer] for model in members])\r\n\t\t# weighted average of weights for this layer\r\n\t\tavg_layer_weights = average(layer_weights, axis=0, weights=weights)\r\n\t\t# store average layer weights\r\n\t\tavg_model_weights.append(avg_layer_weights)\r\n\t# create a new model with the same structure\r\n\tmodel = clone_model(members[0])\r\n\t# set the weights in the new\r\n\tmodel.set_weights(avg_model_weights)\r\n\tmodel.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])\r\n\treturn model<\/pre>\n<p>Tying these elements together, we can load the 10 models and calculate the equally weighted average (arithmetic average) of the model weights. The complete listing is provided below.<\/p>\n<pre class=\"crayon-plain-tag\"># average the weights of multiple loaded models\r\nfrom keras.models import load_model\r\nfrom keras.models import clone_model\r\nfrom numpy import average\r\nfrom numpy import array\r\n\r\n# load models from file\r\ndef load_all_models(n_start, n_end):\r\n\tall_models = list()\r\n\tfor epoch in range(n_start, n_end):\r\n\t\t# define filename for this ensemble\r\n\t\tfilename = 'model_' + str(epoch) + '.h5'\r\n\t\t# load model from file\r\n\t\tmodel = load_model(filename)\r\n\t\t# add to list of members\r\n\t\tall_models.append(model)\r\n\t\tprint('>loaded %s' % filename)\r\n\treturn all_models\r\n\r\n# create a model from the weights of multiple models\r\ndef model_weight_ensemble(members, weights):\r\n\t# determine how many layers need to be averaged\r\n\tn_layers = len(members[0].get_weights())\r\n\t# create an set of average model weights\r\n\tavg_model_weights = list()\r\n\tfor layer in range(n_layers):\r\n\t\t# collect this layer from each model\r\n\t\tlayer_weights = array([model.get_weights()[layer] for model in members])\r\n\t\t# weighted average of weights for this layer\r\n\t\tavg_layer_weights = average(layer_weights, axis=0, weights=weights)\r\n\t\t# store average layer weights\r\n\t\tavg_model_weights.append(avg_layer_weights)\r\n\t# create a new model with the same structure\r\n\tmodel = clone_model(members[0])\r\n\t# set the weights in the new\r\n\tmodel.set_weights(avg_model_weights)\r\n\tmodel.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])\r\n\treturn model\r\n\r\n# load all models into memory\r\nmembers = load_all_models(490, 500)\r\nprint('Loaded %d models' % len(members))\r\n# prepare an array of equal weights\r\nn_models = len(members)\r\nweights = [1\/n_models for i in range(1, n_models+1)]\r\n# create a new model with the weighted average of all model weights\r\nmodel = model_weight_ensemble(members, weights)\r\n# summarize the created model\r\nmodel.summary()<\/pre>\n<p>Running the example first loads the 10 models from file.<\/p>\n<pre class=\"crayon-plain-tag\">>loaded model_490.h5\r\n>loaded model_491.h5\r\n>loaded model_492.h5\r\n>loaded model_493.h5\r\n>loaded model_494.h5\r\n>loaded model_495.h5\r\n>loaded model_496.h5\r\n>loaded model_497.h5\r\n>loaded model_498.h5\r\n>loaded model_499.h5\r\nLoaded 10 models<\/pre>\n<p>A model weight ensemble is created from these 10 models giving equal weight to each model and a summary of the model structure is reported.<\/p>\n<pre class=\"crayon-plain-tag\">_________________________________________________________________\r\nLayer (type)                 Output Shape              Param #\r\n=================================================================\r\ndense_1 (Dense)              (None, 25)                75\r\n_________________________________________________________________\r\ndense_2 (Dense)              (None, 3)                 78\r\n=================================================================\r\nTotal params: 153\r\nTrainable params: 153\r\nNon-trainable params: 0\r\n_________________________________________________________________<\/pre>\n<\/p>\n<h2>Predicting With an Average Model Weight Ensemble<\/h2>\n<p>Now that we know how to calculate a weighted average of model weights, we can evaluate predictions with the resulting model.<\/p>\n<p>One issue is that we don\u2019t know how many models are appropriate to combine in order to achieve good performance. We can address this by evaluating model weight averaging ensembles with the last <em>n<\/em> models and vary <em>n<\/em> to see how many models results in good performance.<\/p>\n<p>The <em>evaluate_n_members()<\/em> function below will create a new model from a given number of loaded models. Each model is given an equal weight in contributing to the final model, then the <em>model_weight_ensemble()<\/em> function is called to create the final model that is then evaluated on the test dataset.<\/p>\n<pre class=\"crayon-plain-tag\"># evaluate a specific number of members in an ensemble\r\ndef evaluate_n_members(members, n_members, testX, testy):\r\n\t# reverse loaded models so we build the ensemble with the last models first\r\n\tmembers = list(reversed(members))\r\n\t# select a subset of members\r\n\tsubset = members[:n_members]\r\n\t# prepare an array of equal weights\r\n\tweights = [1.0\/n_members for i in range(1, n_members+1)]\r\n\t# create a new model with the weighted average of all model weights\r\n\tmodel = model_weight_ensemble(subset, weights)\r\n\t# make predictions and evaluate accuracy\r\n\t_, test_acc = model.evaluate(testX, testy, verbose=0)\r\n\treturn test_acc<\/pre>\n<p>Importantly, the list of loaded models is reversed first to ensure that the last <em>n<\/em> models in the training run are used, which we would assume might have better performance on average.<\/p>\n<pre class=\"crayon-plain-tag\"># reverse loaded models so we build the ensemble with the last models first\r\nmembers = list(reversed(members))<\/pre>\n<p>We can then evaluate models created from different numbers of the last <em>n<\/em> models saved from the training run from the last 1-model to the last 10 models. In addition to evaluating the combined final model, we can also evaluate each saved standalone model on the test dataset to compare performance.<\/p>\n<pre class=\"crayon-plain-tag\"># evaluate different numbers of ensembles on hold out set\r\nsingle_scores, ensemble_scores = list(), list()\r\nfor i in range(1, len(members)+1):\r\n\t# evaluate model with i members\r\n\tensemble_score = evaluate_n_members(members, i, testX, testy)\r\n\t# evaluate the i'th model standalone\r\n\t_, single_score = members[i-1].evaluate(testX, testy, verbose=0)\r\n\t# summarize this step\r\n\tprint('> %d: single=%.3f, ensemble=%.3f' % (i, single_score, ensemble_score))\r\n\tensemble_scores.append(ensemble_score)\r\n\tsingle_scores.append(single_score)<\/pre>\n<p>The collected scores can be plotted, with blue dots for the accuracy of the single saved models and the orange line for the test accuracy for the model that combines the weights the last <em>n<\/em> models.<\/p>\n<pre class=\"crayon-plain-tag\"># plot score vs number of ensemble members\r\nx_axis = [i for i in range(1, len(members)+1)]\r\npyplot.plot(x_axis, single_scores, marker='o', linestyle='None')\r\npyplot.plot(x_axis, ensemble_scores, marker='o')\r\npyplot.show()<\/pre>\n<p>Tying all of this together, the complete example is listed below.<\/p>\n<pre class=\"crayon-plain-tag\"># average of model weights on blobs problem\r\nfrom sklearn.datasets.samples_generator import make_blobs\r\nfrom sklearn.metrics import accuracy_score\r\nfrom keras.utils import to_categorical\r\nfrom keras.models import load_model\r\nfrom keras.models import clone_model\r\nfrom keras.models import Sequential\r\nfrom keras.layers import Dense\r\nfrom matplotlib import pyplot\r\nfrom numpy import average\r\nfrom numpy import array\r\n\r\n# load models from file\r\ndef load_all_models(n_start, n_end):\r\n\tall_models = list()\r\n\tfor epoch in range(n_start, n_end):\r\n\t\t# define filename for this ensemble\r\n\t\tfilename = 'model_' + str(epoch) + '.h5'\r\n\t\t# load model from file\r\n\t\tmodel = load_model(filename)\r\n\t\t# add to list of members\r\n\t\tall_models.append(model)\r\n\t\tprint('>loaded %s' % filename)\r\n\treturn all_models\r\n\r\n# # create a model from the weights of multiple models\r\ndef model_weight_ensemble(members, weights):\r\n\t# determine how many layers need to be averaged\r\n\tn_layers = len(members[0].get_weights())\r\n\t# create an set of average model weights\r\n\tavg_model_weights = list()\r\n\tfor layer in range(n_layers):\r\n\t\t# collect this layer from each model\r\n\t\tlayer_weights = array([model.get_weights()[layer] for model in members])\r\n\t\t# weighted average of weights for this layer\r\n\t\tavg_layer_weights = average(layer_weights, axis=0, weights=weights)\r\n\t\t# store average layer weights\r\n\t\tavg_model_weights.append(avg_layer_weights)\r\n\t# create a new model with the same structure\r\n\tmodel = clone_model(members[0])\r\n\t# set the weights in the new\r\n\tmodel.set_weights(avg_model_weights)\r\n\tmodel.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])\r\n\treturn model\r\n\r\n# evaluate a specific number of members in an ensemble\r\ndef evaluate_n_members(members, n_members, testX, testy):\r\n\t# select a subset of members\r\n\tsubset = members[:n_members]\r\n\t# prepare an array of equal weights\r\n\tweights = [1.0\/n_members for i in range(1, n_members+1)]\r\n\t# create a new model with the weighted average of all model weights\r\n\tmodel = model_weight_ensemble(subset, weights)\r\n\t# make predictions and evaluate accuracy\r\n\t_, test_acc = model.evaluate(testX, testy, verbose=0)\r\n\treturn test_acc\r\n\r\n# generate 2d classification dataset\r\nX, y = make_blobs(n_samples=1100, centers=3, n_features=2, cluster_std=2, random_state=2)\r\n# one hot encode output variable\r\ny = to_categorical(y)\r\n# split into train and test\r\nn_train = 100\r\ntrainX, testX = X[:n_train, :], X[n_train:, :]\r\ntrainy, testy = y[:n_train], y[n_train:]\r\n# load models in order\r\nmembers = load_all_models(490, 500)\r\nprint('Loaded %d models' % len(members))\r\n# reverse loaded models so we build the ensemble with the last models first\r\nmembers = list(reversed(members))\r\n# evaluate different numbers of ensembles on hold out set\r\nsingle_scores, ensemble_scores = list(), list()\r\nfor i in range(1, len(members)+1):\r\n\t# evaluate model with i members\r\n\tensemble_score = evaluate_n_members(members, i, testX, testy)\r\n\t# evaluate the i'th model standalone\r\n\t_, single_score = members[i-1].evaluate(testX, testy, verbose=0)\r\n\t# summarize this step\r\n\tprint('> %d: single=%.3f, ensemble=%.3f' % (i, single_score, ensemble_score))\r\n\tensemble_scores.append(ensemble_score)\r\n\tsingle_scores.append(single_score)\r\n# plot score vs number of ensemble members\r\nx_axis = [i for i in range(1, len(members)+1)]\r\npyplot.plot(x_axis, single_scores, marker='o', linestyle='None')\r\npyplot.plot(x_axis, ensemble_scores, marker='o')\r\npyplot.show()<\/pre>\n<p>Running the example first loads the 10 saved models.<\/p>\n<pre class=\"crayon-plain-tag\">>loaded model_490.h5\r\n>loaded model_491.h5\r\n>loaded model_492.h5\r\n>loaded model_493.h5\r\n>loaded model_494.h5\r\n>loaded model_495.h5\r\n>loaded model_496.h5\r\n>loaded model_497.h5\r\n>loaded model_498.h5\r\n>loaded model_499.h5\r\nLoaded 10 models<\/pre>\n<p>The performance of each individually saved model is reported as well as an ensemble model with weights averaged from all models up to and including each model, working backward from the end of the training run.<\/p>\n<p>The results show that the best test accuracy was about 81.4% achieved by the last two models. We can see that the test accuracy of the model weight ensemble levels out the performance and performs just as well.<\/p>\n<p>Your specific results will vary based on the models saved during the previous section.<\/p>\n<pre class=\"crayon-plain-tag\">> 1: single=0.814, ensemble=0.814\r\n> 2: single=0.814, ensemble=0.814\r\n> 3: single=0.811, ensemble=0.813\r\n> 4: single=0.805, ensemble=0.813\r\n> 5: single=0.807, ensemble=0.811\r\n> 6: single=0.805, ensemble=0.807\r\n> 7: single=0.802, ensemble=0.809\r\n> 8: single=0.805, ensemble=0.808\r\n> 9: single=0.805, ensemble=0.808\r\n> 10: single=0.810, ensemble=0.807<\/pre>\n<p>A line plot is also created showing the test accuracy of each single model (blue dots) and the performance of the model weight ensemble (orange line).<\/p>\n<p>We can see that averaging the model weights does level out the performance of the final model and performs at least as well as the final model of the run.<\/p>\n<div id=\"attachment_6780\" style=\"width: 1290px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" class=\"size-full wp-image-6780\" src=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2018\/10\/Line-Plot-of-Single-Model-Test-Performance-blue-dots-and-Model-Weight-Ensemble-Test-Performance-orange-line.png\" alt=\"Line Plot of Single Model Test Performance (blue dots) and Model Weight Ensemble Test Performance (orange line)\" width=\"1280\" height=\"960\" srcset=\"http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/10\/Line-Plot-of-Single-Model-Test-Performance-blue-dots-and-Model-Weight-Ensemble-Test-Performance-orange-line.png 1280w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/10\/Line-Plot-of-Single-Model-Test-Performance-blue-dots-and-Model-Weight-Ensemble-Test-Performance-orange-line-300x225.png 300w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/10\/Line-Plot-of-Single-Model-Test-Performance-blue-dots-and-Model-Weight-Ensemble-Test-Performance-orange-line-768x576.png 768w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/10\/Line-Plot-of-Single-Model-Test-Performance-blue-dots-and-Model-Weight-Ensemble-Test-Performance-orange-line-1024x768.png 1024w\" sizes=\"(max-width: 1280px) 100vw, 1280px\"><\/p>\n<p class=\"wp-caption-text\">Line Plot of Single Model Test Performance (blue dots) and Model Weight Ensemble Test Performance (orange line)<\/p>\n<\/div>\n<h2>Linearly and Exponentially Decreasing Weighted Average<\/h2>\n<p>We can update the example and evaluate a linearly decreasing weighting of the model weights in the ensemble.<\/p>\n<p>The weights can be calculated as follows:<\/p>\n<pre class=\"crayon-plain-tag\"># prepare an array of linearly decreasing weights\r\nweights = [i\/n_members for i in range(n_members, 0, -1)]<\/pre>\n<p>This can be used instead of the equal weights in the <em>evaluate_n_members()<\/em> function.<\/p>\n<p>The complete example is listed below.<\/p>\n<pre class=\"crayon-plain-tag\"># linearly decreasing weighted average of models on blobs problem\r\nfrom sklearn.datasets.samples_generator import make_blobs\r\nfrom sklearn.metrics import accuracy_score\r\nfrom keras.utils import to_categorical\r\nfrom keras.models import load_model\r\nfrom keras.models import clone_model\r\nfrom keras.models import Sequential\r\nfrom keras.layers import Dense\r\nfrom matplotlib import pyplot\r\nfrom numpy import average\r\nfrom numpy import array\r\n\r\n# load models from file\r\ndef load_all_models(n_start, n_end):\r\n\tall_models = list()\r\n\tfor epoch in range(n_start, n_end):\r\n\t\t# define filename for this ensemble\r\n\t\tfilename = 'model_' + str(epoch) + '.h5'\r\n\t\t# load model from file\r\n\t\tmodel = load_model(filename)\r\n\t\t# add to list of members\r\n\t\tall_models.append(model)\r\n\t\tprint('>loaded %s' % filename)\r\n\treturn all_models\r\n\r\n# create a model from the weights of multiple models\r\ndef model_weight_ensemble(members, weights):\r\n\t# determine how many layers need to be averaged\r\n\tn_layers = len(members[0].get_weights())\r\n\t# create an set of average model weights\r\n\tavg_model_weights = list()\r\n\tfor layer in range(n_layers):\r\n\t\t# collect this layer from each model\r\n\t\tlayer_weights = array([model.get_weights()[layer] for model in members])\r\n\t\t# weighted average of weights for this layer\r\n\t\tavg_layer_weights = average(layer_weights, axis=0, weights=weights)\r\n\t\t# store average layer weights\r\n\t\tavg_model_weights.append(avg_layer_weights)\r\n\t# create a new model with the same structure\r\n\tmodel = clone_model(members[0])\r\n\t# set the weights in the new\r\n\tmodel.set_weights(avg_model_weights)\r\n\tmodel.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])\r\n\treturn model\r\n\r\n# evaluate a specific number of members in an ensemble\r\ndef evaluate_n_members(members, n_members, testX, testy):\r\n\t# select a subset of members\r\n\tsubset = members[:n_members]\r\n\t# prepare an array of linearly decreasing weights\r\n\tweights = [i\/n_members for i in range(n_members, 0, -1)]\r\n\t# create a new model with the weighted average of all model weights\r\n\tmodel = model_weight_ensemble(subset, weights)\r\n\t# make predictions and evaluate accuracy\r\n\t_, test_acc = model.evaluate(testX, testy, verbose=0)\r\n\treturn test_acc\r\n\r\n# generate 2d classification dataset\r\nX, y = make_blobs(n_samples=1100, centers=3, n_features=2, cluster_std=2, random_state=2)\r\n# one hot encode output variable\r\ny = to_categorical(y)\r\n# split into train and test\r\nn_train = 100\r\ntrainX, testX = X[:n_train, :], X[n_train:, :]\r\ntrainy, testy = y[:n_train], y[n_train:]\r\n# load models in order\r\nmembers = load_all_models(490, 500)\r\nprint('Loaded %d models' % len(members))\r\n# reverse loaded models so we build the ensemble with the last models first\r\nmembers = list(reversed(members))\r\n# evaluate different numbers of ensembles on hold out set\r\nsingle_scores, ensemble_scores = list(), list()\r\nfor i in range(1, len(members)+1):\r\n\t# evaluate model with i members\r\n\tensemble_score = evaluate_n_members(members, i, testX, testy)\r\n\t# evaluate the i'th model standalone\r\n\t_, single_score = members[i-1].evaluate(testX, testy, verbose=0)\r\n\t# summarize this step\r\n\tprint('> %d: single=%.3f, ensemble=%.3f' % (i, single_score, ensemble_score))\r\n\tensemble_scores.append(ensemble_score)\r\n\tsingle_scores.append(single_score)\r\n# plot score vs number of ensemble members\r\nx_axis = [i for i in range(1, len(members)+1)]\r\npyplot.plot(x_axis, single_scores, marker='o', linestyle='None')\r\npyplot.plot(x_axis, ensemble_scores, marker='o')\r\npyplot.show()<\/pre>\n<p>Running the example reports the performance of each single model again, and this time the test accuracy of each average model weight ensemble with a linearly decreasing contribution of models.<\/p>\n<p>We can see that, at least in this case, the ensemble achieves a small bump in performance above any standalone model to about 81.5% accuracy.<\/p>\n<pre class=\"crayon-plain-tag\">...\r\n> 1: single=0.814, ensemble=0.814\r\n> 2: single=0.814, ensemble=0.815\r\n> 3: single=0.811, ensemble=0.814\r\n> 4: single=0.805, ensemble=0.813\r\n> 5: single=0.807, ensemble=0.813\r\n> 6: single=0.805, ensemble=0.813\r\n> 7: single=0.802, ensemble=0.811\r\n> 8: single=0.805, ensemble=0.810\r\n> 9: single=0.805, ensemble=0.809\r\n> 10: single=0.810, ensemble=0.809<\/pre>\n<p>The line plot shows the bump in performance and shows a more stable performance in terms of test accuracy over the different sized ensembles created, as compared to the use of an evenly weighted ensemble.<\/p>\n<div id=\"attachment_6781\" style=\"width: 1290px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" class=\"size-full wp-image-6781\" src=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2018\/10\/Line-Plot-of-Single-Model-Test-Performance-blue-dots-and-Model-Weight-Ensemble-Test-Performance-orange-line-with-a-Linear-Decay.png\" alt=\"Line Plot of Single Model Test Performance (blue dots) and Model Weight Ensemble Test Performance (orange line) With a Linear Decay\" width=\"1280\" height=\"960\" srcset=\"http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/10\/Line-Plot-of-Single-Model-Test-Performance-blue-dots-and-Model-Weight-Ensemble-Test-Performance-orange-line-with-a-Linear-Decay.png 1280w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/10\/Line-Plot-of-Single-Model-Test-Performance-blue-dots-and-Model-Weight-Ensemble-Test-Performance-orange-line-with-a-Linear-Decay-300x225.png 300w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/10\/Line-Plot-of-Single-Model-Test-Performance-blue-dots-and-Model-Weight-Ensemble-Test-Performance-orange-line-with-a-Linear-Decay-768x576.png 768w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/10\/Line-Plot-of-Single-Model-Test-Performance-blue-dots-and-Model-Weight-Ensemble-Test-Performance-orange-line-with-a-Linear-Decay-1024x768.png 1024w\" sizes=\"(max-width: 1280px) 100vw, 1280px\"><\/p>\n<p class=\"wp-caption-text\">Line Plot of Single Model Test Performance (blue dots) and Model Weight Ensemble Test Performance (orange line) With a Linear Decay<\/p>\n<\/div>\n<p>We can also experiment with an exponential decay of the contribution of models. This requires that a decay rate (alpha) is specified. The example below creates weights for an exponential decay with a decrease rate of 2.<\/p>\n<pre class=\"crayon-plain-tag\"># prepare an array of exponentially decreasing weights\r\nalpha = 2.0\r\nweights = [exp(-i\/alpha) for i in range(1, n_members+1)]<\/pre>\n<p>The complete example with an exponential decay for the contribution of models to the average weights in the ensemble model is listed below.<\/p>\n<pre class=\"crayon-plain-tag\"># exponentially decreasing weighted average of models on blobs problem\r\nfrom sklearn.datasets.samples_generator import make_blobs\r\nfrom sklearn.metrics import accuracy_score\r\nfrom keras.utils import to_categorical\r\nfrom keras.models import load_model\r\nfrom keras.models import clone_model\r\nfrom keras.models import Sequential\r\nfrom keras.layers import Dense\r\nfrom matplotlib import pyplot\r\nfrom numpy import average\r\nfrom numpy import array\r\nfrom math import exp\r\n\r\n# load models from file\r\ndef load_all_models(n_start, n_end):\r\n\tall_models = list()\r\n\tfor epoch in range(n_start, n_end):\r\n\t\t# define filename for this ensemble\r\n\t\tfilename = 'model_' + str(epoch) + '.h5'\r\n\t\t# load model from file\r\n\t\tmodel = load_model(filename)\r\n\t\t# add to list of members\r\n\t\tall_models.append(model)\r\n\t\tprint('>loaded %s' % filename)\r\n\treturn all_models\r\n\r\n# create a model from the weights of multiple models\r\ndef model_weight_ensemble(members, weights):\r\n\t# determine how many layers need to be averaged\r\n\tn_layers = len(members[0].get_weights())\r\n\t# create an set of average model weights\r\n\tavg_model_weights = list()\r\n\tfor layer in range(n_layers):\r\n\t\t# collect this layer from each model\r\n\t\tlayer_weights = array([model.get_weights()[layer] for model in members])\r\n\t\t# weighted average of weights for this layer\r\n\t\tavg_layer_weights = average(layer_weights, axis=0, weights=weights)\r\n\t\t# store average layer weights\r\n\t\tavg_model_weights.append(avg_layer_weights)\r\n\t# create a new model with the same structure\r\n\tmodel = clone_model(members[0])\r\n\t# set the weights in the new\r\n\tmodel.set_weights(avg_model_weights)\r\n\tmodel.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])\r\n\treturn model\r\n\r\n# evaluate a specific number of members in an ensemble\r\ndef evaluate_n_members(members, n_members, testX, testy):\r\n\t# select a subset of members\r\n\tsubset = members[:n_members]\r\n\t# prepare an array of exponentially decreasing weights\r\n\talpha = 2.0\r\n\tweights = [exp(-i\/alpha) for i in range(1, n_members+1)]\r\n\t# create a new model with the weighted average of all model weights\r\n\tmodel = model_weight_ensemble(subset, weights)\r\n\t# make predictions and evaluate accuracy\r\n\t_, test_acc = model.evaluate(testX, testy, verbose=0)\r\n\treturn test_acc\r\n\r\n# generate 2d classification dataset\r\nX, y = make_blobs(n_samples=1100, centers=3, n_features=2, cluster_std=2, random_state=2)\r\n# one hot encode output variable\r\ny = to_categorical(y)\r\n# split into train and test\r\nn_train = 100\r\ntrainX, testX = X[:n_train, :], X[n_train:, :]\r\ntrainy, testy = y[:n_train], y[n_train:]\r\n# load models in order\r\nmembers = load_all_models(490, 500)\r\nprint('Loaded %d models' % len(members))\r\n# reverse loaded models so we build the ensemble with the last models first\r\nmembers = list(reversed(members))\r\n# evaluate different numbers of ensembles on hold out set\r\nsingle_scores, ensemble_scores = list(), list()\r\nfor i in range(1, len(members)+1):\r\n\t# evaluate model with i members\r\n\tensemble_score = evaluate_n_members(members, i, testX, testy)\r\n\t# evaluate the i'th model standalone\r\n\t_, single_score = members[i-1].evaluate(testX, testy, verbose=0)\r\n\t# summarize this step\r\n\tprint('> %d: single=%.3f, ensemble=%.3f' % (i, single_score, ensemble_score))\r\n\tensemble_scores.append(ensemble_score)\r\n\tsingle_scores.append(single_score)\r\n# plot score vs number of ensemble members\r\nx_axis = [i for i in range(1, len(members)+1)]\r\npyplot.plot(x_axis, single_scores, marker='o', linestyle='None')\r\npyplot.plot(x_axis, ensemble_scores, marker='o')\r\npyplot.show()<\/pre>\n<p>Running the example shows a small improvement in performance much like the use of a linear decay in the weighted average of the saved models.<\/p>\n<pre class=\"crayon-plain-tag\">> 1: single=0.814, ensemble=0.814\r\n> 2: single=0.814, ensemble=0.815\r\n> 3: single=0.811, ensemble=0.814\r\n> 4: single=0.805, ensemble=0.814\r\n> 5: single=0.807, ensemble=0.813\r\n> 6: single=0.805, ensemble=0.813\r\n> 7: single=0.802, ensemble=0.813\r\n> 8: single=0.805, ensemble=0.813\r\n> 9: single=0.805, ensemble=0.813\r\n> 10: single=0.810, ensemble=0.813<\/pre>\n<p>The line plot of the test accuracy scores shows the stronger stabilizing effect of using the exponential decay instead of the linear or equal weighting of models.<\/p>\n<div id=\"attachment_6782\" style=\"width: 1290px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" class=\"size-full wp-image-6782\" src=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2018\/10\/Line-Plot-of-Single-Model-Test-Performance-blue-dots-and-Model-Weight-Ensemble-Test-Performance-orange-line-with-a-Exponential-Decay.png\" alt=\"Line Plot of Single Model Test Performance (blue dots) and Model Weight Ensemble Test Performance (orange line) With an Exponential Decay\" width=\"1280\" height=\"960\" srcset=\"http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/10\/Line-Plot-of-Single-Model-Test-Performance-blue-dots-and-Model-Weight-Ensemble-Test-Performance-orange-line-with-a-Exponential-Decay.png 1280w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/10\/Line-Plot-of-Single-Model-Test-Performance-blue-dots-and-Model-Weight-Ensemble-Test-Performance-orange-line-with-a-Exponential-Decay-300x225.png 300w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/10\/Line-Plot-of-Single-Model-Test-Performance-blue-dots-and-Model-Weight-Ensemble-Test-Performance-orange-line-with-a-Exponential-Decay-768x576.png 768w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/10\/Line-Plot-of-Single-Model-Test-Performance-blue-dots-and-Model-Weight-Ensemble-Test-Performance-orange-line-with-a-Exponential-Decay-1024x768.png 1024w\" sizes=\"(max-width: 1280px) 100vw, 1280px\"><\/p>\n<p class=\"wp-caption-text\">Line Plot of Single Model Test Performance (blue dots) and Model Weight Ensemble Test Performance (orange line) With an Exponential Decay<\/p>\n<\/div>\n<h2>Extensions<\/h2>\n<p>This section lists some ideas for extending the tutorial that you may wish to explore.<\/p>\n<ul>\n<li><strong>Number of Models<\/strong>. Evaluate the effect of many more models contributing their weights to the final model.<\/li>\n<li><strong>Decay Rate<\/strong>. Evaluate the effect on test performance of using different decay rates for an exponentially weighted average.<\/li>\n<\/ul>\n<p>If you explore any of these extensions, I\u2019d love to know.<\/p>\n<h2>Further Reading<\/h2>\n<p>This section provides more resources on the topic if you are looking to go deeper.<\/p>\n<h3>Books<\/h3>\n<ul>\n<li>Section 8.7.3 Polyak Averaging, <a href=\"https:\/\/amzn.to\/2A1vOOd\">Deep Learning<\/a>, 2016.<\/li>\n<\/ul>\n<h3>Papers<\/h3>\n<ul>\n<li><a href=\"https:\/\/epubs.siam.org\/doi\/abs\/10.1137\/0330046\">Acceleration of Stochastic Approximation by Averaging<\/a>, 1992.<\/li>\n<li><a href=\"https:\/\/ecommons.cornell.edu\/handle\/1813\/8664\">Efficient estimations from a slowly convergent robbins-monro process<\/a>, 1988.<\/li>\n<\/ul>\n<h3>API<\/h3>\n<ul>\n<li><a href=\"https:\/\/keras.io\/getting-started\/sequential-model-guide\/\">Getting started with the Keras Sequential model<\/a><\/li>\n<li><a href=\"https:\/\/keras.io\/layers\/core\/\">Keras Core Layers API<\/a><\/li>\n<li><a href=\"http:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.datasets.make_blobs.html\">sklearn.datasets.make_blobs API<\/a><\/li>\n<li><a href=\"https:\/\/docs.scipy.org\/doc\/numpy\/reference\/generated\/numpy.average.html\">numpy.average API<\/a><\/li>\n<\/ul>\n<h3>Articles<\/h3>\n<ul>\n<li><a href=\"https:\/\/stackoverflow.com\/questions\/48212110\/average-weights-in-keras-models\">Average weights in keras models, StackOverflow.<\/a><\/li>\n<li><a href=\"http:\/\/cs231n.github.io\/neural-networks-3\/#ensemble\">Model Ensembles, CS231n Convolutional Neural Networks for Visual Recognition<\/a><\/li>\n<li><a href=\"https:\/\/github.com\/keras-team\/keras\/issues\/3696\">Exponential Moving Average, Keras Issue.<\/a><\/li>\n<li><a href=\"https:\/\/gist.github.com\/soheilb\/c5bf0ba7197caa095acfcb69744df756\">ExponentialMovingAverage Implementation<\/a><\/li>\n<\/ul>\n<h2>Summary<\/h2>\n<p>In this tutorial, you discovered how to combine the weights from multiple different models into a single model for making predictions.<\/p>\n<p>Specifically, you learned:<\/p>\n<ul>\n<li>The stochastic and challenging nature of training neural networks can mean that the optimization process does not converge.<\/li>\n<li>Creating a model with the average of the weights from models observed towards the end of a training run can result in a more stable and sometimes better-performing solution.<\/li>\n<li>How to develop final models created with the equal, linearly, and exponentially weighted average of model parameters from multiple saved models.<\/li>\n<\/ul>\n<p>Do you have any questions?<br \/>\nAsk your questions in the comments below and I will do my best to answer.<\/p>\n<p>The post <a rel=\"nofollow\" href=\"https:\/\/machinelearningmastery.com\/polyak-neural-network-model-weight-ensemble\/\">How to Create an Equally, Linearly, and Exponentially Weighted Average of Neural Network Model Weights in Keras<\/a> appeared first on <a rel=\"nofollow\" href=\"https:\/\/machinelearningmastery.com\/\">Machine Learning Mastery<\/a>.<\/p>\n<\/div>\n<p><a href=\"https:\/\/machinelearningmastery.com\/polyak-neural-network-model-weight-ensemble\/\">Go to Source<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Author: Jason Brownlee The training process of neural networks is a challenging optimization process that can often fail to converge. This can mean that the [&hellip;] <span class=\"read-more-link\"><a class=\"read-more\" href=\"https:\/\/www.aiproblog.com\/index.php\/2019\/01\/06\/how-to-create-an-equally-linearly-and-exponentially-weighted-average-of-neural-network-model-weights-in-keras\/\">Read More<\/a><\/span><\/p>\n","protected":false},"author":1,"featured_media":1539,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_bbp_topic_count":0,"_bbp_reply_count":0,"_bbp_total_topic_count":0,"_bbp_total_reply_count":0,"_bbp_voice_count":0,"_bbp_anonymous_reply_count":0,"_bbp_topic_count_hidden":0,"_bbp_reply_count_hidden":0,"_bbp_forum_subforum_count":0,"footnotes":""},"categories":[24],"tags":[],"_links":{"self":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts\/1538"}],"collection":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/comments?post=1538"}],"version-history":[{"count":0,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts\/1538\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/media\/1539"}],"wp:attachment":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/media?parent=1538"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/categories?post=1538"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/tags?post=1538"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}