{"id":1374,"date":"2018-12-09T18:00:02","date_gmt":"2018-12-09T18:00:02","guid":{"rendered":"https:\/\/www.aiproblog.com\/index.php\/2018\/12\/09\/how-to-stop-training-deep-neural-networks-at-the-right-time-using-early-stopping\/"},"modified":"2018-12-09T18:00:02","modified_gmt":"2018-12-09T18:00:02","slug":"how-to-stop-training-deep-neural-networks-at-the-right-time-using-early-stopping","status":"publish","type":"post","link":"https:\/\/www.aiproblog.com\/index.php\/2018\/12\/09\/how-to-stop-training-deep-neural-networks-at-the-right-time-using-early-stopping\/","title":{"rendered":"How to Stop Training Deep Neural Networks At the Right Time Using Early Stopping"},"content":{"rendered":"<p>Author: Jason Brownlee<\/p>\n<div>\n<p>A problem with training neural networks is in the choice of the number of training epochs to use.<\/p>\n<p>Too many epochs can lead to overfitting of the training dataset, whereas too few may result in an underfit model. Early stopping is a method that allows you to specify an arbitrary large number of training epochs and stop training once the model performance stops improving on a hold out validation dataset.<\/p>\n<p>In this tutorial, you will discover the Keras API for adding early stopping to overfit deep learning neural network models.<\/p>\n<p>After completing this tutorial, you will know:<\/p>\n<ul>\n<li>How to monitor the performance of a model during training using the Keras API.<\/li>\n<li>How to create and configure early stopping and model checkpoint callbacks using the Keras API.<\/li>\n<li>How to reduce overfitting by adding an early stopping to an existing model.<\/li>\n<\/ul>\n<p>Let\u2019s get started.<\/p>\n<div id=\"attachment_6623\" style=\"width: 650px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" class=\"size-full wp-image-6623\" src=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2018\/12\/How-to-Stop-Training-Deep-Neural-Networks-At-the-Right-Time-With-Using-Early-Stopping.jpg\" alt=\"How to Stop Training Deep Neural Networks At the Right Time With Using Early Stopping\" width=\"640\" height=\"360\" srcset=\"http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/12\/How-to-Stop-Training-Deep-Neural-Networks-At-the-Right-Time-With-Using-Early-Stopping.jpg 640w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/12\/How-to-Stop-Training-Deep-Neural-Networks-At-the-Right-Time-With-Using-Early-Stopping-300x169.jpg 300w\" sizes=\"(max-width: 640px) 100vw, 640px\"><\/p>\n<p class=\"wp-caption-text\">How to Stop Training Deep Neural Networks At the Right Time With Using Early Stopping<br \/>Photo by <a href=\"https:\/\/www.flickr.com\/photos\/ian-arlett\/29166578334\/\">Ian D. Keating<\/a>, some rights reserved.<\/p>\n<\/div>\n<h2>Tutorial Overview<\/h2>\n<p>This tutorial is divided into six parts; they are:<\/p>\n<ol>\n<li>Using Callbacks in Keras<\/li>\n<li>Evaluating a Validation Dataset<\/li>\n<li>Monitoring Model Performance<\/li>\n<li>Early Stopping in Keras<\/li>\n<li>Checkpointing in Keras<\/li>\n<li>Early Stopping Case Study<\/li>\n<\/ol>\n<h2>Using Callbacks in Keras<\/h2>\n<p>Callbacks provide a way to execute code and interact with the training model process automatically.<\/p>\n<p>Callbacks can be provided to the <em>fit()<\/em> function via the \u201c<em>callbacks<\/em>\u201d argument.<\/p>\n<p>First, callbacks must be instantiated.<\/p>\n<pre class=\"crayon-plain-tag\">...\r\ncb = Callback(...)<\/pre>\n<p>Then, one or more callbacks that you intend to use must be added to a Python list.<\/p>\n<pre class=\"crayon-plain-tag\">...\r\ncb_list = [cb, ...]<\/pre>\n<p>Finally, the list of callbacks is provided to the callback argument when fitting the model.<\/p>\n<pre class=\"crayon-plain-tag\">...\r\nmodel.fit(..., callbacks=cb_list)<\/pre>\n<\/p>\n<h2>Evaluating a Validation Dataset in Keras<\/h2>\n<p>Early stopping requires that a validation dataset is evaluated during training.<\/p>\n<p>This can be achieved by specifying the validation dataset to the fit() function when training your model.<\/p>\n<p>There are two ways of doing this.<\/p>\n<p>The first involves you manually splitting your training data into a train and validation dataset and specifying the validation dataset to the <em>fit()<\/em> function via the <em>validation_data<\/em> argument. For example:<\/p>\n<pre class=\"crayon-plain-tag\">...\r\nmodel.fit(train_X, train_y, validation_data=(val_x, val_y))<\/pre>\n<p>Alternately, the <em>fit()<\/em> function can automatically split your training dataset into train and validation sets based on a percentage split specified via the <em>validation_split<\/em> argument.<\/p>\n<p>The <em>validation_split<\/em> is a value between 0 and 1 and defines the percentage amount of the training dataset to use for the validation dataset. For example:<\/p>\n<pre class=\"crayon-plain-tag\">...\r\nmodel.fit(train_X, train_y, validation_split=0.3)<\/pre>\n<p>In both cases, the model is not trained on the validation dataset. Instead, the model is evaluated on the validation dataset at the end of each training epoch.<\/p>\n<h2>Monitoring Model Performance<\/h2>\n<p>The loss function chosen to be optimized for your model is calculated at the end of each epoch.<\/p>\n<p>To callbacks, this is made available via the name \u201c<em>loss<\/em>.\u201d<\/p>\n<p>If a validation dataset is specified to the <em>fit()<\/em> function via the <em>validation_data<\/em> or <em>validation_split<\/em> arguments, then the loss on the validation dataset will be made available via the name \u201c<em>val_loss<\/em>.\u201d<\/p>\n<p>Additional metrics can be monitored during the training of the model.<\/p>\n<p>They can be specified when compiling the model via the \u201c<em>metrics<\/em>\u201d argument to the compile function. This argument takes a Python list of known metric functions, such as \u2018<em>mse<\/em>\u2018 for mean squared error and \u2018<em>acc<\/em>\u2018 for accuracy. For example:<\/p>\n<pre class=\"crayon-plain-tag\">...\r\nmodel.compile(..., metrics=['acc'])<\/pre>\n<p>If additional metrics are monitored during training, they are also available to the callbacks via the same name, such as \u2018<em>acc<\/em>\u2018 for accuracy on the training dataset and \u2018<em>val_acc<\/em>\u2018 for the accuracy on the validation dataset. Or, \u2018<em>mse<\/em>\u2018 for mean squared error on the training dataset and \u2018<em>val_mse<\/em>\u2018 on the validation dataset.<\/p>\n<h2>Early Stopping in Keras<\/h2>\n<p>Keras supports the early stopping of training via a callback called <em>EarlyStopping<\/em>.<\/p>\n<p>This callback allows you to specify the performance measure to monitor, the trigger, and once triggered, it will stop the training process.<\/p>\n<p>The <em>EarlyStopping<\/em> callback is configured when instantiated via arguments.<\/p>\n<p>The \u201c<em>monitor<\/em>\u201d allows you to specify the performance measure to monitor in order to end training. Recall from the previous section that the calculation of measures on the validation dataset will have the \u2018<em>val_<\/em>\u2018 prefix, such as \u2018<em>val_loss<\/em>\u2018 for the loss on the validation dataset.<\/p>\n<pre class=\"crayon-plain-tag\">es = EarlyStopping(monitor='val_loss')<\/pre>\n<p>Based on the choice of performance measure, the \u201c<em>mode<\/em>\u201d argument will need to be specified as whether the objective of the chosen metric is to increase (maximize or \u2018<em>max<\/em>\u2018) or to decrease (minimize or \u2018<em>min<\/em>\u2018).<\/p>\n<p>For example, we would seek a minimum for validation loss and a minimum for validation mean squared error, whereas we would seek a maximum for validation accuracy.<\/p>\n<pre class=\"crayon-plain-tag\">es = EarlyStopping(monitor='val_loss', mode='min')<\/pre>\n<p>By default, mode is set to \u2018<em>auto<\/em>\u2018 and knows that you want to minimize loss or maximize accuracy.<\/p>\n<p>That is all that is needed for the simplest form of early stopping. Training will stop when the chosen performance measure stops improving. To discover the training epoch on which training was stopped, the \u201c<em>verbose<\/em>\u201d argument can be set to 1. Once stopped, the callback will print the epoch number.<\/p>\n<pre class=\"crayon-plain-tag\">es = EarlyStopping(monitor='val_loss', mode='min', verbose=1)<\/pre>\n<p>Often, the first sign of no further improvement may not be the best time to stop training. This is because the model may coast into a plateau of no improvement or even get slightly worse before getting much better.<\/p>\n<p>We can account for this by adding a delay to the trigger in terms of the number of epochs on which we would like to see no improvement. This can be done by setting the \u201c<em>patience<\/em>\u201d argument.<\/p>\n<pre class=\"crayon-plain-tag\">es = EarlyStopping(monitor='val_loss', mode='min', verbose=1, patience=50)<\/pre>\n<p>The exact amount of patience will vary between models and problems. Reviewing plots of your performance measure can be very useful to get an idea of how noisy the optimization process for your model on your data may be.<\/p>\n<p>By default, any change in the performance measure, no matter how fractional, will be considered an improvement. You may want to consider an improvement that is a specific increment, such as 1 unit for mean squared error or 1% for accuracy. This can be specified via the \u201c<em>min_delta<\/em>\u201d argument.<\/p>\n<pre class=\"crayon-plain-tag\">es = EarlyStopping(monitor='val_acc', mode='max', min_delta=1)<\/pre>\n<p>Finally, it may be desirable to only stop training if performance stays above or below a given threshold or baseline. For example, if you have familiarity with the training of the model (e.g. learning curves) and know that once a validation loss of a given value is achieved that there is no point in continuing training. This can be specified by setting the \u201c<em>baseline<\/em>\u201d argument.<\/p>\n<p>This might be more useful when fine tuning a model, after the initial wild fluctuations in the performance measure seen in the early stages of training a new model are past.<\/p>\n<pre class=\"crayon-plain-tag\">es = EarlyStopping(monitor='val_loss', mode='min', baseline=0.4)<\/pre>\n<\/p>\n<h2>Checkpointing in Keras<\/h2>\n<p>The <em>EarlyStopping<\/em> callback will stop training once triggered, but the model at the end of training may not be the model with best performance on the validation dataset.<\/p>\n<p>An additional callback is required that will save the best model observed during training for later use. This is the <em>ModelCheckpoint<\/em> callback.<\/p>\n<p>The <em>ModelCheckpoint<\/em> callback is flexible in the way it can be used, but in this case we will use it only to save the best model observed during training as defined by a chosen performance measure on the validation dataset.<\/p>\n<p>Saving and loading models requires that HDF5 support has been installed on your workstation. For example, using the <em>pip<\/em> Python installer, this can be achieved as follows:<\/p>\n<pre class=\"crayon-plain-tag\">sudo pip install h5py<\/pre>\n<p>You can learn more from the <a href=\"http:\/\/docs.h5py.org\/en\/latest\/build.html\">h5py Installation documentation<\/a>.<\/p>\n<p>The callback will save the model to file, which requires that a path and filename be specified via the first argument.<\/p>\n<pre class=\"crayon-plain-tag\">mc = ModelCheckpoint('best_model.h5')<\/pre>\n<p>The preferred loss function to be monitored can be specified via the monitor argument, in the same way as the <em>EarlyStopping<\/em> callback. For example, loss on the validation dataset (the default).<\/p>\n<pre class=\"crayon-plain-tag\">mc = ModelCheckpoint('best_model.h5', monitor='val_loss')<\/pre>\n<p>Also, as with the <em>EarlyStopping<\/em> callback, we must specify the \u201c<em>mode<\/em>\u201d as either minimizing or maximizing the performance measure. Again, the default is \u2018<em>auto<\/em>,\u2019 which is aware of the standard performance measures.<\/p>\n<pre class=\"crayon-plain-tag\">mc = ModelCheckpoint('best_model.h5', monitor='val_loss', mode='min')<\/pre>\n<p>Finally, we are interested in only the very best model observed during training, rather than the best compared to the previous epoch, which might not be the best overall if training is noisy. This can be achieved by setting the \u201c<em>save_best_only<\/em>\u201d argument to <em>True<\/em>.<\/p>\n<pre class=\"crayon-plain-tag\">mc = ModelCheckpoint('best_model.h5', monitor='val_loss', mode='min', save_best_only=True)<\/pre>\n<p>That is all that is needed to ensure the model with the best performance is saved when using early stopping, or in general.<\/p>\n<p>It may be interesting to know the value of the performance measure and at what epoch the model was saved. This can be printed by the callback by setting the \u201c<em>verbose<\/em>\u201d argument to \u201c<em>1<\/em>\u201c.<\/p>\n<pre class=\"crayon-plain-tag\">mc = ModelCheckpoint('best_model.h5', monitor='val_loss', mode='min', verbose=1)<\/pre>\n<p>The saved model can then be loaded and evaluated any time by calling the <em>load_model()<\/em> function.<\/p>\n<pre class=\"crayon-plain-tag\"># load a saved model\r\nfrom keras.models import load_model\r\nsaved_model = load_model('best_model.h5')<\/pre>\n<p>Now that we know how to use the early stopping and model checkpoint APIs, let\u2019s look at a worked example.<\/p>\n<h2>Early Stopping Case Study<\/h2>\n<p>In this section, we will demonstrate how to use early stopping to reduce overfitting of an MLP on a simple binary classification problem.<\/p>\n<p>This example provides a template for applying early stopping to your own neural network for classification and regression problems.<\/p>\n<h3>Binary Classification Problem<\/h3>\n<p>We will use a standard binary classification problem that defines two semi-circles of observations, one semi-circle for each class.<\/p>\n<p>Each observation has two input variables with the same scale and a class output value of either 0 or 1. This dataset is called the \u201c<em>moons<\/em>\u201d dataset because of the shape of the observations in each class when plotted.<\/p>\n<p>We can use the <a href=\"http:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.datasets.make_moons.html\">make_moons() function<\/a> to generate observations from this problem. We will add noise to the data and seed the random number generator so that the same samples are generated each time the code is run.<\/p>\n<pre class=\"crayon-plain-tag\"># generate 2d classification dataset\r\nX, y = make_moons(n_samples=100, noise=0.2, random_state=1)<\/pre>\n<p>We can plot the dataset where the two variables are taken as x and y coordinates on a graph and the class value is taken as the color of the observation.<\/p>\n<p>The complete example of generating the dataset and plotting it is listed below.<\/p>\n<pre class=\"crayon-plain-tag\"># generate two moons dataset\r\nfrom sklearn.datasets import make_moons\r\nfrom matplotlib import pyplot\r\nfrom pandas import DataFrame\r\n# generate 2d classification dataset\r\nX, y = make_moons(n_samples=100, noise=0.2, random_state=1)\r\n# scatter plot, dots colored by class value\r\ndf = DataFrame(dict(x=X[:,0], y=X[:,1], label=y))\r\ncolors = {0:'red', 1:'blue'}\r\nfig, ax = pyplot.subplots()\r\ngrouped = df.groupby('label')\r\nfor key, group in grouped:\r\n    group.plot(ax=ax, kind='scatter', x='x', y='y', label=key, color=colors[key])\r\npyplot.show()<\/pre>\n<p>Running the example creates a scatter plot showing the semi-circle or moon shape of the observations in each class. We can see the noise in the dispersal of the points making the moons less obvious.<\/p>\n<div id=\"attachment_6548\" style=\"width: 1290px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" class=\"size-full wp-image-6548\" src=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2018\/09\/Scatter-Plot-of-Moons-Dataset-with-Color-Showing-the-Class-Value-of-Each-Sample.png\" alt=\"Scatter Plot of Moons Dataset With Color Showing the Class Value of Each Sample\" width=\"1280\" height=\"960\" srcset=\"http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/09\/Scatter-Plot-of-Moons-Dataset-with-Color-Showing-the-Class-Value-of-Each-Sample.png 1280w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/09\/Scatter-Plot-of-Moons-Dataset-with-Color-Showing-the-Class-Value-of-Each-Sample-300x225.png 300w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/09\/Scatter-Plot-of-Moons-Dataset-with-Color-Showing-the-Class-Value-of-Each-Sample-768x576.png 768w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/09\/Scatter-Plot-of-Moons-Dataset-with-Color-Showing-the-Class-Value-of-Each-Sample-1024x768.png 1024w\" sizes=\"(max-width: 1280px) 100vw, 1280px\"><\/p>\n<p class=\"wp-caption-text\">Scatter Plot of Moons Dataset With Color Showing the Class Value of Each Sample<\/p>\n<\/div>\n<p>This is a good test problem because the classes cannot be separated by a line, e.g. are not linearly separable, requiring a nonlinear method such as a neural network to address.<\/p>\n<p>We have only generated 100 samples, which is small for a neural network, providing the opportunity to overfit the training dataset and have higher error on the test dataset: a good case for using regularization. Further, the samples have noise, giving the model an opportunity to learn aspects of the samples that don\u2019t generalize.<\/p>\n<h3>Overfit Multilayer Perceptron<\/h3>\n<p>We can develop an MLP model to address this binary classification problem.<\/p>\n<p>The model will have one hidden layer with more nodes than may be required to solve this problem, providing an opportunity to overfit. We will also train the model for longer than is required to ensure the model overfits.<\/p>\n<p>Before we define the model, we will split the dataset into train and test sets, using 30 examples to train the model and 70 to evaluate the fit model\u2019s performance.<\/p>\n<pre class=\"crayon-plain-tag\"># generate 2d classification dataset\r\nX, y = make_moons(n_samples=100, noise=0.2, random_state=1)\r\n# split into train and test\r\nn_train = 30\r\ntrainX, testX = X[:n_train, :], X[n_train:, :]\r\ntrainy, testy = y[:n_train], y[n_train:]<\/pre>\n<p>Next, we can define the model.<\/p>\n<p>The hidden layer uses 500 nodes and the rectified linear activation function. A sigmoid activation function is used in the output layer in order to predict class values of 0 or 1. The model is optimized using the binary cross entropy loss function, suitable for binary classification problems and the efficient <a href=\"https:\/\/machinelearningmastery.com\/adam-optimization-algorithm-for-deep-learning\/\">Adam version of gradient descent<\/a>.<\/p>\n<pre class=\"crayon-plain-tag\"># define model\r\nmodel = Sequential()\r\nmodel.add(Dense(500, input_dim=2, activation='relu'))\r\nmodel.add(Dense(1, activation='sigmoid'))\r\nmodel.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])<\/pre>\n<p>The defined model is then fit on the training data for 4,000 epochs and the default batch size of 32.<\/p>\n<p>We will also use the test dataset as a validation dataset. This is just a simplification for this example. In practice, you would split the training set into train and validation and also hold back a test set for final model evaluation.<\/p>\n<pre class=\"crayon-plain-tag\"># fit model\r\nhistory = model.fit(trainX, trainy, validation_data=(testX, testy), epochs=4000, verbose=0)<\/pre>\n<p>We can evaluate the performance of the model on the test dataset and report the result.<\/p>\n<pre class=\"crayon-plain-tag\"># evaluate the model\r\n_, train_acc = model.evaluate(trainX, trainy, verbose=0)\r\n_, test_acc = model.evaluate(testX, testy, verbose=0)\r\nprint('Train: %.3f, Test: %.3f' % (train_acc, test_acc))<\/pre>\n<p>Finally, we will plot the loss of the model on both the train and test set each epoch.<\/p>\n<p>If the model does indeed overfit the training dataset, we would expect the line plot of loss (and accuracy) on the training set to continue to increase and the test set to rise and then fall again as the model learns statistical noise in the training dataset.<\/p>\n<pre class=\"crayon-plain-tag\"># plot training history\r\npyplot.plot(history.history['loss'], label='train')\r\npyplot.plot(history.history['val_loss'], label='test')\r\npyplot.legend()\r\npyplot.show()<\/pre>\n<p>We can tie all of these pieces together; the complete example is listed below.<\/p>\n<pre class=\"crayon-plain-tag\"># mlp overfit on the moons dataset\r\nfrom sklearn.datasets import make_moons\r\nfrom keras.layers import Dense\r\nfrom keras.models import Sequential\r\nfrom matplotlib import pyplot\r\n# generate 2d classification dataset\r\nX, y = make_moons(n_samples=100, noise=0.2, random_state=1)\r\n# split into train and test\r\nn_train = 30\r\ntrainX, testX = X[:n_train, :], X[n_train:, :]\r\ntrainy, testy = y[:n_train], y[n_train:]\r\n# define model\r\nmodel = Sequential()\r\nmodel.add(Dense(500, input_dim=2, activation='relu'))\r\nmodel.add(Dense(1, activation='sigmoid'))\r\nmodel.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])\r\n# fit model\r\nhistory = model.fit(trainX, trainy, validation_data=(testX, testy), epochs=4000, verbose=0)\r\n# evaluate the model\r\n_, train_acc = model.evaluate(trainX, trainy, verbose=0)\r\n_, test_acc = model.evaluate(testX, testy, verbose=0)\r\nprint('Train: %.3f, Test: %.3f' % (train_acc, test_acc))\r\n# plot training history\r\npyplot.plot(history.history['loss'], label='train')\r\npyplot.plot(history.history['val_loss'], label='test')\r\npyplot.legend()\r\npyplot.show()<\/pre>\n<p>Running the example reports the model performance on the train and test datasets.<\/p>\n<p>We can see that the model has better performance on the training dataset than the test dataset, one possible sign of overfitting.<\/p>\n<p>Your specific results may vary given the stochastic nature of the neural network and the training algorithm. Because the model is severely overfit, we generally would not expect much, if any, variance in the accuracy across repeated runs of the model on the same dataset.<\/p>\n<pre class=\"crayon-plain-tag\">Train: 1.000, Test: 0.914<\/pre>\n<p>A figure is created showing line plots of the model loss on the train and test sets.<\/p>\n<p>We can see that expected shape of an overfit model where test accuracy increases to a point and then begins to decrease again.<\/p>\n<p>Reviewing the figure, we can also see flat spots in the ups and downs in the validation loss. Any early stopping will have to account for these behaviors. We would also expect that a good time to stop training might be around epoch 800.<\/p>\n<div id=\"attachment_6620\" style=\"width: 1290px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" class=\"size-full wp-image-6620\" src=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2018\/10\/Line-Plots-of-Loss-on-Train-and-Test-Datasets-While-Training-Showing-an-Overfit-Model.png\" alt=\"Line Plots of Loss on Train and Test Datasets While Training Showing an Overfit Model\" width=\"1280\" height=\"960\" srcset=\"http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/10\/Line-Plots-of-Loss-on-Train-and-Test-Datasets-While-Training-Showing-an-Overfit-Model.png 1280w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/10\/Line-Plots-of-Loss-on-Train-and-Test-Datasets-While-Training-Showing-an-Overfit-Model-300x225.png 300w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/10\/Line-Plots-of-Loss-on-Train-and-Test-Datasets-While-Training-Showing-an-Overfit-Model-768x576.png 768w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/10\/Line-Plots-of-Loss-on-Train-and-Test-Datasets-While-Training-Showing-an-Overfit-Model-1024x768.png 1024w\" sizes=\"(max-width: 1280px) 100vw, 1280px\"><\/p>\n<p class=\"wp-caption-text\">Line Plots of Loss on Train and Test Datasets While Training Showing an Overfit Model<\/p>\n<\/div>\n<h3>Overfit MLP With Early Stopping<\/h3>\n<p>We can update the example and add very simple early stopping.<\/p>\n<p>As soon as the loss of the model begins to increase on the test dataset, we will stop training.<\/p>\n<p>First, we can define the early stopping callback.<\/p>\n<pre class=\"crayon-plain-tag\"># simple early stopping\r\nes = EarlyStopping(monitor='val_loss', mode='min', verbose=1)<\/pre>\n<p>We can then update the call to the <em>fit()<\/em> function and specify a list of callbacks via the \u201c<em>callback<\/em>\u201d argument.<\/p>\n<pre class=\"crayon-plain-tag\"># fit model\r\nhistory = model.fit(trainX, trainy, validation_data=(testX, testy), epochs=4000, verbose=0, callbacks=[es])<\/pre>\n<p>The complete example with the addition of simple early stopping is listed below.<\/p>\n<pre class=\"crayon-plain-tag\"># mlp overfit on the moons dataset with simple early stopping\r\nfrom sklearn.datasets import make_moons\r\nfrom keras.models import Sequential\r\nfrom keras.layers import Dense\r\nfrom keras.callbacks import EarlyStopping\r\nfrom matplotlib import pyplot\r\n# generate 2d classification dataset\r\nX, y = make_moons(n_samples=100, noise=0.2, random_state=1)\r\n# split into train and test\r\nn_train = 30\r\ntrainX, testX = X[:n_train, :], X[n_train:, :]\r\ntrainy, testy = y[:n_train], y[n_train:]\r\n# define model\r\nmodel = Sequential()\r\nmodel.add(Dense(500, input_dim=2, activation='relu'))\r\nmodel.add(Dense(1, activation='sigmoid'))\r\nmodel.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])\r\n# simple early stopping\r\nes = EarlyStopping(monitor='val_loss', mode='min', verbose=1)\r\n# fit model\r\nhistory = model.fit(trainX, trainy, validation_data=(testX, testy), epochs=4000, verbose=0, callbacks=[es])\r\n# evaluate the model\r\n_, train_acc = model.evaluate(trainX, trainy, verbose=0)\r\n_, test_acc = model.evaluate(testX, testy, verbose=0)\r\nprint('Train: %.3f, Test: %.3f' % (train_acc, test_acc))\r\n# plot training history\r\npyplot.plot(history.history['loss'], label='train')\r\npyplot.plot(history.history['val_loss'], label='test')\r\npyplot.legend()\r\npyplot.show()<\/pre>\n<p>Running the example reports the model performance on the train and test datasets.<\/p>\n<p>We can also see that the callback stopped training at epoch 200. This is too early as we would expect an early stop to be around epoch 800. This is also highlighted by the classification accuracy on both the train and test sets, which is worse than no early stopping.<\/p>\n<pre class=\"crayon-plain-tag\">Epoch 00219: early stopping\r\nTrain: 0.967, Test: 0.814<\/pre>\n<p>Reviewing the line plot of train and test loss, we can indeed see that training was stopped at the point when validation loss began to plateau for the first time.<\/p>\n<div id=\"attachment_6621\" style=\"width: 1290px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" class=\"size-full wp-image-6621\" src=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2018\/10\/Line-Plot-of-Train-and-Test-Loss-During-Training-With-Simple-Early-Stopping.png\" alt=\"Line Plot of Train and Test Loss During Training With Simple Early Stopping\" width=\"1280\" height=\"960\" srcset=\"http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/10\/Line-Plot-of-Train-and-Test-Loss-During-Training-With-Simple-Early-Stopping.png 1280w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/10\/Line-Plot-of-Train-and-Test-Loss-During-Training-With-Simple-Early-Stopping-300x225.png 300w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/10\/Line-Plot-of-Train-and-Test-Loss-During-Training-With-Simple-Early-Stopping-768x576.png 768w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/10\/Line-Plot-of-Train-and-Test-Loss-During-Training-With-Simple-Early-Stopping-1024x768.png 1024w\" sizes=\"(max-width: 1280px) 100vw, 1280px\"><\/p>\n<p class=\"wp-caption-text\">Line Plot of Train and Test Loss During Training With Simple Early Stopping<\/p>\n<\/div>\n<p>We can improve the trigger for early stopping by waiting a while before stopping.<\/p>\n<p>This can be achieved by setting the \u201c<em>patience<\/em>\u201d argument.<\/p>\n<p>In this case, we will wait 200 epochs before training is stopped. Specifically, this means that we will allow training to continue for up to an additional 200 epochs after the point that validation loss started to degrade, giving the training process an opportunity to get across flat spots or find some additional improvement.<\/p>\n<pre class=\"crayon-plain-tag\"># patient early stopping\r\nes = EarlyStopping(monitor='val_loss', mode='min', verbose=1, patience=200)<\/pre>\n<p>The complete example with this change is listed below.<\/p>\n<pre class=\"crayon-plain-tag\"># mlp overfit on the moons dataset with patient early stopping\r\nfrom sklearn.datasets import make_moons\r\nfrom keras.models import Sequential\r\nfrom keras.layers import Dense\r\nfrom keras.callbacks import EarlyStopping\r\nfrom matplotlib import pyplot\r\n# generate 2d classification dataset\r\nX, y = make_moons(n_samples=100, noise=0.2, random_state=1)\r\n# split into train and test\r\nn_train = 30\r\ntrainX, testX = X[:n_train, :], X[n_train:, :]\r\ntrainy, testy = y[:n_train], y[n_train:]\r\n# define model\r\nmodel = Sequential()\r\nmodel.add(Dense(500, input_dim=2, activation='relu'))\r\nmodel.add(Dense(1, activation='sigmoid'))\r\nmodel.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])\r\n# patient early stopping\r\nes = EarlyStopping(monitor='val_loss', mode='min', verbose=1, patience=200)\r\n# fit model\r\nhistory = model.fit(trainX, trainy, validation_data=(testX, testy), epochs=4000, verbose=0, callbacks=[es])\r\n# evaluate the model\r\n_, train_acc = model.evaluate(trainX, trainy, verbose=0)\r\n_, test_acc = model.evaluate(testX, testy, verbose=0)\r\nprint('Train: %.3f, Test: %.3f' % (train_acc, test_acc))\r\n# plot training history\r\npyplot.plot(history.history['loss'], label='train')\r\npyplot.plot(history.history['val_loss'], label='test')\r\npyplot.legend()\r\npyplot.show()<\/pre>\n<p>Running the example, we can see that training was stopped much later, in this case after epoch 1,000. Your specific results may differ given the stochastic nature of training neural networks.<\/p>\n<p>We can also see that the performance on the test dataset is better than not using any early stopping.<\/p>\n<pre class=\"crayon-plain-tag\">Epoch 01033: early stopping\r\nTrain: 1.000, Test: 0.943<\/pre>\n<p>Reviewing the line plot of loss during training, we can see that the patience allowed the training to progress past some small flat and bad spots.<\/p>\n<div id=\"attachment_6622\" style=\"width: 1290px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" class=\"size-full wp-image-6622\" src=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2018\/10\/Line-Plot-of-Train-and-Test-Loss-During-Training-With-Patient-Early-Stopping.png\" alt=\"Line Plot of Train and Test Loss During Training With Patient Early Stopping\" width=\"1280\" height=\"960\" srcset=\"http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/10\/Line-Plot-of-Train-and-Test-Loss-During-Training-With-Patient-Early-Stopping.png 1280w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/10\/Line-Plot-of-Train-and-Test-Loss-During-Training-With-Patient-Early-Stopping-300x225.png 300w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/10\/Line-Plot-of-Train-and-Test-Loss-During-Training-With-Patient-Early-Stopping-768x576.png 768w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/10\/Line-Plot-of-Train-and-Test-Loss-During-Training-With-Patient-Early-Stopping-1024x768.png 1024w\" sizes=\"(max-width: 1280px) 100vw, 1280px\"><\/p>\n<p class=\"wp-caption-text\">Line Plot of Train and Test Loss During Training With Patient Early Stopping<\/p>\n<\/div>\n<p>We can also see that test loss started to increase again in the last approximately 100 epochs.<\/p>\n<p>This means that although the performance of the model has improved, we may not have the best performing or most stable model at the end of training. We can address this by using a <em>ModelChecckpoint<\/em> callback.<\/p>\n<p>In this case, we are interested in saving the model with the best accuracy on the test dataset. We could also seek the model with the best loss on the test dataset, but this may or may not correspond to the model with the best accuracy.<\/p>\n<p>This highlights an important concept in model selection. The notion of the \u201c<em>best<\/em>\u201d model during training may conflict when evaluated using different performance measures. Try to choose models based on the metric by which they will be evaluated and presented in the domain. In a balanced binary classification problem, this will most likely be classification accuracy. Therefore, we will use accuracy on the validation in the <em>ModelCheckpoint<\/em> callback to save the best model observed during training.<\/p>\n<pre class=\"crayon-plain-tag\">mc = ModelCheckpoint('best_model.h5', monitor='val_acc', mode='max', verbose=1, save_best_only=True)<\/pre>\n<p>During training, the entire model will be saved to the file \u201c<em>best_model.h5<\/em>\u201d only when accuracy on the validation dataset improves overall across the entire training process. A verbose output will also inform us as to the epoch and accuracy value each time the model is saved to the same file (e.g. overwritten).<\/p>\n<p>This new additional callback can be added to the list of callbacks when calling the <em>fit()<\/em> function.<\/p>\n<pre class=\"crayon-plain-tag\">history = model.fit(trainX, trainy, validation_data=(testX, testy), epochs=4000, verbose=0, callbacks=[es, mc])<\/pre>\n<p>We are no longer interested in the line plot of loss during training; it will be much the same as the previous run.<\/p>\n<p>Instead, we want to load the saved model from file and evaluate its performance on the test dataset.<\/p>\n<pre class=\"crayon-plain-tag\"># load the saved model\r\nsaved_model = load_model('best_model.h5')\r\n# evaluate the model\r\n_, train_acc = saved_model.evaluate(trainX, trainy, verbose=0)\r\n_, test_acc = saved_model.evaluate(testX, testy, verbose=0)\r\nprint('Train: %.3f, Test: %.3f' % (train_acc, test_acc))<\/pre>\n<p>The complete example with these changes is listed below.<\/p>\n<pre class=\"crayon-plain-tag\"># mlp overfit on the moons dataset with patient early stopping and model checkpointing\r\nfrom sklearn.datasets import make_moons\r\nfrom keras.models import Sequential\r\nfrom keras.layers import Dense\r\nfrom keras.callbacks import EarlyStopping\r\nfrom keras.callbacks import ModelCheckpoint\r\nfrom matplotlib import pyplot\r\nfrom keras.models import load_model\r\n# generate 2d classification dataset\r\nX, y = make_moons(n_samples=100, noise=0.2, random_state=1)\r\n# split into train and test\r\nn_train = 30\r\ntrainX, testX = X[:n_train, :], X[n_train:, :]\r\ntrainy, testy = y[:n_train], y[n_train:]\r\n# define model\r\nmodel = Sequential()\r\nmodel.add(Dense(500, input_dim=2, activation='relu'))\r\nmodel.add(Dense(1, activation='sigmoid'))\r\nmodel.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])\r\n# simple early stopping\r\nes = EarlyStopping(monitor='val_loss', mode='min', verbose=1, patience=200)\r\nmc = ModelCheckpoint('best_model.h5', monitor='val_acc', mode='max', verbose=1, save_best_only=True)\r\n# fit model\r\nhistory = model.fit(trainX, trainy, validation_data=(testX, testy), epochs=4000, verbose=0, callbacks=[es, mc])\r\n# load the saved model\r\nsaved_model = load_model('best_model.h5')\r\n# evaluate the model\r\n_, train_acc = saved_model.evaluate(trainX, trainy, verbose=0)\r\n_, test_acc = saved_model.evaluate(testX, testy, verbose=0)\r\nprint('Train: %.3f, Test: %.3f' % (train_acc, test_acc))<\/pre>\n<p>Running the example, we can see the verbose output from the <em>ModelCheckpoint<\/em> callback for both when a new best model is saved and from when no improvement was observed.<\/p>\n<p>We can see that the best model was observed at epoch 879 during this run. Your specific results may vary given the stochastic nature of training neural networks.<\/p>\n<p>Again, we can see that early stopping continued patiently until after epoch 1,000. Note that epoch 880 + a patience of 200 is not epoch 1044. Recall that early stopping is monitoring loss on the validation dataset and that the model checkpoint is saving models based on accuracy. As such, the patience of early stopping started at an epoch other than 880.<\/p>\n<pre class=\"crayon-plain-tag\">...\r\nEpoch 00878: val_acc did not improve from 0.92857\r\nEpoch 00879: val_acc improved from 0.92857 to 0.94286, saving model to best_model.h5\r\nEpoch 00880: val_acc did not improve from 0.94286\r\n...\r\nEpoch 01042: val_acc did not improve from 0.94286\r\nEpoch 01043: val_acc did not improve from 0.94286\r\nEpoch 01044: val_acc did not improve from 0.94286\r\nEpoch 01044: early stopping\r\nTrain: 1.000, Test: 0.943<\/pre>\n<p>In this case, we don\u2019t see any further improvement in model accuracy on the test dataset. Nevertheless, we have followed a good practice.<\/p>\n<p>Why not monitor validation accuracy for early stopping?<\/p>\n<p>This is a good question. The main reason is that accuracy is a coarse measure of model performance during training and that loss provides more nuance when using early stopping with classification problems. The same measure may be used for early stopping and model checkpointing in the case of regression, such as mean squared error.<\/p>\n<h2>Extensions<\/h2>\n<p>This section lists some ideas for extending the tutorial that you may wish to explore.<\/p>\n<ul>\n<li><strong>Use Accuracy<\/strong>. Update the example to monitor accuracy on the test dataset rather than loss, and plot learning curves showing accuracy.<\/li>\n<li><strong>Use True Validation Set<\/strong>. Update the example to split the training set into train and validation sets, then evaluate the model on the test dataset.<\/li>\n<li><strong>Regression Example<\/strong>. Create a new example of using early stopping to address overfitting on a simple regression problem and monitoring mean squared error.<\/li>\n<\/ul>\n<p>If you explore any of these extensions, I\u2019d love to know.<\/p>\n<h2>Further Reading<\/h2>\n<p>This section provides more resources on the topic if you are looking to go deeper.<\/p>\n<h3>Posts<\/h3>\n<ul>\n<li><a href=\"https:\/\/machinelearningmastery.com\/avoid-overfitting-by-early-stopping-with-xgboost-in-python\/\">Avoid Overfitting by Early Stopping With XGBoost in Python<\/a><\/li>\n<li><a href=\"https:\/\/machinelearningmastery.com\/check-point-deep-learning-models-keras\/\">How to Check-Point Deep Learning Models in Keras<\/a><\/li>\n<\/ul>\n<h3>API<\/h3>\n<ul>\n<li><a href=\"http:\/\/docs.h5py.org\/en\/latest\/build.html\">H5Py Installation Documentation<\/a><\/li>\n<li><a href=\"https:\/\/keras.io\/regularizers\/\">Keras Regularizers API<\/a><\/li>\n<li><a href=\"https:\/\/keras.io\/layers\/core\/\">Keras Core Layers API<\/a><\/li>\n<li><a href=\"https:\/\/keras.io\/layers\/convolutional\/\">Keras Convolutional Layers API<\/a><\/li>\n<li><a href=\"https:\/\/keras.io\/layers\/recurrent\/\">Keras Recurrent Layers API<\/a><\/li>\n<li><a href=\"https:\/\/keras.io\/callbacks\/\">Keras Callbacks API<\/a><\/li>\n<li><a href=\"http:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.datasets.make_moons.html\">sklearn.datasets.make_moons API<\/a><\/li>\n<\/ul>\n<h2>Summary<\/h2>\n<p>In this tutorial, you discovered the Keras API for adding early stopping to overfit deep learning neural network models.<\/p>\n<p>Specifically, you learned:<\/p>\n<ul>\n<li>How to monitor the performance of a model during training using the Keras API.<\/li>\n<li>How to create and configure early stopping and model checkpoint callbacks using the Keras API.<\/li>\n<li>How to reduce overfitting by adding a early stopping to an existing model.<\/li>\n<\/ul>\n<p>Do you have any questions?<br \/>\nAsk your questions in the comments below and I will do my best to answer.<\/p>\n<p>The post <a rel=\"nofollow\" href=\"https:\/\/machinelearningmastery.com\/how-to-stop-training-deep-neural-networks-at-the-right-time-using-early-stopping\/\">How to Stop Training Deep Neural Networks At the Right Time Using Early Stopping<\/a> appeared first on <a rel=\"nofollow\" href=\"https:\/\/machinelearningmastery.com\/\">Machine Learning Mastery<\/a>.<\/p>\n<\/div>\n<p><a href=\"https:\/\/machinelearningmastery.com\/how-to-stop-training-deep-neural-networks-at-the-right-time-using-early-stopping\/\">Go to Source<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Author: Jason Brownlee A problem with training neural networks is in the choice of the number of training epochs to use. Too many epochs can [&hellip;] <span class=\"read-more-link\"><a class=\"read-more\" href=\"https:\/\/www.aiproblog.com\/index.php\/2018\/12\/09\/how-to-stop-training-deep-neural-networks-at-the-right-time-using-early-stopping\/\">Read More<\/a><\/span><\/p>\n","protected":false},"author":1,"featured_media":1375,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_bbp_topic_count":0,"_bbp_reply_count":0,"_bbp_total_topic_count":0,"_bbp_total_reply_count":0,"_bbp_voice_count":0,"_bbp_anonymous_reply_count":0,"_bbp_topic_count_hidden":0,"_bbp_reply_count_hidden":0,"_bbp_forum_subforum_count":0,"footnotes":""},"categories":[24],"tags":[],"_links":{"self":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts\/1374"}],"collection":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/comments?post=1374"}],"version-history":[{"count":0,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts\/1374\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/media\/1375"}],"wp:attachment":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/media?parent=1374"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/categories?post=1374"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/tags?post=1374"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}