{"id":1623,"date":"2019-01-24T18:00:25","date_gmt":"2019-01-24T18:00:25","guid":{"rendered":"https:\/\/www.aiproblog.com\/index.php\/2019\/01\/24\/understand-the-impact-of-learning-rate-on-model-performance-with-deep-learning-neural-networks\/"},"modified":"2019-01-24T18:00:25","modified_gmt":"2019-01-24T18:00:25","slug":"understand-the-impact-of-learning-rate-on-model-performance-with-deep-learning-neural-networks","status":"publish","type":"post","link":"https:\/\/www.aiproblog.com\/index.php\/2019\/01\/24\/understand-the-impact-of-learning-rate-on-model-performance-with-deep-learning-neural-networks\/","title":{"rendered":"Understand the Impact of Learning Rate on Model Performance With Deep Learning Neural Networks"},"content":{"rendered":"<p>Author: Jason Brownlee<\/p>\n<div>\n<p>Deep learning neural networks are trained using the stochastic gradient descent optimization algorithm.<\/p>\n<p>The learning rate is a hyperparameter that controls how much to change the model in response to the estimated error each time the model weights are updated. Choosing the learning rate is challenging as a value too small may result in a long training process that could get stuck, whereas a value too large may result in learning a sub-optimal set of weights too fast or an unstable training process.<\/p>\n<p>The learning rate may be the most important hyperparameter when configuring your neural network. Therefore it is vital to know how to investigate the effects of the learning rate on model performance and to build an intuition about the dynamics of the learning rate on model behavior.<\/p>\n<p>In this tutorial, you will discover the effects of the learning rate, learning rate schedules, and adaptive learning rates on model performance.<\/p>\n<p>After completing this tutorial, you will know:<\/p>\n<ul>\n<li>How large learning rates result in unstable training and tiny rates result in a failure to train.<\/li>\n<li>Momentum can accelerate training and learning rate schedules can help to converge the optimization process.<\/li>\n<li>Adaptive learning rates can accelerate training and alleviate some of the pressure of choosing a learning rate and learning rate schedule.<\/li>\n<\/ul>\n<p>Let\u2019s get started.<\/p>\n<div id=\"attachment_6898\" style=\"width: 650px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" class=\"wp-image-6898 size-full\" src=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2019\/01\/Understand-the-Dynamics-of-Learning-Rate-on-Model-Performance-With-Deep-Learning-Neural-Networks.jpg\" alt=\"Understand the Dynamics of Learning Rate on Model Performance With Deep Learning Neural Networks\" width=\"640\" height=\"377\" srcset=\"http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2019\/01\/Understand-the-Dynamics-of-Learning-Rate-on-Model-Performance-With-Deep-Learning-Neural-Networks.jpg 640w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2019\/01\/Understand-the-Dynamics-of-Learning-Rate-on-Model-Performance-With-Deep-Learning-Neural-Networks-300x177.jpg 300w\" sizes=\"(max-width: 640px) 100vw, 640px\"><\/p>\n<p class=\"wp-caption-text\">Understand the Dynamics of Learning Rate on Model Performance With Deep Learning Neural Networks<br \/>Photo by <a href=\"https:\/\/www.flickr.com\/photos\/yunir\/7383264302\/\">Abdul Rahman<\/a> some rights reserved<\/p>\n<\/div>\n<h2>Tutorial Overview<\/h2>\n<p>This tutorial is divided into six parts; they are:<\/p>\n<ol>\n<li>Learning Rate and Gradient Descent<\/li>\n<li>Configure the Learning Rate in Keras<\/li>\n<li>Multi-Class Classification Problem<\/li>\n<li>Effect of Learning Rate and Momentum<\/li>\n<li>Effect of Learning Rate Schedules<\/li>\n<li>Effect of Adaptive Learning Rates<\/li>\n<\/ol>\n<h2>Learning Rate and Gradient Descent<\/h2>\n<p>Deep learning neural networks are trained using the stochastic gradient descent algorithm.<\/p>\n<p>Stochastic gradient descent is an optimization algorithm that estimates the error gradient for the current state of the model using examples from the training dataset, then updates the weights of the model using the back-propagation of errors algorithm, referred to as simply backpropagation.<\/p>\n<p>The amount that the weights are updated during training is referred to as the step size or the \u201c<em>learning rate<\/em>.\u201d<\/p>\n<p>Specifically, the learning rate is a configurable hyperparameter used in the training of neural networks that has a small positive value, often in the range between 0.0 and 1.0.<\/p>\n<p>The learning rate controls how quickly the model is adapted to the problem. Smaller learning rates require more training epochs given the smaller changes made to the weights each update, whereas larger learning rates result in rapid changes and require fewer training epochs.<\/p>\n<p>A learning rate that is too large can cause the model to converge too quickly to a suboptimal solution, whereas a learning rate that is too small can cause the process to get stuck.<\/p>\n<p>The challenge of training deep learning neural networks involves carefully selecting the learning rate. It may be the most important hyperparameter for the model.<\/p>\n<blockquote>\n<p>The learning rate is perhaps the most important hyperparameter. If you have time to tune only one hyperparameter, tune the learning rate.<\/p>\n<\/blockquote>\n<p>\u2014 Page 429, <a href=\"https:\/\/amzn.to\/2NJW3gE\">Deep Learning<\/a>, 2016.<\/p>\n<p>Now that we are familiar with what the learning rate is, let\u2019s look at how we can configure the learning rate for neural networks.<\/p>\n<div class=\"woo-sc-hr\"><\/div>\n<p><center><\/p>\n<h3>Want Better Results with Deep Learning?<\/h3>\n<p>Take my free 7-day email crash course now (with sample code).<\/p>\n<p>Click to sign-up and also get a free PDF Ebook version of the course.<\/p>\n<p><a href=\"https:\/\/machinelearningmastery.lpages.co\/leadbox\/1433e7773f72a2%3A164f8be4f346dc\/5764144745676800\/\" target=\"_blank\" style=\"background: rgb(255, 206, 10); color: rgb(255, 255, 255); text-decoration: none; font-family: Helvetica, Arial, sans-serif; font-weight: bold; font-size: 16px; line-height: 20px; padding: 10px; display: inline-block; max-width: 300px; border-radius: 5px; text-shadow: rgba(0, 0, 0, 0.25) 0px -1px 1px; box-shadow: rgba(255, 255, 255, 0.5) 0px 1px 3px inset, rgba(0, 0, 0, 0.5) 0px 1px 3px;\">Download Your FREE Mini-Course<\/a><script data-leadbox=\"1433e7773f72a2:164f8be4f346dc\" data-url=\"https:\/\/machinelearningmastery.lpages.co\/leadbox\/1433e7773f72a2%3A164f8be4f346dc\/5764144745676800\/\" data-config=\"%7B%7D\" type=\"text\/javascript\" src=\"https:\/\/machinelearningmastery.lpages.co\/leadbox-1543333086.js\"><\/script><\/p>\n<p><\/center><\/p>\n<div class=\"woo-sc-hr\"><\/div>\n<h2>Configure the Learning Rate in Keras<\/h2>\n<p>The Keras deep learning library allows you to easily configure the learning rate for a number of different variations of the stochastic gradient descent optimization algorithm.<\/p>\n<h3>Stochastic Gradient Descent<\/h3>\n<p>Keras provides the SGD class that implements the stochastic gradient descent optimizer with a learning rate and momentum.<\/p>\n<p>First, an instance of the class must be created and configured, then specified to the \u201c<em>optimizer<\/em>\u201d argument when calling the <em>fit()<\/em> function on the model.<\/p>\n<p>The default learning rate is 0.01 and no momentum is used by default.<\/p>\n<pre class=\"crayon-plain-tag\">from keras.optimizers import SGD\r\n...\r\nopt = SGD()\r\nmodel.compile(..., optimizer=opt)<\/pre>\n<p>The learning rate can be specified via the \u201c<em>lr<\/em>\u201d argument and the momentum can be specified via the \u201c<em>momentum<\/em>\u201d argument.<\/p>\n<pre class=\"crayon-plain-tag\">from keras.optimizers import SGD\r\n...\r\nopt = SGD(lr=0.01, momentum=0.9)\r\nmodel.compile(..., optimizer=opt)<\/pre>\n<p>The class also supports weight decay via the \u201c<em>decay<\/em>\u201d argument.<\/p>\n<p>With learning rate decay, the learning rate is calculated each update (e.g. end of each mini-batch) as follows:<\/p>\n<pre class=\"crayon-plain-tag\">lrate = initial_lrate * (1 \/ (1 + decay * iteration))<\/pre>\n<p>Where <em>lrate<\/em> is the learning rate for the current epoch, <em>initial_lrate<\/em> is the learning rate specified as an argument to SGD, <em>decay<\/em> is the decay rate which is greater than zero and <em>iteration<\/em> is the current update number.<\/p>\n<pre class=\"crayon-plain-tag\">from keras.optimizers import SGD\r\n...\r\nopt = SGD(lr=0.01, momentum=0.9, decay=0.01)\r\nmodel.compile(..., optimizer=opt)<\/pre>\n<\/p>\n<h3>Learning Rate Schedule<\/h3>\n<p>Keras supports learning rate schedules via callbacks.<\/p>\n<p>The callbacks operate separately from the optimization algorithm, although they adjust the learning rate used by the optimization algorithm. It is recommended to use the SGD when using a learning rate schedule callback.<\/p>\n<p>Callbacks are instantiated and configured, then specified in a list to the \u201c<em>callbacks<\/em>\u201d argument of the fit() function when training the model.<\/p>\n<p>Keras provides the <a href=\"https:\/\/keras.io\/callbacks\/#reducelronplateau\">ReduceLROnPlateau<\/a> that will adjust the learning rate when a plateau in model performance is detected, e.g. no change for a given number of training epochs. This callback is designed to reduce the learning rate after the model stops improving with the hope of fine-tuning model weights.<\/p>\n<p>The <em>ReduceLROnPlateau<\/em> requires you to specify the metric to monitor during training via the \u201c<em>monitor<\/em>\u201d argument, the value that the learning rate will be multiplied by via the \u201c<em>factor<\/em>\u201d argument and the \u201c<em>patience<\/em>\u201d argument that specifies the number of training epochs to wait before triggering the change in learning rate.<\/p>\n<p>For example, we can monitor the validation loss and reduce the learning rate by an order of magnitude if validation loss does not improve for 100 epochs:<\/p>\n<pre class=\"crayon-plain-tag\"># snippet of using the ReduceLROnPlateau callback\r\nfrom keras.optimizers import SGD\r\nfrom keras.callbacks import ReduceLROnPlateau\r\n...\r\nrlrop = ReduceLROnPlateau(monitor='val_loss', factor=0.1, patience=100)\r\nopt = SGD()\r\nmodel.compile(..., optimizer=opt, callbacks=[rlrop])<\/pre>\n<p>Keras also provides <a href=\"https:\/\/keras.io\/callbacks\/#learningratescheduler\">LearningRateScheduler<\/a> callback that allows you to specify a function that is called each epoch in order to adjust the learning rate.<\/p>\n<p>You can define your Python function that takes two arguments (epoch and current learning rate) and returns the new learning rate.<\/p>\n<pre class=\"crayon-plain-tag\"># snippet of using the LearningRateScheduler callback\r\nfrom keras.optimizers import SGD\r\nfrom keras.callbacks import LearningRateScheduler\r\n...\r\n\r\ndef my_learning_rate(epoch, lrate)\r\n\treturn lrate\r\n\r\nlrs = LearningRateScheduler(my_learning_rate)\r\nopt = SGD()\r\nmodel.compile(..., optimizer=opt, callbacks=[lrs])<\/pre>\n<\/p>\n<h3>Adaptive Learning Rate Gradient Descent<\/h3>\n<p>Keras also provides a suite of extensions of simple stochastic gradient descent that support adaptive learning rates.<\/p>\n<p>Because each method adapts the learning rate, often one learning rate per model weight, little configuration is often required.<\/p>\n<p>Three commonly used adaptive learning rate methods include:<\/p>\n<h4>RMSProp Optimizer<\/h4>\n<\/p>\n<pre class=\"crayon-plain-tag\">from keras.optimizers import RMSprop\r\n...\r\nopt = RMSprop()\r\nmodel.compile(..., optimizer=opt)<\/pre>\n<\/p>\n<h4>Adagrad Optimizer<\/h4>\n<\/p>\n<pre class=\"crayon-plain-tag\">from keras.optimizers import Adagrad\r\n...\r\nopt = Adagrad()\r\nmodel.compile(..., optimizer=opt)<\/pre>\n<\/p>\n<h4>Adam Optimizer<\/h4>\n<\/p>\n<pre class=\"crayon-plain-tag\">from keras.optimizers import Adam\r\n...\r\nopt = Adam()\r\nmodel.compile(..., optimizer=opt)<\/pre>\n<\/p>\n<h2>Multi-Class Classification Problem<\/h2>\n<p>We will use a small multi-class classification problem as the basis to demonstrate the effect of learning rate on model performance.<\/p>\n<p>The scikit-learn class provides the <a href=\"http:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.datasets.make_blobs.html\">make_blobs() function<\/a> that can be used to create a multi-class classification problem with the prescribed number of samples, input variables, classes, and variance of samples within a class.<\/p>\n<p>The problem has two input variables (to represent the <em>x<\/em> and <em>y<\/em> coordinates of the points) and a standard deviation of 2.0 for points within each group. We will use the same random state (seed for the pseudorandom number generator) to ensure that we always get the same data points.<\/p>\n<pre class=\"crayon-plain-tag\"># generate 2d classification dataset\r\nX, y = make_blobs(n_samples=1000, centers=3, n_features=2, cluster_std=2, random_state=2)<\/pre>\n<p>The results are the input and output elements of a dataset that we can model.<\/p>\n<p>In order to get a feeling for the complexity of the problem, we can plot each point on a two-dimensional scatter plot and color each point by class value.<\/p>\n<p>The complete example is listed below.<\/p>\n<pre class=\"crayon-plain-tag\"># scatter plot of blobs dataset\r\nfrom sklearn.datasets.samples_generator import make_blobs\r\nfrom matplotlib import pyplot\r\nfrom numpy import where\r\n# generate 2d classification dataset\r\nX, y = make_blobs(n_samples=1000, centers=3, n_features=2, cluster_std=2, random_state=2)\r\n# scatter plot for each class value\r\nfor class_value in range(3):\r\n\t# select indices of points with the class label\r\n\trow_ix = where(y == class_value)\r\n\t# scatter plot for points with a different color\r\n\tpyplot.scatter(X[row_ix, 0], X[row_ix, 1])\r\n# show plot\r\npyplot.show()<\/pre>\n<p>Running the example creates a scatter plot of the entire dataset. We can see that the standard deviation of 2.0 means that the classes are not linearly separable (separable by a line), causing many ambiguous points.<\/p>\n<p>This is desirable as it means that the problem is non-trivial and will allow a neural network model to find many different \u201cgood enough\u201d candidate solutions.<\/p>\n<div id=\"attachment_6888\" style=\"width: 1290px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" class=\"size-full wp-image-6888\" src=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2018\/11\/Scatter-Plot-of-Blobs-Dataset-with-Three-Classes-and-Points-Colored-by-Class-Value-1.png\" alt=\"Scatter Plot of Blobs Dataset With Three Classes and Points Colored by Class Value\" width=\"1280\" height=\"960\" srcset=\"http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/11\/Scatter-Plot-of-Blobs-Dataset-with-Three-Classes-and-Points-Colored-by-Class-Value-1.png 1280w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/11\/Scatter-Plot-of-Blobs-Dataset-with-Three-Classes-and-Points-Colored-by-Class-Value-1-300x225.png 300w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/11\/Scatter-Plot-of-Blobs-Dataset-with-Three-Classes-and-Points-Colored-by-Class-Value-1-768x576.png 768w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/11\/Scatter-Plot-of-Blobs-Dataset-with-Three-Classes-and-Points-Colored-by-Class-Value-1-1024x768.png 1024w\" sizes=\"(max-width: 1280px) 100vw, 1280px\"><\/p>\n<p class=\"wp-caption-text\">Scatter Plot of Blobs Dataset With Three Classes and Points Colored by Class Value<\/p>\n<\/div>\n<h2>Effect of Learning Rate and Momentum<\/h2>\n<p>In this section, we will develop a Multilayer Perceptron (MLP) model to address the blobs classification problem and investigate the effect of different learning rates and momentum.<\/p>\n<h3>Learning Rate Dynamics<\/h3>\n<p>The first step is to develop a function that will create the samples from the problem and split them into train and test datasets.<\/p>\n<p>Additionally, we must also one hot encode the target variable so that we can develop a model that predicts the probability of an example belonging to each class.<\/p>\n<p>The <em>prepare_data()<\/em> function below implements this behavior, returning train and test sets split into input and output elements.<\/p>\n<pre class=\"crayon-plain-tag\"># prepare train and test dataset\r\ndef prepare_data():\r\n\t# generate 2d classification dataset\r\n\tX, y = make_blobs(n_samples=1000, centers=3, n_features=2, cluster_std=2, random_state=2)\r\n\t# one hot encode output variable\r\n\ty = to_categorical(y)\r\n\t# split into train and test\r\n\tn_train = 500\r\n\ttrainX, testX = X[:n_train, :], X[n_train:, :]\r\n\ttrainy, testy = y[:n_train], y[n_train:]\r\n\treturn trainX, trainy, testX, testy<\/pre>\n<p>Next, we can develop a function to fit and evaluate an MLP model.<\/p>\n<p>First, we will define a simple MLP model that expects two input variables from the blobs problem, has a single hidden layer with 50 nodes, and an output layer with three nodes to predict the probability for each of the three classes. Nodes in the hidden layer will use the rectified linear activation function, whereas nodes in the output layer will use the softmax activation function.<\/p>\n<pre class=\"crayon-plain-tag\"># define model\r\nmodel = Sequential()\r\nmodel.add(Dense(50, input_dim=2, activation='relu', kernel_initializer='he_uniform'))\r\nmodel.add(Dense(3, activation='softmax'))<\/pre>\n<p>We will use the stochastic gradient descent optimizer and require that the learning rate be specified so that we can evaluate different rates. The model will be trained to minimize cross entropy.<\/p>\n<pre class=\"crayon-plain-tag\"># compile model\r\nopt = SGD(lr=lrate)\r\nmodel.compile(loss='categorical_crossentropy', optimizer=opt, metrics=['accuracy'])<\/pre>\n<p>The model will be fit for 200 training epochs, found with a little trial and error, and the test set will be used as the validation dataset so we can get an idea of the generalization error of the model during training.<\/p>\n<pre class=\"crayon-plain-tag\"># fit model\r\nhistory = model.fit(trainX, trainy, validation_data=(testX, testy), epochs=200, verbose=0)<\/pre>\n<p>Once fit, we will plot the accuracy of the model on the train and test sets over the training epochs.<\/p>\n<pre class=\"crayon-plain-tag\"># plot learning curves\r\npyplot.plot(history.history['acc'], label='train')\r\npyplot.plot(history.history['val_acc'], label='test')\r\npyplot.title('lrate='+str(lrate), pad=-50)<\/pre>\n<p>The <em>fit_model()<\/em> function below ties together these elements and will fit a model and plot its performance given the train and test datasets as well as a specific learning rate to evaluate.<\/p>\n<pre class=\"crayon-plain-tag\"># fit a model and plot learning curve\r\ndef fit_model(trainX, trainy, testX, testy, lrate):\r\n\t# define model\r\n\tmodel = Sequential()\r\n\tmodel.add(Dense(50, input_dim=2, activation='relu', kernel_initializer='he_uniform'))\r\n\tmodel.add(Dense(3, activation='softmax'))\r\n\t# compile model\r\n\topt = SGD(lr=lrate)\r\n\tmodel.compile(loss='categorical_crossentropy', optimizer=opt, metrics=['accuracy'])\r\n\t# fit model\r\n\thistory = model.fit(trainX, trainy, validation_data=(testX, testy), epochs=200, verbose=0)\r\n\t# plot learning curves\r\n\tpyplot.plot(history.history['acc'], label='train')\r\n\tpyplot.plot(history.history['val_acc'], label='test')\r\n\tpyplot.title('lrate='+str(lrate), pad=-50)<\/pre>\n<p>We can now investigate the dynamics of different learning rates on the train and test accuracy of the model.<\/p>\n<p>In this example, we will evaluate learning rates on a logarithmic scale from 1E-0 (1.0) to 1E-7 and create line plots for each learning rate by calling the <em>fit_model()<\/em> function.<\/p>\n<pre class=\"crayon-plain-tag\"># create learning curves for different learning rates\r\nlearning_rates = [1E-0, 1E-1, 1E-2, 1E-3, 1E-4, 1E-5, 1E-6, 1E-7]\r\nfor i in range(len(learning_rates)):\r\n\t# determine the plot number\r\n\tplot_no = 420 + (i+1)\r\n\tpyplot.subplot(plot_no)\r\n\t# fit model and plot learning curves for a learning rate\r\n\tfit_model(trainX, trainy, testX, testy, learning_rates[i])\r\n# show learning curves\r\npyplot.show()<\/pre>\n<p>Tying all of this together, the complete example is listed below.<\/p>\n<pre class=\"crayon-plain-tag\"># study of learning rate on accuracy for blobs problem\r\nfrom sklearn.datasets.samples_generator import make_blobs\r\nfrom keras.layers import Dense\r\nfrom keras.models import Sequential\r\nfrom keras.optimizers import SGD\r\nfrom keras.utils import to_categorical\r\nfrom matplotlib import pyplot\r\n\r\n# prepare train and test dataset\r\ndef prepare_data():\r\n\t# generate 2d classification dataset\r\n\tX, y = make_blobs(n_samples=1000, centers=3, n_features=2, cluster_std=2, random_state=2)\r\n\t# one hot encode output variable\r\n\ty = to_categorical(y)\r\n\t# split into train and test\r\n\tn_train = 500\r\n\ttrainX, testX = X[:n_train, :], X[n_train:, :]\r\n\ttrainy, testy = y[:n_train], y[n_train:]\r\n\treturn trainX, trainy, testX, testy\r\n\r\n# fit a model and plot learning curve\r\ndef fit_model(trainX, trainy, testX, testy, lrate):\r\n\t# define model\r\n\tmodel = Sequential()\r\n\tmodel.add(Dense(50, input_dim=2, activation='relu', kernel_initializer='he_uniform'))\r\n\tmodel.add(Dense(3, activation='softmax'))\r\n\t# compile model\r\n\topt = SGD(lr=lrate)\r\n\tmodel.compile(loss='categorical_crossentropy', optimizer=opt, metrics=['accuracy'])\r\n\t# fit model\r\n\thistory = model.fit(trainX, trainy, validation_data=(testX, testy), epochs=200, verbose=0)\r\n\t# plot learning curves\r\n\tpyplot.plot(history.history['acc'], label='train')\r\n\tpyplot.plot(history.history['val_acc'], label='test')\r\n\tpyplot.title('lrate='+str(lrate), pad=-50)\r\n\r\n# prepare dataset\r\ntrainX, trainy, testX, testy = prepare_data()\r\n# create learning curves for different learning rates\r\nlearning_rates = [1E-0, 1E-1, 1E-2, 1E-3, 1E-4, 1E-5, 1E-6, 1E-7]\r\nfor i in range(len(learning_rates)):\r\n\t# determine the plot number\r\n\tplot_no = 420 + (i+1)\r\n\tpyplot.subplot(plot_no)\r\n\t# fit model and plot learning curves for a learning rate\r\n\tfit_model(trainX, trainy, testX, testy, learning_rates[i])\r\n# show learning curves\r\npyplot.show()<\/pre>\n<p>Running the example creates a single figure that contains eight line plots for the eight different evaluated learning rates. Classification accuracy on the training dataset is marked in blue, whereas accuracy on the test dataset is marked in orange.<\/p>\n<p>Your specific results may vary given the stochastic nature of the learning algorithm. Consider running the example a few times.<\/p>\n<p>The plots show oscillations in behavior for the too-large learning rate of 1.0 and the inability of the model to learn anything with the too-small learning rates of 1E-6 and 1E-7.<\/p>\n<p>We can see that the model was able to learn the problem well with the learning rates 1E-1, 1E-2 and 1E-3, although successively slower as the learning rate was decreased. With the chosen model configuration, the results suggest a moderate learning rate of 0.1 results in good model performance on the train and test sets.<\/p>\n<div id=\"attachment_6889\" style=\"width: 1290px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" class=\"size-full wp-image-6889\" src=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2018\/11\/Line-Plots-of-Train-and-Test-Accuracy-for-a-Suite-of-Learning-Rates-on-the-Blobs-Classification-Problem.png\" alt=\"Line Plots of Train and Test Accuracy for a Suite of Learning Rates on the Blobs Classification Problem\" width=\"1280\" height=\"960\" srcset=\"http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/11\/Line-Plots-of-Train-and-Test-Accuracy-for-a-Suite-of-Learning-Rates-on-the-Blobs-Classification-Problem.png 1280w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/11\/Line-Plots-of-Train-and-Test-Accuracy-for-a-Suite-of-Learning-Rates-on-the-Blobs-Classification-Problem-300x225.png 300w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/11\/Line-Plots-of-Train-and-Test-Accuracy-for-a-Suite-of-Learning-Rates-on-the-Blobs-Classification-Problem-768x576.png 768w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/11\/Line-Plots-of-Train-and-Test-Accuracy-for-a-Suite-of-Learning-Rates-on-the-Blobs-Classification-Problem-1024x768.png 1024w\" sizes=\"(max-width: 1280px) 100vw, 1280px\"><\/p>\n<p class=\"wp-caption-text\">Line Plots of Train and Test Accuracy for a Suite of Learning Rates on the Blobs Classification Problem<\/p>\n<\/div>\n<h3>Momentum Dynamics<\/h3>\n<p>Momentum can smooth the progression of the learning algorithm that, in turn, can accelerate the training process.<\/p>\n<p>We can adapt the example from the previous section to evaluate the effect of momentum with a fixed learning rate. In this case, we will choose the learning rate of 0.01 that in the previous section converged to a reasonable solution, but required more epochs than the learning rate of 0.1<\/p>\n<p>The <em>fit_model()<\/em> function can be updated to take a \u201c<em>momentum<\/em>\u201d argument instead of a learning rate argument, that can be used in the configuration of the SGD class and reported on the resulting plot.<\/p>\n<p>The updated version of this function is listed below.<\/p>\n<pre class=\"crayon-plain-tag\"># fit a model and plot learning curve\r\ndef fit_model(trainX, trainy, testX, testy, momentum):\r\n\t# define model\r\n\tmodel = Sequential()\r\n\tmodel.add(Dense(50, input_dim=2, activation='relu', kernel_initializer='he_uniform'))\r\n\tmodel.add(Dense(3, activation='softmax'))\r\n\t# compile model\r\n\topt = SGD(lr=0.01, momentum=momentum)\r\n\tmodel.compile(loss='categorical_crossentropy', optimizer=opt, metrics=['accuracy'])\r\n\t# fit model\r\n\thistory = model.fit(trainX, trainy, validation_data=(testX, testy), epochs=200, verbose=0)\r\n\t# plot learning curves\r\n\tpyplot.plot(history.history['acc'], label='train')\r\n\tpyplot.plot(history.history['val_acc'], label='test')\r\n\tpyplot.title('momentum='+str(momentum), pad=-80)<\/pre>\n<p>It is common to use momentum values close to 1.0, such as 0.9 and 0.99.<\/p>\n<p>In this example, we will demonstrate the dynamics of the model without momentum compared to the model with momentum values of 0.5 and the higher momentum values.<\/p>\n<pre class=\"crayon-plain-tag\"># create learning curves for different momentums\r\nmomentums = [0.0, 0.5, 0.9, 0.99]\r\nfor i in range(len(momentums)):\r\n\t# determine the plot number\r\n\tplot_no = 220 + (i+1)\r\n\tpyplot.subplot(plot_no)\r\n\t# fit model and plot learning curves for a momentum\r\n\tfit_model(trainX, trainy, testX, testy, momentums[i])\r\n# show learning curves\r\npyplot.show()<\/pre>\n<p>Tying all of this together, the complete example is listed below.<\/p>\n<pre class=\"crayon-plain-tag\"># study of momentum on accuracy for blobs problem\r\nfrom sklearn.datasets.samples_generator import make_blobs\r\nfrom keras.layers import Dense\r\nfrom keras.models import Sequential\r\nfrom keras.optimizers import SGD\r\nfrom keras.utils import to_categorical\r\nfrom matplotlib import pyplot\r\n\r\n# prepare train and test dataset\r\ndef prepare_data():\r\n\t# generate 2d classification dataset\r\n\tX, y = make_blobs(n_samples=1000, centers=3, n_features=2, cluster_std=2, random_state=2)\r\n\t# one hot encode output variable\r\n\ty = to_categorical(y)\r\n\t# split into train and test\r\n\tn_train = 500\r\n\ttrainX, testX = X[:n_train, :], X[n_train:, :]\r\n\ttrainy, testy = y[:n_train], y[n_train:]\r\n\treturn trainX, trainy, testX, testy\r\n\r\n# fit a model and plot learning curve\r\ndef fit_model(trainX, trainy, testX, testy, momentum):\r\n\t# define model\r\n\tmodel = Sequential()\r\n\tmodel.add(Dense(50, input_dim=2, activation='relu', kernel_initializer='he_uniform'))\r\n\tmodel.add(Dense(3, activation='softmax'))\r\n\t# compile model\r\n\topt = SGD(lr=0.01, momentum=momentum)\r\n\tmodel.compile(loss='categorical_crossentropy', optimizer=opt, metrics=['accuracy'])\r\n\t# fit model\r\n\thistory = model.fit(trainX, trainy, validation_data=(testX, testy), epochs=200, verbose=0)\r\n\t# plot learning curves\r\n\tpyplot.plot(history.history['acc'], label='train')\r\n\tpyplot.plot(history.history['val_acc'], label='test')\r\n\tpyplot.title('momentum='+str(momentum), pad=-80)\r\n\r\n# prepare dataset\r\ntrainX, trainy, testX, testy = prepare_data()\r\n# create learning curves for different momentums\r\nmomentums = [0.0, 0.5, 0.9, 0.99]\r\nfor i in range(len(momentums)):\r\n\t# determine the plot number\r\n\tplot_no = 220 + (i+1)\r\n\tpyplot.subplot(plot_no)\r\n\t# fit model and plot learning curves for a momentum\r\n\tfit_model(trainX, trainy, testX, testy, momentums[i])\r\n# show learning curves\r\npyplot.show()<\/pre>\n<p>Running the example creates a single figure that contains four line plots for the different evaluated momentum values. Classification accuracy on the training dataset is marked in blue, whereas accuracy on the test dataset is marked in orange.<\/p>\n<p>Your specific results may vary given the stochastic nature of the learning algorithm. Consider running the example a few times.<\/p>\n<p>We can see that the addition of momentum does accelerate the training of the model. Specifically, momentum values of 0.9 and 0.99 achieve reasonable train and test accuracy within about 50 training epochs as opposed to 200 training epochs when momentum is not used.<\/p>\n<p>In all cases where momentum is used, the accuracy of the model on the holdout test dataset appears to be more stable, showing less volatility over the training epochs.<\/p>\n<div id=\"attachment_6890\" style=\"width: 1290px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" class=\"size-full wp-image-6890\" src=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2018\/11\/Line-Plots-of-Train-and-Test-Accuracy-for-a-Suite-of-Momentums-on-the-Blobs-Classification-Problem.png\" alt=\"Line Plots of Train and Test Accuracy for a Suite of Momentums on the Blobs Classification Problem\" width=\"1280\" height=\"960\" srcset=\"http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/11\/Line-Plots-of-Train-and-Test-Accuracy-for-a-Suite-of-Momentums-on-the-Blobs-Classification-Problem.png 1280w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/11\/Line-Plots-of-Train-and-Test-Accuracy-for-a-Suite-of-Momentums-on-the-Blobs-Classification-Problem-300x225.png 300w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/11\/Line-Plots-of-Train-and-Test-Accuracy-for-a-Suite-of-Momentums-on-the-Blobs-Classification-Problem-768x576.png 768w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/11\/Line-Plots-of-Train-and-Test-Accuracy-for-a-Suite-of-Momentums-on-the-Blobs-Classification-Problem-1024x768.png 1024w\" sizes=\"(max-width: 1280px) 100vw, 1280px\"><\/p>\n<p class=\"wp-caption-text\">Line Plots of Train and Test Accuracy for a Suite of Momentums on the Blobs Classification Problem<\/p>\n<\/div>\n<h2>Effect of Learning Rate Schedules<\/h2>\n<p>We will look at two learning rate schedules in this section.<\/p>\n<p>The first is the decay built into the SGD class and the second is the <em>ReduceLROnPlateau<\/em> callback.<\/p>\n<h3>Learning Rate Decay<\/h3>\n<p>The <em>SGD<\/em> class provides the \u201c<em>decay<\/em>\u201d argument that specifies the learning rate decay.<\/p>\n<p>It may not be clear from the equation or the code as to the effect that this decay has on the learning rate over updates. We can make this clearer with a worked example.<\/p>\n<p>The function below implements the learning rate decay as implemented in the <a href=\"https:\/\/github.com\/keras-team\/keras\/blob\/master\/keras\/optimizers.py\">SGD class<\/a>.<\/p>\n<pre class=\"crayon-plain-tag\"># learning rate decay\r\ndef decay_lrate(initial_lrate, decay, iteration):\r\n\treturn initial_lrate * (1.0 \/ (1.0 + decay * iteration))<\/pre>\n<p>We can use this function to calculate the learning rate over multiple updates with different decay values.<\/p>\n<p>We will compare a range of decay values [1E-1, 1E-2, 1E-3, 1E-4] with an initial learning rate of 0.01 and 200 weight updates.<\/p>\n<pre class=\"crayon-plain-tag\">decays = [1E-1, 1E-2, 1E-3, 1E-4]\r\nlrate = 0.01\r\nn_updates = 200\r\nfor decay in decays:\r\n\t# calculate learning rates for updates\r\n\tlrates = [decay_lrate(lrate, decay, i) for i in range(n_updates)]\r\n\t# plot result\r\n\tpyplot.plot(lrates, label=str(decay))<\/pre>\n<p>The complete example is listed below.<\/p>\n<pre class=\"crayon-plain-tag\"># demonstrate the effect of decay on the learning rate\r\nfrom matplotlib import pyplot\r\n\r\n# learning rate decay\r\ndef\tdecay_lrate(initial_lrate, decay, iteration):\r\n\treturn initial_lrate * (1.0 \/ (1.0 + decay * iteration))\r\n\r\ndecays = [1E-1, 1E-2, 1E-3, 1E-4]\r\nlrate = 0.01\r\nn_updates = 200\r\nfor decay in decays:\r\n\t# calculate learning rates for updates\r\n\tlrates = [decay_lrate(lrate, decay, i) for i in range(n_updates)]\r\n\t# plot result\r\n\tpyplot.plot(lrates, label=str(decay))\r\npyplot.legend()\r\npyplot.show()<\/pre>\n<p>Running the example creates a line plot showing learning rates over updates for different decay values.<\/p>\n<p>We can see that in all cases, the learning rate starts at the initial value of 0.01. We can see that a small decay value of 1E-4 (red) has almost no effect, whereas a large decay value of 1E-1 (blue) has a dramatic effect, reducing the learning rate to below 0.002 within 50 epochs (about one order of magnitude less than the initial value) and arriving at the final value of about 0.0004 (about two orders of magnitude less than the initial value).<\/p>\n<p>We can see that the change to the learning rate is not linear. We can also see that changes to the learning rate are dependent on the batch size, after which an update is performed. In the example from the previous section, a default batch size of 32 across 500 examples results in 16 updates per epoch and 3,200 updates across the 200 epochs.<\/p>\n<p>Using a decay of 0.1 and an initial learning rate of 0.01, we can calculate the final learning rate to be a tiny value of about 3.1E-05.<\/p>\n<div id=\"attachment_6891\" style=\"width: 1290px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" class=\"size-full wp-image-6891\" src=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2018\/11\/Line-Plot-of-the-Effect-of-Decay-on-Learning-Rate-over-Multiple-Weight-Updates.png\" alt=\"Line Plot of the Effect of Decay on Learning Rate Over Multiple Weight Updates\" width=\"1280\" height=\"960\" srcset=\"http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/11\/Line-Plot-of-the-Effect-of-Decay-on-Learning-Rate-over-Multiple-Weight-Updates.png 1280w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/11\/Line-Plot-of-the-Effect-of-Decay-on-Learning-Rate-over-Multiple-Weight-Updates-300x225.png 300w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/11\/Line-Plot-of-the-Effect-of-Decay-on-Learning-Rate-over-Multiple-Weight-Updates-768x576.png 768w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/11\/Line-Plot-of-the-Effect-of-Decay-on-Learning-Rate-over-Multiple-Weight-Updates-1024x768.png 1024w\" sizes=\"(max-width: 1280px) 100vw, 1280px\"><\/p>\n<p class=\"wp-caption-text\">Line Plot of the Effect of Decay on Learning Rate Over Multiple Weight Updates<\/p>\n<\/div>\n<p>We can update the example from the previous section to evaluate the dynamics of different learning rate decay values.<\/p>\n<p>Fixing the learning rate at 0.01 and not using momentum, we would expect that a very small learning rate decay would be preferred, as a large learning rate decay would rapidly result in a learning rate that is too small for the model to learn effectively.<\/p>\n<p>The <em>fit_model()<\/em> function can be updated to take a \u201c<em>decay<\/em>\u201d argument that can be used to configure decay for the SGD class.<\/p>\n<p>The updated version of the function is listed below.<\/p>\n<pre class=\"crayon-plain-tag\"># fit a model and plot learning curve\r\ndef fit_model(trainX, trainy, testX, testy, decay):\r\n\t# define model\r\n\tmodel = Sequential()\r\n\tmodel.add(Dense(50, input_dim=2, activation='relu', kernel_initializer='he_uniform'))\r\n\tmodel.add(Dense(3, activation='softmax'))\r\n\t# compile model\r\n\topt = SGD(lr=0.01, decay=decay)\r\n\tmodel.compile(loss='categorical_crossentropy', optimizer=opt, metrics=['accuracy'])\r\n\t# fit model\r\n\thistory = model.fit(trainX, trainy, validation_data=(testX, testy), epochs=200, verbose=0)\r\n\t# plot learning curves\r\n\tpyplot.plot(history.history['acc'], label='train')\r\n\tpyplot.plot(history.history['val_acc'], label='test')\r\n\tpyplot.title('decay='+str(decay), pad=-80)<\/pre>\n<p>We can evaluate the same four decay values of [1E-1, 1E-2, 1E-3, 1E-4] and their effect on model accuracy.<\/p>\n<p>The complete example is listed below.<\/p>\n<pre class=\"crayon-plain-tag\"># study of decay rate on accuracy for blobs problem\r\nfrom sklearn.datasets.samples_generator import make_blobs\r\nfrom keras.layers import Dense\r\nfrom keras.models import Sequential\r\nfrom keras.optimizers import SGD\r\nfrom keras.utils import to_categorical\r\nfrom matplotlib import pyplot\r\n\r\n# prepare train and test dataset\r\ndef prepare_data():\r\n\t# generate 2d classification dataset\r\n\tX, y = make_blobs(n_samples=1000, centers=3, n_features=2, cluster_std=2, random_state=2)\r\n\t# one hot encode output variable\r\n\ty = to_categorical(y)\r\n\t# split into train and test\r\n\tn_train = 500\r\n\ttrainX, testX = X[:n_train, :], X[n_train:, :]\r\n\ttrainy, testy = y[:n_train], y[n_train:]\r\n\treturn trainX, trainy, testX, testy\r\n\r\n# fit a model and plot learning curve\r\ndef fit_model(trainX, trainy, testX, testy, decay):\r\n\t# define model\r\n\tmodel = Sequential()\r\n\tmodel.add(Dense(50, input_dim=2, activation='relu', kernel_initializer='he_uniform'))\r\n\tmodel.add(Dense(3, activation='softmax'))\r\n\t# compile model\r\n\topt = SGD(lr=0.01, decay=decay)\r\n\tmodel.compile(loss='categorical_crossentropy', optimizer=opt, metrics=['accuracy'])\r\n\t# fit model\r\n\thistory = model.fit(trainX, trainy, validation_data=(testX, testy), epochs=200, verbose=0)\r\n\t# plot learning curves\r\n\tpyplot.plot(history.history['acc'], label='train')\r\n\tpyplot.plot(history.history['val_acc'], label='test')\r\n\tpyplot.title('decay='+str(decay), pad=-80)\r\n\r\n# prepare dataset\r\ntrainX, trainy, testX, testy = prepare_data()\r\n# create learning curves for different decay rates\r\ndecay_rates = [1E-1, 1E-2, 1E-3, 1E-4]\r\nfor i in range(len(decay_rates)):\r\n\t# determine the plot number\r\n\tplot_no = 220 + (i+1)\r\n\tpyplot.subplot(plot_no)\r\n\t# fit model and plot learning curves for a decay rate\r\n\tfit_model(trainX, trainy, testX, testy, decay_rates[i])\r\n# show learning curves\r\npyplot.show()<\/pre>\n<p>Running the example creates a single figure that contains four line plots for the different evaluated learning rate decay values. Classification accuracy on the training dataset is marked in blue, whereas accuracy on the test dataset is marked in orange.<\/p>\n<p>Your specific results may vary given the stochastic nature of the learning algorithm. Consider running the example a few times.<\/p>\n<p>We can see that the large decay values of 1E-1 and 1E-2 indeed decay the learning rate too rapidly for this model on this problem and result in poor performance. The larger decay values do result in better performance, with the value of 1E-4 perhaps causing in a similar result as not using decay at all. In fact, we can calculate the final learning rate with a decay of 1E-4 to be about 0.0075, only a little bit smaller than the initial value of 0.01.<\/p>\n<div id=\"attachment_6892\" style=\"width: 1290px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" class=\"size-full wp-image-6892\" src=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2018\/11\/Line-Plots-of-Train-and-Test-Accuracy-for-a-Suite-of-Decay-Rates-on-the-Blobs-Classification-Problem.png\" alt=\"Line Plots of Train and Test Accuracy for a Suite of Decay Rates on the Blobs Classification Problem\" width=\"1280\" height=\"960\" srcset=\"http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/11\/Line-Plots-of-Train-and-Test-Accuracy-for-a-Suite-of-Decay-Rates-on-the-Blobs-Classification-Problem.png 1280w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/11\/Line-Plots-of-Train-and-Test-Accuracy-for-a-Suite-of-Decay-Rates-on-the-Blobs-Classification-Problem-300x225.png 300w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/11\/Line-Plots-of-Train-and-Test-Accuracy-for-a-Suite-of-Decay-Rates-on-the-Blobs-Classification-Problem-768x576.png 768w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/11\/Line-Plots-of-Train-and-Test-Accuracy-for-a-Suite-of-Decay-Rates-on-the-Blobs-Classification-Problem-1024x768.png 1024w\" sizes=\"(max-width: 1280px) 100vw, 1280px\"><\/p>\n<p class=\"wp-caption-text\">Line Plots of Train and Test Accuracy for a Suite of Decay Rates on the Blobs Classification Problem<\/p>\n<\/div>\n<h3>Drop Learning Rate on Plateau<\/h3>\n<p>The <em>ReduceLROnPlateau<\/em> will drop the learning rate by a factor after no change in a monitored metric for a given number of epochs.<\/p>\n<p>We can explore the effect of different \u201c<em>patience<\/em>\u201d values, which is the number of epochs to wait for a change before dropping the learning rate. We will use the default learning rate of 0.01 and drop the learning rate by an order of magnitude by setting the \u201c<em>factor<\/em>\u201d argument to 0.1.<\/p>\n<pre class=\"crayon-plain-tag\">rlrp = ReduceLROnPlateau(monitor='val_loss', factor=0.1, patience=patience, min_delta=1E-7)<\/pre>\n<p>It will be interesting to review the effect on the learning rate over the training epochs. We can do that by creating a new Keras Callback that is responsible for recording the learning rate at the end of each training epoch. We can then retrieve the recorded learning rates and create a line plot to see how the learning rate was affected by drops.<\/p>\n<p>We can create a custom <em>Callback<\/em> called <em>LearningRateMonitor<\/em>. The <em>on_train_begin()<\/em> function is called at the start of training, and in it we can define an empty list of learning rates. The <em>on_epoch_end()<\/em> function is called at the end of each training epoch and in it we can retrieve the optimizer and the current learning rate from the optimizer and store it in the list. The complete <em>LearningRateMonitor<\/em> callback is listed below.<\/p>\n<pre class=\"crayon-plain-tag\"># monitor the learning rate\r\nclass LearningRateMonitor(Callback):\r\n\t# start of training\r\n\tdef on_train_begin(self, logs={}):\r\n\t\tself.lrates = list()\r\n\r\n\t# end of each training epoch\r\n\tdef on_epoch_end(self, epoch, logs={}):\r\n\t\t# get and store the learning rate\r\n\t\toptimizer = self.model.optimizer\r\n\t\tlrate = float(backend.get_value(self.model.optimizer.lr))\r\n\t\tself.lrates.append(lrate)<\/pre>\n<p>The <em>fit_model()<\/em> function developed in the previous sections can be updated to create and configure the <em>ReduceLROnPlateau<\/em> callback and our new <em>LearningRateMonitor<\/em> callback and register them with the model in the call to fit.<\/p>\n<p>The function will also take \u201c<em>patience<\/em>\u201d as an argument so that we can evaluate different values.<\/p>\n<pre class=\"crayon-plain-tag\"># fit model\r\nrlrp = ReduceLROnPlateau(monitor='val_loss', factor=0.1, patience=patience, min_delta=1E-7)\r\nlrm = LearningRateMonitor()\r\nhistory = model.fit(trainX, trainy, validation_data=(testX, testy), epochs=200, verbose=0, callbacks=[rlrp, lrm])<\/pre>\n<p>We will want to create a few plots in this example, so instead of creating subplots directly, the <em>fit_model()<\/em> function will return the list of learning rates as well as loss and accuracy on the training dataset for each training epochs.<\/p>\n<p>The function with these updates is listed below.<\/p>\n<pre class=\"crayon-plain-tag\"># fit a model and plot learning curve\r\ndef fit_model(trainX, trainy, testX, testy, patience):\r\n\t# define model\r\n\tmodel = Sequential()\r\n\tmodel.add(Dense(50, input_dim=2, activation='relu', kernel_initializer='he_uniform'))\r\n\tmodel.add(Dense(3, activation='softmax'))\r\n\t# compile model\r\n\topt = SGD(lr=0.01)\r\n\tmodel.compile(loss='categorical_crossentropy', optimizer=opt, metrics=['accuracy'])\r\n\t# fit model\r\n\trlrp = ReduceLROnPlateau(monitor='val_loss', factor=0.1, patience=patience, min_delta=1E-7)\r\n\tlrm = LearningRateMonitor()\r\n\thistory = model.fit(trainX, trainy, validation_data=(testX, testy), epochs=200, verbose=0, callbacks=[rlrp, lrm])\r\n\treturn lrm.lrates, history.history['loss'], history.history['acc']<\/pre>\n<p>The patience in the <em>ReduceLROnPlateau<\/em> controls how often the learning rate will be dropped.<\/p>\n<p>We will test a few different patience values suited for this model on the blobs problem and keep track of the learning rate, loss, and accuracy series from each run.<\/p>\n<pre class=\"crayon-plain-tag\"># create learning curves for different patiences\r\npatiences = [2, 5, 10, 15]\r\nlr_list, loss_list, acc_list, = list(), list(), list()\r\nfor i in range(len(patiences)):\r\n\t# fit model and plot learning curves for a patience\r\n\tlr, loss, acc = fit_model(trainX, trainy, testX, testy, patiences[i])\r\n\tlr_list.append(lr)\r\n\tloss_list.append(loss)\r\n\tacc_list.append(acc)<\/pre>\n<p>At the end of the run, we will create figures with line plots for each of the patience values for the learning rates, training loss, and training accuracy for each patience value.<\/p>\n<p>We can create a helper function to easily create a figure with subplots for each series that we have recorded.<\/p>\n<pre class=\"crayon-plain-tag\"># create line plots for a series\r\ndef line_plots(patiences, series):\r\n\tfor i in range(len(patiences)):\r\n\t\tpyplot.subplot(220 + (i+1))\r\n\t\tpyplot.plot(series[i])\r\n\t\tpyplot.title('patience='+str(patiences[i]), pad=-80)\r\n\tpyplot.show()<\/pre>\n<p>Tying these elements together, the complete example is listed below.<\/p>\n<pre class=\"crayon-plain-tag\"># study of patience for the learning rate drop schedule on the blobs problem\r\nfrom sklearn.datasets.samples_generator import make_blobs\r\nfrom keras.layers import Dense\r\nfrom keras.models import Sequential\r\nfrom keras.optimizers import SGD\r\nfrom keras.utils import to_categorical\r\nfrom keras.callbacks import Callback\r\nfrom keras.callbacks import ReduceLROnPlateau\r\nfrom keras import backend\r\nfrom matplotlib import pyplot\r\n\r\n# monitor the learning rate\r\nclass LearningRateMonitor(Callback):\r\n\t# start of training\r\n\tdef on_train_begin(self, logs={}):\r\n\t\tself.lrates = list()\r\n\r\n\t# end of each training epoch\r\n\tdef on_epoch_end(self, epoch, logs={}):\r\n\t\t# get and store the learning rate\r\n\t\toptimizer = self.model.optimizer\r\n\t\tlrate = float(backend.get_value(self.model.optimizer.lr))\r\n\t\tself.lrates.append(lrate)\r\n\r\n# prepare train and test dataset\r\ndef prepare_data():\r\n\t# generate 2d classification dataset\r\n\tX, y = make_blobs(n_samples=1000, centers=3, n_features=2, cluster_std=2, random_state=2)\r\n\t# one hot encode output variable\r\n\ty = to_categorical(y)\r\n\t# split into train and test\r\n\tn_train = 500\r\n\ttrainX, testX = X[:n_train, :], X[n_train:, :]\r\n\ttrainy, testy = y[:n_train], y[n_train:]\r\n\treturn trainX, trainy, testX, testy\r\n\r\n# fit a model and plot learning curve\r\ndef fit_model(trainX, trainy, testX, testy, patience):\r\n\t# define model\r\n\tmodel = Sequential()\r\n\tmodel.add(Dense(50, input_dim=2, activation='relu', kernel_initializer='he_uniform'))\r\n\tmodel.add(Dense(3, activation='softmax'))\r\n\t# compile model\r\n\topt = SGD(lr=0.01)\r\n\tmodel.compile(loss='categorical_crossentropy', optimizer=opt, metrics=['accuracy'])\r\n\t# fit model\r\n\trlrp = ReduceLROnPlateau(monitor='val_loss', factor=0.1, patience=patience, min_delta=1E-7)\r\n\tlrm = LearningRateMonitor()\r\n\thistory = model.fit(trainX, trainy, validation_data=(testX, testy), epochs=200, verbose=0, callbacks=[rlrp, lrm])\r\n\treturn lrm.lrates, history.history['loss'], history.history['acc']\r\n\r\n# create line plots for a series\r\ndef line_plots(patiences, series):\r\n\tfor i in range(len(patiences)):\r\n\t\tpyplot.subplot(220 + (i+1))\r\n\t\tpyplot.plot(series[i])\r\n\t\tpyplot.title('patience='+str(patiences[i]), pad=-80)\r\n\tpyplot.show()\r\n\r\n# prepare dataset\r\ntrainX, trainy, testX, testy = prepare_data()\r\n# create learning curves for different patiences\r\npatiences = [2, 5, 10, 15]\r\nlr_list, loss_list, acc_list, = list(), list(), list()\r\nfor i in range(len(patiences)):\r\n\t# fit model and plot learning curves for a patience\r\n\tlr, loss, acc = fit_model(trainX, trainy, testX, testy, patiences[i])\r\n\tlr_list.append(lr)\r\n\tloss_list.append(loss)\r\n\tacc_list.append(acc)\r\n# plot learning rates\r\nline_plots(patiences, lr_list)\r\n# plot loss\r\nline_plots(patiences, loss_list)\r\n# plot accuracy\r\nline_plots(patiences, acc_list)<\/pre>\n<p>Running the example creates three figures, each containing a line plot for the different patience values.<\/p>\n<p>Your specific results may vary given the stochastic nature of the learning algorithm. Consider running the example a few times.<\/p>\n<p>The first figure shows line plots of the learning rate over the training epochs for each of the evaluated patience values. We can see that the smallest patience value of two rapidly drops the learning rate to a minimum value within 25 epochs, the largest patience of 15 only suffers one drop in the learning rate.<\/p>\n<p>From these plots, we would expect the patience values of 5 and 10 for this model on this problem to result in better performance as they allow the larger learning rate to be used for some time before dropping the rate to refine the weights.<\/p>\n<div id=\"attachment_6893\" style=\"width: 1290px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" class=\"size-full wp-image-6893\" src=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2018\/11\/Line-Plots-of-Learning-Rate-Over-Epochs-for-Different-Patience-Values-Used-in-the-ReduceLROnPlateau-Schedule.png\" alt=\"Line Plots of Learning Rate Over Epochs for Different Patience Values Used in the ReduceLROnPlateau Schedule\" width=\"1280\" height=\"960\" srcset=\"http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/11\/Line-Plots-of-Learning-Rate-Over-Epochs-for-Different-Patience-Values-Used-in-the-ReduceLROnPlateau-Schedule.png 1280w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/11\/Line-Plots-of-Learning-Rate-Over-Epochs-for-Different-Patience-Values-Used-in-the-ReduceLROnPlateau-Schedule-300x225.png 300w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/11\/Line-Plots-of-Learning-Rate-Over-Epochs-for-Different-Patience-Values-Used-in-the-ReduceLROnPlateau-Schedule-768x576.png 768w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/11\/Line-Plots-of-Learning-Rate-Over-Epochs-for-Different-Patience-Values-Used-in-the-ReduceLROnPlateau-Schedule-1024x768.png 1024w\" sizes=\"(max-width: 1280px) 100vw, 1280px\"><\/p>\n<p class=\"wp-caption-text\">Line Plots of Learning Rate Over Epochs for Different Patience Values Used in the ReduceLROnPlateau Schedule<\/p>\n<\/div>\n<p>The next figure shows the loss on the training dataset for each of the patience values.<\/p>\n<p>The plot shows that the patience values of 2 and 5 result in a rapid convergence of the model, perhaps to a sub-optimal loss value. In the case of a patience level of 10 and 15, loss drops reasonably until the learning rate is dropped below a level that large changes to the loss can be seen. This occurs halfway for the patience of 10 and nearly the end of the run for patience 15.<\/p>\n<div id=\"attachment_6895\" style=\"width: 1290px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" class=\"size-full wp-image-6895\" src=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2018\/11\/Line-Plots-of-Training-Loss-Over-Epochs-for-Different-Patience-Values-Used-in-the-ReduceLROnPlateau-Schedule.png\" alt=\"Line Plots of Training Loss Over Epochs for Different Patience Values Used in the ReduceLROnPlateau Schedule\" width=\"1280\" height=\"960\" srcset=\"http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/11\/Line-Plots-of-Training-Loss-Over-Epochs-for-Different-Patience-Values-Used-in-the-ReduceLROnPlateau-Schedule.png 1280w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/11\/Line-Plots-of-Training-Loss-Over-Epochs-for-Different-Patience-Values-Used-in-the-ReduceLROnPlateau-Schedule-300x225.png 300w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/11\/Line-Plots-of-Training-Loss-Over-Epochs-for-Different-Patience-Values-Used-in-the-ReduceLROnPlateau-Schedule-768x576.png 768w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/11\/Line-Plots-of-Training-Loss-Over-Epochs-for-Different-Patience-Values-Used-in-the-ReduceLROnPlateau-Schedule-1024x768.png 1024w\" sizes=\"(max-width: 1280px) 100vw, 1280px\"><\/p>\n<p class=\"wp-caption-text\">Line Plots of Training Loss Over Epochs for Different Patience Values Used in the ReduceLROnPlateau Schedule<\/p>\n<\/div>\n<p>The final figure shows the training set accuracy over training epochs for each patience value.<\/p>\n<p>We can see that indeed the small patience values of 2 and 5 epochs results in premature convergence of the model to a less-than-optimal model at around 65% and less than 75% accuracy respectively. The larger patience values result in better performing models, with the patience of 10 showing convergence just before 150 epochs, whereas the patience 15 continues to show the effects of a volatile accuracy given the nearly completely unchanged learning rate.<\/p>\n<p>These plots show how a learning rate that is decreased a sensible way for the problem and chosen model configuration can result in both a skillful and converged stable set of final weights, a desirable property in a final model at the end of a training run.<\/p>\n<div id=\"attachment_6896\" style=\"width: 1290px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" class=\"size-full wp-image-6896\" src=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2018\/11\/Line-Plots-of-Training-Accuracy-Over-Epochs-for-Different-Patience-Values-Used-in-the-ReduceLROnPlateau-Schedule.png\" alt=\"Line Plots of Training Accuracy Over Epochs for Different Patience Values Used in the ReduceLROnPlateau Schedule\" width=\"1280\" height=\"960\" srcset=\"http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/11\/Line-Plots-of-Training-Accuracy-Over-Epochs-for-Different-Patience-Values-Used-in-the-ReduceLROnPlateau-Schedule.png 1280w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/11\/Line-Plots-of-Training-Accuracy-Over-Epochs-for-Different-Patience-Values-Used-in-the-ReduceLROnPlateau-Schedule-300x225.png 300w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/11\/Line-Plots-of-Training-Accuracy-Over-Epochs-for-Different-Patience-Values-Used-in-the-ReduceLROnPlateau-Schedule-768x576.png 768w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/11\/Line-Plots-of-Training-Accuracy-Over-Epochs-for-Different-Patience-Values-Used-in-the-ReduceLROnPlateau-Schedule-1024x768.png 1024w\" sizes=\"(max-width: 1280px) 100vw, 1280px\"><\/p>\n<p class=\"wp-caption-text\">Line Plots of Training Accuracy Over Epochs for Different Patience Values Used in the ReduceLROnPlateau Schedule<\/p>\n<\/div>\n<h2>Effect of Adaptive Learning Rates<\/h2>\n<p>Learning rates and learning rate schedules are both challenging to configure and critical to the performance of a deep learning neural network model.<\/p>\n<p>Keras provides a number of different popular variations of stochastic gradient descent with adaptive learning rates, such as:<\/p>\n<ul>\n<li>Adaptive Gradient Algorithm (AdaGrad).<\/li>\n<li>Root Mean Square Propagation (RMSprop).<\/li>\n<li>Adaptive Moment Estimation (Adam).<\/li>\n<\/ul>\n<p>Each provides a different methodology for adapting learning rates for each weight in the network.<\/p>\n<p>There is no single best algorithm, and the results of racing optimization algorithms on one problem are unlikely to be transferable to new problems.<\/p>\n<p>We can study the dynamics of different adaptive learning rate methods on the blobs problem. The <em>fit_model()<\/em> function can be updated to take the name of an optimization algorithm to evaluate, which can be specified to the \u201c<em>optimizer<\/em>\u201d argument when the MLP model is compiled. The default parameters for each method will then be used. The updated version of the function is listed below.<\/p>\n<pre class=\"crayon-plain-tag\"># fit a model and plot learning curve\r\ndef fit_model(trainX, trainy, testX, testy, optimizer):\r\n\t# define model\r\n\tmodel = Sequential()\r\n\tmodel.add(Dense(50, input_dim=2, activation='relu', kernel_initializer='he_uniform'))\r\n\tmodel.add(Dense(3, activation='softmax'))\r\n\t# compile model\r\n\tmodel.compile(loss='categorical_crossentropy', optimizer=optimizer, metrics=['accuracy'])\r\n\t# fit model\r\n\thistory = model.fit(trainX, trainy, validation_data=(testX, testy), epochs=200, verbose=0)\r\n\t# plot learning curves\r\n\tpyplot.plot(history.history['acc'], label='train')\r\n\tpyplot.plot(history.history['val_acc'], label='test')\r\n\tpyplot.title('opt='+optimizer, pad=-80)<\/pre>\n<p>We can explore the three popular methods of RMSprop, AdaGrad and Adam and compare their behavior to simple stochastic gradient descent with a static learning rate.<\/p>\n<p>We would expect the adaptive learning rate versions of the algorithm to perform similarly or better, perhaps adapting to the problem in fewer training epochs, but importantly, to result in a more stable model.<\/p>\n<pre class=\"crayon-plain-tag\"># prepare dataset\r\ntrainX, trainy, testX, testy = prepare_data()\r\n# create learning curves for different optimizers\r\nmomentums = ['sgd', 'rmsprop', 'adagrad', 'adam']\r\nfor i in range(len(momentums)):\r\n\t# determine the plot number\r\n\tplot_no = 220 + (i+1)\r\n\tpyplot.subplot(plot_no)\r\n\t# fit model and plot learning curves for an optimizer\r\n\tfit_model(trainX, trainy, testX, testy, momentums[i])\r\n# show learning curves\r\npyplot.show()<\/pre>\n<p>Tying these elements together, the complete example is listed below.<\/p>\n<pre class=\"crayon-plain-tag\"># study of sgd with adaptive learning rates in the blobs problem\r\nfrom sklearn.datasets.samples_generator import make_blobs\r\nfrom keras.layers import Dense\r\nfrom keras.models import Sequential\r\nfrom keras.optimizers import SGD\r\nfrom keras.utils import to_categorical\r\nfrom keras.callbacks import Callback\r\nfrom keras import backend\r\nfrom matplotlib import pyplot\r\n\r\n# prepare train and test dataset\r\ndef prepare_data():\r\n\t# generate 2d classification dataset\r\n\tX, y = make_blobs(n_samples=1000, centers=3, n_features=2, cluster_std=2, random_state=2)\r\n\t# one hot encode output variable\r\n\ty = to_categorical(y)\r\n\t# split into train and test\r\n\tn_train = 500\r\n\ttrainX, testX = X[:n_train, :], X[n_train:, :]\r\n\ttrainy, testy = y[:n_train], y[n_train:]\r\n\treturn trainX, trainy, testX, testy\r\n\r\n# fit a model and plot learning curve\r\ndef fit_model(trainX, trainy, testX, testy, optimizer):\r\n\t# define model\r\n\tmodel = Sequential()\r\n\tmodel.add(Dense(50, input_dim=2, activation='relu', kernel_initializer='he_uniform'))\r\n\tmodel.add(Dense(3, activation='softmax'))\r\n\t# compile model\r\n\tmodel.compile(loss='categorical_crossentropy', optimizer=optimizer, metrics=['accuracy'])\r\n\t# fit model\r\n\thistory = model.fit(trainX, trainy, validation_data=(testX, testy), epochs=200, verbose=0)\r\n\t# plot learning curves\r\n\tpyplot.plot(history.history['acc'], label='train')\r\n\tpyplot.plot(history.history['val_acc'], label='test')\r\n\tpyplot.title('opt='+optimizer, pad=-80)\r\n\r\n# prepare dataset\r\ntrainX, trainy, testX, testy = prepare_data()\r\n# create learning curves for different optimizers\r\nmomentums = ['sgd', 'rmsprop', 'adagrad', 'adam']\r\nfor i in range(len(momentums)):\r\n\t# determine the plot number\r\n\tplot_no = 220 + (i+1)\r\n\tpyplot.subplot(plot_no)\r\n\t# fit model and plot learning curves for an optimizer\r\n\tfit_model(trainX, trainy, testX, testy, momentums[i])\r\n# show learning curves\r\npyplot.show()<\/pre>\n<p>Running the example creates a single figure that contains four line plots for the different evaluated optimization algorithms. Classification accuracy on the training dataset is marked in blue, whereas accuracy on the test dataset is marked in orange.<\/p>\n<p>Your specific results may vary given the stochastic nature of the learning algorithm. Consider running the example a few times.<\/p>\n<p>Again, we can see that SGD with a default learning rate of 0.01 and no momentum does learn the problem, but requires nearly all 200 epochs and results in volatile accuracy on the training data and much more so on the test dataset. The plots show that all three adaptive learning rate methods learning the problem faster and with dramatically less volatility in train and test set accuracy.<\/p>\n<p>Both RMSProp and Adam demonstrate similar performance, effectively learning the problem within 50 training epochs and spending the remaining training time making very minor weight updates, but not converging as we saw with the learning rate schedules in the previous section.<\/p>\n<div id=\"attachment_6897\" style=\"width: 1290px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" class=\"size-full wp-image-6897\" src=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2018\/11\/Line-Plots-of-Train-and-Test-Accuracy-for-a-Suite-of-Adaptive-Learning-Rate-Methods-on-the-Blobs-Classification-Problem.png\" alt=\"Line Plots of Train and Test Accuracy for a Suite of Adaptive Learning Rate Methods on the Blobs Classification Problem\" width=\"1280\" height=\"960\" srcset=\"http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/11\/Line-Plots-of-Train-and-Test-Accuracy-for-a-Suite-of-Adaptive-Learning-Rate-Methods-on-the-Blobs-Classification-Problem.png 1280w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/11\/Line-Plots-of-Train-and-Test-Accuracy-for-a-Suite-of-Adaptive-Learning-Rate-Methods-on-the-Blobs-Classification-Problem-300x225.png 300w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/11\/Line-Plots-of-Train-and-Test-Accuracy-for-a-Suite-of-Adaptive-Learning-Rate-Methods-on-the-Blobs-Classification-Problem-768x576.png 768w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/11\/Line-Plots-of-Train-and-Test-Accuracy-for-a-Suite-of-Adaptive-Learning-Rate-Methods-on-the-Blobs-Classification-Problem-1024x768.png 1024w\" sizes=\"(max-width: 1280px) 100vw, 1280px\"><\/p>\n<p class=\"wp-caption-text\">Line Plots of Train and Test Accuracy for a Suite of Adaptive Learning Rate Methods on the Blobs Classification Problem<\/p>\n<\/div>\n<h2>Further Reading<\/h2>\n<p>This section provides more resources on the topic if you are looking to go deeper.<\/p>\n<h3>Papers<\/h3>\n<ul>\n<li><a href=\"https:\/\/arxiv.org\/abs\/1206.5533\">Practical recommendations for gradient-based training of deep architectures<\/a>, 2012.<\/li>\n<\/ul>\n<h3>Books<\/h3>\n<ul>\n<li>Chapter 8: Optimization for Training Deep Models, <a href=\"https:\/\/amzn.to\/2NJW3gE\">Deep Learning<\/a>, 2016.<\/li>\n<li>Chapter 6: Learning Rate and Momentum, <a href=\"https:\/\/amzn.to\/2S8qRdI\">Neural Smithing: Supervised Learning in Feedforward Artificial Neural Networks<\/a>, 1999.<\/li>\n<li>Section 5.7: Gradient descent, <a href=\"https:\/\/amzn.to\/2S8qdwt\">Neural Networks for Pattern Recognition<\/a>, 1995.<\/li>\n<\/ul>\n<h3>API<\/h3>\n<ul>\n<li><a href=\"https:\/\/keras.io\/optimizers\/\">Keras Optimizers API<\/a><\/li>\n<li><a href=\"https:\/\/keras.io\/callbacks\/\">Keras Callbacks API<\/a><\/li>\n<li><a href=\"https:\/\/github.com\/keras-team\/keras\/blob\/master\/keras\/optimizers.py\">optimizers.py Keras Source Code<\/a><\/li>\n<\/ul>\n<h3>Articles<\/h3>\n<ul>\n<li><a href=\"https:\/\/en.wikipedia.org\/wiki\/Stochastic_gradient_descent\">Stochastic gradient descent, Wikipedia<\/a>.<\/li>\n<li><a href=\"ftp:\/\/ftp.sas.com\/pub\/neural\/FAQ2.html#A_learn_rate\">What learning rate should be used for backprop?, Neural Network FAQ<\/a>.<\/li>\n<\/ul>\n<h2>Summary<\/h2>\n<p>In this tutorial, you discovered the effects of the learning rate, learning rate schedules, and adaptive learning rates on model performance.<\/p>\n<p>Specifically, you learned:<\/p>\n<ul>\n<li>How large learning rates result in unstable training and tiny rates result in a failure to train.<\/li>\n<li>Momentum can accelerate training and learning rate schedules can help to converge the optimization process.<\/li>\n<li>Adaptive learning rates can accelerate training and alleviate some of the pressure of choosing a learning rate and learning rate schedule.<\/li>\n<\/ul>\n<p>Do you have any questions?<br \/>\nAsk your questions in the comments below and I will do my best to answer.<\/p>\n<p>The post <a rel=\"nofollow\" href=\"https:\/\/machinelearningmastery.com\/understand-the-dynamics-of-learning-rate-on-deep-learning-neural-networks\/\">Understand the Impact of Learning Rate on Model Performance With Deep Learning Neural Networks<\/a> appeared first on <a rel=\"nofollow\" href=\"https:\/\/machinelearningmastery.com\/\">Machine Learning Mastery<\/a>.<\/p>\n<\/div>\n<p><a href=\"https:\/\/machinelearningmastery.com\/understand-the-dynamics-of-learning-rate-on-deep-learning-neural-networks\/\">Go to Source<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Author: Jason Brownlee Deep learning neural networks are trained using the stochastic gradient descent optimization algorithm. The learning rate is a hyperparameter that controls how [&hellip;] <span class=\"read-more-link\"><a class=\"read-more\" href=\"https:\/\/www.aiproblog.com\/index.php\/2019\/01\/24\/understand-the-impact-of-learning-rate-on-model-performance-with-deep-learning-neural-networks\/\">Read More<\/a><\/span><\/p>\n","protected":false},"author":1,"featured_media":1624,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_bbp_topic_count":0,"_bbp_reply_count":0,"_bbp_total_topic_count":0,"_bbp_total_reply_count":0,"_bbp_voice_count":0,"_bbp_anonymous_reply_count":0,"_bbp_topic_count_hidden":0,"_bbp_reply_count_hidden":0,"_bbp_forum_subforum_count":0,"footnotes":""},"categories":[24],"tags":[],"_links":{"self":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts\/1623"}],"collection":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/comments?post=1623"}],"version-history":[{"count":0,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts\/1623\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/media\/1624"}],"wp:attachment":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/media?parent=1623"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/categories?post=1623"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/tags?post=1623"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}