{"id":1605,"date":"2019-01-20T18:00:40","date_gmt":"2019-01-20T18:00:40","guid":{"rendered":"https:\/\/www.aiproblog.com\/index.php\/2019\/01\/20\/how-to-control-the-speed-and-stability-of-training-neural-networks-with-gradient-descent-batch-size\/"},"modified":"2019-01-20T18:00:40","modified_gmt":"2019-01-20T18:00:40","slug":"how-to-control-the-speed-and-stability-of-training-neural-networks-with-gradient-descent-batch-size","status":"publish","type":"post","link":"https:\/\/www.aiproblog.com\/index.php\/2019\/01\/20\/how-to-control-the-speed-and-stability-of-training-neural-networks-with-gradient-descent-batch-size\/","title":{"rendered":"How to Control the Speed and Stability of Training Neural Networks With Gradient Descent Batch Size"},"content":{"rendered":"<p>Author: Jason Brownlee<\/p>\n<div>\n<p>Neural networks are trained using gradient descent where the estimate of the error used to update the weights is calculated based on a subset of the training dataset.<\/p>\n<p>The number of examples from the training dataset used in the estimate of the error gradient is called the batch size and is an important hyperparameter that influences the dynamics of the learning algorithm.<\/p>\n<p>It is important to explore the dynamics of your model to ensure that you\u2019re getting the most out of it.<\/p>\n<p>In this tutorial, you will discover three different flavors of gradient descent and how to explore and diagnose the effect of batch size on the learning process.<\/p>\n<p>After completing this tutorial, you will know:<\/p>\n<ul>\n<li>Batch size controls the accuracy of the estimate of the error gradient when training neural networks.<\/li>\n<li>Batch, Stochastic, and Minibatch gradient descent are the three main flavors of the learning algorithm.<\/li>\n<li>There is a tension between batch size and the speed and stability of the learning process.<\/li>\n<\/ul>\n<p>Let\u2019s get started.<\/p>\n<div id=\"attachment_6876\" style=\"width: 650px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" class=\"size-full wp-image-6876\" src=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2019\/01\/How-to-Control-the-Speed-and-Stability-of-Training-Neural-Networks-With-Gradient-Descent-Batch-Size.jpg\" alt=\"How to Control the Speed and Stability of Training Neural Networks With Gradient Descent Batch Size\" width=\"640\" height=\"427\" srcset=\"http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2019\/01\/How-to-Control-the-Speed-and-Stability-of-Training-Neural-Networks-With-Gradient-Descent-Batch-Size.jpg 640w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2019\/01\/How-to-Control-the-Speed-and-Stability-of-Training-Neural-Networks-With-Gradient-Descent-Batch-Size-300x200.jpg 300w\" sizes=\"(max-width: 640px) 100vw, 640px\"><\/p>\n<p class=\"wp-caption-text\">How to Control the Speed and Stability of Training Neural Networks With Gradient Descent Batch Size<br \/>Photo by <a href=\"https:\/\/www.flickr.com\/photos\/chodhound\/34643497066\/\">Adrian Scottow<\/a>, some rights reserved.<\/p>\n<\/div>\n<h2>Tutorial Overview<\/h2>\n<p>This tutorial is divided into seven parts; they are:<\/p>\n<ol>\n<li>Batch Size and Gradient Descent<\/li>\n<li>Stochastic, Batch, and Minibatch Gradient Descent in Keras<\/li>\n<li>Multi-Class Classification Problem<\/li>\n<li>MLP Fit With Batch Gradient Descent<\/li>\n<li>MLP Fit With Stochastic Gradient Descent<\/li>\n<li>MLP Fit With Minibatch Gradient Descent<\/li>\n<li>Effect of Batch Size on Model Behavior<\/li>\n<\/ol>\n<h2>Batch Size and Gradient Descent<\/h2>\n<p>Neural networks are trained using the stochastic gradient descent optimization algorithm.<\/p>\n<p>This involves using the current state of the model to make a prediction, comparing the prediction to the expected values, and using the difference as an estimate of the error gradient. This error gradient is then used to update the model weights and the process is repeated.<\/p>\n<p>The error gradient is a statistical estimate. The more training examples used in the estimate, the more accurate this estimate will be and the more likely that the weights of the network will be adjusted in a way that will improve the performance of the model. The improved estimate of the error gradient comes at the cost of having to use the model to make many more predictions before the estimate can be calculated, and in turn, the weights updated.<\/p>\n<blockquote>\n<p>Optimization algorithms that use the entire training set are called batch or deterministic gradient methods, because they process all of the training examples simultaneously in a large batch.<\/p>\n<\/blockquote>\n<p>\u2014 Page 278, <a href=\"https:\/\/amzn.to\/2NJW3gE\">Deep Learning<\/a>, 2016.<\/p>\n<p>Alternately, using fewer examples results in a less accurate estimate of the error gradient that is highly dependent on the specific training examples used.<\/p>\n<p>This results in a noisy estimate that, in turn, results in noisy updates to the model weights, e.g. many updates with perhaps quite different estimates of the error gradient. Nevertheless, these noisy updates can result in faster learning and sometimes a more robust model.<\/p>\n<blockquote>\n<p>Optimization algorithms that use only a single example at a time are sometimes called stochastic or sometimes online methods. The term online is usually reserved for the case where the examples are drawn from a stream of continually created examples rather than from a fixed-size training set over which several passes are made.<\/p>\n<\/blockquote>\n<p>\u2014 Page 278, <a href=\"https:\/\/amzn.to\/2NJW3gE\">Deep Learning<\/a>, 2016.<\/p>\n<p>The number of training examples used in the estimate of the error gradient is a hyperparameter for the learning algorithm called the \u201c<em>batch size<\/em>,\u201d or simply the \u201c<em>batch<\/em>.\u201d<\/p>\n<p>A batch size of 32 means that 32 samples from the training dataset will be used to estimate the error gradient before the model weights are updated. One training epoch means that the learning algorithm has made one pass through the training dataset, where examples were separated into randomly selected \u201c<em>batch size<\/em>\u201d groups.<\/p>\n<p>Historically, a training algorithm where the batch size is set to the total number of training examples is called \u201c<em>batch gradient descent<\/em>\u201d and a training algorithm where the batch size is set to 1 training example is called \u201c<em>stochastic gradient descent<\/em>\u201d or \u201c<em>online gradient descent<\/em>.\u201d<\/p>\n<p>A configuration of the batch size anywhere in between (e.g. more than 1 example and less than the number of examples in the training dataset) is called \u201c<em>minibatch gradient descent<\/em>.\u201d<\/p>\n<ul>\n<li><strong>Batch Gradient Descent<\/strong>. Batch size is set to the total number of examples in the training dataset.<\/li>\n<li><strong>Stochastic Gradient Descent<\/strong>. Batch size is set to one.<\/li>\n<li><strong>Minibatch Gradient Descent<\/strong>. Batch size is set to more than one and less than the total number of examples in the training dataset.<\/li>\n<\/ul>\n<p>For shorthand, the algorithm is often referred to as stochastic gradient descent regardless of the batch size. Given that very large datasets are often used to train deep learning neural networks, the batch size is rarely set to the size of the training dataset.<\/p>\n<p>Smaller batch sizes are used for two main reasons:<\/p>\n<ul>\n<li>Smaller batch sizes are noisy, offering a regularizing effect and lower generalization error.<\/li>\n<li>Smaller batch sizes make it easier to fit one batch worth of training data in memory (i.e. when using a GPU).<\/li>\n<\/ul>\n<p>A third reason is that the batch size is often set at something small, such as 32 examples, and is not tuned by the practitioner. Small batch sizes such as 32 do work well generally.<\/p>\n<blockquote>\n<p>\u2026 [batch size] is typically chosen between 1 and a few hundreds, e.g. [batch size] = 32 is a good default value<\/p>\n<\/blockquote>\n<p>\u2014 <a href=\"https:\/\/arxiv.org\/abs\/1206.5533\">Practical recommendations for gradient-based training of deep architectures<\/a>, 2012.<\/p>\n<blockquote>\n<p>The presented results confirm that using small batch sizes achieves the best training stability and generalization performance, for a given computational cost, across a wide range of experiments. In all cases the best results have been obtained with batch sizes m = 32 or smaller, often as small as m = 2 or m = 4.<\/p>\n<\/blockquote>\n<p>\u2014 <a href=\"https:\/\/arxiv.org\/abs\/1804.07612\">Revisiting Small Batch Training for Deep Neural Networks<\/a>, 2018.<\/p>\n<p>Nevertheless, the batch size impacts how quickly a model learns and the stability of the learning process. It is an important hyperparameter that should be well understood and tuned by the deep learning practitioner.<\/p>\n<div class=\"woo-sc-hr\"><\/div>\n<p><center><\/p>\n<h3>Want Better Results with Deep Learning?<\/h3>\n<p>Take my free 7-day email crash course now (with sample code).<\/p>\n<p>Click to sign-up and also get a free PDF Ebook version of the course.<\/p>\n<p><a href=\"https:\/\/machinelearningmastery.lpages.co\/leadbox\/1433e7773f72a2%3A164f8be4f346dc\/5764144745676800\/\" target=\"_blank\" style=\"background: rgb(255, 206, 10); color: rgb(255, 255, 255); text-decoration: none; font-family: Helvetica, Arial, sans-serif; font-weight: bold; font-size: 16px; line-height: 20px; padding: 10px; display: inline-block; max-width: 300px; border-radius: 5px; text-shadow: rgba(0, 0, 0, 0.25) 0px -1px 1px; box-shadow: rgba(255, 255, 255, 0.5) 0px 1px 3px inset, rgba(0, 0, 0, 0.5) 0px 1px 3px;\">Download Your FREE Mini-Course<\/a><script data-leadbox=\"1433e7773f72a2:164f8be4f346dc\" data-url=\"https:\/\/machinelearningmastery.lpages.co\/leadbox\/1433e7773f72a2%3A164f8be4f346dc\/5764144745676800\/\" data-config=\"%7B%7D\" type=\"text\/javascript\" src=\"https:\/\/machinelearningmastery.lpages.co\/leadbox-1543333086.js\"><\/script><\/p>\n<p><\/center><\/p>\n<div class=\"woo-sc-hr\"><\/div>\n<h2>Stochastic, Batch, and Minibatch Gradient Descent in Keras<\/h2>\n<p>Keras allows you to train your model using stochastic, batch, or minibatch gradient descent.<\/p>\n<p>This can be achieved by setting the batch_size argument on the call to the <em>fit()<\/em> function when training your model.<\/p>\n<p>Let\u2019s take a look at each approach in turn.<\/p>\n<h3>Stochastic Gradient Descent in Keras<\/h3>\n<p>The example below sets the batch_size argument to 1 for stochastic gradient descent.<\/p>\n<pre class=\"crayon-plain-tag\">...\r\nmodel.fit(trainX, trainy, batch_size=1)<\/pre>\n<\/p>\n<h3>Batch Gradient Descent in Keras<\/h3>\n<p>The example below sets the batch_size argument to the number of samples in the training dataset for batch gradient descent.<\/p>\n<pre class=\"crayon-plain-tag\">...\r\nmodel.fit(trainX, trainy, batch_size=len(trainX))<\/pre>\n<\/p>\n<h3>Minibatch Gradient Descent in Keras<\/h3>\n<p>The example below uses the default batch size of 32 for the <em>batch_size<\/em> argument, which is more than 1 for stochastic gradient descent and less that the size of your training dataset for batch gradient descent.<\/p>\n<pre class=\"crayon-plain-tag\">...\r\nmodel.fit(trainX, trainy)<\/pre>\n<p>Alternately, the batch_size can be specified to something other than 1 or the number of samples in the training dataset, such as 64.<\/p>\n<pre class=\"crayon-plain-tag\">...\r\nmodel.fit(trainX, trainy, batch_size=64)<\/pre>\n<\/p>\n<h2>Multi-Class Classification Problem<\/h2>\n<p>We will use a small multi-class classification problem as the basis to demonstrate the effect of batch size on learning.<\/p>\n<p>The scikit-learn class provides the <a href=\"http:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.datasets.make_blobs.html\">make_blobs() function<\/a> that can be used to create a multi-class classification problem with the prescribed number of samples, input variables, classes, and variance of samples within a class.<\/p>\n<p>The problem can be configured to have two input variables (to represent the <em>x<\/em> and <em>y<\/em> coordinates of the points) and a standard deviation of 2.0 for points within each group. We will use the same random state (seed for the pseudorandom number generator) to ensure that we always get the same data points.<\/p>\n<pre class=\"crayon-plain-tag\"># generate 2d classification dataset\r\nX, y = make_blobs(n_samples=1000, centers=3, n_features=2, cluster_std=2, random_state=2)<\/pre>\n<p>The results are the input and output elements of a dataset that we can model.<\/p>\n<p>In order to get a feeling for the complexity of the problem, we can plot each point on a two-dimensional scatter plot and color each point by class value.<\/p>\n<p>The complete example is listed below.<\/p>\n<pre class=\"crayon-plain-tag\"># scatter plot of blobs dataset\r\nfrom sklearn.datasets.samples_generator import make_blobs\r\nfrom matplotlib import pyplot\r\nfrom numpy import where\r\n# generate 2d classification dataset\r\nX, y = make_blobs(n_samples=1000, centers=3, n_features=2, cluster_std=2, random_state=2)\r\n# scatter plot for each class value\r\nfor class_value in range(3):\r\n\t# select indices of points with the class label\r\n\trow_ix = where(y == class_value)\r\n\t# scatter plot for points with a different color\r\n\tpyplot.scatter(X[row_ix, 0], X[row_ix, 1])\r\n# show plot\r\npyplot.show()<\/pre>\n<p>Running the example creates a scatter plot of the entire dataset. We can see that the standard deviation of 2.0 means that the classes are not linearly separable (separable by a line) causing many ambiguous points.<\/p>\n<p>This is desirable as it means that the problem is non-trivial and will allow a neural network model to find many different \u201c<em>good enough<\/em>\u201d candidate solutions.<\/p>\n<div id=\"attachment_6870\" style=\"width: 1290px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" class=\"size-full wp-image-6870\" src=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2018\/11\/Scatter-Plot-of-Blobs-Dataset-with-Three-Classes-and-Points-Colored-by-Class-Value.png\" alt=\"Scatter Plot of Blobs Dataset With Three Classes and Points Colored by Class Value\" width=\"1280\" height=\"960\" srcset=\"http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/11\/Scatter-Plot-of-Blobs-Dataset-with-Three-Classes-and-Points-Colored-by-Class-Value.png 1280w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/11\/Scatter-Plot-of-Blobs-Dataset-with-Three-Classes-and-Points-Colored-by-Class-Value-300x225.png 300w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/11\/Scatter-Plot-of-Blobs-Dataset-with-Three-Classes-and-Points-Colored-by-Class-Value-768x576.png 768w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/11\/Scatter-Plot-of-Blobs-Dataset-with-Three-Classes-and-Points-Colored-by-Class-Value-1024x768.png 1024w\" sizes=\"(max-width: 1280px) 100vw, 1280px\"><\/p>\n<p class=\"wp-caption-text\">Scatter Plot of Blobs Dataset With Three Classes and Points Colored by Class Value<\/p>\n<\/div>\n<h2>MLP Fit With Batch Gradient Descent<\/h2>\n<p>We can develop a Multilayer Perceptron model (MLP) to address the multi-class classification problem described in the previous section and train it using batch gradient descent.<\/p>\n<p>Firstly, we need to one hot encode the target variable, transforming the integer class values into binary vectors. This will allow the model to predict the probability of each example belonging to each of the three classes, providing more nuance in the predictions and context when training the model.<\/p>\n<pre class=\"crayon-plain-tag\"># one hot encode output variable\r\ny = to_categorical(y)<\/pre>\n<p>Next, we will split the training dataset of 1,000 examples into a train and test dataset with 500 examples each.<\/p>\n<p>This even split will allow us to evaluate and compare the performance of different configurations of the batch size on the model and its performance.<\/p>\n<pre class=\"crayon-plain-tag\"># split into train and test\r\nn_train = 500\r\ntrainX, testX = X[:n_train, :], X[n_train:, :]\r\ntrainy, testy = y[:n_train], y[n_train:]<\/pre>\n<p>We will define an MLP model with an input layer that expects two input variables, for the two variables in the dataset.<\/p>\n<p>The model will have a single hidden layer with 50 nodes and a rectified linear activation function and He random weight initialization. Finally, the output layer has 3 nodes in order to make predictions for the three classes and a softmax activation function.<\/p>\n<pre class=\"crayon-plain-tag\"># define model\r\nmodel = Sequential()\r\nmodel.add(Dense(50, input_dim=2, activation='relu', kernel_initializer='he_uniform'))\r\nmodel.add(Dense(3, activation='softmax'))<\/pre>\n<p>We will optimize the model with stochastic gradient descent and use categorical cross entropy to calculate the error of the model during training.<\/p>\n<p>In this example, we will use \u201c<em>batch gradient descent<\/em>\u201c, meaning that the batch size will be set to the size of the training dataset. The model will be fit for 200 training epochs and the test dataset will be used as the validation set in order to monitor the performance of the model on a holdout set during training.<\/p>\n<p>The effect will be more time between weight updates and we would expect faster training than other batch sizes, and more stable estimates of the gradient, which should result in a more stable performance of the model during training.<\/p>\n<pre class=\"crayon-plain-tag\"># compile model\r\nopt = SGD(lr=0.01, momentum=0.9)\r\nmodel.compile(loss='categorical_crossentropy', optimizer=opt, metrics=['accuracy'])\r\n# fit model\r\nhistory = model.fit(trainX, trainy, validation_data=(testX, testy), epochs=200, verbose=0, batch_size=len(trainX))<\/pre>\n<p>Once the model is fit, the performance is evaluated and reported on the train and test datasets.<\/p>\n<pre class=\"crayon-plain-tag\"># evaluate the model\r\n_, train_acc = model.evaluate(trainX, trainy, verbose=0)\r\n_, test_acc = model.evaluate(testX, testy, verbose=0)\r\nprint('Train: %.3f, Test: %.3f' % (train_acc, test_acc))<\/pre>\n<p>A line plot is created showing the train and test set accuracy of the model for each training epoch.<\/p>\n<p>These learning curves provide an indication of three things: how quickly the model learns the problem, how well it has learned the problem, and how noisy the updates were to the model during training.<\/p>\n<pre class=\"crayon-plain-tag\"># plot training history\r\npyplot.plot(history.history['acc'], label='train')\r\npyplot.plot(history.history['val_acc'], label='test')\r\npyplot.legend()\r\npyplot.show()<\/pre>\n<p>Tying these elements together, the complete example is listed below.<\/p>\n<pre class=\"crayon-plain-tag\"># mlp for the blobs problem with batch gradient descent\r\nfrom sklearn.datasets.samples_generator import make_blobs\r\nfrom keras.layers import Dense\r\nfrom keras.models import Sequential\r\nfrom keras.optimizers import SGD\r\nfrom keras.utils import to_categorical\r\nfrom matplotlib import pyplot\r\n# generate 2d classification dataset\r\nX, y = make_blobs(n_samples=1000, centers=3, n_features=2, cluster_std=2, random_state=2)\r\n# one hot encode output variable\r\ny = to_categorical(y)\r\n# split into train and test\r\nn_train = 500\r\ntrainX, testX = X[:n_train, :], X[n_train:, :]\r\ntrainy, testy = y[:n_train], y[n_train:]\r\n# define model\r\nmodel = Sequential()\r\nmodel.add(Dense(50, input_dim=2, activation='relu', kernel_initializer='he_uniform'))\r\nmodel.add(Dense(3, activation='softmax'))\r\n# compile model\r\nopt = SGD(lr=0.01, momentum=0.9)\r\nmodel.compile(loss='categorical_crossentropy', optimizer=opt, metrics=['accuracy'])\r\n# fit model\r\nhistory = model.fit(trainX, trainy, validation_data=(testX, testy), epochs=200, verbose=0, batch_size=len(trainX))\r\n# evaluate the model\r\n_, train_acc = model.evaluate(trainX, trainy, verbose=0)\r\n_, test_acc = model.evaluate(testX, testy, verbose=0)\r\nprint('Train: %.3f, Test: %.3f' % (train_acc, test_acc))\r\n# plot training history\r\npyplot.plot(history.history['acc'], label='train')\r\npyplot.plot(history.history['val_acc'], label='test')\r\npyplot.legend()\r\npyplot.show()<\/pre>\n<p>Running the example first reports the performance of the model on the train and test datasets.<\/p>\n<p>Your specific results may vary given the stochastic nature of the learning algorithm; consider running the example a few times.<\/p>\n<p>In this case, we can see that performance was similar between the train and test sets with 81% and 83% respectively.<\/p>\n<pre class=\"crayon-plain-tag\">Train: 0.816, Test: 0.830<\/pre>\n<p>A line plot of model classification accuracy on the train (blue) and test (orange) dataset is created. We can see that the model is relatively slow to learn this problem, converging on a solution after about 100 epochs after which changes in model performance are minor.<\/p>\n<div id=\"attachment_6871\" style=\"width: 1290px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" class=\"size-full wp-image-6871\" src=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2018\/11\/Line-Plot-of-Classification-Accuracy-on-Train-and-Tests-Sets-of-a-MLP-Fit-with-Batch-Gradient-Descent-.png\" alt=\"Line Plot of Classification Accuracy on Train and Tests Sets of an MLP Fit With Batch Gradient Descent\" width=\"1280\" height=\"960\" srcset=\"http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/11\/Line-Plot-of-Classification-Accuracy-on-Train-and-Tests-Sets-of-a-MLP-Fit-with-Batch-Gradient-Descent-.png 1280w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/11\/Line-Plot-of-Classification-Accuracy-on-Train-and-Tests-Sets-of-a-MLP-Fit-with-Batch-Gradient-Descent--300x225.png 300w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/11\/Line-Plot-of-Classification-Accuracy-on-Train-and-Tests-Sets-of-a-MLP-Fit-with-Batch-Gradient-Descent--768x576.png 768w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/11\/Line-Plot-of-Classification-Accuracy-on-Train-and-Tests-Sets-of-a-MLP-Fit-with-Batch-Gradient-Descent--1024x768.png 1024w\" sizes=\"(max-width: 1280px) 100vw, 1280px\"><\/p>\n<p class=\"wp-caption-text\">Line Plot of Classification Accuracy on Train and Tests Sets of an MLP Fit With Batch Gradient Descent<\/p>\n<\/div>\n<h2>MLP Fit With Stochastic Gradient Descent<\/h2>\n<p>The example of batch gradient descent from the previous section can be updated to instead use stochastic gradient descent.<\/p>\n<p>This requires changing the batch size from the size of the training dataset to 1.<\/p>\n<pre class=\"crayon-plain-tag\"># fit model\r\nhistory = model.fit(trainX, trainy, validation_data=(testX, testy), epochs=200, verbose=0, batch_size=1)<\/pre>\n<p>Stochastic gradient descent requires that the model make a prediction and have the weights updated for each training example. This has the effect of dramatically slowing down the training process as compared to batch gradient descent.<\/p>\n<p>The expectation of this change is that the model learns faster and that changes to the model are noisy, resulting, in turn, in noisy performance over training epochs.<\/p>\n<p>The complete example with this change is listed below.<\/p>\n<pre class=\"crayon-plain-tag\"># mlp for the blobs problem with stochastic gradient descent\r\nfrom sklearn.datasets.samples_generator import make_blobs\r\nfrom keras.layers import Dense\r\nfrom keras.models import Sequential\r\nfrom keras.optimizers import SGD\r\nfrom keras.utils import to_categorical\r\nfrom matplotlib import pyplot\r\n# generate 2d classification dataset\r\nX, y = make_blobs(n_samples=1000, centers=3, n_features=2, cluster_std=2, random_state=2)\r\n# one hot encode output variable\r\ny = to_categorical(y)\r\n# split into train and test\r\nn_train = 500\r\ntrainX, testX = X[:n_train, :], X[n_train:, :]\r\ntrainy, testy = y[:n_train], y[n_train:]\r\n# define model\r\nmodel = Sequential()\r\nmodel.add(Dense(50, input_dim=2, activation='relu', kernel_initializer='he_uniform'))\r\nmodel.add(Dense(3, activation='softmax'))\r\n# compile model\r\nopt = SGD(lr=0.01, momentum=0.9)\r\nmodel.compile(loss='categorical_crossentropy', optimizer=opt, metrics=['accuracy'])\r\n# fit model\r\nhistory = model.fit(trainX, trainy, validation_data=(testX, testy), epochs=200, verbose=0, batch_size=1)\r\n# evaluate the model\r\n_, train_acc = model.evaluate(trainX, trainy, verbose=0)\r\n_, test_acc = model.evaluate(testX, testy, verbose=0)\r\nprint('Train: %.3f, Test: %.3f' % (train_acc, test_acc))\r\n# plot training history\r\npyplot.plot(history.history['acc'], label='train')\r\npyplot.plot(history.history['val_acc'], label='test')\r\npyplot.legend()\r\npyplot.show()<\/pre>\n<p>Running the example first reports the performance of the model on the train and test datasets.<\/p>\n<p>Your specific results may vary given the stochastic nature of the learning algorithm; consider running the example a few times.<\/p>\n<p>In this case, we can see that performance was similar between the train and test sets, around 60% accuracy, but was dramatically worse (about 20 percentage points) than using batch gradient descent.<\/p>\n<p>At least for this problem and the chosen model and model configuration, stochastic (online) gradient descent is not appropriate.<\/p>\n<pre class=\"crayon-plain-tag\">Train: 0.612, Test: 0.606<\/pre>\n<p>A line plot of model classification accuracy on the train (blue) and test (orange) dataset is created.<\/p>\n<p>The plot shows the unstable nature of the training process with the chosen configuration. The poor performance and violent changes to the model suggest that the learning rate used to update weights after each training example may be too large and that a smaller learning rate may make the learning process more stable.<\/p>\n<div id=\"attachment_6872\" style=\"width: 1290px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" class=\"size-full wp-image-6872\" src=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2018\/11\/Line-Plot-of-Classification-Accuracy-on-Train-and-Tests-Sets-of-a-MLP-Fit-with-Stochastic-Gradient-Descent.png\" alt=\"Line Plot of Classification Accuracy on Train and Tests Sets of an MLP Fit With Stochastic Gradient Descent\" width=\"1280\" height=\"960\" srcset=\"http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/11\/Line-Plot-of-Classification-Accuracy-on-Train-and-Tests-Sets-of-a-MLP-Fit-with-Stochastic-Gradient-Descent.png 1280w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/11\/Line-Plot-of-Classification-Accuracy-on-Train-and-Tests-Sets-of-a-MLP-Fit-with-Stochastic-Gradient-Descent-300x225.png 300w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/11\/Line-Plot-of-Classification-Accuracy-on-Train-and-Tests-Sets-of-a-MLP-Fit-with-Stochastic-Gradient-Descent-768x576.png 768w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/11\/Line-Plot-of-Classification-Accuracy-on-Train-and-Tests-Sets-of-a-MLP-Fit-with-Stochastic-Gradient-Descent-1024x768.png 1024w\" sizes=\"(max-width: 1280px) 100vw, 1280px\"><\/p>\n<p class=\"wp-caption-text\">Line Plot of Classification Accuracy on Train and Tests Sets of an MLP Fit With Stochastic Gradient Descent<\/p>\n<\/div>\n<p>We can test this by re-running the model fit with stochastic gradient descent and a smaller learning rate. For example, we can drop the learning rate by an order of magnitude form 0.01 to 0.001.<\/p>\n<pre class=\"crayon-plain-tag\"># compile model\r\nopt = SGD(lr=0.001, momentum=0.9)\r\nmodel.compile(loss='categorical_crossentropy', optimizer=opt, metrics=['accuracy'])<\/pre>\n<p>The full code listing with this change is provided below for completeness.<\/p>\n<pre class=\"crayon-plain-tag\"># mlp for the blobs problem with stochastic gradient descent\r\nfrom sklearn.datasets.samples_generator import make_blobs\r\nfrom keras.layers import Dense\r\nfrom keras.models import Sequential\r\nfrom keras.optimizers import SGD\r\nfrom keras.utils import to_categorical\r\nfrom matplotlib import pyplot\r\n# generate 2d classification dataset\r\nX, y = make_blobs(n_samples=1000, centers=3, n_features=2, cluster_std=2, random_state=2)\r\n# one hot encode output variable\r\ny = to_categorical(y)\r\n# split into train and test\r\nn_train = 500\r\ntrainX, testX = X[:n_train, :], X[n_train:, :]\r\ntrainy, testy = y[:n_train], y[n_train:]\r\n# define model\r\nmodel = Sequential()\r\nmodel.add(Dense(50, input_dim=2, activation='relu', kernel_initializer='he_uniform'))\r\nmodel.add(Dense(3, activation='softmax'))\r\n# compile model\r\nopt = SGD(lr=0.001, momentum=0.9)\r\nmodel.compile(loss='categorical_crossentropy', optimizer=opt, metrics=['accuracy'])\r\n# fit model\r\nhistory = model.fit(trainX, trainy, validation_data=(testX, testy), epochs=200, verbose=0, batch_size=1)\r\n# evaluate the model\r\n_, train_acc = model.evaluate(trainX, trainy, verbose=0)\r\n_, test_acc = model.evaluate(testX, testy, verbose=0)\r\nprint('Train: %.3f, Test: %.3f' % (train_acc, test_acc))\r\n# plot training history\r\npyplot.plot(history.history['acc'], label='train')\r\npyplot.plot(history.history['val_acc'], label='test')\r\npyplot.legend()\r\npyplot.show()<\/pre>\n<p>Running this example tells a very different story.<\/p>\n<p>Your specific results may vary given the stochastic nature of the learning algorithm; consider running the example a few times.<\/p>\n<p>The reported performance is greatly improved, achieving classification accuracy on the train and test sets on par with fit using batch gradient descent.<\/p>\n<pre class=\"crayon-plain-tag\">Train: 0.816, Test: 0.824<\/pre>\n<p>The line plot shows the expected behavior. Namely, that the model rapidly learns the problem as compared to batch gradient descent, leaping up to about 80% accuracy in about 25 epochs rather than the 100 epochs seen when using batch gradient descent. We could have stopped training at epoch 50 instead of epoch 200 due to the faster training.<\/p>\n<p>This is not surprising. With batch gradient descent, 100 epochs involved 100 estimates of error and 100 weight updates. In stochastic gradient descent, 25 epochs involved (500 * 25) or 12,500 weight updates, providing more than 10-times more feedback, albeit more noisy feedback, about how to improve the model.<\/p>\n<p>The line plot also shows that train and test performance remain comparable during training, as compared to the dynamics with batch gradient descent where the performance on the test set was slightly better and remained so throughout training.<\/p>\n<p>Unlike batch gradient descent, we can see that the noisy updates result in noisy performance throughout the duration of training. This variance in the model means that it may be challenging to choose which model to use as the final model, as opposed to batch gradient descent where performance is stabilized because the model has converged.<\/p>\n<div id=\"attachment_6873\" style=\"width: 1290px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" class=\"size-full wp-image-6873\" src=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2018\/11\/Line-Plot-of-Classification-Accuracy-on-Train-and-Tests-Sets-of-a-MLP-Fit-with-Stochastic-Gradient-Descent-and-Smaller-Learning-Rate.png\" alt=\"Line Plot of Classification Accuracy on Train and Tests Sets of an MLP Fit With Stochastic Gradient Descent and Smaller Learning Rate\" width=\"1280\" height=\"960\" srcset=\"http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/11\/Line-Plot-of-Classification-Accuracy-on-Train-and-Tests-Sets-of-a-MLP-Fit-with-Stochastic-Gradient-Descent-and-Smaller-Learning-Rate.png 1280w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/11\/Line-Plot-of-Classification-Accuracy-on-Train-and-Tests-Sets-of-a-MLP-Fit-with-Stochastic-Gradient-Descent-and-Smaller-Learning-Rate-300x225.png 300w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/11\/Line-Plot-of-Classification-Accuracy-on-Train-and-Tests-Sets-of-a-MLP-Fit-with-Stochastic-Gradient-Descent-and-Smaller-Learning-Rate-768x576.png 768w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/11\/Line-Plot-of-Classification-Accuracy-on-Train-and-Tests-Sets-of-a-MLP-Fit-with-Stochastic-Gradient-Descent-and-Smaller-Learning-Rate-1024x768.png 1024w\" sizes=\"(max-width: 1280px) 100vw, 1280px\"><\/p>\n<p class=\"wp-caption-text\">Line Plot of Classification Accuracy on Train and Tests Sets of an MLP Fit With Stochastic Gradient Descent and Smaller Learning Rate<\/p>\n<\/div>\n<p>This example highlights the important relationship between batch size and the learning rate. Namely, more noisy updates to the model require a smaller learning rate, whereas less noisy more accurate estimates of the error gradient may be applied to the model more liberally. We can summarize this as follows:<\/p>\n<ul>\n<li><strong>Batch Gradient Descent<\/strong>: Use a relatively larger learning rate and more training epochs.<\/li>\n<li><strong>Stochastic Gradient Descent<\/strong>: Use a relatively smaller learning rate and fewer training epochs.<\/li>\n<\/ul>\n<p>Mini-batch gradient descent provides an alternative approach.<\/p>\n<h2>MLP Fit With Minibatch Gradient Descent<\/h2>\n<p>An alternative to using stochastic gradient descent and tuning the learning rate is to hold the learning rate constant and to change the batch size.<\/p>\n<p>In effect, it means that we specify the rate of learning or amount of change to apply to the weights each time we estimate the error gradient, but to vary the accuracy of the gradient based on the number of samples used to estimate it.<\/p>\n<p>Holding the learning rate at 0.01 as we did with batch gradient descent, we can set the batch size to 32, a widely adopted default batch size.<\/p>\n<pre class=\"crayon-plain-tag\"># fit model\r\nhistory = model.fit(trainX, trainy, validation_data=(testX, testy), epochs=200, verbose=0, batch_size=32)<\/pre>\n<p>We would expect to get some of the benefits of stochastic gradient descent with a larger learning rate.<\/p>\n<p>The complete example with this modification is listed below.<\/p>\n<pre class=\"crayon-plain-tag\"># mlp for the blobs problem with minibatch gradient descent\r\nfrom sklearn.datasets.samples_generator import make_blobs\r\nfrom keras.layers import Dense\r\nfrom keras.models import Sequential\r\nfrom keras.optimizers import SGD\r\nfrom keras.utils import to_categorical\r\nfrom matplotlib import pyplot\r\n# generate 2d classification dataset\r\nX, y = make_blobs(n_samples=1000, centers=3, n_features=2, cluster_std=2, random_state=2)\r\n# one hot encode output variable\r\ny = to_categorical(y)\r\n# split into train and test\r\nn_train = 500\r\ntrainX, testX = X[:n_train, :], X[n_train:, :]\r\ntrainy, testy = y[:n_train], y[n_train:]\r\n# define model\r\nmodel = Sequential()\r\nmodel.add(Dense(50, input_dim=2, activation='relu', kernel_initializer='he_uniform'))\r\nmodel.add(Dense(3, activation='softmax'))\r\n# compile model\r\nopt = SGD(lr=0.01, momentum=0.9)\r\nmodel.compile(loss='categorical_crossentropy', optimizer=opt, metrics=['accuracy'])\r\n# fit model\r\nhistory = model.fit(trainX, trainy, validation_data=(testX, testy), epochs=200, verbose=0, batch_size=32)\r\n# evaluate the model\r\n_, train_acc = model.evaluate(trainX, trainy, verbose=0)\r\n_, test_acc = model.evaluate(testX, testy, verbose=0)\r\nprint('Train: %.3f, Test: %.3f' % (train_acc, test_acc))\r\n# plot training history\r\npyplot.plot(history.history['acc'], label='train')\r\npyplot.plot(history.history['val_acc'], label='test')\r\npyplot.legend()\r\npyplot.show()<\/pre>\n<p>Running the example reports similar performance on both train and test sets, comparable with batch gradient descent and stochastic gradient descent after we reduced the learning rate.<\/p>\n<pre class=\"crayon-plain-tag\">Train: 0.832, Test: 0.812<\/pre>\n<p>The line plot shows the dynamics of both stochastic and batch gradient descent. Specifically, the model learns fast and has noisy updates but also stabilizes more towards the end of the run, more so than stochastic gradient descent.<\/p>\n<p>Holding learning rate constant and varying the batch size allows you to dial in the best of both approaches.<\/p>\n<div id=\"attachment_6874\" style=\"width: 1290px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" class=\"size-full wp-image-6874\" src=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2018\/11\/Line-Plot-of-Classification-Accuracy-on-Train-and-Tests-Sets-of-a-MLP-Fit-with-Minibatch-Gradient-Descent.png\" alt=\"Line Plot of Classification Accuracy on Train and Tests Sets of an MLP Fit With Minibatch Gradient Descent\" width=\"1280\" height=\"960\" srcset=\"http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/11\/Line-Plot-of-Classification-Accuracy-on-Train-and-Tests-Sets-of-a-MLP-Fit-with-Minibatch-Gradient-Descent.png 1280w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/11\/Line-Plot-of-Classification-Accuracy-on-Train-and-Tests-Sets-of-a-MLP-Fit-with-Minibatch-Gradient-Descent-300x225.png 300w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/11\/Line-Plot-of-Classification-Accuracy-on-Train-and-Tests-Sets-of-a-MLP-Fit-with-Minibatch-Gradient-Descent-768x576.png 768w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/11\/Line-Plot-of-Classification-Accuracy-on-Train-and-Tests-Sets-of-a-MLP-Fit-with-Minibatch-Gradient-Descent-1024x768.png 1024w\" sizes=\"(max-width: 1280px) 100vw, 1280px\"><\/p>\n<p class=\"wp-caption-text\">Line Plot of Classification Accuracy on Train and Tests Sets of an MLP Fit With Minibatch Gradient Descent<\/p>\n<\/div>\n<h2>Effect of Batch Size on Model Behavior<\/h2>\n<p>We can refit the model with different batch sizes and review the impact the change in batch size has on the speed of learning, stability during learning, and on the final result.<\/p>\n<p>First, we can clean up the code and create a function to prepare the dataset.<\/p>\n<pre class=\"crayon-plain-tag\"># prepare train and test dataset\r\ndef prepare_data():\r\n\t# generate 2d classification dataset\r\n\tX, y = make_blobs(n_samples=1000, centers=3, n_features=2, cluster_std=2, random_state=2)\r\n\t# one hot encode output variable\r\n\ty = to_categorical(y)\r\n\t# split into train and test\r\n\tn_train = 500\r\n\ttrainX, testX = X[:n_train, :], X[n_train:, :]\r\n\ttrainy, testy = y[:n_train], y[n_train:]\r\n\treturn trainX, trainy, testX, testy<\/pre>\n<p>Next, we can create a function to fit a model on the problem with a given batch size and plot the learning curves of classification accuracy on the train and test datasets.<\/p>\n<pre class=\"crayon-plain-tag\"># fit a model and plot learning curve\r\ndef fit_model(trainX, trainy, testX, testy, n_batch):\r\n\t# define model\r\n\tmodel = Sequential()\r\n\tmodel.add(Dense(50, input_dim=2, activation='relu', kernel_initializer='he_uniform'))\r\n\tmodel.add(Dense(3, activation='softmax'))\r\n\t# compile model\r\n\topt = SGD(lr=0.01, momentum=0.9)\r\n\tmodel.compile(loss='categorical_crossentropy', optimizer=opt, metrics=['accuracy'])\r\n\t# fit model\r\n\thistory = model.fit(trainX, trainy, validation_data=(testX, testy), epochs=200, verbose=0, batch_size=n_batch)\r\n\t# plot learning curves\r\n\tpyplot.plot(history.history['acc'], label='train')\r\n\tpyplot.plot(history.history['val_acc'], label='test')\r\n\tpyplot.title('batch='+str(n_batch), pad=-40)<\/pre>\n<p>Finally, we can evaluate the model behavior with a suite of different batch sizes while holding everything else about the model constant, including the learning rate.<\/p>\n<pre class=\"crayon-plain-tag\"># prepare dataset\r\ntrainX, trainy, testX, testy = prepare_data()\r\n# create learning curves for different batch sizes\r\nbatch_sizes = [4, 8, 16, 32, 64, 128, 256, 450]\r\nfor i in range(len(batch_sizes)):\r\n\t# determine the plot number\r\n\tplot_no = 420 + (i+1)\r\n\tpyplot.subplot(plot_no)\r\n\t# fit model and plot learning curves for a batch size\r\n\tfit_model(trainX, trainy, testX, testy, batch_sizes[i])\r\n# show learning curves\r\npyplot.show()<\/pre>\n<p>The result will be a figure with eight plots of model behavior with eight different batch sizes.<\/p>\n<p>Tying this together, the complete example is listed below.<\/p>\n<pre class=\"crayon-plain-tag\"># mlp for the blobs problem with minibatch gradient descent with varied batch size\r\nfrom sklearn.datasets.samples_generator import make_blobs\r\nfrom keras.layers import Dense\r\nfrom keras.models import Sequential\r\nfrom keras.optimizers import SGD\r\nfrom keras.utils import to_categorical\r\nfrom matplotlib import pyplot\r\n\r\n# prepare train and test dataset\r\ndef prepare_data():\r\n\t# generate 2d classification dataset\r\n\tX, y = make_blobs(n_samples=1000, centers=3, n_features=2, cluster_std=2, random_state=2)\r\n\t# one hot encode output variable\r\n\ty = to_categorical(y)\r\n\t# split into train and test\r\n\tn_train = 500\r\n\ttrainX, testX = X[:n_train, :], X[n_train:, :]\r\n\ttrainy, testy = y[:n_train], y[n_train:]\r\n\treturn trainX, trainy, testX, testy\r\n\r\n# fit a model and plot learning curve\r\ndef fit_model(trainX, trainy, testX, testy, n_batch):\r\n\t# define model\r\n\tmodel = Sequential()\r\n\tmodel.add(Dense(50, input_dim=2, activation='relu', kernel_initializer='he_uniform'))\r\n\tmodel.add(Dense(3, activation='softmax'))\r\n\t# compile model\r\n\topt = SGD(lr=0.01, momentum=0.9)\r\n\tmodel.compile(loss='categorical_crossentropy', optimizer=opt, metrics=['accuracy'])\r\n\t# fit model\r\n\thistory = model.fit(trainX, trainy, validation_data=(testX, testy), epochs=200, verbose=0, batch_size=n_batch)\r\n\t# plot learning curves\r\n\tpyplot.plot(history.history['acc'], label='train')\r\n\tpyplot.plot(history.history['val_acc'], label='test')\r\n\tpyplot.title('batch='+str(n_batch), pad=-40)\r\n\r\n# prepare dataset\r\ntrainX, trainy, testX, testy = prepare_data()\r\n# create learning curves for different batch sizes\r\nbatch_sizes = [4, 8, 16, 32, 64, 128, 256, 450]\r\nfor i in range(len(batch_sizes)):\r\n\t# determine the plot number\r\n\tplot_no = 420 + (i+1)\r\n\tpyplot.subplot(plot_no)\r\n\t# fit model and plot learning curves for a batch size\r\n\tfit_model(trainX, trainy, testX, testy, batch_sizes[i])\r\n# show learning curves\r\npyplot.show()<\/pre>\n<p>Running the example creates a figure with eight line plots showing the classification accuracy on the train and test sets of models with different batch sizes when using mini-batch gradient descent.<\/p>\n<p>The plots show that small batch results generally in rapid learning but a volatile learning process with higher variance in the classification accuracy. Larger batch sizes slow down the learning process but the final stages result in a convergence to a more stable model exemplified by lower variance in classification accuracy.<\/p>\n<div id=\"attachment_6875\" style=\"width: 1290px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" class=\"size-full wp-image-6875\" src=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2018\/11\/Line-Plots-of-Classification-Accuracy-on-Train-and-Test-Datasets-With-Different-Batch-Sizes.png\" alt=\"Line Plots of Classification Accuracy on Train and Test Datasets With Different Batch Sizes\" width=\"1280\" height=\"960\" srcset=\"http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/11\/Line-Plots-of-Classification-Accuracy-on-Train-and-Test-Datasets-With-Different-Batch-Sizes.png 1280w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/11\/Line-Plots-of-Classification-Accuracy-on-Train-and-Test-Datasets-With-Different-Batch-Sizes-300x225.png 300w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/11\/Line-Plots-of-Classification-Accuracy-on-Train-and-Test-Datasets-With-Different-Batch-Sizes-768x576.png 768w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/11\/Line-Plots-of-Classification-Accuracy-on-Train-and-Test-Datasets-With-Different-Batch-Sizes-1024x768.png 1024w\" sizes=\"(max-width: 1280px) 100vw, 1280px\"><\/p>\n<p class=\"wp-caption-text\">Line Plots of Classification Accuracy on Train and Test Datasets With Different Batch Sizes<\/p>\n<\/div>\n<h2>Further Reading<\/h2>\n<p>This section provides more resources on the topic if you are looking to go deeper.<\/p>\n<h3>Posts<\/h3>\n<ul>\n<li><a href=\"https:\/\/machinelearningmastery.com\/gentle-introduction-mini-batch-gradient-descent-configure-batch-size\/\">A Gentle Introduction to Mini-Batch Gradient Descent and How to Configure Batch Size<\/a><\/li>\n<\/ul>\n<h3>Papers<\/h3>\n<ul>\n<li><a href=\"https:\/\/arxiv.org\/abs\/1804.07612\">Revisiting Small Batch Training for Deep Neural Networks<\/a>, 2018.<\/li>\n<li><a href=\"https:\/\/arxiv.org\/abs\/1206.5533\">Practical recommendations for gradient-based training of deep architectures<\/a>, 2012.<\/li>\n<\/ul>\n<h3>Books<\/h3>\n<ul>\n<li>8.1.3 Batch and Minibatch Algorithms, <a href=\"https:\/\/amzn.to\/2NJW3gE\">Deep Learning<\/a>, 2016.<\/li>\n<\/ul>\n<h3>Articles<\/h3>\n<ul>\n<li><a href=\"https:\/\/en.wikipedia.org\/wiki\/Stochastic_gradient_descent\">Stochastic gradient descent, Wikipedia<\/a>.<\/li>\n<\/ul>\n<h2>Summary<\/h2>\n<p>In this tutorial, you discovered three different flavors of gradient descent and how to explore and diagnose the effect of batch size on the learning process.<\/p>\n<p>Specifically, you learned:<\/p>\n<ul>\n<li>Batch size controls the accuracy of the estimate of the error gradient when training neural networks.<\/li>\n<li>Batch, Stochastic, and Minibatch gradient descent are the three main flavors of the learning algorithm.<\/li>\n<li>There is a tension between batch size and the speed and stability of the learning process.<\/li>\n<\/ul>\n<p>Do you have any questions?<br \/>\nAsk your questions in the comments below and I will do my best to answer.<\/p>\n<p>The post <a rel=\"nofollow\" href=\"https:\/\/machinelearningmastery.com\/how-to-control-the-speed-and-stability-of-training-neural-networks-with-gradient-descent-batch-size\/\">How to Control the Speed and Stability of Training Neural Networks With Gradient Descent Batch Size<\/a> appeared first on <a rel=\"nofollow\" href=\"https:\/\/machinelearningmastery.com\/\">Machine Learning Mastery<\/a>.<\/p>\n<\/div>\n<p><a href=\"https:\/\/machinelearningmastery.com\/how-to-control-the-speed-and-stability-of-training-neural-networks-with-gradient-descent-batch-size\/\">Go to Source<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Author: Jason Brownlee Neural networks are trained using gradient descent where the estimate of the error used to update the weights is calculated based on [&hellip;] <span class=\"read-more-link\"><a class=\"read-more\" href=\"https:\/\/www.aiproblog.com\/index.php\/2019\/01\/20\/how-to-control-the-speed-and-stability-of-training-neural-networks-with-gradient-descent-batch-size\/\">Read More<\/a><\/span><\/p>\n","protected":false},"author":1,"featured_media":1606,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_bbp_topic_count":0,"_bbp_reply_count":0,"_bbp_total_topic_count":0,"_bbp_total_reply_count":0,"_bbp_voice_count":0,"_bbp_anonymous_reply_count":0,"_bbp_topic_count_hidden":0,"_bbp_reply_count_hidden":0,"_bbp_forum_subforum_count":0,"footnotes":""},"categories":[24],"tags":[],"_links":{"self":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts\/1605"}],"collection":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/comments?post=1605"}],"version-history":[{"count":0,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts\/1605\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/media\/1606"}],"wp:attachment":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/media?parent=1605"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/categories?post=1605"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/tags?post=1605"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}