{"id":1594,"date":"2019-01-17T18:00:10","date_gmt":"2019-01-17T18:00:10","guid":{"rendered":"https:\/\/www.aiproblog.com\/index.php\/2019\/01\/17\/how-to-accelerate-learning-of-deep-neural-networks-with-batch-normalization\/"},"modified":"2019-01-17T18:00:10","modified_gmt":"2019-01-17T18:00:10","slug":"how-to-accelerate-learning-of-deep-neural-networks-with-batch-normalization","status":"publish","type":"post","link":"https:\/\/www.aiproblog.com\/index.php\/2019\/01\/17\/how-to-accelerate-learning-of-deep-neural-networks-with-batch-normalization\/","title":{"rendered":"How to Accelerate Learning of Deep Neural Networks With Batch Normalization"},"content":{"rendered":"<p>Author: Jason Brownlee<\/p>\n<div>\n<p>Batch normalization is a technique designed to automatically standardize the inputs to a layer in a deep learning neural network.<\/p>\n<p>Once implemented, batch normalization has the effect of dramatically accelerating the training process of a neural network, and in some cases improves the performance of the model via a modest regularization effect.<\/p>\n<p>In this tutorial, you will discover how to use batch normalization to accelerate the training of deep learning neural networks in Python with Keras.<\/p>\n<p>After completing this tutorial, you will know:<\/p>\n<ul>\n<li>How to create and configure a BatchNormalization layer using the Keras API.<\/li>\n<li>How to add the BatchNormalization layer to deep learning neural network models.<\/li>\n<li>How to update an MLP model to use batch normalization to accelerate training on a binary classification problem.<\/li>\n<\/ul>\n<p>Let\u2019s get started.<\/p>\n<div id=\"attachment_6863\" style=\"width: 650px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" class=\"size-full wp-image-6863\" src=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2019\/01\/How-to-Accelerate-Learning-of-Deep-Neural-Networks-With-Batch-Normalization.jpg\" alt=\"How to Accelerate Learning of Deep Neural Networks With Batch Normalization\" width=\"640\" height=\"320\" srcset=\"http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2019\/01\/How-to-Accelerate-Learning-of-Deep-Neural-Networks-With-Batch-Normalization.jpg 640w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2019\/01\/How-to-Accelerate-Learning-of-Deep-Neural-Networks-With-Batch-Normalization-300x150.jpg 300w\" sizes=\"(max-width: 640px) 100vw, 640px\"><\/p>\n<p class=\"wp-caption-text\">How to Accelerate Learning of Deep Neural Networks With Batch Normalization<br \/>Photo by <a href=\"https:\/\/www.flickr.com\/photos\/150568953@N07\/33743679584\/in\/photolist-TpPbk3-5G9sdW-22jaB3h-eDpu7f-UuLQaX-UBXo3K-ez18sY-4XZNoe-iK4Dcb-QhtB5y-2b9cJbJ-21uwe5a-7TRKo5-dRZUkF-55NmoW-g7KEUA-KnKkuy-igJZXh-4VpkoP-2aknhQ5-amec93-2apvqR3-nEt1TA-arT9ZL-qF4NCe-pvqkJQ-27RCmY5-9sVNKs-4XZNkp-UDnFvV-7vVcmn-5pRw4L-e6gvpz-JxyUUt-LHjxD4-9Y986P-igKn6G-29AHGJB-7vVfJ2-jUwzcR-N8P8A7-28AhQbY-iJBZmG-L19pGt-anDpWe-9debvF-ihzZtU-ih63LL-22HBL6u-YwhNrb\">Angela and Andrew<\/a>, some rights reserved.<\/p>\n<\/div>\n<h2>Tutorial Overview<\/h2>\n<p>This tutorial is divided into three parts; they are:<\/p>\n<ol>\n<li>BatchNormalization in Keras<\/li>\n<li>BatchNormalization in Models<\/li>\n<li>BatchNormalization Case Study<\/li>\n<\/ol>\n<h2>BatchNormalization in Keras<\/h2>\n<p>Keras provides support for batch normalization via the BatchNormalization layer.<\/p>\n<p>For example:<\/p>\n<pre class=\"crayon-plain-tag\">bn = BatchNormalization()<\/pre>\n<p>The layer will transform inputs so that they are standardized, meaning that they will have a mean of zero and a standard deviation of one.<\/p>\n<p>During training, the layer will keep track of statistics for each input variable and use them to standardize the data.<\/p>\n<p>Further, the standardized output can be scaled using the learned parameters of <em>Beta<\/em> and <em>Gamma<\/em> that define the new mean and standard deviation for the output of the transform. The layer can be configured to control whether these additional parameters will be used or not via the \u201c<em>center<\/em>\u201d and \u201c<em>scale<\/em>\u201d attributes respectively. By default, they are enabled.<\/p>\n<p>The statistics used to perform the standardization, e.g. the mean and standard deviation of each variable, are updated for each mini batch and a running average is maintained.<\/p>\n<p>A \u201c<em>momentum<\/em>\u201d argument allows you to control how much of the statistics from the previous mini batch to include when the update is calculated. By default, this is kept high with a value of 0.99. This can be set to 0.0 to only use statistics from the current mini-batch, as described in the original paper.<\/p>\n<pre class=\"crayon-plain-tag\">bn = BatchNormalization(momentum=0.0)<\/pre>\n<p>At the end of training, the mean and standard deviation statistics in the layer at that time will be used to standardize inputs when the model is used to make a prediction.<\/p>\n<p>The default configuration estimating mean and standard deviation across all mini batches is probably sensible.<\/p>\n<div class=\"woo-sc-hr\"><\/div>\n<p><center><\/p>\n<h3>Want Better Results with Deep Learning?<\/h3>\n<p>Take my free 7-day email crash course now (with sample code).<\/p>\n<p>Click to sign-up and also get a free PDF Ebook version of the course.<\/p>\n<p><a href=\"https:\/\/machinelearningmastery.lpages.co\/leadbox\/1433e7773f72a2%3A164f8be4f346dc\/5764144745676800\/\" target=\"_blank\" style=\"background: rgb(255, 206, 10); color: rgb(255, 255, 255); text-decoration: none; font-family: Helvetica, Arial, sans-serif; font-weight: bold; font-size: 16px; line-height: 20px; padding: 10px; display: inline-block; max-width: 300px; border-radius: 5px; text-shadow: rgba(0, 0, 0, 0.25) 0px -1px 1px; box-shadow: rgba(255, 255, 255, 0.5) 0px 1px 3px inset, rgba(0, 0, 0, 0.5) 0px 1px 3px;\">Download Your FREE Mini-Course<\/a><script data-leadbox=\"1433e7773f72a2:164f8be4f346dc\" data-url=\"https:\/\/machinelearningmastery.lpages.co\/leadbox\/1433e7773f72a2%3A164f8be4f346dc\/5764144745676800\/\" data-config=\"%7B%7D\" type=\"text\/javascript\" src=\"https:\/\/machinelearningmastery.lpages.co\/leadbox-1543333086.js\"><\/script><\/p>\n<p><\/center><\/p>\n<div class=\"woo-sc-hr\"><\/div>\n<h2>BatchNormalization in Models<\/h2>\n<p>Batch normalization can be used at most points in a model and with most types of deep learning neural networks.<\/p>\n<h3>Input and Hidden Layer Inputs<\/h3>\n<p>The BatchNormalization layer can be added to your model to standardize raw input variables or the outputs of a hidden layer.<\/p>\n<p>Batch normalization is not recommended as an alternative to proper data preparation for your model.<\/p>\n<p>Nevertheless, when used to standardize the raw input variables, the layer must specify the <em>input_shape<\/em> argument; for example:<\/p>\n<pre class=\"crayon-plain-tag\">...\r\nmodel = Sequential\r\nmodel.add(BatchNormalization(input_shape=(2,)))\r\n...<\/pre>\n<p>When used to standardize the outputs of a hidden layer, the layer can be added to the model just like any other layer.<\/p>\n<pre class=\"crayon-plain-tag\">...\r\nmodel = Sequential\r\n...\r\nmodel.add(BatchNormalization())\r\n...<\/pre>\n<\/p>\n<h3>Use Before or After the Activation Function<\/h3>\n<p>The BatchNormalization normalization layer can be used to standardize inputs before or after the activation function of the previous layer.<\/p>\n<p>The <a href=\"https:\/\/arxiv.org\/abs\/1502.03167\">original paper<\/a> that introduced the method suggests adding batch normalization before the activation function of the previous layer, for example:<\/p>\n<pre class=\"crayon-plain-tag\">...\r\nmodel = Sequential\r\nmodel.add(Dense(32))\r\nmodel.add(BatchNormalization())\r\nmodel.add(Activation('relu'))\r\n...<\/pre>\n<p><a href=\"https:\/\/github.com\/ducha-aiki\/caffenet-benchmark\/blob\/master\/batchnorm.md\">Some reported experiments suggest<\/a> better performance when adding the batch normalization layer after the activation function of the previous layer; for example:<\/p>\n<pre class=\"crayon-plain-tag\">...\r\nmodel = Sequential\r\nmodel.add(Dense(32, activation='relu'))\r\nmodel.add(BatchNormalization())\r\n...<\/pre>\n<p>If time and resources permit, it may be worth testing both approaches on your model and use the approach that results in the best performance.<\/p>\n<p>Let\u2019s take a look at how batch normalization can be used with some common network types.<\/p>\n<h3>MLP Batch Normalization<\/h3>\n<p>The example below adds batch normalization after the activation function between Dense hidden layers.<\/p>\n<pre class=\"crayon-plain-tag\"># example of batch normalization for an mlp\r\nfrom keras.layers import Dense\r\nfrom keras.layers import BatchNormalization\r\n...\r\nmodel.add(Dense(32, activation='relu'))\r\nmodel.add(BatchNormalization())\r\nmodel.add(Dense(1))\r\n...<\/pre>\n<\/p>\n<h3>CNN Batch Normalization<\/h3>\n<p>The example below adds batch normalization after the activation function between a convolutional and max pooling layers.<\/p>\n<pre class=\"crayon-plain-tag\"># example of batch normalization for an cnn\r\nfrom keras.layers import Dense\r\nfrom keras.layers import Conv2D\r\nfrom keras.layers import MaxPooling2D\r\nfrom keras.layers import BatchNormalization\r\n...\r\nmodel.add(Conv2D(32, (3,3), activation='relu'))\r\nmodel.add(Conv2D(32, (3,3), activation='relu'))\r\nmodel.add(BatchNormalization())\r\nmodel.add(MaxPooling2D())\r\nmodel.add(Dense(1))\r\n...<\/pre>\n<\/p>\n<h3>RNN Batch Normalization<\/h3>\n<p>The example below adds batch normalization after the activation function between an LSTM and Dense hidden layers.<\/p>\n<pre class=\"crayon-plain-tag\"># example of batch normalization for a lstm\r\nfrom keras.layers import Dense\r\nfrom keras.layers import LSTM\r\nfrom keras.layers import BatchNormalization\r\n...\r\nmodel.add(LSTM(32))\r\nmodel.add(BatchNormalization())\r\nmodel.add(Dense(1))\r\n...<\/pre>\n<\/p>\n<h2>BatchNormalization Case Study<\/h2>\n<p>In this section, we will demonstrate how to use batch normalization to accelerate the training of an MLP on a simple binary classification problem.<\/p>\n<p>This example provides a template for applying batch normalization to your own neural network for classification and regression problems.<\/p>\n<h3>Binary Classification Problem<\/h3>\n<p>We will use a standard binary classification problem that defines two two-dimensional concentric circles of observations, one circle for each class.<\/p>\n<p>Each observation has two input variables with the same scale and a class output value of either 0 or 1. This dataset is called the \u201ccircles\u201d dataset because of the shape of the observations in each class when plotted.<\/p>\n<p>We can use the <a href=\"http:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.datasets.make_circles.html\">make_circles() function<\/a> to generate observations from this problem. We will add noise to the data and seed the random number generator so that the same samples are generated each time the code is run.<\/p>\n<pre class=\"crayon-plain-tag\"># generate 2d classification dataset\r\nX, y = make_circles(n_samples=1000, noise=0.1, random_state=1)<\/pre>\n<p>We can plot the dataset where the two variables are taken as <em>x<\/em> and <em>y<\/em> coordinates on a graph and the class value is taken as the color of the observation.<\/p>\n<p>The complete example of generating the dataset and plotting it is listed below.<\/p>\n<pre class=\"crayon-plain-tag\"># scatter plot of the circles dataset with points colored by class\r\nfrom sklearn.datasets import make_circles\r\nfrom numpy import where\r\nfrom matplotlib import pyplot\r\n# generate circles\r\nX, y = make_circles(n_samples=1000, noise=0.1, random_state=1)\r\n# select indices of points with each class label\r\nfor i in range(2):\r\n\tsamples_ix = where(y == i)\r\n\tpyplot.scatter(X[samples_ix, 0], X[samples_ix, 1], label=str(i))\r\npyplot.legend()\r\npyplot.show()<\/pre>\n<p>Running the example creates a scatter plot showing the concentric circles shape of the observations in each class.<\/p>\n<p>We can see the noise in the dispersal of the points making the circles less obvious.<\/p>\n<div id=\"attachment_6859\" style=\"width: 1290px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" class=\"size-full wp-image-6859\" src=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2018\/11\/Scatter-Plot-of-Circles-Dataset-with-Color-Showing-the-Class-Value-of-Each-Sample.png\" alt=\"Scatter Plot of Circles Dataset With Color Showing the Class Value of Each Sample\" width=\"1280\" height=\"960\" srcset=\"http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/11\/Scatter-Plot-of-Circles-Dataset-with-Color-Showing-the-Class-Value-of-Each-Sample.png 1280w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/11\/Scatter-Plot-of-Circles-Dataset-with-Color-Showing-the-Class-Value-of-Each-Sample-300x225.png 300w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/11\/Scatter-Plot-of-Circles-Dataset-with-Color-Showing-the-Class-Value-of-Each-Sample-768x576.png 768w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/11\/Scatter-Plot-of-Circles-Dataset-with-Color-Showing-the-Class-Value-of-Each-Sample-1024x768.png 1024w\" sizes=\"(max-width: 1280px) 100vw, 1280px\"><\/p>\n<p class=\"wp-caption-text\">Scatter Plot of Circles Dataset With Color Showing the Class Value of Each Sample<\/p>\n<\/div>\n<p>This is a good test problem because the classes cannot be separated by a line, e.g. are not linearly separable, requiring a nonlinear method such as a neural network to address.<\/p>\n<h3>Multilayer Perceptron Model<\/h3>\n<p>We can develop a Multilayer Perceptron model, or MLP, as a baseline for this problem.<\/p>\n<p>First, we will split the 1,000 generated samples into a train and test dataset, with 500 examples in each. This will provide a sufficiently large sample for the model to learn from and an equally sized (fair) evaluation of its performance.<\/p>\n<pre class=\"crayon-plain-tag\"># split into train and test\r\nn_train = 500\r\ntrainX, testX = X[:n_train, :], X[n_train:, :]\r\ntrainy, testy = y[:n_train], y[n_train:]<\/pre>\n<p>We will define a simple MLP model. The network must have two inputs in the visible layer for the two variables in the dataset.<\/p>\n<p>The model will have a single hidden layer with 50 nodes, chosen arbitrarily, and use the rectified linear activation function and the He random weight initialization method. The output layer will be a single node with the sigmoid activation function, capable of predicting a 0 for the outer circle and a 1 for the inner circle of the problem.<\/p>\n<p>The model will be trained using stochastic gradient descent with a modest learning rate of 0.01 and a large momentum of 0.9, and the optimization will be directed using the binary cross entropy loss function.<\/p>\n<pre class=\"crayon-plain-tag\"># define model\r\nmodel = Sequential()\r\nmodel.add(Dense(50, input_dim=2, activation='relu', kernel_initializer='he_uniform'))\r\nmodel.add(Dense(1, activation='sigmoid'))\r\nopt = SGD(lr=0.01, momentum=0.9)\r\nmodel.compile(loss='binary_crossentropy', optimizer=opt, metrics=['accuracy'])<\/pre>\n<p>Once defined, the model can be fit on the training dataset.<\/p>\n<p>We will use the holdout test dataset as a validation dataset and evaluate its performance at the end of each training epoch. The model will be fit for 100 epochs, chosen after a little trial and error.<\/p>\n<pre class=\"crayon-plain-tag\"># fit model\r\nhistory = model.fit(trainX, trainy, validation_data=(testX, testy), epochs=100, verbose=0)<\/pre>\n<p>At the end of the run, the model is evaluated on the train and test dataset and the accuracy is reported.<\/p>\n<pre class=\"crayon-plain-tag\"># evaluate the model\r\n_, train_acc = model.evaluate(trainX, trainy, verbose=0)\r\n_, test_acc = model.evaluate(testX, testy, verbose=0)\r\nprint('Train: %.3f, Test: %.3f' % (train_acc, test_acc))<\/pre>\n<p>Finally, line plots are created showing model accuracy on the train and test sets at the end of each training epoch providing learning curves.<\/p>\n<p>This plot of learning curves is useful as it gives an idea of how quickly and how well the model has learned the problem.<\/p>\n<pre class=\"crayon-plain-tag\"># plot history\r\npyplot.plot(history.history['acc'], label='train')\r\npyplot.plot(history.history['val_acc'], label='test')\r\npyplot.legend()\r\npyplot.show()<\/pre>\n<p>Tying these elements together, the complete example is listed below.<\/p>\n<pre class=\"crayon-plain-tag\"># mlp for the two circles problem\r\nfrom sklearn.datasets import make_circles\r\nfrom keras.models import Sequential\r\nfrom keras.layers import Dense\r\nfrom keras.optimizers import SGD\r\nfrom matplotlib import pyplot\r\n# generate 2d classification dataset\r\nX, y = make_circles(n_samples=1000, noise=0.1, random_state=1)\r\n# split into train and test\r\nn_train = 500\r\ntrainX, testX = X[:n_train, :], X[n_train:, :]\r\ntrainy, testy = y[:n_train], y[n_train:]\r\n# define model\r\nmodel = Sequential()\r\nmodel.add(Dense(50, input_dim=2, activation='relu', kernel_initializer='he_uniform'))\r\nmodel.add(Dense(1, activation='sigmoid'))\r\nopt = SGD(lr=0.01, momentum=0.9)\r\nmodel.compile(loss='binary_crossentropy', optimizer=opt, metrics=['accuracy'])\r\n# fit model\r\nhistory = model.fit(trainX, trainy, validation_data=(testX, testy), epochs=100, verbose=0)\r\n# evaluate the model\r\n_, train_acc = model.evaluate(trainX, trainy, verbose=0)\r\n_, test_acc = model.evaluate(testX, testy, verbose=0)\r\nprint('Train: %.3f, Test: %.3f' % (train_acc, test_acc))\r\n# plot history\r\npyplot.plot(history.history['acc'], label='train')\r\npyplot.plot(history.history['val_acc'], label='test')\r\npyplot.legend()\r\npyplot.show()<\/pre>\n<p>Running the example fits the model and evaluates it on the train and test sets.<\/p>\n<p>Your specific results may vary given the stochastic nature of the learning algorithm. Consider re-running the example a number of times.<\/p>\n<p>In this case, we can see that the model achieved an accuracy of about 84% on the holdout dataset and achieved comparable performance on both the train and test sets, given the same size and similar composition of both datasets.<\/p>\n<pre class=\"crayon-plain-tag\">Train: 0.838, Test: 0.846<\/pre>\n<p>A graph is created showing line plots of the classification accuracy on the train (blue) and test (orange) datasets.<\/p>\n<p>The plot shows comparable performance of the model on both datasets during the training process. We can see that performance leaps up over the first 30-to-40 epochs to above 80% accuracy then is slowly refined.<\/p>\n<div id=\"attachment_6860\" style=\"width: 1290px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" class=\"size-full wp-image-6860\" src=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2018\/11\/Line-Plot-of-MLP-Classification-Accuracy-on-Train-and-Test-Datasets-over-Training-Epochs.png\" alt=\"Line Plot of MLP Classification Accuracy on Train and Test Datasets Over Training Epochs\" width=\"1280\" height=\"960\" srcset=\"http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/11\/Line-Plot-of-MLP-Classification-Accuracy-on-Train-and-Test-Datasets-over-Training-Epochs.png 1280w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/11\/Line-Plot-of-MLP-Classification-Accuracy-on-Train-and-Test-Datasets-over-Training-Epochs-300x225.png 300w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/11\/Line-Plot-of-MLP-Classification-Accuracy-on-Train-and-Test-Datasets-over-Training-Epochs-768x576.png 768w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/11\/Line-Plot-of-MLP-Classification-Accuracy-on-Train-and-Test-Datasets-over-Training-Epochs-1024x768.png 1024w\" sizes=\"(max-width: 1280px) 100vw, 1280px\"><\/p>\n<p class=\"wp-caption-text\">Line Plot of MLP Classification Accuracy on Train and Test Datasets Over Training Epochs<\/p>\n<\/div>\n<p>This result, and specifically the dynamics of the model during training, provide a baseline that can be compared to the same model with the addition of batch normalization.<\/p>\n<h2>MLP With Batch Normalization<\/h2>\n<p>The model introduced in the previous section can be updated to add batch normalization.<\/p>\n<p>The expectation is that the addition of batch normalization would accelerate the training process, offering similar or better classification accuracy of the model in fewer training epochs. Batch normalization is also reported as providing a modest form of regularization, meaning that it may also offer a small reduction in generalization error demonstrated by a small increase in classification accuracy on the holdout test dataset.<\/p>\n<p>A new BatchNormalization layer can be added to the model after the hidden layer before the output layer. Specifically, after the activation function of the prior hidden layer.<\/p>\n<pre class=\"crayon-plain-tag\"># define model\r\nmodel = Sequential()\r\nmodel.add(Dense(50, input_dim=2, activation='relu', kernel_initializer='he_uniform'))\r\nmodel.add(BatchNormalization())\r\nmodel.add(Dense(1, activation='sigmoid'))\r\nopt = SGD(lr=0.01, momentum=0.9)\r\nmodel.compile(loss='binary_crossentropy', optimizer=opt, metrics=['accuracy'])<\/pre>\n<p>The complete example with this modification is listed below.<\/p>\n<pre class=\"crayon-plain-tag\"># mlp for the two circles problem with batchnorm after activation function\r\nfrom sklearn.datasets import make_circles\r\nfrom keras.models import Sequential\r\nfrom keras.layers import Dense\r\nfrom keras.layers import BatchNormalization\r\nfrom keras.optimizers import SGD\r\nfrom matplotlib import pyplot\r\n# generate 2d classification dataset\r\nX, y = make_circles(n_samples=1000, noise=0.1, random_state=1)\r\n# split into train and test\r\nn_train = 500\r\ntrainX, testX = X[:n_train, :], X[n_train:, :]\r\ntrainy, testy = y[:n_train], y[n_train:]\r\n# define model\r\nmodel = Sequential()\r\nmodel.add(Dense(50, input_dim=2, activation='relu', kernel_initializer='he_uniform'))\r\nmodel.add(BatchNormalization())\r\nmodel.add(Dense(1, activation='sigmoid'))\r\nopt = SGD(lr=0.01, momentum=0.9)\r\nmodel.compile(loss='binary_crossentropy', optimizer=opt, metrics=['accuracy'])\r\n# fit model\r\nhistory = model.fit(trainX, trainy, validation_data=(testX, testy), epochs=100, verbose=0)\r\n# evaluate the model\r\n_, train_acc = model.evaluate(trainX, trainy, verbose=0)\r\n_, test_acc = model.evaluate(testX, testy, verbose=0)\r\nprint('Train: %.3f, Test: %.3f' % (train_acc, test_acc))\r\n# plot history\r\npyplot.plot(history.history['acc'], label='train')\r\npyplot.plot(history.history['val_acc'], label='test')\r\npyplot.legend()\r\npyplot.show()<\/pre>\n<p>Running the example first prints the classification accuracy of the model on the train and test dataset.<\/p>\n<p>Your specific results may vary given the stochastic nature of the learning algorithm. Consider re-running the example a number of times.<\/p>\n<p>In this case, we can see comparable performance of the model on both the train and test set of about 84% accuracy, very similar to what we saw in the previous section, if not a little bit better.<\/p>\n<pre class=\"crayon-plain-tag\">Train: 0.846, Test: 0.848<\/pre>\n<p>A graph of the learning curves is also created showing classification accuracy on both the train and test sets for each training epoch.<\/p>\n<p>In this case, we can see that the model has learned the problem faster than the model in the previous section without batch normalization. Specifically, we can see that classification accuracy on the train and test datasets leaps above 80% within the first 20 epochs, as opposed to 30-to-40 epochs in the model without batch normalization.<\/p>\n<p>The plot also shows the effect of batch normalization during training. We can see lower performance on the training dataset than the test dataset: scores on the training dataset that are lower than the performance of the model at the end of the training run. This is likely the effect of the input collected and updated each mini-batch.<\/p>\n<div id=\"attachment_6861\" style=\"width: 1290px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" class=\"size-full wp-image-6861\" src=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2018\/11\/Line-Plot-Classification-Accuracy-of-MLP-with-Batch-Normalization-After-Activation-Function-on-Train-and-Test-Datasets-over-Training-Epochs.png\" alt=\"Line Plot Classification Accuracy of MLP With Batch Normalization After Activation Function on Train and Test Datasets Over Training Epochs\" width=\"1280\" height=\"960\" srcset=\"http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/11\/Line-Plot-Classification-Accuracy-of-MLP-with-Batch-Normalization-After-Activation-Function-on-Train-and-Test-Datasets-over-Training-Epochs.png 1280w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/11\/Line-Plot-Classification-Accuracy-of-MLP-with-Batch-Normalization-After-Activation-Function-on-Train-and-Test-Datasets-over-Training-Epochs-300x225.png 300w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/11\/Line-Plot-Classification-Accuracy-of-MLP-with-Batch-Normalization-After-Activation-Function-on-Train-and-Test-Datasets-over-Training-Epochs-768x576.png 768w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/11\/Line-Plot-Classification-Accuracy-of-MLP-with-Batch-Normalization-After-Activation-Function-on-Train-and-Test-Datasets-over-Training-Epochs-1024x768.png 1024w\" sizes=\"(max-width: 1280px) 100vw, 1280px\"><\/p>\n<p class=\"wp-caption-text\">Line Plot Classification Accuracy of MLP With Batch Normalization After Activation Function on Train and Test Datasets Over Training Epochs<\/p>\n<\/div>\n<p>We can also try a variation of the model where batch normalization is applied prior to the activation function of the hidden layer, instead of after the activation function.<\/p>\n<pre class=\"crayon-plain-tag\"># define model\r\nmodel = Sequential()\r\nmodel.add(Dense(50, input_dim=2, kernel_initializer='he_uniform'))\r\nmodel.add(BatchNormalization())\r\nmodel.add(Activation('relu'))\r\nmodel.add(Dense(1, activation='sigmoid'))\r\nopt = SGD(lr=0.01, momentum=0.9)\r\nmodel.compile(loss='binary_crossentropy', optimizer=opt, metrics=['accuracy'])<\/pre>\n<p>The complete code listing with this change to the model is listed below.<\/p>\n<pre class=\"crayon-plain-tag\"># mlp for the two circles problem with batchnorm before activation function\r\nfrom sklearn.datasets import make_circles\r\nfrom keras.models import Sequential\r\nfrom keras.layers import Dense\r\nfrom keras.layers import Activation\r\nfrom keras.layers import BatchNormalization\r\nfrom keras.optimizers import SGD\r\nfrom matplotlib import pyplot\r\n# generate 2d classification dataset\r\nX, y = make_circles(n_samples=1000, noise=0.1, random_state=1)\r\n# split into train and test\r\nn_train = 500\r\ntrainX, testX = X[:n_train, :], X[n_train:, :]\r\ntrainy, testy = y[:n_train], y[n_train:]\r\n# define model\r\nmodel = Sequential()\r\nmodel.add(Dense(50, input_dim=2, kernel_initializer='he_uniform'))\r\nmodel.add(BatchNormalization())\r\nmodel.add(Activation('relu'))\r\nmodel.add(Dense(1, activation='sigmoid'))\r\nopt = SGD(lr=0.01, momentum=0.9)\r\nmodel.compile(loss='binary_crossentropy', optimizer=opt, metrics=['accuracy'])\r\n# fit model\r\nhistory = model.fit(trainX, trainy, validation_data=(testX, testy), epochs=100, verbose=0)\r\n# evaluate the model\r\n_, train_acc = model.evaluate(trainX, trainy, verbose=0)\r\n_, test_acc = model.evaluate(testX, testy, verbose=0)\r\nprint('Train: %.3f, Test: %.3f' % (train_acc, test_acc))\r\n# plot history\r\npyplot.plot(history.history['acc'], label='train')\r\npyplot.plot(history.history['val_acc'], label='test')\r\npyplot.legend()\r\npyplot.show()<\/pre>\n<p>Running the example first prints the classification accuracy of the model on the train and test dataset.<\/p>\n<p>Your specific results may vary given the stochastic nature of the learning algorithm. Consider re-running the example a number of times.<\/p>\n<p>In this case, we can see comparable performance of the model on the train and test datasets, but slightly worse than the model without batch normalization.<\/p>\n<pre class=\"crayon-plain-tag\">Train: 0.826, Test: 0.830<\/pre>\n<p>The line plot of the learning curves on the train and test sets also tells a different story.<\/p>\n<p>The plot shows the model learning perhaps at the same pace as the model without batch normalization, but the performance of the model on the training dataset is much worse, hovering around 70% to 75% accuracy, again likely an effect of the statistics collected and used over each mini-batch.<\/p>\n<p>At least for this model configuration on this specific dataset, it appears that batch normalization is more effective after the rectified linear activation function.<\/p>\n<div id=\"attachment_6862\" style=\"width: 1290px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" class=\"size-full wp-image-6862\" src=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2018\/11\/Line-Plot-Classification-Accuracy-of-MLP-with-Batch-Normalization-Before-Activation-Function-on-Train-and-Test-Datasets-over-Training-Epochs.png\" alt=\"Line Plot Classification Accuracy of MLP With Batch Normalization Before Activation Function on Train and Test Datasets Over Training Epochs\" width=\"1280\" height=\"960\" srcset=\"http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/11\/Line-Plot-Classification-Accuracy-of-MLP-with-Batch-Normalization-Before-Activation-Function-on-Train-and-Test-Datasets-over-Training-Epochs.png 1280w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/11\/Line-Plot-Classification-Accuracy-of-MLP-with-Batch-Normalization-Before-Activation-Function-on-Train-and-Test-Datasets-over-Training-Epochs-300x225.png 300w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/11\/Line-Plot-Classification-Accuracy-of-MLP-with-Batch-Normalization-Before-Activation-Function-on-Train-and-Test-Datasets-over-Training-Epochs-768x576.png 768w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2018\/11\/Line-Plot-Classification-Accuracy-of-MLP-with-Batch-Normalization-Before-Activation-Function-on-Train-and-Test-Datasets-over-Training-Epochs-1024x768.png 1024w\" sizes=\"(max-width: 1280px) 100vw, 1280px\"><\/p>\n<p class=\"wp-caption-text\">Line Plot Classification Accuracy of MLP With Batch Normalization Before Activation Function on Train and Test Datasets Over Training Epochs<\/p>\n<\/div>\n<h2>Extensions<\/h2>\n<p>This section lists some ideas for extending the tutorial that you may wish to explore.<\/p>\n<ul>\n<li><strong>Without Beta and Gamma<\/strong>. Update the example to not use the beta and gamma parameters in the batch normalization layer and compare results.<\/li>\n<li><strong>Without Momentum<\/strong>. Update the example to not use momentum in the batch normalization layer during training and compare results.<\/li>\n<li><strong>Input Layer<\/strong>. Update the example to use batch normalization after the input to the model and compare results.<\/li>\n<\/ul>\n<p>If you explore any of these extensions, I\u2019d love to know.<\/p>\n<h2>Further Reading<\/h2>\n<p>This section provides more resources on the topic if you are looking to go deeper.<\/p>\n<h3>Papers<\/h3>\n<ul>\n<li><a href=\"https:\/\/arxiv.org\/abs\/1502.03167\">Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift<\/a>, 2015.<\/li>\n<\/ul>\n<h3>API<\/h3>\n<ul>\n<li><a href=\"https:\/\/keras.io\/regularizers\/\">Keras Regularizers API<\/a><\/li>\n<li><a href=\"https:\/\/keras.io\/layers\/core\/\">Keras Core Layers API<\/a><\/li>\n<li><a href=\"https:\/\/keras.io\/layers\/convolutional\/\">Keras Convolutional Layers API<\/a><\/li>\n<li><a href=\"https:\/\/keras.io\/layers\/recurrent\/\">Keras Recurrent Layers API<\/a><\/li>\n<li><a href=\"https:\/\/keras.io\/layers\/normalization\/\">BatchNormalization Keras API<\/a><\/li>\n<li><a href=\"http:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.datasets.make_circles.html\">sklearn.datasets.make_circles<\/a><\/li>\n<\/ul>\n<h3>Articles<\/h3>\n<ul>\n<li><a href=\"http:\/\/blog.datumbox.com\/the-batch-normalization-layer-of-keras-is-broken\/\">The Batch Normalization layer of Keras is broken, Vasilis Vryniotis<\/a>, 2018.<\/li>\n<li><a href=\"https:\/\/www.reddit.com\/r\/MachineLearning\/comments\/67gonq\/d_batch_normalization_before_or_after_relu\/\">Batch Normalization before or after ReLU?, Reddit<\/a>.<\/li>\n<li><a href=\"https:\/\/github.com\/ducha-aiki\/caffenet-benchmark\/blob\/master\/batchnorm.md\">Studies of Batch Normalization Before and After Activation Function<\/a>.<\/li>\n<\/ul>\n<h2>Summary<\/h2>\n<p>In this tutorial, you discovered how to use batch normalization to accelerate the training of deep learning neural networks in Python with Keras.<\/p>\n<p>Specifically, you learned:<\/p>\n<ul>\n<li>How to create and configure a BatchNormalization layer using the Keras API.<\/li>\n<li>How to add the BatchNormalization layer to deep learning neural network models.<\/li>\n<li>How to update an MLP model to use batch normalization to accelerate training on a binary classification problem.<\/li>\n<\/ul>\n<p>Do you have any questions?<br \/>\nAsk your questions in the comments below and I will do my best to answer.<\/p>\n<p>The post <a rel=\"nofollow\" href=\"https:\/\/machinelearningmastery.com\/how-to-accelerate-learning-of-deep-neural-networks-with-batch-normalization\/\">How to Accelerate Learning of Deep Neural Networks With Batch Normalization<\/a> appeared first on <a rel=\"nofollow\" href=\"https:\/\/machinelearningmastery.com\/\">Machine Learning Mastery<\/a>.<\/p>\n<\/div>\n<p><a href=\"https:\/\/machinelearningmastery.com\/how-to-accelerate-learning-of-deep-neural-networks-with-batch-normalization\/\">Go to Source<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Author: Jason Brownlee Batch normalization is a technique designed to automatically standardize the inputs to a layer in a deep learning neural network. Once implemented, [&hellip;] <span class=\"read-more-link\"><a class=\"read-more\" href=\"https:\/\/www.aiproblog.com\/index.php\/2019\/01\/17\/how-to-accelerate-learning-of-deep-neural-networks-with-batch-normalization\/\">Read More<\/a><\/span><\/p>\n","protected":false},"author":1,"featured_media":1595,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_bbp_topic_count":0,"_bbp_reply_count":0,"_bbp_total_topic_count":0,"_bbp_total_reply_count":0,"_bbp_voice_count":0,"_bbp_anonymous_reply_count":0,"_bbp_topic_count_hidden":0,"_bbp_reply_count_hidden":0,"_bbp_forum_subforum_count":0,"footnotes":""},"categories":[24],"tags":[],"_links":{"self":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts\/1594"}],"collection":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/comments?post=1594"}],"version-history":[{"count":0,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts\/1594\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/media\/1595"}],"wp:attachment":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/media?parent=1594"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/categories?post=1594"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/tags?post=1594"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}