{"id":5690,"date":"2022-06-14T15:51:07","date_gmt":"2022-06-14T15:51:07","guid":{"rendered":"https:\/\/www.aiproblog.com\/index.php\/2022\/06\/14\/using-normalization-layers-to-improve-deep-learning-models\/"},"modified":"2022-06-14T15:51:07","modified_gmt":"2022-06-14T15:51:07","slug":"using-normalization-layers-to-improve-deep-learning-models","status":"publish","type":"post","link":"https:\/\/www.aiproblog.com\/index.php\/2022\/06\/14\/using-normalization-layers-to-improve-deep-learning-models\/","title":{"rendered":"Using Normalization Layers to Improve Deep Learning Models"},"content":{"rendered":"<p>Author: Zhe Ming Chng<\/p>\n<div>\n<p>You\u2019ve probably been told to standardize or normalize inputs to your model to improve performance. But what is normalization and how can we implement it easily in our deep learning models to improve performance? Normalizing our inputs aims to create a set of features that are on the same scale as each other, which we\u2019ll explore more in this article.<\/p>\n<p>Also, thinking about it, in neural networks, the output of each layer serves as the inputs into the next layer, so a natural question to ask is: If normalizing inputs to the model helps improve model performance, does standardizing the inputs into each layer help to improve model performance too?<\/p>\n<p>The answer most of the time is yes! However, unlike normalizing our inputs to the model as a whole, it is slightly more complicated to normalize the inputs to intermediate layers as the activations are constantly changing. As such, it is infeasible, or at least, computationally expensive to continuously compute statistics over the entire train set over and over again. In this article, we\u2019ll be exploring normalization layers to normalize your inputs to your model as well as batch normalization, a technique to standardize the inputs into each layer across batches.<\/p>\n<p>Let\u2019s get started!<\/p>\n<div id=\"attachment_13669\" style=\"width: 810px\" class=\"wp-caption aligncenter\">\n<img decoding=\"async\" aria-describedby=\"caption-attachment-13669\" class=\"size-full wp-image-13669\" src=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2022\/06\/pexels-matej-712501-scaled.jpg\" alt=\"\" width=\"800\" srcset=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2022\/06\/pexels-matej-712501-scaled.jpg 2560w, https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2022\/06\/pexels-matej-712501-300x200.jpg 300w, https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2022\/06\/pexels-matej-712501-1024x683.jpg 1024w, https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2022\/06\/pexels-matej-712501-768x512.jpg 768w, https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2022\/06\/pexels-matej-712501-1536x1024.jpg 1536w, https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2022\/06\/pexels-matej-712501-2048x1365.jpg 2048w\" sizes=\"(max-width: 2560px) 100vw, 2560px\"><\/p>\n<p id=\"caption-attachment-13669\" class=\"wp-caption-text\">Using Normalization Layers to Improve Deep Learning Models<br \/><a href=\"https:\/\/www.pexels.com\/photo\/plenty-of-nuts-712501\/\">Photo by Matej<\/a>. Some rights reserved.<\/p>\n<\/div>\n<h2>Overview<\/h2>\n<p>This tutorial is split into 6 parts; they are:<\/p>\n<ul>\n<li>What is normalization and why is it helpful?<\/li>\n<li>Using Normalization layer in TensorFlow<\/li>\n<li>What is batch normalization and why should we use it?<\/li>\n<li>Batch normalization: Under the hood<\/li>\n<li>Normalization and Batch Normalization in Action<\/li>\n<\/ul>\n<h2>What is Normalization and Why is It Helpful?<\/h2>\n<p>Normalizing a set of data transforms the set of data to be on a similar scale. For machine learning models, our goal is usually to recenter and rescale our data such that is between 0 and 1 or -1 and 1, depending on the data itself. One common way to accomplish this is to calculate the mean and the standard deviation on the set of data and transform each sample by subtracting the mean and dividing by the standard deviation, which is good if we assume that the data follows a normal distribution as this method helps us standardize the data and achieve a standard normal distribution.<\/p>\n<p>Normalization can help training of our neural networks as the different features are on a similar scale, which helps to stabilize the gradient descent step, allowing us to use larger learning rates or help models converge faster for a given learning rate.<\/p>\n<h2>Using Normalization Layer in Tensorflow<\/h2>\n<p>To normalize inputs in TensorFlow, we can use Normalization layer in Keras. First, let\u2019s define some sample data,<\/p>\n<pre class=\"urvanov-syntax-highlighter-plain-tag\">sample1 = np.array([\r\n    [1, 1, 1],\r\n    [1, 1, 1],\r\n    [1, 1, 1]\r\n], dtype=np.float32)\r\n\r\nsample2 = np.array([\r\n    [2, 2, 2],\r\n    [2, 2, 2],\r\n    [2, 2, 2]\r\n], dtype=np.float32)\r\n\r\nsample3 = np.array([\r\n    [3, 3, 3],\r\n    [3, 3, 3],\r\n    [3, 3, 3]\r\n], dtype=np.float32)<\/pre>\n<p>Then we initialize our Normalization layer.<\/p>\n<pre class=\"urvanov-syntax-highlighter-plain-tag\">normalization_layer = Normalization()<\/pre>\n<p>And then to get the mean and standard deviation of the dataset and set our Normalization layer to use those parameters, we can call Normalization.adapt() method on our data.<\/p>\n<pre class=\"urvanov-syntax-highlighter-plain-tag\">combined_batch = tf.constant(np.expand_dims(np.stack([sample1, sample2, sample3]), axis=-1), dtype=tf.float32)\r\n\r\nnormalization_layer = Normalization()\r\n\r\nnormalization_layer.adapt(combined_batch)<\/pre>\n<p>For this case, we used <code>expand_dims<\/code> to add an extra dimension as the Normalization layer normalizes along the last dimension by default (each index in the last dimension gets its own mean and variance parameters computed on the train set) as that is assumed to be the feature dimension, which for RGB images is usually just the different color dimensions.<\/p>\n<p>And then to normalize our data, we can call normalization layer on that data, as such:<\/p>\n<pre class=\"urvanov-syntax-highlighter-plain-tag\">normalization_layer(sample1)<\/pre>\n<p>which gives the output<\/p>\n<pre class=\"urvanov-syntax-highlighter-plain-tag\">&lt;tf.Tensor: shape=(1, 1, 3, 3), dtype=float32, numpy=\r\narray([[[[-1.2247449, -1.2247449, -1.2247449],\r\n         [-1.2247449, -1.2247449, -1.2247449],\r\n         [-1.2247449, -1.2247449, -1.2247449]]]], dtype=float32)&gt;<\/pre>\n<p>And we can verify that this is the expected behavior by running <code>np.mean<\/code> and <code>np.std<\/code> on our original data which gives us a mean of 2.0 and a standard deviation of 0.8164966.<\/p>\n<p>Now that we\u2019ve seen how to normalize our inputs, let\u2019s take a look at another normalization method, batch normalization.<\/p>\n<h2>What is batch normalization and why should we use it?<\/h2>\n<div id=\"attachment_13594\" style=\"width: 271px\" class=\"wp-caption aligncenter\">\n<a href=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2022\/05\/Batch-Normalization-Cube.png\"><img decoding=\"async\" aria-describedby=\"caption-attachment-13594\" loading=\"lazy\" class=\"wp-image-13594\" src=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2022\/05\/Batch-Normalization-Cube.png\" alt=\"\" width=\"261\" height=\"284\" srcset=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2022\/05\/Batch-Normalization-Cube.png 448w, https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2022\/05\/Batch-Normalization-Cube-275x300.png 275w\" sizes=\"(max-width: 261px) 100vw, 261px\"><\/a><\/p>\n<p id=\"caption-attachment-13594\" class=\"wp-caption-text\">Source: https:\/\/arxiv.org\/pdf\/1803.08494.pdf<\/p>\n<\/div>\n<p>From the name, you can probably guess that batch normalization must have something to do with batches during training. Simply put, batch normalization standardizes the input of a layer across a single batch.<\/p>\n<p>You might be thinking, why can\u2019t we just calculate the mean and variance at a given layer and normalize it that way? The problem comes when we train our model as the parameters change during training, hence activations in the intermediate layers are constantly changing and calculating mean and variance across the entire training set for each iteration would be time consuming and potentially pointless since the activations are going to change at each iteration anyway. That\u2019s where batch normalization comes in.<\/p>\n<p>Introduced in \u201cBatch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift\u201d by Ioffe and Szegedy over at Google, batch normalization looks at standardizing the inputs to a layer in order to reduce the problem of internal covariate shift. In the paper, internal covariate shift is defined as the problem of \u201cthe distribution of each layer\u2019s inputs changes during training, as the parameters of the previous layers change.\u201d<\/p>\n<p>The idea of batch normalization fixing the problem of internal covariate shift has been disputed, notably in \u201cHow Does Batch Normalization Help Optimization?\u201d by Santurkar, et al. where it was proposed that batch normalization helps to smoothen the loss function over the parameter space instead. While it might not always be clear how batch normalization does it, but it has achieved good empirical results on many different problems and models.<\/p>\n<p>There is also some evidence that batch normalization can contribute significantly to addressing the vanishing gradient problem common with deep learning models. In the original ResNet paper, He, et al. mention in their analysis of ResNet vs plain networks that \u201cbackward propagated gradients exhibit healthy norms with BN (batch normalization)\u201d even in plain networks.<\/p>\n<p>It has also been suggested that batch normalization has other benefits as well such as allowing us to use higher learning rates as batch normalization can help to stabilize parameter growth. It can also help to regularize the model. From the original batch normalization paper,<\/p>\n<blockquote>\n<p>\u201cWhen training with Batch Normalization, a training example is seen in conjunction with other examples in the mini-batch, and the training network no longer producing deterministic values for a given training example In our experiments, we found this effect to be advantageous to the generalization of the network\u201d<\/p>\n<\/blockquote>\n<h2>Batch Normalization: Under the Hood<\/h2>\n<p>So, what does batch normalization actually do?<\/p>\n<p>First, we need to calculate batch statistics, in particular, the mean and variance for each of the different activations across a batch. Since each layer\u2019s output serves as an input into the next layer in a neural network, by standardizing the output of the layers, we are also standardizing the inputs to the next layer in our model (though in practice, it was suggested in the original paper to implement batch normalization before the activation function, however there\u2019s some debate over this).<\/p>\n<p>So, we calculate<\/p>\n<div id=\"attachment_13593\" style=\"width: 314px\" class=\"wp-caption aligncenter\">\n<a href=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2022\/05\/sample_mean_variance.png\"><img decoding=\"async\" aria-describedby=\"caption-attachment-13593\" loading=\"lazy\" class=\"wp-image-13593\" src=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2022\/05\/sample_mean_variance.png\" alt=\"\" width=\"304\" height=\"148\" srcset=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2022\/05\/sample_mean_variance.png 398w, https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2022\/05\/sample_mean_variance-300x146.png 300w\" sizes=\"(max-width: 304px) 100vw, 304px\"><\/a><\/p>\n<p id=\"caption-attachment-13593\" class=\"wp-caption-text\">Sample mean and variance on batch<\/p>\n<\/div>\n<p>Then, for each of the activation maps, we normalization each value using the respective statistics<\/p>\n<p><a href=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2022\/05\/normalized_x.png\"><img decoding=\"async\" loading=\"lazy\" class=\"aligncenter wp-image-13592\" src=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2022\/05\/normalized_x.png\" alt=\"\" width=\"243\" height=\"63\" srcset=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2022\/05\/normalized_x.png 308w, https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2022\/05\/normalized_x-300x78.png 300w\" sizes=\"(max-width: 243px) 100vw, 243px\"><\/a><\/p>\n<p>For Convolutional Neural Networks (CNNs) in particular, we calculate these statistics over all locations of the same channel. From the original bath normalization paper,<\/p>\n<blockquote>\n<p>\u201cFor convolutional layers, we additionally want the normalization to obey the convolutional property \u2013 so that different elements of the same feature map, at different locations, are normalized in the same way\u201d<\/p>\n<\/blockquote>\n<p>Now that we\u2019ve seen how to calculate the normalized activation maps, let\u2019s explore how this can be implemented using Numpy arrays.<\/p>\n<p>Suppose we had these activation maps with all of them representing a single channel,<\/p>\n<pre class=\"urvanov-syntax-highlighter-plain-tag\">import numpy as np\r\n\r\nactivation_map_sample1 = np.array([\r\n    [1, 1, 1],\r\n    [1, 1, 1],\r\n    [1, 1, 1]\r\n], dtype=np.float32)\r\n\r\nactivation_map_sample2 = np.array([\r\n    [1, 2, 3],\r\n    [4, 5, 6],\r\n    [7, 8, 9]\r\n], dtype=np.float32)\r\n\r\nactivation_map_sample3 = np.array([\r\n    [9, 8, 7],\r\n    [6, 5, 4],\r\n    [3, 2, 1]\r\n], dtype=np.float32)<\/pre>\n<p>Then, we want to standardize each element in the activation map across all locations and across the different samples. To standardize, we compute their mean and standard deviation using<\/p>\n<pre class=\"urvanov-syntax-highlighter-plain-tag\">#get mean across the different samples in batch for each activation\r\nactivation_mean_bn = np.mean([activation_map_sample1, activation_map_sample2, activation_map_sample3], axis=0)\r\n\r\n#get standard deviation across different samples in batch for each activation\r\nactivation_std_bn = np.std([activation_map_sample1, activation_map_sample2, activation_map_sample3], axis=0)\r\n\r\nprint (activation_mean_bn)\r\nprint (activation_std_bn)<\/pre>\n<p>which outputs<\/p>\n<pre class=\"urvanov-syntax-highlighter-plain-tag\">3.6666667\r\n2.8284268<\/pre>\n<p>Then, we can standardize an activation map by doing<\/p>\n<pre class=\"urvanov-syntax-highlighter-plain-tag\">#get batch normalized activation map for sample 1\r\nactivation_map_sample1_bn = (activation_map_sample1 - activation_mean_bn) \/ activation_std_bn<\/pre>\n<p>and these store the outputs<\/p>\n<pre class=\"urvanov-syntax-highlighter-plain-tag\">activation_map_sample1_bn:\r\n[[-0.94280916 -0.94280916 -0.94280916]\r\n [-0.94280916 -0.94280916 -0.94280916]\r\n [-0.94280916 -0.94280916 -0.94280916]]\r\n\r\nactivation_map_sample2_bn:\r\n[[-0.94280916 -0.58925575 -0.2357023 ]\r\n [ 0.11785112  0.47140455  0.82495797]\r\n [ 1.1785114   1.5320647   1.8856182 ]]\r\n\r\nactivation_map_sample3_bn:\r\n[[ 1.8856182   1.5320647   1.1785114 ]\r\n [ 0.82495797  0.47140455  0.11785112]\r\n [-0.2357023  -0.58925575 -0.94280916]]<\/pre>\n<p>But we hit a snag when it comes to inference time. What if we don\u2019t have batches of examples at inference time and even if we did, it would still be preferable if the output is computed from the input deterministically. So, we need to calculate a fixed set of parameters to be used at inference time. For this purpose, we store a moving average for the means and variances instead which we use at inference time to compute the outputs of the layers.<\/p>\n<p>However, another problem with simply standardizing the inputs to a model in this way also changes the representational ability of the layers. One example brought up in the batch normalization paper was the sigmoid nonlinear function, where normalizing the inputs would constrain it to the linear regime of the sigmoid function. To address this, another linear layer is added to scale and recenter the values, along with 2 trainable parameters to learn the appropriate scale and center that should be used.<\/p>\n<h2>Implementing Batch Normalization in TensorFlow<\/h2>\n<p>Now that we understand what goes on with batch normalization under the hood, let\u2019s see how we can use Keras\u2019 batch normalization layer as part of our deep learning models.<\/p>\n<p>To implement batch normalization as part of our deep learning models in Tensorflow, we can use the <code>keras.layers.BatchNormalization<\/code> layer. Using the Numpy arrays from our previous example, we can implement the BatchNormalization on them.<\/p>\n<pre class=\"urvanov-syntax-highlighter-plain-tag\">import tensorflow as tf\r\nimport tensorflow.keras as keras\r\nfrom tensorflow.keras.layers import BatchNormalization\r\nimport numpy as np\r\n\r\n#expand dims to create the channels\r\nactivation_maps = tf.constant(np.expand_dims(np.stack([activation_map_sample1, activation_map_sample2, activation_map_sample3]), axis=0), dtype=tf.float32)\r\n\r\nprint (f\"activation_maps: n{activation_maps}n\")\r\n\r\nprint (BatchNormalization(axis=0)(activation_maps, training=True))<\/pre>\n<p>which gives us the output<\/p>\n<pre class=\"urvanov-syntax-highlighter-plain-tag\">activation_maps: \r\n[[[[1. 1. 1.]\r\n   [1. 1. 1.]\r\n   [1. 1. 1.]]\r\n\r\n   [[1. 2. 3.]\r\n   [4. 5. 6.]\r\n   [7. 8. 9.]]\r\n\r\n  [[9. 8. 7.]\r\n   [6. 5. 4.]\r\n   [3. 2. 1.]]]]\r\n\r\ntf.Tensor(\r\n[[[[-0.9427501  -0.9427501  -0.9427501 ]\r\n   [-0.9427501  -0.9427501  -0.9427501 ]\r\n   [-0.9427501  -0.9427501  -0.9427501 ]]\r\n\r\n  [[-0.9427501  -0.5892188  -0.2356875 ]\r\n   [ 0.11784375  0.471375    0.82490635]\r\n   [ 1.1784375   1.5319688   1.8855002 ]]\r\n\r\n  [[ 1.8855002   1.5319688   1.1784375 ]\r\n   [ 0.82490635  0.471375    0.11784375]\r\n   [-0.2356875  -0.5892188  -0.9427501 ]]]], shape=(1, 3, 3, 3), dtype=float32)<\/pre>\n<p>By default, the BatchNormalization layer uses a scale of 1 and center of 0 for the linear layer, hence these values are similar to the values that we computed earlier using Numpy functions.<\/p>\n<h2>Normalization and Batch Normalization in Action<\/h2>\n<p>Now that we\u2019ve seen how to implement the normalization and batch normalization layers in Tensorflow, let\u2019s explore a LeNet-5 model that uses the normalization and batch normalization layers, as well as compare it to a model that does not use either of these layers.<\/p>\n<p>First, let\u2019s get our dataset, we\u2019ll use CIFAR-10 for this example.<\/p>\n<pre class=\"urvanov-syntax-highlighter-plain-tag\">(trainX, trainY), (testX, testY) = keras.datasets.cifar10.load_data()<\/pre>\n<p>Using a LeNet-5 model with ReLU activation,<\/p>\n<pre class=\"urvanov-syntax-highlighter-plain-tag\">from tensorflow.keras.layers import Dense, Input, Flatten, Conv2D, MaxPool2D\r\nfrom tensorflow.keras.models import Model\r\nimport tensorflow as tf\r\n\r\nclass LeNet5(tf.keras.Model):\r\n  def __init__(self):\r\n    super(LeNet5, self).__init__()\r\n  def call(self, input_tensor):\r\n    self.conv1 = Conv2D(filters=6, kernel_size=(5,5), padding=\"same\", activation=\"relu\")(input_tensor)\r\n    self.maxpool1 = MaxPool2D(pool_size=(2,2))(self.conv1)\r\n    self.conv2 = Conv2D(filters=16, kernel_size=(5,5), padding=\"same\", activation=\"relu\")(self.maxpool1)\r\n    self.maxpool2 = MaxPool2D(pool_size=(2, 2))(self.conv2)\r\n    self.flatten = Flatten()(self.maxpool2)\r\n    self.fc1 = Dense(units=120, activation=\"relu\")(self.flatten)\r\n    self.fc2 = Dense(units=84, activation=\"relu\")(self.fc1)\r\n    self.fc3 = Dense(units=10, activation=\"sigmoid\")(self.fc2)\r\n    return self.fc3\r\n\r\ninput_layer = Input(shape=(32,32,3,))\r\nx = LeNet5()(input_layer)\r\n\r\nmodel = Model(inputs=input_layer, outputs=x)\r\n\r\nmodel.compile(optimizer=\"adam\", loss=tf.keras.losses.SparseCategoricalCrossentropy(), metrics=\"acc\")\r\n\r\nhistory = model.fit(x=trainX, y=trainY, batch_size=256, epochs=10, validation_data=(testX, testY))<\/pre>\n<p>Training the model gives us the output,<\/p>\n<pre class=\"urvanov-syntax-highlighter-plain-tag\">Epoch 1\/10\r\n196\/196 [==============================] - 14s 15ms\/step - loss: 3.8905 - acc: 0.2172 - val_loss: 1.9656 - val_acc: 0.2853\r\nEpoch 2\/10\r\n196\/196 [==============================] - 2s 12ms\/step - loss: 1.8402 - acc: 0.3375 - val_loss: 1.7654 - val_acc: 0.3678\r\nEpoch 3\/10\r\n196\/196 [==============================] - 2s 12ms\/step - loss: 1.6778 - acc: 0.3986 - val_loss: 1.6484 - val_acc: 0.4039\r\nEpoch 4\/10\r\n196\/196 [==============================] - 2s 12ms\/step - loss: 1.5663 - acc: 0.4355 - val_loss: 1.5644 - val_acc: 0.4380\r\nEpoch 5\/10\r\n196\/196 [==============================] - 2s 12ms\/step - loss: 1.4815 - acc: 0.4712 - val_loss: 1.5357 - val_acc: 0.4472\r\nEpoch 6\/10\r\n196\/196 [==============================] - 2s 12ms\/step - loss: 1.4053 - acc: 0.4975 - val_loss: 1.4883 - val_acc: 0.4675\r\nEpoch 7\/10\r\n196\/196 [==============================] - 2s 12ms\/step - loss: 1.3300 - acc: 0.5262 - val_loss: 1.4643 - val_acc: 0.4805\r\nEpoch 8\/10\r\n196\/196 [==============================] - 2s 12ms\/step - loss: 1.2595 - acc: 0.5531 - val_loss: 1.4685 - val_acc: 0.4866\r\nEpoch 9\/10\r\n196\/196 [==============================] - 2s 12ms\/step - loss: 1.1999 - acc: 0.5752 - val_loss: 1.4302 - val_acc: 0.5026\r\nEpoch 10\/10\r\n196\/196 [==============================] - 2s 12ms\/step - loss: 1.1370 - acc: 0.5979 - val_loss: 1.4441 - val_acc: 0.5009<\/pre>\n<p>Next, let\u2019s take a look at what happens if we added normalization and batch normalization layers. We usually add layer normalization. Amending our LeNet-5 model,<\/p>\n<pre class=\"urvanov-syntax-highlighter-plain-tag\">class LeNet5_Norm(tf.keras.Model):\r\n  def __init__(self, norm_layer, *args, **kwargs):\r\n    super(LeNet5_Norm, self).__init__()\r\n    self.conv1 = Conv2D(filters=6, kernel_size=(5,5), padding=\"same\")\r\n    self.norm1 = norm_layer(*args, **kwargs)\r\n    self.relu = relu\r\n    self.max_pool2x2 = MaxPool2D(pool_size=(2,2))\r\n    self.conv2 = Conv2D(filters=16, kernel_size=(5,5), padding=\"same\")\r\n    self.norm2 = norm_layer(*args, **kwargs)\r\n    self.flatten = Flatten()\r\n    self.fc1 = Dense(units=120)\r\n    self.norm3 = norm_layer(*args, **kwargs)\r\n    self.fc2 = Dense(units=84)\r\n    self.norm4 = norm_layer(*args, **kwargs)\r\n    self.fc3 = Dense(units=10, activation=\"softmax\")\r\n  def call(self, input_tensor):\r\n    conv1 = self.conv1(input_tensor)\r\n    conv1 = self.norm1(conv1)\r\n    conv1 = self.relu(conv1)\r\n    maxpool1 = self.max_pool2x2(conv1)\r\n    conv2 = self.conv2(maxpool1)\r\n    conv2 = self.norm2(conv2)\r\n    conv2 = self.relu(conv2)\r\n    maxpool2 = self.max_pool2x2(conv2)\r\n    flatten = self.flatten(maxpool2)\r\n    fc1 = self.fc1(flatten)\r\n    fc1 = self.norm3(fc1)\r\n    fc1 = self.relu(fc1)\r\n    fc2 = self.fc2(fc1)\r\n    fc2 = self.norm4(fc2)\r\n    fc2 = self.relu(fc2)\r\n    fc3 = self.fc3(fc2)\r\n    return fc3<\/pre>\n<p>And running the training again, this time with the normalization layer added.<\/p>\n<pre class=\"urvanov-syntax-highlighter-plain-tag\">normalization_layer = Normalization()\r\nnormalization_layer.adapt(trainX)\r\n\r\ninput_layer = Input(shape=(32,32,3,))\r\nx = LeNet5_Norm(BatchNormalization)(normalization_layer(input_layer))\r\n\r\nmodel = Model(inputs=input_layer, outputs=x)\r\n\r\nmodel.compile(optimizer=\"adam\", loss=tf.keras.losses.SparseCategoricalCrossentropy(), metrics=\"acc\")\r\n\r\nhistory = model.fit(x=trainX, y=trainY, batch_size=256, epochs=10, validation_data=(testX, testY))<\/pre>\n<p>And we see that the model converges faster and gets a higher validation accuracy.<\/p>\n<pre class=\"urvanov-syntax-highlighter-plain-tag\">Epoch 1\/10\r\n196\/196 [==============================] - 5s 17ms\/step - loss: 1.4643 - acc: 0.4791 - val_loss: 1.3837 - val_acc: 0.5054\r\nEpoch 2\/10\r\n196\/196 [==============================] - 3s 14ms\/step - loss: 1.1171 - acc: 0.6041 - val_loss: 1.2150 - val_acc: 0.5683\r\nEpoch 3\/10\r\n196\/196 [==============================] - 3s 14ms\/step - loss: 0.9627 - acc: 0.6606 - val_loss: 1.1038 - val_acc: 0.6086\r\nEpoch 4\/10\r\n196\/196 [==============================] - 3s 14ms\/step - loss: 0.8560 - acc: 0.7003 - val_loss: 1.0976 - val_acc: 0.6229\r\nEpoch 5\/10\r\n196\/196 [==============================] - 3s 14ms\/step - loss: 0.7644 - acc: 0.7325 - val_loss: 1.1073 - val_acc: 0.6153\r\nEpoch 6\/10\r\n196\/196 [==============================] - 3s 15ms\/step - loss: 0.6872 - acc: 0.7617 - val_loss: 1.1484 - val_acc: 0.6128\r\nEpoch 7\/10\r\n196\/196 [==============================] - 3s 14ms\/step - loss: 0.6229 - acc: 0.7850 - val_loss: 1.1469 - val_acc: 0.6346\r\nEpoch 8\/10\r\n196\/196 [==============================] - 3s 14ms\/step - loss: 0.5583 - acc: 0.8067 - val_loss: 1.2041 - val_acc: 0.6206\r\nEpoch 9\/10\r\n196\/196 [==============================] - 3s 15ms\/step - loss: 0.4998 - acc: 0.8300 - val_loss: 1.3095 - val_acc: 0.6071\r\nEpoch 10\/10\r\n196\/196 [==============================] - 3s 14ms\/step - loss: 0.4474 - acc: 0.8471 - val_loss: 1.2649 - val_acc: 0.6177<\/pre>\n<p>Plotting the train and validation accuracies of both models,<\/p>\n<div id=\"attachment_13591\" style=\"width: 399px\" class=\"wp-caption aligncenter\">\n<a href=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2022\/05\/lenet_5_without_normalization.png\"><img decoding=\"async\" aria-describedby=\"caption-attachment-13591\" loading=\"lazy\" class=\"wp-image-13591 size-full\" src=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2022\/05\/lenet_5_without_normalization.png\" alt=\"\" width=\"389\" height=\"278\" srcset=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2022\/05\/lenet_5_without_normalization.png 389w, https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2022\/05\/lenet_5_without_normalization-300x214.png 300w\" sizes=\"(max-width: 389px) 100vw, 389px\"><\/a><\/p>\n<p id=\"caption-attachment-13591\" class=\"wp-caption-text\">Train and validation accuracy of LeNet-5<\/p>\n<\/div>\n<div id=\"attachment_13590\" style=\"width: 396px\" class=\"wp-caption aligncenter\">\n<a href=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2022\/05\/lenet_5_with_normalization.png\"><img decoding=\"async\" aria-describedby=\"caption-attachment-13590\" loading=\"lazy\" class=\"wp-image-13590 size-full\" src=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2022\/05\/lenet_5_with_normalization.png\" alt=\"\" width=\"386\" height=\"278\" srcset=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2022\/05\/lenet_5_with_normalization.png 386w, https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2022\/05\/lenet_5_with_normalization-300x216.png 300w\" sizes=\"(max-width: 386px) 100vw, 386px\"><\/a><\/p>\n<p id=\"caption-attachment-13590\" class=\"wp-caption-text\">Train and validation accuracy of LeNet-5 with normalization and batch normalization added<\/p>\n<\/div>\n<p>Some caution when using batch normalization, it\u2019s generally not advised to use batch normalization together with dropout as batch normalization has a regularizing effect. Also, too small batch sizes might be an issue for batch normalization as the quality of the statistics (mean and variance) calculated is affected by the batch size and very small batch sizes could lead to issues, with the extreme case being one sample have all activations as 0 if looking at simple neural networks. Consider using layer normalization (more resources in further reading section below) if you are considering using small batch sizes.<\/p>\n<p>Here\u2019s the complete code for the model with normalization too.<\/p>\n<pre class=\"urvanov-syntax-highlighter-plain-tag\">from tensorflow.keras.layers import Dense, Input, Flatten, Conv2D, BatchNormalization, MaxPool2D, Normalization\r\nfrom tensorflow.keras.models import Model\r\nimport tensorflow as tf\r\nimport tensorflow.keras as keras\r\n\r\nclass LeNet5_Norm(tf.keras.Model):\r\n  def __init__(self, norm_layer, *args, **kwargs):\r\n    super(LeNet5_Norm, self).__init__()\r\n    self.conv1 = Conv2D(filters=6, kernel_size=(5,5), padding=\"same\")\r\n    self.norm1 = norm_layer(*args, **kwargs)\r\n    self.relu = relu\r\n    self.max_pool2x2 = MaxPool2D(pool_size=(2,2))\r\n    self.conv2 = Conv2D(filters=16, kernel_size=(5,5), padding=\"same\")\r\n    self.norm2 = norm_layer(*args, **kwargs)\r\n    self.flatten = Flatten()\r\n    self.fc1 = Dense(units=120)\r\n    self.norm3 = norm_layer(*args, **kwargs)\r\n    self.fc2 = Dense(units=84)\r\n    self.norm4 = norm_layer(*args, **kwargs)\r\n    self.fc3 = Dense(units=10, activation=\"softmax\")\r\n  def call(self, input_tensor):\r\n    conv1 = self.conv1(input_tensor)\r\n    conv1 = self.norm1(conv1)\r\n    conv1 = self.relu(conv1)\r\n    maxpool1 = self.max_pool2x2(conv1)\r\n    conv2 = self.conv2(maxpool1)\r\n    conv2 = self.norm2(conv2)\r\n    conv2 = self.relu(conv2)\r\n    maxpool2 = self.max_pool2x2(conv2)\r\n    flatten = self.flatten(maxpool2)\r\n    fc1 = self.fc1(flatten)\r\n    fc1 = self.norm3(fc1)\r\n    fc1 = self.relu(fc1)\r\n    fc2 = self.fc2(fc1)\r\n    fc2 = self.norm4(fc2)\r\n    fc2 = self.relu(fc2)\r\n    fc3 = self.fc3(fc2)\r\n    return fc3\r\n\r\n# load dataset, using cifar10 to show greater improvement in accuracy\r\n(trainX, trainY), (testX, testY) = keras.datasets.cifar10.load_data()\r\n\r\nnormalization_layer = Normalization()\r\nnormalization_layer.adapt(trainX)\r\n\r\ninput_layer = Input(shape=(32,32,3,))\r\nx = LeNet5_Norm(BatchNormalization)(normalization_layer(input_layer))\r\n\r\nmodel = Model(inputs=input_layer, outputs=x)\r\n\r\nmodel.compile(optimizer=\"adam\", loss=tf.keras.losses.SparseCategoricalCrossentropy(), metrics=\"acc\")\r\n\r\nhistory = model.fit(x=trainX, y=trainY, batch_size=256, epochs=10, validation_data=(testX, testY))<\/pre>\n<\/p>\n<h2>Further Reading<\/h2>\n<p>Here are some of the different types of normalization that you can implement in your model:<\/p>\n<ul>\n<li><a href=\"https:\/\/arxiv.org\/abs\/1607.06450\">Layer normalization<\/a><\/li>\n<li><a href=\"https:\/\/arxiv.org\/abs\/1803.08494\">Group normalization<\/a><\/li>\n<li><a href=\"https:\/\/arxiv.org\/abs\/1607.08022\">Instance Normalization: The Missing Ingredient for Fast Stylization<\/a><\/li>\n<\/ul>\n<p>Tensorflow layers:<\/p>\n<ul>\n<li>Tensorflow addons (Layer, Instance, Group normalization): <a href=\"https:\/\/github.com\/tensorflow\/addons\/blob\/master\/docs\/tutorials\/layers_normalizations.ipynb\">https:\/\/github.com\/tensorflow\/addons\/blob\/master\/docs\/tutorials\/layers_normalizations.ipynb<\/a>\n<\/li>\n<li><a href=\"https:\/\/www.tensorflow.org\/api_docs\/python\/tf\/keras\/layers\/BatchNormalization\">Batch normalization<\/a><\/li>\n<li><a href=\"https:\/\/www.tensorflow.org\/api_docs\/python\/tf\/keras\/layers\/Normalization\">Normalization<\/a><\/li>\n<\/ul>\n<h2>Conclusion<\/h2>\n<p>In this post, you\u2019ve discovered how normalization and batch normalization works, as well as how to implement them in TensorFlow. You have also seen how using these layers can help to significantly improve the performance of our machine learning models.<\/p>\n<p>Specifically, you\u2019ve learned:<\/p>\n<ul>\n<li>What normalization and batch normalization does<\/li>\n<li>How to use normalization and batch normalization in TensorFlow<\/li>\n<li>Some tips when using batch normalization in your machine learning model<\/li>\n<\/ul>\n<p>\u00a0<\/p>\n<p>\u00a0<\/p>\n<p>The post <a rel=\"nofollow\" href=\"https:\/\/machinelearningmastery.com\/using-normalization-layers-to-improve-deep-learning-models\/\">Using Normalization Layers to Improve Deep Learning Models<\/a> appeared first on <a rel=\"nofollow\" href=\"https:\/\/machinelearningmastery.com\/\">Machine Learning Mastery<\/a>.<\/p>\n<\/div>\n<p><a href=\"https:\/\/machinelearningmastery.com\/using-normalization-layers-to-improve-deep-learning-models\/\">Go to Source<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Author: Zhe Ming Chng You\u2019ve probably been told to standardize or normalize inputs to your model to improve performance. But what is normalization and how [&hellip;] <span class=\"read-more-link\"><a class=\"read-more\" href=\"https:\/\/www.aiproblog.com\/index.php\/2022\/06\/14\/using-normalization-layers-to-improve-deep-learning-models\/\">Read More<\/a><\/span><\/p>\n","protected":false},"author":1,"featured_media":5691,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_bbp_topic_count":0,"_bbp_reply_count":0,"_bbp_total_topic_count":0,"_bbp_total_reply_count":0,"_bbp_voice_count":0,"_bbp_anonymous_reply_count":0,"_bbp_topic_count_hidden":0,"_bbp_reply_count_hidden":0,"_bbp_forum_subforum_count":0,"footnotes":""},"categories":[24],"tags":[],"_links":{"self":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts\/5690"}],"collection":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/comments?post=5690"}],"version-history":[{"count":0,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts\/5690\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/media\/5691"}],"wp:attachment":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/media?parent=5690"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/categories?post=5690"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/tags?post=5690"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}