{"id":5246,"date":"2021-12-01T06:28:48","date_gmt":"2021-12-01T06:28:48","guid":{"rendered":"https:\/\/www.aiproblog.com\/index.php\/2021\/12\/01\/application-of-differentiations-in-neural-networks\/"},"modified":"2021-12-01T06:28:48","modified_gmt":"2021-12-01T06:28:48","slug":"application-of-differentiations-in-neural-networks","status":"publish","type":"post","link":"https:\/\/www.aiproblog.com\/index.php\/2021\/12\/01\/application-of-differentiations-in-neural-networks\/","title":{"rendered":"Application of differentiations in neural networks"},"content":{"rendered":"<p>Author: Adrian Tam<\/p>\n<div>\n<p>Differential calculus is an important tool in machine learning algorithms. Neural networks in particular, the gradient descent algorithm depends on the gradient, which is a quantity computed by differentiation.<\/p>\n<p>In this tutorial, we will see how the back-propagation technique is used in finding the gradients in neural networks.<\/p>\n<p>After completing this tutorial, you will know<\/p>\n<ul>\n<li>What is a total differential and total derivative<\/li>\n<li>How to compute the total derivatives in neural networks<\/li>\n<li>How back-propagation helped in computing the total derivatives<\/li>\n<\/ul>\n<p>Let\u2019s get started<\/p>\n<div id=\"attachment_4963\" style=\"width: 650px\" class=\"wp-caption aligncenter\">\n<img decoding=\"async\" aria-describedby=\"caption-attachment-4963\" loading=\"lazy\" class=\"size-full wp-image-4963\" src=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2021\/11\/freeman-zhou-plX7xeNb3Yo-unsplash.jpg\" alt=\"Application of differentiations in neural networks\" width=\"640\" height=\"480\"><\/p>\n<p id=\"caption-attachment-4963\" class=\"wp-caption-text\">Application of differentiations in neural networks<br \/>Photo by <a href=\"https:\/\/unsplash.com\/photos\/plX7xeNb3Yo\">Freeman Zhou<\/a>, some rights reserved.<\/p>\n<\/div>\n<h2>Tutorial overview<\/h2>\n<p>This tutorial is divided into 5 parts; they are:<\/p>\n<ol>\n<li>Total differential and total derivatives<\/li>\n<li>Algebraic representation of a multilayer perceptron model<\/li>\n<li>Finding the gradient by back-propagation<\/li>\n<li>Matrix form of gradient equations<\/li>\n<li>Implementing back-propagation<\/li>\n<\/ol>\n<h2>Total differential and total derivatives<\/h2>\n<p>For a function such as $f(x)$, we call denote its derivative as $f'(x)$ or $frac{df}{dx}$. But for a multivariate function, such as $f(u,v)$, we have a partial derivative of $f$ with respect to $u$ denoted as $frac{partial f}{partial u}$, or sometimes written as $f_u$. A partial derivative is obtained by differentiation of $f$ with respect to $u$ while assuming the other variable $v$ is a constant. Therefore, we use $partial$ instead of $d$ as the symbol for differentiation to signify the difference.<\/p>\n<p>However, what if the $u$ and $v$ in $f(u,v)$ are both function of $x$? In other words, we can write $u(x)$ and $v(x)$ and $f(u(x), v(x))$. So $x$ determines the value of $u$ and $v$ and in turn, determines $f(u,v)$. In this case, it is perfectly fine to ask what is $frac{df}{dx}$, as $f$ is eventually determined by $x$.<\/p>\n<p>This is the concept of total derivatives. In fact, for a multivariate function $f(t,u,v)=f(t(x),u(x),v(x))$, we always have<br \/>\n$$<br \/>\nfrac{df}{dx} = frac{partial f}{partial t}frac{dt}{dx} + frac{partial f}{partial u}frac{du}{dx} + frac{partial f}{partial v}frac{dv}{dx}<br \/>\n$$<br \/>\nThe above notation is called the total derivative because it is sum of the partial derivatives. In essence, it is applying chain rule to find the differentiation.<\/p>\n<p>If we take away the $dx$ part in the above equation, what we get is an approximate change in $f$ with respect to $x$, i.e.,<br \/>\n$$<br \/>\ndf = frac{partial f}{partial t}dt + frac{partial f}{partial u}du + frac{partial f}{partial v}dv<br \/>\n$$<br \/>\nWe call this notation the total differential.<\/p>\n<h2>Algebraic representation of a multilayer perceptron model<\/h2>\n<p>Consider the network:<\/p>\n<div id=\"attachment_4963\" style=\"width: 650px\" class=\"wp-caption aligncenter\">\n<img decoding=\"async\" aria-describedby=\"caption-attachment-4963\" loading=\"lazy\" class=\"size-full wp-image-4963\" src=\"https:\/\/upload.wikimedia.org\/wikipedia\/commons\/3\/30\/Multilayer_Neural_Network.png\" width=\"640\" height=\"480\"><\/p>\n<p id=\"caption-attachment-4963\" class=\"wp-caption-text\">An example of neural network. Source: <a href=\"https:\/\/commons.wikimedia.org\/wiki\/File:Multilayer_Neural_Network.png%22\">https:\/\/commons.wikimedia.org\/wiki\/File:Multilayer_Neural_Network.png<\/a><\/p>\n<\/div>\n<p>This is a simple, fully-connected, 4-layer neural network. Let\u2019s call the input layer as layer 0, the two hidden layers the layer 1 and 2, and the output layer as layer 3. In this picture, we see that we have $n_0=3$ input units, and $n_1=4$ units in the first hidden layer and $n_2=2$ units in the second input layer. There are $n_3=2$ output units.<\/p>\n<p>If we denote the input to the network as $x_i$ where $i=1,cdots,n_0$ and the network\u2019s output as $hat{y}_i$ where $i=1,cdots,n_3$. Then we can write<\/p>\n<p>$$<br \/>\nbegin{aligned}<br \/>\nh_{1i} &amp;= f_1(sum_{j=1}^{n_0} w^{(1)}_{ij} x_j + b^{(1)}_i) &amp; text{for } i &amp;= 1,cdots,n_1\\<br \/>\nh_{2i} &amp;= f_2(sum_{j=1}^{n_1} w^{(2)}_{ij} h_{1j} + b^{(2)}_i) &amp; i &amp;= 1,cdots,n_2\\<br \/>\nhat{y}_i &amp;= f_3(sum_{j=1}^{n_2} w^{(3)}_{ij} h_{2j} + b^{(3)}_i) &amp; i &amp;= 1,cdots,n_3<br \/>\nend{aligned}<br \/>\n$$<\/p>\n<p>Here the activation function at layer $i$ is denoted as $f_i$. The outputs of first hidden layer are denoted as $h_{1i}$ for the $i$-th unit. Similarly, the outputs of second hidden layer are denoted as $h_{2i}$. The weights and bias of unit $i$ in layer $k$ are denoted as $w^{(k)}_{ij}$ and $b^{(k)}_i$ respectively.<\/p>\n<p>In the above, we can see that the output of layer $k-1$ will feed into layer $k$. Therefore, while $hat{y}_i$ is expressed as a function of $h_{2j}$, but $h_{2i}$ is also a function of $h_{1j}$ and in turn, a function of $x_j$.<\/p>\n<p>The above describes the construction of a neural network in terms of algebraic equations. Training a neural network would need to specify a *loss function* as well so we can minimize it in the training loop. Depends on the application, we commonly use cross entropy for categorization problems or mean squared error for regression problems. With the target variables as $y_i$, the mean square error loss function is specified as<br \/>\n$$<br \/>\nL = sum_{i=1}^{n_3} (y_i-hat{y}_i)^2<br \/>\n$$<\/p>\n<h2>Finding the gradient by back-propagation<\/h2>\n<p>In the above construct, $x_i$ and $y_i$ are from the dataset. The parameters to the neural network are $w$ and $b$. While the activation functions $f_i$ are by design the outputs at each layer $h_{1i}$, $h_{2i}$, and $hat{y}_i$ are dependent variables. In training the neural network, our goal is to update $w$ and $b$ in each iteration, namely, by the gradient descent update rule:<br \/>\n$$<br \/>\nbegin{aligned}<br \/>\nw^{(k)}_{ij} &amp;= w^{(k)}_{ij} \u2013 eta frac{partial L}{partial w^{(k)}_{ij}} \\<br \/>\nb^{(k)}_{i} &amp;= b^{(k)}_{i} \u2013 eta frac{partial L}{partial b^{(k)}_{i}}<br \/>\nend{aligned}<br \/>\n$$<br \/>\nwhere $eta$ is the learning rate parameter to gradient descent.<\/p>\n<p>From the equation of $L$ we know that $L$ is not dependent on $w^{(k)}_{ij}$ or $b^{(k)}_i$ but on $hat{y}_i$. However, $hat{y}_i$ can be written as function of $w^{(k)}_{ij}$ or $b^{(k)}_i$ eventually. Let\u2019s see one by one how the weights and bias at layer $k$ can be connected to $hat{y}_i$ at the output layer.<\/p>\n<p>We begin with the loss metric. If we consider the loss of a single data point, we have<br \/>\n$$<br \/>\nbegin{aligned}<br \/>\nL &amp;= sum_{i=1}^{n_3} (y_i-hat{y}_i)^2\\<br \/>\nfrac{partial L}{partial hat{y}_i} &amp;= 2(y_i \u2013 hat{y}_i) &amp; text{for } i &amp;= 1,cdots,n_3<br \/>\nend{aligned}<br \/>\n$$<br \/>\nHere we see that the loss function depends on all outputs $hat{y}_i$ and therefore we can find a partial derivative $frac{partial L}{partial hat{y}_i}$.<\/p>\n<p>Now let\u2019s look at the output layer:<br \/>\n$$<br \/>\nbegin{aligned}<br \/>\nhat{y}_i &amp;= f_3(sum_{j=1}^{n_2} w^{(3)}_{ij} h_{2j} + b^{(3)}_i) &amp; text{for }i &amp;= 1,cdots,n_3 \\<br \/>\nfrac{partial L}{partial w^{(3)}_{ij}} &amp;= frac{partial L}{partial hat{y}_i}frac{partial hat{y}_i}{partial w^{(3)}_{ij}} &amp; i &amp;= 1,cdots,n_3; j=1,cdots,n_2 \\<br \/>\n&amp;= frac{partial L}{partial hat{y}_i} f\u2019_3(sum_{j=1}^{n_2} w^{(3)}_{ij} h_{2j} + b^{(3)}_i)h_{2j} \\<br \/>\nfrac{partial L}{partial b^{(3)}_i} &amp;= frac{partial L}{partial hat{y}_i}frac{partial hat{y}_i}{partial b^{(3)}_i} &amp; i &amp;= 1,cdots,n_3 \\<br \/>\n&amp;= frac{partial L}{partial hat{y}_i}f\u2019_3(sum_{j=1}^{n_2} w^{(3)}_{ij} h_{2j} + b^{(3)}_i)<br \/>\nend{aligned}<br \/>\n$$<br \/>\nBecause the weight $w^{(3)}_{ij}$ at layer 3 applies to input $h_{2j}$ and affects output $hat{y}_i$ only. Hence we can write the derivative $frac{partial L}{partial w^{(3)}_{ij}}$ as the product of two derivatives $frac{partial L}{partial hat{y}_i}frac{partial hat{y}_i}{partial w^{(3)}_{ij}}$. Similar case for the bias $b^{(3)}_i$ as well. In the above, we make use of $frac{partial L}{partial hat{y}_i}$, which we already derived previously.<\/p>\n<p>But in fact, we can also write the partial derivative of $L$ with respect to output of second layer $h_{2j}$. It is not used for the update of weights and bias on layer 3 but we will see its importance later:<br \/>\n$$<br \/>\nbegin{aligned}<br \/>\nfrac{partial L}{partial h_{2j}} &amp;= sum_{i=1}^{n_3}frac{partial L}{partial hat{y}_i}frac{partial hat{y}_i}{partial h_{2j}} &amp; text{for }j &amp;= 1,cdots,n_2 \\<br \/>\n&amp;= sum_{i=1}^{n_3}frac{partial L}{partial hat{y}_i}f\u2019_3(sum_{j=1}^{n_2} w^{(3)}_{ij} h_{2j} + b^{(3)}_i)w^{(3)}_{ij}<br \/>\nend{aligned}<br \/>\n$$<br \/>\nThis one is the interesting one and different from the previous partial derivatives. Note that $h_{2j}$ is an output of layer 2. Each and every output in layer 2 will affect the output $hat{y}_i$ in layer 3. Therefore, to find $frac{partial L}{partial h_{2j}}$ we need to add up every output at layer 3. Thus the summation sign in the equation above. And we can consider $frac{partial L}{partial h_{2j}}$ as the total derivative, in which we applied the chain rule $frac{partial L}{partial hat{y}_i}frac{partial hat{y}_i}{partial h_{2j}}$ for every output $i$ and then sum them up.<\/p>\n<p>If we move back to layer 2, we can derive the derivatives similarly:<br \/>\n$$<br \/>\nbegin{aligned}<br \/>\nh_{2i} &amp;= f_2(sum_{j=1}^{n_1} w^{(2)}_{ij} h_{1j} + b^{(2)}_i) &amp; text{for }i &amp;= 1,cdots,n_2\\<br \/>\nfrac{partial L}{partial w^{(2)}_{ij}} &amp;= frac{partial L}{partial h_{2i}}frac{partial h_{2i}}{partial w^{(2)}_{ij}} &amp; i&amp;=1,cdots,n_2; j=1,cdots,n_1 \\<br \/>\n&amp;= frac{partial L}{partial h_{2i}}f\u2019_2(sum_{j=1}^{n_1} w^{(2)}_{ij} h_{1j} + b^{(2)}_i)h_{1j} \\<br \/>\nfrac{partial L}{partial b^{(2)}_i} &amp;= frac{partial L}{partial h_{2i}}frac{partial h_{2i}}{partial b^{(2)}_i} &amp; i &amp;= 1,cdots,n_2 \\<br \/>\n&amp;= frac{partial L}{partial h_{2i}}f\u2019_2(sum_{j=1}^{n_1} w^{(2)}_{ij} h_{1j} + b^{(2)}_i) \\<br \/>\nfrac{partial L}{partial h_{1j}} &amp;= sum_{i=1}^{n_2}frac{partial L}{partial h_{2i}}frac{partial h_{2i}}{partial h_{1j}} &amp; j&amp;= 1,cdots,n_1 \\<br \/>\n&amp;= sum_{i=1}^{n_2}frac{partial L}{partial h_{2i}}f\u2019_2(sum_{j=1}^{n_1} w^{(2)}_{ij} h_{1j} + b^{(2)}_i) w^{(2)}_{ij}<br \/>\nend{aligned}<br \/>\n$$<\/p>\n<p>In the equations above, we are reusing $frac{partial L}{partial h_{2i}}$ that we derived earlier. Again, this derivative is computed as a sum of several products from the chain rule. Also similar to the previous, we derived $frac{partial L}{partial h_{1j}}$ as well. It is not used to train $w^{(2)}_{ij}$ nor $b^{(2)}_i$ but will be used for the layer prior. So for layer 1, we have<\/p>\n<p>$$<br \/>\nbegin{aligned}<br \/>\nh_{1i} &amp;= f_1(sum_{j=1}^{n_0} w^{(1)}_{ij} x_j + b^{(1)}_i) &amp; text{for } i &amp;= 1,cdots,n_1\\<br \/>\nfrac{partial L}{partial w^{(1)}_{ij}} &amp;= frac{partial L}{partial h_{1i}}frac{partial h_{1i}}{partial w^{(1)}_{ij}} &amp; i&amp;=1,cdots,n_1; j=1,cdots,n_0 \\<br \/>\n&amp;= frac{partial L}{partial h_{1i}}f\u2019_1(sum_{j=1}^{n_0} w^{(1)}_{ij} x_j + b^{(1)}_i)x_j \\<br \/>\nfrac{partial L}{partial b^{(1)}_i} &amp;= frac{partial L}{partial h_{1i}}frac{partial h_{1i}}{partial b^{(1)}_i} &amp; i&amp;=1,cdots,n_1 \\<br \/>\n&amp;= frac{partial L}{partial h_{1i}}f\u2019_1(sum_{j=1}^{n_0} w^{(1)}_{ij} x_j + b^{(1)}_i)<br \/>\nend{aligned}<br \/>\n$$<\/p>\n<p>and this completes all the derivatives needed for training of the neural network using gradient descent algorithm.<\/p>\n<p>Recall how we derived the above: We first start from the loss function $L$ and find the derivatives one by one in the reverse order of the layers. We write down the derivatives on layer $k$ and reuse it for the derivatives on layer $k-1$. While computing the output $hat{y}_i$ from input $x_i$ starts from layer 0 forward, computing gradients are in the reversed order. Hence the name \u201cback-propagation\u201d.<\/p>\n<h2>Matrix form of gradient equations<\/h2>\n<p>While we did not use it above, it is cleaner to write the equations in vectors and matrices. We can rewrite the layers and the outputs as:<br \/>\n$$<br \/>\nmathbf{a}_k = f_k(mathbf{z}_k) = f_k(mathbf{W}_kmathbf{a}_{k-1}+mathbf{b}_k)<br \/>\n$$<br \/>\nwhere $mathbf{a}_k$ is a vector of outputs of layer $k$, and assume $mathbf{a}_0=mathbf{x}$ is the input vector and $mathbf{a}_3=hat{mathbf{y}}$ is the output vector. Also denote $mathbf{z}_k = mathbf{W}_kmathbf{a}_{k-1}+mathbf{b}_k$ for convenience of notation.<\/p>\n<p>Under such notation, we can represent $frac{partial L}{partialmathbf{a}_k}$ as a vector (so as that of $mathbf{z}_k$ and $mathbf{b}_k$) and $frac{partial L}{partialmathbf{W}_k}$ as a matrix. And then if $frac{partial L}{partialmathbf{a}_k}$ is known, we have<br \/>\n$$<br \/>\nbegin{aligned}<br \/>\nfrac{partial L}{partialmathbf{z}_k} &amp;= frac{partial L}{partialmathbf{a}_k}odot f_k'(mathbf{z}_k) \\<br \/>\nfrac{partial L}{partialmathbf{W}_k} &amp;= left(frac{partial L}{partialmathbf{z}_k}right)^top cdot mathbf{a}_k \\<br \/>\nfrac{partial L}{partialmathbf{b}_k} &amp;= frac{partial L}{partialmathbf{z}_k} \\<br \/>\nfrac{partial L}{partialmathbf{a}_{k-1}} &amp;= left(frac{partialmathbf{z}_k}{partialmathbf{a}_{k-1}}right)^topcdotfrac{partial L}{partialmathbf{z}_k} = mathbf{W}_k^topcdotfrac{partial L}{partialmathbf{z}_k}<br \/>\nend{aligned}<br \/>\n$$<br \/>\nwhere $frac{partialmathbf{z}_k}{partialmathbf{a}_{k-1}}$ is a Jacobian matrix as both $mathbf{z}_k$ and $mathbf{a}_{k-1}$ are vectors, and this Jacobian matrix happens to be $mathbf{W}_k$.<\/p>\n<h2>Implementing back-propagation<\/h2>\n<p>We need the matrix form of equations because it will make our code simpler and avoided a lot of loops. Let\u2019s see how we can convert these equations into code and make a multilayer perceptron model for classification from scratch using numpy.<\/p>\n<p>The first thing we need to implement the activation function and the loss function. Both need to be differentiable functions or otherwise our gradient descent procedure would not work. Nowadays, it is common to use ReLU activation in the hidden layers and sigmoid activation in the output layer. We define them as a function (which assumes the input as numpy array) as well as their differentiation:<\/p>\n<pre class=\"urvanov-syntax-highlighter-plain-tag\">import numpy as np\r\n\r\n# Find a small float to avoid division by zero\r\nepsilon = np.finfo(float).eps\r\n\r\n# Sigmoid function and its differentiation\r\ndef sigmoid(z):\r\n    return 1\/(1+np.exp(-z.clip(-500, 500)))\r\ndef dsigmoid(z):\r\n    s = sigmoid(z)\r\n    return 2 * s * (1-s)\r\n\r\n# ReLU function and its differentiation\r\ndef relu(z):\r\n    return np.maximum(0, z)\r\ndef drelu(z):\r\n    return (z &gt; 0).astype(float)<\/pre>\n<p>We deliberately clip the input of the sigmoid function to between -500 to +500 to avoid overflow. Otherwise, these functions are trivial. Then for classification, we care about accuracy but the accuracy function is not differentiable. Therefore, we use the cross entropy function as loss for training:<\/p>\n<pre class=\"urvanov-syntax-highlighter-plain-tag\"># Loss function L(y, yhat) and its differentiation\r\ndef cross_entropy(y, yhat):\r\n    \"\"\"Binary cross entropy function\r\n        L = - y log yhat - (1-y) log (1-yhat)\r\n\r\n    Args:\r\n        y, yhat (np.array): 1xn matrices which n are the number of data instances\r\n    Returns:\r\n        average cross entropy value of shape 1x1, averaging over the n instances\r\n    \"\"\"\r\n    return -(y.T @ np.log(yhat.clip(epsilon)) + (1-y.T) @ np.log((1-yhat).clip(epsilon))) \/ y.shape[1]\r\n\r\ndef d_cross_entropy(y, yhat):\r\n    \"\"\" dL\/dyhat \"\"\"\r\n    return - np.divide(y, yhat.clip(epsilon)) + np.divide(1-y, (1-yhat).clip(epsilon))<\/pre>\n<p>In the above, we assume the output and the target variables are row matrices in numpy. Hence we use the dot product operator <code>@<\/code> to compute the sum and divide by the number of elements in the output. Note that this design is to compute the <strong>average cross entropy<\/strong> over a <strong>batch<\/strong> of samples.<\/p>\n<p>Then we can implement our multilayer perceptron model. To make it easier to read, we want to create the model by providing the number of neurons at each layer as well as the activation function at the layers. But at the same time, we would also need the differentiation of the activation functions as well as the differentiation of the loss function for the training. The loss function itself, however, is not required but useful for us to track the progress. We create a class to ensapsulate the entire model, and define each layer $k$ according to the formula:<br \/>\n$$<br \/>\nmathbf{a}_k = f_k(mathbf{z}_k) = f_k(mathbf{a}_{k-1}mathbf{W}_k+mathbf{b}_k)<br \/>\n$<\/p>\n<pre class=\"urvanov-syntax-highlighter-plain-tag\">class mlp:\r\n    '''Multilayer perceptron using numpy\r\n    '''\r\n    def __init__(self, layersizes, activations, derivatives, lossderiv):\r\n        \"\"\"remember config, then initialize array to hold NN parameters without init\"\"\"\r\n        # hold NN config\r\n        self.layersizes = layersizes\r\n        self.activations = activations\r\n        self.derivatives = derivatives\r\n        self.lossderiv = lossderiv\r\n        # parameters, each is a 2D numpy array\r\n        L = len(self.layersizes)\r\n        self.z = [None] * L\r\n        self.W = [None] * L\r\n        self.b = [None] * L\r\n        self.a = [None] * L\r\n        self.dz = [None] * L\r\n        self.dW = [None] * L\r\n        self.db = [None] * L\r\n        self.da = [None] * L\r\n\r\n    def initialize(self, seed=42):\r\n        np.random.seed(seed)\r\n        sigma = 0.1\r\n        for l, (insize, outsize) in enumerate(zip(self.layersizes, self.layersizes[1:]), 1):\r\n            self.W[l] = np.random.randn(insize, outsize) * sigma\r\n            self.b[l] = np.random.randn(1, outsize) * sigma\r\n\r\n    def forward(self, x):\r\n        self.a[0] = x\r\n        for l, func in enumerate(self.activations, 1):\r\n            # z = W a + b, with `a` as output from previous layer\r\n            # `W` is of size rxs and `a` the size sxn with n the number of data instances, `z` the size rxn\r\n            # `b` is rx1 and broadcast to each column of `z`\r\n            self.z[l] = (self.a[l-1] @ self.W[l]) + self.b[l]\r\n            # a = g(z), with `a` as output of this layer, of size rxn\r\n            self.a[l] = func(self.z[l])\r\n        return self.a[-1]<\/pre>\n<p>The variables in this class <code>z<\/code>, <code>W<\/code>, <code>b<\/code>, and <code>a<\/code> are for the forward pass and the variables <code>dz<\/code>, <code>dW<\/code>, <code>db<\/code>, and <code>da<\/code> are their respective gradients that to be computed in the back-propagation. All these variables are presented as numpy arrays.<\/p>\n<p>As we will see later, we are going to test our model using data generated by scikit-learn. Hence we will see our data in numpy array of shape \u201c(number of samples, number of features)\u201d. Therefore, each sample is presented as a row on a matrix, and in function <code>forward()<\/code>, the weight matrix is right-multiplied to each input <code>a<\/code> to the layer. While the activation function and dimension of each layer can be different, the process is the same. Thus we transform the neural network\u2019s input <code>x<\/code> to its output by a loop in the <code>forward()<\/code> function. The network\u2019s output is simply the output of the last layer.<\/p>\n<p>To train the network, we need to run the back-propagation after each forward pass. The back-propagation is to compute the gradient of the weight and bias of each layer, starting from the output layer to the input layer. With the equations we derived above, the back-propagation function is implemented as:<\/p>\n<pre class=\"urvanov-syntax-highlighter-plain-tag\">class mlp:\r\n    ...\r\n    \r\n    def backward(self, y, yhat):\r\n        # first `da`, at the output\r\n        self.da[-1] = self.lossderiv(y, yhat)\r\n        for l, func in reversed(list(enumerate(self.derivatives, 1))):\r\n            # compute the differentials at this layer\r\n            self.dz[l] = self.da[l] * func(self.z[l])\r\n            self.dW[l] = self.a[l-1].T @ self.dz[l]\r\n            self.db[l] = np.mean(self.dz[l], axis=0, keepdims=True)\r\n            self.da[l-1] = self.dz[l] @ self.W[l].T\r\n\r\n    def update(self, eta):\r\n        for l in range(1, len(self.W)):\r\n            self.W[l] -= eta * self.dW[l]\r\n            self.b[l] -= eta * self.db[l]<\/pre>\n<p>The only difference here is that we compute <code>db<\/code> not for one training sample, but for the entire batch. Since the loss function is the cross entropy averaged across the batch, we compute <code>db<\/code> also by averaging across the samples.<\/p>\n<p>Up to here, we completed our model. The <code>update()<\/code> function simply applies the gradients found by the back-propagation to the parameters <code>W<\/code> and <code>b<\/code> using the gradient descent update rule.<\/p>\n<p>To test out our model, we make use of scikit-learn to generate a classification dataset:<\/p>\n<pre class=\"urvanov-syntax-highlighter-plain-tag\">from sklearn.datasets import make_circles\r\nfrom sklearn.metrics import accuracy_score\r\n\r\n# Make data: Two circles on x-y plane as a classification problem\r\nX, y = make_circles(n_samples=1000, factor=0.5, noise=0.1)\r\ny = y.reshape(-1,1) # our model expects a 2D array of (n_sample, n_dim)<\/pre>\n<p>and then we build our model: Input is two-dimensional and output is one dimensional (logistic regression). We make two hidden layers of 4 and 3 neurons respectively:<\/p>\n<p><img decoding=\"async\" class=\"aligncenter size-large wp-image-13102\" src=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2021\/11\/neuralnetwork.png\" alt=\"\" width=\"800\" srcset=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2021\/11\/neuralnetwork.png 1474w, https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2021\/11\/neuralnetwork-300x228.png 300w, https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2021\/11\/neuralnetwork-1024x777.png 1024w, https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2021\/11\/neuralnetwork-768x583.png 768w\" sizes=\"(max-width: 1474px) 100vw, 1474px\"><\/p>\n<pre class=\"urvanov-syntax-highlighter-plain-tag\"># Build a model\r\nmodel = mlp(layersizes=[2, 4, 3, 1],\r\n            activations=[relu, relu, sigmoid],\r\n            derivatives=[drelu, drelu, dsigmoid],\r\n            lossderiv=d_cross_entropy)\r\nmodel.initialize()\r\nyhat = model.forward(X)\r\nloss = cross_entropy(y, yhat)\r\nprint(\"Before training - loss value {} accuracy {}\".format(loss, accuracy_score(y, (yhat &gt; 0.5))))<\/pre>\n<p>We see that, under random weight, the accuracy is 50%:<\/p>\n<pre class=\"urvanov-syntax-highlighter-plain-tag\">Before training - loss value [[693.62972747]] accuracy 0.5<\/pre>\n<p>Now we train our network. To make things simple, we perform full-batch gradient descent with fixed learning rate:<\/p>\n<pre class=\"urvanov-syntax-highlighter-plain-tag\"># train for each epoch\r\nn_epochs = 150\r\nlearning_rate = 0.005\r\nfor n in range(n_epochs):\r\n    model.forward(X)\r\n    yhat = model.a[-1]\r\n    model.backward(y, yhat)\r\n    model.update(learning_rate)\r\n    loss = cross_entropy(y, yhat)\r\n    print(\"Iteration {} - loss value {} accuracy {}\".format(n, loss, accuracy_score(y, (yhat &gt; 0.5))))<\/pre>\n<p>and the output is:<\/p>\n<pre class=\"urvanov-syntax-highlighter-plain-tag\">Iteration 0 - loss value [[693.62972747]] accuracy 0.5\r\nIteration 1 - loss value [[693.62166655]] accuracy 0.5\r\nIteration 2 - loss value [[693.61534159]] accuracy 0.5\r\nIteration 3 - loss value [[693.60994018]] accuracy 0.5\r\n...\r\nIteration 145 - loss value [[664.60120828]] accuracy 0.818\r\nIteration 146 - loss value [[697.97739669]] accuracy 0.58\r\nIteration 147 - loss value [[681.08653776]] accuracy 0.642\r\nIteration 148 - loss value [[665.06165774]] accuracy 0.71\r\nIteration 149 - loss value [[683.6170298]] accuracy 0.614<\/pre>\n<p>Although not perfect, we see the improvement by training. At least in the example above, we can see the accuracy was up to more than 80% at iteration 145, but then we saw the model diverged. That can be improved by reducing the learning rate, which we didn\u2019t implement above. Nonetheless, this shows how we computed the gradients by back-propagations and chain rules.<\/p>\n<p>The complete code is as follows:<\/p>\n<pre class=\"urvanov-syntax-highlighter-plain-tag\">from sklearn.datasets import make_circles\r\nfrom sklearn.metrics import accuracy_score\r\nimport numpy as np\r\nnp.random.seed(0)\r\n\r\n# Find a small float to avoid division by zero\r\nepsilon = np.finfo(float).eps\r\n\r\n# Sigmoid function and its differentiation\r\ndef sigmoid(z):\r\n    return 1\/(1+np.exp(-z.clip(-500, 500)))\r\ndef dsigmoid(z):\r\n    s = sigmoid(z)\r\n    return 2 * s * (1-s)\r\n\r\n# ReLU function and its differentiation\r\ndef relu(z):\r\n    return np.maximum(0, z)\r\ndef drelu(z):\r\n    return (z &gt; 0).astype(float)\r\n\r\n# Loss function L(y, yhat) and its differentiation\r\ndef cross_entropy(y, yhat):\r\n    \"\"\"Binary cross entropy function\r\n        L = - y log yhat - (1-y) log (1-yhat)\r\n\r\n    Args:\r\n        y, yhat (np.array): nx1 matrices which n are the number of data instances\r\n    Returns:\r\n        average cross entropy value of shape 1x1, averaging over the n instances\r\n    \"\"\"\r\n    return -(y.T @ np.log(yhat.clip(epsilon)) + (1-y.T) @ np.log((1-yhat).clip(epsilon))) \/ y.shape[1]\r\n\r\ndef d_cross_entropy(y, yhat):\r\n    \"\"\" dL\/dyhat \"\"\"\r\n    return - np.divide(y, yhat.clip(epsilon)) + np.divide(1-y, (1-yhat).clip(epsilon))\r\n\r\nclass mlp:\r\n    '''Multilayer perceptron using numpy\r\n    '''\r\n    def __init__(self, layersizes, activations, derivatives, lossderiv):\r\n        \"\"\"remember config, then initialize array to hold NN parameters without init\"\"\"\r\n        # hold NN config\r\n        self.layersizes = tuple(layersizes)\r\n        self.activations = tuple(activations)\r\n        self.derivatives = tuple(derivatives)\r\n        self.lossderiv = lossderiv\r\n        assert len(self.layersizes)-1 == len(self.activations), \r\n            \"number of layers and the number of activation functions does not match\"\r\n        assert len(self.activations) == len(self.derivatives), \r\n            \"number of activation functions and number of derivatives does not match\"\r\n        assert all(isinstance(n, int) and n &gt;= 1 for n in layersizes), \r\n            \"Only positive integral number of perceptons is allowed in each layer\"\r\n        # parameters, each is a 2D numpy array\r\n        L = len(self.layersizes)\r\n        self.z = [None] * L\r\n        self.W = [None] * L\r\n        self.b = [None] * L\r\n        self.a = [None] * L\r\n        self.dz = [None] * L\r\n        self.dW = [None] * L\r\n        self.db = [None] * L\r\n        self.da = [None] * L\r\n\r\n    def initialize(self, seed=42):\r\n        \"\"\"initialize the value of weight matrices and bias vectors with small random numbers.\"\"\"\r\n        np.random.seed(seed)\r\n        sigma = 0.1\r\n        for l, (insize, outsize) in enumerate(zip(self.layersizes, self.layersizes[1:]), 1):\r\n            self.W[l] = np.random.randn(insize, outsize) * sigma\r\n            self.b[l] = np.random.randn(1, outsize) * sigma\r\n\r\n    def forward(self, x):\r\n        \"\"\"Feed forward using existing `W` and `b`, and overwrite the result variables `a` and `z`\r\n\r\n        Args:\r\n            x (numpy.ndarray): Input data to feed forward\r\n        \"\"\"\r\n        self.a[0] = x\r\n        for l, func in enumerate(self.activations, 1):\r\n            # z = W a + b, with `a` as output from previous layer\r\n            # `W` is of size rxs and `a` the size sxn with n the number of data instances, `z` the size rxn\r\n            # `b` is rx1 and broadcast to each column of `z`\r\n            self.z[l] = (self.a[l-1] @ self.W[l]) + self.b[l]\r\n            # a = g(z), with `a` as output of this layer, of size rxn\r\n            self.a[l] = func(self.z[l])\r\n        return self.a[-1]\r\n\r\n    def backward(self, y, yhat):\r\n        \"\"\"back propagation using NN output yhat and the reference output y, generates dW, dz, db,\r\n        da\r\n        \"\"\"\r\n        assert y.shape[1] == self.layersizes[-1], \"Output size doesn't match network output size\"\r\n        assert y.shape == yhat.shape, \"Output size doesn't match reference\"\r\n        # first `da`, at the output\r\n        self.da[-1] = self.lossderiv(y, yhat)\r\n        for l, func in reversed(list(enumerate(self.derivatives, 1))):\r\n            # compute the differentials at this layer\r\n            self.dz[l] = self.da[l] * func(self.z[l])\r\n            self.dW[l] = self.a[l-1].T @ self.dz[l]\r\n            self.db[l] = np.mean(self.dz[l], axis=0, keepdims=True)\r\n            self.da[l-1] = self.dz[l] @ self.W[l].T\r\n            assert self.z[l].shape == self.dz[l].shape\r\n            assert self.W[l].shape == self.dW[l].shape\r\n            assert self.b[l].shape == self.db[l].shape\r\n            assert self.a[l].shape == self.da[l].shape\r\n\r\n    def update(self, eta):\r\n        \"\"\"Updates W and b\r\n\r\n        Args:\r\n            eta (float): Learning rate\r\n        \"\"\"\r\n        for l in range(1, len(self.W)):\r\n            self.W[l] -= eta * self.dW[l]\r\n            self.b[l] -= eta * self.db[l]\r\n\r\n# Make data: Two circles on x-y plane as a classification problem\r\nX, y = make_circles(n_samples=1000, factor=0.5, noise=0.1)\r\ny = y.reshape(-1,1) # our model expects a 2D array of (n_sample, n_dim)\r\nprint(X.shape)\r\nprint(y.shape)\r\n\r\n# Build a model\r\nmodel = mlp(layersizes=[2, 4, 3, 1],\r\n            activations=[relu, relu, sigmoid],\r\n            derivatives=[drelu, drelu, dsigmoid],\r\n            lossderiv=d_cross_entropy)\r\nmodel.initialize()\r\nyhat = model.forward(X)\r\nloss = cross_entropy(y, yhat)\r\nprint(\"Before training - loss value {} accuracy {}\".format(loss, accuracy_score(y, (yhat &gt; 0.5))))\r\n\r\n# train for each epoch\r\nn_epochs = 150\r\nlearning_rate = 0.005\r\nfor n in range(n_epochs):\r\n    model.forward(X)\r\n    yhat = model.a[-1]\r\n    model.backward(y, yhat)\r\n    model.update(learning_rate)\r\n    loss = cross_entropy(y, yhat)\r\n    print(\"Iteration {} - loss value {} accuracy {}\".format(n, loss, accuracy_score(y, (yhat &gt; 0.5))))<\/pre>\n<\/p>\n<h2>Further readings<\/h2>\n<p>The back-propagation algorithm is the center of all neural network training, regardless of what variation of gradient descent algorithms you used. Textbook such as this one covered it:<\/p>\n<ul>\n<li>\n<em>Deep Learning<\/em>, by Ian Goodfellow, Yoshua Bengio, and Aaron Courville, 2016.<br \/>\n(<a href=\"https:\/\/www.amazon.com\/dp\/0262035618\">https:\/\/www.amazon.com\/dp\/0262035618<\/a>)<\/li>\n<\/ul>\n<p>Previously also implemented the neural network from scratch without discussing the math, it explained the steps in greater detail:<\/p>\n<ul>\n<li><a href=\"https:\/\/machinelearningmastery.com\/implement-backpropagation-algorithm-scratch-python\/\">How to Code a Neural Network with Backpropagation In Python (from scratch)<\/a><\/li>\n<\/ul>\n<h2>Summary<\/h2>\n<p>In this tutorial, you learned how differentiation is applied to training a neural network.<\/p>\n<p>Specifically, you learned:<\/p>\n<ul>\n<li>What is a total differential and how it is expressed as a sum of partial differentials<\/li>\n<li>How to express a neural network as equations and derive the gradients by differentiation<\/li>\n<li>How back-propagation helped us to express the gradients of each layer in the neural network<\/li>\n<li>How to convert the gradients into code to make a neural network model<\/li>\n<\/ul>\n<p>The post <a rel=\"nofollow\" href=\"https:\/\/machinelearningmastery.com\/application-of-differentiations-in-neural-networks\/\">Application of differentiations in neural networks<\/a> appeared first on <a rel=\"nofollow\" href=\"https:\/\/machinelearningmastery.com\/\">Machine Learning Mastery<\/a>.<\/p>\n<\/div>\n<p><a href=\"https:\/\/machinelearningmastery.com\/application-of-differentiations-in-neural-networks\/\">Go to Source<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Author: Adrian Tam Differential calculus is an important tool in machine learning algorithms. Neural networks in particular, the gradient descent algorithm depends on the gradient, [&hellip;] <span class=\"read-more-link\"><a class=\"read-more\" href=\"https:\/\/www.aiproblog.com\/index.php\/2021\/12\/01\/application-of-differentiations-in-neural-networks\/\">Read More<\/a><\/span><\/p>\n","protected":false},"author":1,"featured_media":5247,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_bbp_topic_count":0,"_bbp_reply_count":0,"_bbp_total_topic_count":0,"_bbp_total_reply_count":0,"_bbp_voice_count":0,"_bbp_anonymous_reply_count":0,"_bbp_topic_count_hidden":0,"_bbp_reply_count_hidden":0,"_bbp_forum_subforum_count":0,"footnotes":""},"categories":[24],"tags":[],"_links":{"self":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts\/5246"}],"collection":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/comments?post=5246"}],"version-history":[{"count":0,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts\/5246\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/media\/5247"}],"wp:attachment":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/media?parent=5246"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/categories?post=5246"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/tags?post=5246"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}