{"id":2086,"date":"2019-05-02T06:36:19","date_gmt":"2019-05-02T06:36:19","guid":{"rendered":"https:\/\/www.aiproblog.com\/index.php\/2019\/05\/02\/the-mathematics-of-forward-and-back-propagation\/"},"modified":"2019-05-02T06:36:19","modified_gmt":"2019-05-02T06:36:19","slug":"the-mathematics-of-forward-and-back-propagation","status":"publish","type":"post","link":"https:\/\/www.aiproblog.com\/index.php\/2019\/05\/02\/the-mathematics-of-forward-and-back-propagation\/","title":{"rendered":"The Mathematics of Forward and Back Propagation"},"content":{"rendered":"<p>Author: ajit jaokar<\/p>\n<div>\n<p>Understanding the maths behind forward and back propagation is not very easy.<\/p>\n<p>There are some very good \u2013 but also very technical explanations.<\/p>\n<p>For example : <a href=\"https:\/\/arxiv.org\/abs\/1802.01528\">The Matrix Calculus You Need For Deep Learning Terence Parr and Jeremy Howard\u00a0<\/a>is an excellent resource but still too complex for beginners.\u00a0<\/p>\n<p>I found a much simpler explanation <a href=\"https:\/\/ml-cheatsheet.readthedocs.io\/en\/latest\/\">in the ml cheatsheet<\/a>.<\/p>\n<p>The section below is based on this source.<\/p>\n<p>I have tried to simplify this explanation as below.<\/p>\n<p>All diagrams and equations are based on this source <a href=\"https:\/\/ml-cheatsheet.readthedocs.io\/en\/latest\/\">in the ml cheatsheet<\/a><\/p>\n<p>\u00a0<\/p>\n<p>\u00a0<\/p>\n<h2>Forward propagation<\/h2>\n<p>\u00a0<a href=\"https:\/\/storage.ning.com\/topology\/rest\/1.0\/file\/get\/2228236054?profile=original\" target=\"_blank\" rel=\"noopener noreferrer\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/storage.ning.com\/topology\/rest\/1.0\/file\/get\/2228236054?profile=RESIZE_710x\" class=\"align-full\" width=\"560\" height=\"145\"><\/a><\/p>\n<p>\u00a0<\/p>\n<p>Let\u2019s start with forward propagation<\/p>\n<p>\u00a0<\/p>\n<p>Here, input data is \u201cforward propagated\u201d through the network layer by layer to the final layer which outputs a prediction. The simple network can be seen as a series of nested functions. \u00a0<\/p>\n<p>For the neural network above, a single pass of forward propagation translates mathematically to:<\/p>\n<\/p>\n<p><strong>A ( A( X Wh) Wo )<\/strong><\/p>\n<p>\u00a0<\/p>\n<p>Where\u00a0<\/p>\n<p>A\u00a0is an activation function like\u00a0<a href=\"https:\/\/ml-cheatsheet.readthedocs.io\/en\/latest\/activation_functions.html#activation-relu\">ReLU<\/a>,\u00a0<\/p>\n<p>X\u00a0is the input<\/p>\n<p>Wh and\u00a0Wo are weights for the hidden layer and output layer respectively<\/p>\n<p>\u00a0<\/p>\n<p>A more complex network can be shown as below<\/p>\n<p><a href=\"https:\/\/storage.ning.com\/topology\/rest\/1.0\/file\/get\/2228237146?profile=original\" target=\"_blank\" rel=\"noopener noreferrer\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/storage.ning.com\/topology\/rest\/1.0\/file\/get\/2228237146?profile=RESIZE_710x\" class=\"align-full\" width=\"467\" height=\"235\"><\/a>\u00a0<\/p>\n<p>INPUT_LAYER_SIZE = 1<\/p>\n<p>HIDDEN_LAYER_SIZE = 2<\/p>\n<p>OUTPUT_LAYER_SIZE = 2<\/p>\n<p>\u00a0<\/p>\n<p>In matrix form, this is represented as:<\/p>\n<p>\u00a0<\/p>\n<p>\u00a0<a href=\"https:\/\/storage.ning.com\/topology\/rest\/1.0\/file\/get\/2228238483?profile=original\" target=\"_blank\" rel=\"noopener noreferrer\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/storage.ning.com\/topology\/rest\/1.0\/file\/get\/2228238483?profile=RESIZE_710x\" class=\"align-full\" width=\"557\" height=\"343\"><\/a><\/p>\n<p>\u00a0<\/p>\n<p>\u00a0<\/p>\n<p>\u00a0<\/p>\n<p>\u00a0<\/p>\n<table>\n<tbody>\n<tr>\n<td>\n<p>Var<\/p>\n<\/td>\n<td>\n<p>Name<\/p>\n<\/td>\n<td>\n<p>Dimensions<\/p>\n<\/td>\n<\/tr>\n<tr>\n<td>\n<p>X<\/p>\n<\/td>\n<td>\n<p>Input<\/p>\n<\/td>\n<td>\n<p>(3, 1)<\/p>\n<\/td>\n<\/tr>\n<tr>\n<td>\n<p>Wh<\/p>\n<\/td>\n<td>\n<p>Hidden weights<\/p>\n<\/td>\n<td>\n<p>(1, 2)<\/p>\n<\/td>\n<\/tr>\n<tr>\n<td>\n<p>Bh<\/p>\n<\/td>\n<td>\n<p>Hidden bias<\/p>\n<\/td>\n<td>\n<p>(1, 2)<\/p>\n<\/td>\n<\/tr>\n<tr>\n<td>\n<p>Zh<\/p>\n<\/td>\n<td>\n<p>Hidden weighted input<\/p>\n<\/td>\n<td>\n<p>(1, 2)<\/p>\n<\/td>\n<\/tr>\n<tr>\n<td>\n<p>H<\/p>\n<\/td>\n<td>\n<p>Hidden activations<\/p>\n<\/td>\n<td>\n<p>(3, 2)<\/p>\n<\/td>\n<\/tr>\n<tr>\n<td>\n<p>Wo<\/p>\n<\/td>\n<td>\n<p>Output weights<\/p>\n<\/td>\n<td>\n<p>(2, 2)<\/p>\n<\/td>\n<\/tr>\n<tr>\n<td>\n<p>Bo<\/p>\n<\/td>\n<td>\n<p>Output bias<\/p>\n<\/td>\n<td>\n<p>(1, 2)<\/p>\n<\/td>\n<\/tr>\n<tr>\n<td>\n<p>Zo<\/p>\n<\/td>\n<td>\n<p>Output weighted input<\/p>\n<\/td>\n<td>\n<p>(3, 2)<\/p>\n<\/td>\n<\/tr>\n<tr>\n<td>\n<p>O<\/p>\n<\/td>\n<td>\n<p>Output activations<\/p>\n<\/td>\n<td>\n<p>(3, 2)<\/p>\n<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>\u00a0<\/p>\n<p>Source: <a href=\"https:\/\/ml-cheatsheet.readthedocs.io\/en\/latest\/forwardpropagation.html\">https:\/\/ml-cheatsheet.readthedocs.io\/en\/latest\/forwardpropagation.html<\/a><\/p>\n<p>\u00a0<\/p>\n<h1>Backpropagation<\/h1>\n<p>Now, let us explore backpropagation.\u00a0 In backpropagation, we adjust each weight in the network in proportion to how much it contributes to overall error. We iteratively reduce each weight\u2019s error. \u00a0Eventually, when the error is minimised, we have a series of weights the produce good predictions.<\/p>\n<p>\u00a0<\/p>\n<p>As we have seen before, forward propagation can be viewed as a series of nested functions. Hence, backpropagation can be seen as the application of the Chain rule \u00a0to find the derivative of the cost with respect to any weight in the network. This represents how much each weight contributes to the overall error and the direction to update each weight to reduce the error. The error is then reduced through Gradient descent. The equations needed to make a prediction and calculate total error, or cost:<\/p>\n<p>\u00a0<\/p>\n<p>\u00a0<a href=\"https:\/\/storage.ning.com\/topology\/rest\/1.0\/file\/get\/2228240096?profile=original\" target=\"_blank\" rel=\"noopener noreferrer\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/storage.ning.com\/topology\/rest\/1.0\/file\/get\/2228240096?profile=RESIZE_710x\" class=\"align-full\" width=\"586\" height=\"246\"><\/a><\/p>\n<p>\u00a0<\/p>\n<p>To use these equations:<\/p>\n<ol>\n<li>We first calculate the output layer error<\/li>\n<li>We pass this error back to the hidden layer before it<\/li>\n<li>At that hidden layer, we calculate the error (hidden layer error) pass the result to the hidden layer before it and so on.<\/li>\n<li>At every layer, we calculate the derivative of cost with respect that layer\u2019s weights.<\/li>\n<li>This resulting derivative tells us in which direction to adjust our weights to reduce overall cost. This step is performed using Gradient descent algorithm.<\/li>\n<\/ol>\n<p>Hence,<\/p>\n<p>We first calculate the derivative of cost with respect to the output layer input, Zo. This gives us the impact of the final layer\u2019s weights on the overall error in the network. The derivative is:<\/p>\n<\/p>\n<p><strong>C\u2032(Zo)=(y^\u2212y)<\/strong><strong>\u22c5<\/strong><strong>R\u2032(Zo)<\/strong><\/p>\n<p>Here<\/p>\n<p>(y^\u2212y) is the cost and<\/p>\n<p>R\u2032(Zo) represents the derivative of the ReLU activation for the output layer<\/p>\n<\/p>\n<p>This error is represented by Eo where<\/p>\n<p><strong>Eo=(y^\u2212y)<\/strong><strong>\u22c5<\/strong><strong>R\u2032(Zo)<\/strong><\/p>\n<\/p>\n<p>Now, to calculate hidden layer error, we need to find the derivative of cost with respect to the hidden layer input, Zh. Following the same logic, this can be represented as<\/p>\n<p><strong>Eh=Eo<\/strong><strong>\u22c5<\/strong><strong>Wo<\/strong><strong>\u22c5<\/strong><strong>R\u2032(Zh)<\/strong><\/p>\n<p>Where R\u2032(Zh) represents the derivative of the Relu activation for the hidden layer<\/p>\n<\/p>\n<p>This formula is at the core of backpropagation.<\/p>\n<ol>\n<li>We calculate the current layer\u2019s error<\/li>\n<li>Pass the weighted error back to the previous layer<\/li>\n<li>We continue the process through the hidden layers<\/li>\n<li>Along the way we update the weights using the derivative of cost with respect to each weight.<\/li>\n<\/ol>\n<p>The Derivative of cost with respect to\u00a0any weight is represented as<\/p>\n<p><strong>C\u2032(w)=CurrentLayerError<\/strong><strong>\u22c5<\/strong><strong>CurrentLayerInput<\/strong> where Input\u00a0refers to the activation from the previous\u00a0layer, not the weighted input, Z.<\/p>\n<p>Hence, the 3 equations that together form the foundation of backpropagation are<\/p>\n<\/p>\n<p><a href=\"https:\/\/storage.ning.com\/topology\/rest\/1.0\/file\/get\/2228241609?profile=original\" target=\"_blank\" rel=\"noopener noreferrer\"><img decoding=\"async\" src=\"https:\/\/storage.ning.com\/topology\/rest\/1.0\/file\/get\/2228241609?profile=RESIZE_710x\" class=\"align-full\"><\/a><\/p>\n<\/p>\n<p>The process can be visualised as below:<\/p>\n<p><a href=\"https:\/\/storage.ning.com\/topology\/rest\/1.0\/file\/get\/2228242438?profile=original\" target=\"_blank\" rel=\"noopener noreferrer\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/storage.ning.com\/topology\/rest\/1.0\/file\/get\/2228242438?profile=RESIZE_710x\" class=\"align-full\" width=\"574\" height=\"169\"><\/a>\u00a0<\/p>\n<p>These equations are not very easy to understand and I hope you find the simplified explanation useful<\/p>\n<p>I keep trying to improve my own understanding and to explain them better<\/p>\n<\/p>\n<p>I welcome your comments<\/p>\n<p>\u00a0<\/p>\n<p>Source: adapted and simplified from\u00a0<a href=\"https:\/\/ml-cheatsheet.readthedocs.io\/en\/latest\/\">the ml cheatsheet<\/a><\/p>\n<p>\u00a0<\/p>\n<\/div>\n<p><a href=\"https:\/\/www.datasciencecentral.com\/xn\/detail\/6448529:BlogPost:822155\">Go to Source<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Author: ajit jaokar Understanding the maths behind forward and back propagation is not very easy. There are some very good \u2013 but also very technical [&hellip;] <span class=\"read-more-link\"><a class=\"read-more\" href=\"https:\/\/www.aiproblog.com\/index.php\/2019\/05\/02\/the-mathematics-of-forward-and-back-propagation\/\">Read More<\/a><\/span><\/p>\n","protected":false},"author":1,"featured_media":465,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_bbp_topic_count":0,"_bbp_reply_count":0,"_bbp_total_topic_count":0,"_bbp_total_reply_count":0,"_bbp_voice_count":0,"_bbp_anonymous_reply_count":0,"_bbp_topic_count_hidden":0,"_bbp_reply_count_hidden":0,"_bbp_forum_subforum_count":0,"footnotes":""},"categories":[26],"tags":[],"_links":{"self":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts\/2086"}],"collection":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/comments?post=2086"}],"version-history":[{"count":0,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts\/2086\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/media\/456"}],"wp:attachment":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/media?parent=2086"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/categories?post=2086"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/tags?post=2086"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}