{"id":4758,"date":"2021-06-22T06:25:31","date_gmt":"2021-06-22T06:25:31","guid":{"rendered":"https:\/\/www.aiproblog.com\/index.php\/2021\/06\/22\/calculus-in-machine-learning-why-it-works\/"},"modified":"2021-06-22T06:25:31","modified_gmt":"2021-06-22T06:25:31","slug":"calculus-in-machine-learning-why-it-works","status":"publish","type":"post","link":"https:\/\/www.aiproblog.com\/index.php\/2021\/06\/22\/calculus-in-machine-learning-why-it-works\/","title":{"rendered":"Calculus in Machine Learning: Why it Works"},"content":{"rendered":"<p>Author: Stefania Cristina<\/p>\n<div>\n<p>Calculus is one of the core mathematical concepts in machine learning that permits us to understand the internal workings of different machine learning algorithms.<span class=\"Apple-converted-space\">\u00a0<\/span><\/p>\n<p>One of the important applications of calculus in machine learning is the gradient descent algorithm, which, in tandem with backpropagation, allows us to train a neural network model.<span class=\"Apple-converted-space\">\u00a0<\/span><\/p>\n<p>In this tutorial, you will discover the integral role of calculus in machine learning.<span class=\"Apple-converted-space\">\u00a0<\/span><\/p>\n<p>After completing this tutorial, you will know:<\/p>\n<ul>\n<li>Calculus plays an integral role in understanding the internal workings of machine learning algorithms, such as the gradient descent algorithm for minimizing an error function.<span class=\"Apple-converted-space\">\u00a0<\/span>\n<\/li>\n<li>Calculus provides us with the necessary tools to optimise complex objective functions as well as functions with multidimensional inputs, which are representative of different machine learning applications. <span class=\"Apple-converted-space\">\u00a0<\/span>\n<\/li>\n<\/ul>\n<p>Let\u2019s get started.<span class=\"Apple-converted-space\">\u00a0<\/span><\/p>\n<div id=\"attachment_12447\" style=\"width: 1034px\" class=\"wp-caption alignnone\">\n<a href=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2021\/06\/calculus_in_machine_learning_cover-scaled.jpg\"><img decoding=\"async\" aria-describedby=\"caption-attachment-12447\" loading=\"lazy\" class=\"wp-image-12447 size-large\" src=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2021\/06\/calculus_in_machine_learning_cover-1024x768.jpg\" alt=\"\" width=\"1024\" height=\"768\" srcset=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2021\/06\/calculus_in_machine_learning_cover-1024x768.jpg 1024w, https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2021\/06\/calculus_in_machine_learning_cover-300x225.jpg 300w, https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2021\/06\/calculus_in_machine_learning_cover-768x576.jpg 768w, https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2021\/06\/calculus_in_machine_learning_cover-1536x1152.jpg 1536w, https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2021\/06\/calculus_in_machine_learning_cover-2048x1536.jpg 2048w\" sizes=\"(max-width: 1024px) 100vw, 1024px\"><\/a><\/p>\n<p id=\"caption-attachment-12447\" class=\"wp-caption-text\">Calculus in Machine Learning: Why it Works<br \/>Photo by <a href=\"https:\/\/unsplash.com\/photos\/N9OQ2ZHNwCs\">Hasmik Ghazaryan Olson<\/a>, some rights reserved.<\/p>\n<\/div>\n<p>\u00a0<\/p>\n<h2><b>Tutorial Overview<\/b><\/h2>\n<p>This tutorial is divided into two parts; they are:<\/p>\n<ul>\n<li>Calculus in Machine Learning<\/li>\n<li>Why Calculus in Machine Learning Works<span class=\"Apple-converted-space\">\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0<\/span>\n<\/li>\n<\/ul>\n<h2><b>Calculus in Machine Learning<\/b><\/h2>\n<p>A neural network model, whether shallow or deep, implements a function that maps a set of inputs to expected outputs.<span class=\"Apple-converted-space\">\u00a0<\/span><\/p>\n<p>The function implemented by the neural network is learned through a training process, which iteratively searches for a set of weights that best enable the neural network to model the variations in the training data.<\/p>\n<blockquote>\n<p><i>A very simple type of function is a linear mapping from a single input to a single output.<span class=\"Apple-converted-space\">\u00a0<\/span><\/i><\/p>\n<p><i><\/i>Page 187, Deep Learning, 2019.<\/p>\n<\/blockquote>\n<p>Such a linear function can be represented by the equation of a line having a slope, <i>m<\/i>, and a y-intercept, <i>c<\/i>:<\/p>\n<p style=\"text-align: center;\"><i>y<\/i> = <i>mx<\/i> + <i>c<\/i><\/p>\n<p>Varying each of parameters, <i>m<\/i> and <i>c<\/i>, produces different linear models that define different input-output mappings.<\/p>\n<p>\u00a0<\/p>\n<div id=\"attachment_12443\" style=\"width: 313px\" class=\"wp-caption aligncenter\">\n<a href=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2021\/06\/calculus_in_machine_learning_1.png\"><img decoding=\"async\" aria-describedby=\"caption-attachment-12443\" loading=\"lazy\" class=\"wp-image-12443\" src=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2021\/06\/calculus_in_machine_learning_1-1024x899.png\" alt=\"\" width=\"303\" height=\"266\" srcset=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2021\/06\/calculus_in_machine_learning_1-1024x899.png 1024w, https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2021\/06\/calculus_in_machine_learning_1-300x263.png 300w, https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2021\/06\/calculus_in_machine_learning_1-768x674.png 768w, https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2021\/06\/calculus_in_machine_learning_1-1536x1348.png 1536w, https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2021\/06\/calculus_in_machine_learning_1.png 1846w\" sizes=\"(max-width: 303px) 100vw, 303px\"><\/a><\/p>\n<p id=\"caption-attachment-12443\" class=\"wp-caption-text\">Line Plot of Different Line Models Produced by Varying the Slope and Intercept<br \/>Taken from Deep Learning<\/p>\n<\/div>\n<p>\u00a0<\/p>\n<p>The process of learning the mapping function, therefore, involves the approximation of these model parameters, or <i>weights<\/i>, that result in the minimum error between the predicted and target outputs. This error is calculated by means of a loss function, cost function, or error function, as often used interchangeably, and the process of minimizing the loss is referred to as <i>function optimization<\/i>.<span class=\"Apple-converted-space\">\u00a0<\/span><\/p>\n<p>We can apply differential calculus to the process of function optimization. <span class=\"Apple-converted-space\">\u00a0<\/span><\/p>\n<p>In order to understand better how differential calculus can be applied to function optimization, let us return to our specific example of having a linear mapping function.<span class=\"Apple-converted-space\">\u00a0<\/span><\/p>\n<p>Say that we have some dataset of single input features, <i>x<\/i>, and their corresponding target outputs, <i>y<\/i>. In order to measure the error on the dataset, we shall be taking the sum of squared errors (SSE), computed between the predicted and target outputs, as our loss function.<span class=\"Apple-converted-space\">\u00a0<\/span><\/p>\n<p>Carrying out a parameter sweep across different values for the model weights, <em>w<sub>0<\/sub><\/em> = <i>m<\/i> and <em>w<sub>1<\/sub><\/em> = <i>c<\/i>, generates individual error profiles that are convex in shape.<\/p>\n<p>\u00a0<\/p>\n<div id=\"attachment_12444\" style=\"width: 313px\" class=\"wp-caption aligncenter\">\n<a href=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2021\/06\/calculus_in_machine_learning_2.png\"><img decoding=\"async\" aria-describedby=\"caption-attachment-12444\" loading=\"lazy\" class=\"wp-image-12444\" src=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2021\/06\/calculus_in_machine_learning_2-625x1024.png\" alt=\"\" width=\"303\" height=\"496\" srcset=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2021\/06\/calculus_in_machine_learning_2-625x1024.png 625w, https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2021\/06\/calculus_in_machine_learning_2-183x300.png 183w, https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2021\/06\/calculus_in_machine_learning_2-768x1259.png 768w, https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2021\/06\/calculus_in_machine_learning_2-937x1536.png 937w, https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2021\/06\/calculus_in_machine_learning_2.png 992w\" sizes=\"(max-width: 303px) 100vw, 303px\"><\/a><\/p>\n<p id=\"caption-attachment-12444\" class=\"wp-caption-text\">Line Plots of Error (SSE) Profiles Generated When Sweeping Across a Range of Values for the Slope and Intercept<br \/>Taken from Deep Learning<\/p>\n<\/div>\n<p>\u00a0<\/p>\n<p>Combining the individual error profiles generates a three-dimensional error surface that is also convex in shape. This error surface is contained within a weight space, which is defined by the swept ranges of values for the model weights, <em>w<sub>0<\/sub><\/em> and <em>w<sub>1<\/sub><\/em>.<\/p>\n<p>\u00a0<\/p>\n<div id=\"attachment_12445\" style=\"width: 313px\" class=\"wp-caption aligncenter\">\n<a href=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2021\/06\/calculus_in_machine_learning_3.png\"><img decoding=\"async\" aria-describedby=\"caption-attachment-12445\" loading=\"lazy\" class=\"wp-image-12445\" src=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2021\/06\/calculus_in_machine_learning_3-1024x851.png\" alt=\"\" width=\"303\" height=\"252\" srcset=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2021\/06\/calculus_in_machine_learning_3-1024x851.png 1024w, https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2021\/06\/calculus_in_machine_learning_3-300x249.png 300w, https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2021\/06\/calculus_in_machine_learning_3-768x638.png 768w, https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2021\/06\/calculus_in_machine_learning_3.png 1352w\" sizes=\"(max-width: 303px) 100vw, 303px\"><\/a><\/p>\n<p id=\"caption-attachment-12445\" class=\"wp-caption-text\">Three-Dimensional Plot of the Error (SSE) Surface Generated When Both Slope and Intercept are Varied<br \/>Taken from Deep Learning<\/p>\n<\/div>\n<p>\u00a0<\/p>\n<p>Moving across this weight space is equivalent to moving between different linear models. Our objective is to identify the model that best fits the data among all possible alternatives. The best model is characterised by the lowest error on the dataset, which corresponds with the lowest point on the error surface.<span class=\"Apple-converted-space\">\u00a0<\/span><\/p>\n<blockquote>\n<p><i>A convex or bowl-shaped error surface is incredibly useful for learning a linear function to model a dataset because it means that the learning process can be framed as a search for the lowest point on the error surface. The standard algorithm used to find this lowest point is known as gradient descent.<span class=\"Apple-converted-space\">\u00a0<\/span><\/i><\/p>\n<p><i><\/i>Page 194, Deep Learning, 2019.<\/p>\n<\/blockquote>\n<p>The gradient descent algorithm, as the optimization algorithm, will seek to reach the lowest point on the error surface by following its gradient downhill. This descent is based upon the computation of the gradient, or slope, of the error surface.<\/p>\n<p>This is where differential calculus comes into the picture.<span class=\"Apple-converted-space\">\u00a0<\/span><\/p>\n<blockquote>\n<p><i>Calculus, and in particular differentiation, is the field of mathematics that deals with rates of change.<\/i><b><span class=\"Apple-converted-space\">\u00a0<\/span><\/b><\/p>\n<p><i><\/i>Page 198, Deep Learning, 2019.<\/p>\n<\/blockquote>\n<p>More formally, let us denote the function that we would like to optimize by:<span class=\"Apple-converted-space\">\u00a0<\/span><\/p>\n<p style=\"text-align: center;\"><i>error = <\/i>f(<i>weights<\/i>)<\/p>\n<p>By computing the rate of change, or the slope, of the error with respect to the weights, the gradient descent algorithm can decide on how to change the weights in order to keep reducing the error.<span class=\"Apple-converted-space\">\u00a0<\/span><\/p>\n<h2><b>Why Calculus in Machine Learning Works<\/b><\/h2>\n<p>The error function that we have considered to optimize is relatively simple, because it is convex and characterised by a single global minimum.<span class=\"Apple-converted-space\">\u00a0<\/span><\/p>\n<p>Nonetheless, in the context of machine learning, we often need to optimize more complex functions that can make the optimization task very challenging. Optimization can become even more challenging if the input to the function is also multidimensional.<span class=\"Apple-converted-space\">\u00a0<\/span><\/p>\n<p>Calculus provides us with the necessary tools to address both challenges.<span class=\"Apple-converted-space\">\u00a0<\/span><\/p>\n<p>Suppose that we have a more generic function that we wish to minimize, and which takes a real input, <i>x<\/i>, to produce a real output, <i>y<\/i>:<\/p>\n<p style=\"text-align: center;\"><i>y<\/i> = f(<i>x<\/i>)<\/p>\n<p>Computing the rate of change at different values of <i>x<\/i> is useful because it gives us an indication of the changes that we need to apply to <i>x<\/i>, in order to obtain the corresponding changes in <i>y<\/i>.<span class=\"Apple-converted-space\">\u00a0<\/span><\/p>\n<p>Since we are minimizing the function, our goal is to reach a point that obtains as low a value of f(<i>x<\/i>) as possible that is also characterised by zero rate of change; hence, a global minimum. Depending on the complexity of the function, this may not necessarily be possible since there may be many local minima or saddle points that the optimisation algorithm may remain caught into.<span class=\"Apple-converted-space\">\u00a0<\/span><\/p>\n<blockquote>\n<p><i>In the context of deep learning, we optimize functions that may have many local minima that are not optimal, and many saddle points surrounded by very flat regions.<span class=\"Apple-converted-space\">\u00a0<\/span><\/i><\/p>\n<p><i><\/i>Page 84, Deep Learning, 2017.<\/p>\n<\/blockquote>\n<p>Hence, within the context of deep learning, we often accept a suboptimal solution that may not necessarily correspond to a global minimum, so long as it corresponds to a very low value of f(<i>x<\/i>).<\/p>\n<p>\u00a0<\/p>\n<div id=\"attachment_12446\" style=\"width: 410px\" class=\"wp-caption aligncenter\">\n<a href=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2021\/06\/calculus_in_machine_learning_4.png\"><img decoding=\"async\" aria-describedby=\"caption-attachment-12446\" loading=\"lazy\" class=\"wp-image-12446\" src=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2021\/06\/calculus_in_machine_learning_4-1024x576.png\" alt=\"\" width=\"400\" height=\"225\" srcset=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2021\/06\/calculus_in_machine_learning_4-1024x576.png 1024w, https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2021\/06\/calculus_in_machine_learning_4-300x169.png 300w, https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2021\/06\/calculus_in_machine_learning_4-768x432.png 768w, https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2021\/06\/calculus_in_machine_learning_4-1536x865.png 1536w, https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2021\/06\/calculus_in_machine_learning_4.png 1972w\" sizes=\"(max-width: 400px) 100vw, 400px\"><\/a><\/p>\n<p id=\"caption-attachment-12446\" class=\"wp-caption-text\">Line Plot of Cost Function to Minimize Displaying Local and Global Minima<br \/>Taken from Deep Learning<\/p>\n<\/div>\n<p><i><span class=\"Apple-converted-space\">\u00a0\u00a0<\/span><\/i><\/p>\n<p>If the function we are working with takes multiple inputs, calculus also provides us with the concept of <i>partial derivatives<\/i>; or in simpler terms, a method to calculate the rate of change of <i>y<\/i> with respect to changes in each one of the inputs, <i>x<\/i><i><sub>i<\/sub><\/i>, while holding the remaining inputs constant.<span class=\"Apple-converted-space\">\u00a0<\/span><\/p>\n<blockquote>\n<p><i>This is why each of the weights is updated independently in the gradient descent algorithm: the weight update rule is dependent on the partial derivative of the SSE for each weight, and because there is a different partial derivative for each weight, there is a separate weight update rule for each weight.<span class=\"Apple-converted-space\">\u00a0<\/span><\/i><\/p>\n<p><i><\/i>Page 200, Deep Learning, 2019.<\/p>\n<\/blockquote>\n<p>Hence, if we consider again the minimization of an error function, calculating the partial derivative for the error with respect to each specific weight permits that each weight is updated independently of the others.<span class=\"Apple-converted-space\">\u00a0<\/span><\/p>\n<p>This also means that the gradient descent algorithm may not follow a straight path down the error surface. Rather, each weight will be updated in proportion to the local gradient of the error curve. Hence, one weight may be updated by a larger amount than another, as much as needed for the gradient descent algorithm to reach the function minimum. <span class=\"Apple-converted-space\">\u00a0<\/span><\/p>\n<h2><b>Further Reading<\/b><\/h2>\n<p>This section provides more resources on the topic if you are looking to go deeper.<\/p>\n<h3><b>Books<\/b><\/h3>\n<ul>\n<li>\n<a href=\"https:\/\/www.amazon.com\/Deep-Learning-Press-Essential-Knowledge\/dp\/0262537559\/ref=sr_1_4?dchild=1&amp;keywords=deep+learning&amp;qid=1622968138&amp;sr=8-4\">Deep Learning<\/a>, 2019.<\/li>\n<li>\n<a href=\"https:\/\/www.amazon.com\/Deep-Learning-Adaptive-Computation-Machine\/dp\/0262035618\/ref=sr_1_1?dchild=1&amp;keywords=deep+learning&amp;qid=1622968138&amp;sr=8-1\">Deep Learning<\/a>, 2017.<\/li>\n<\/ul>\n<h2><b>Summary<\/b><\/h2>\n<p>In this tutorial, you discovered the integral role of calculus in machine learning.<\/p>\n<p>Specifically, you learned:<\/p>\n<ul>\n<li>Calculus plays an integral role in understanding the internal workings of machine learning algorithms, such as the gradient descent algorithm that minimizes an error function based on the computation of the rate of change.<span class=\"Apple-converted-space\">\u00a0<\/span>\n<\/li>\n<li>The concept of the rate of change in calculus can also be exploited to minimise more complex objective functions that are not necessarily convex in shape.<span class=\"Apple-converted-space\">\u00a0<\/span>\n<\/li>\n<li>The calculation of the partial derivative, another important concept in calculus, permits us to work with functions that take multiple inputs.<span class=\"Apple-converted-space\">\u00a0<\/span>\n<\/li>\n<\/ul>\n<p>Do you have any questions?<br \/>\nAsk your questions in the comments below and I will do my best to answer.<\/p>\n<p>The post <a rel=\"nofollow\" href=\"https:\/\/machinelearningmastery.com\/calculus-in-machine-learning-why-it-works\/\">Calculus in Machine Learning: Why it Works<\/a> appeared first on <a rel=\"nofollow\" href=\"https:\/\/machinelearningmastery.com\/\">Machine Learning Mastery<\/a>.<\/p>\n<\/div>\n<p><a href=\"https:\/\/machinelearningmastery.com\/calculus-in-machine-learning-why-it-works\/\">Go to Source<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Author: Stefania Cristina Calculus is one of the core mathematical concepts in machine learning that permits us to understand the internal workings of different machine [&hellip;] <span class=\"read-more-link\"><a class=\"read-more\" href=\"https:\/\/www.aiproblog.com\/index.php\/2021\/06\/22\/calculus-in-machine-learning-why-it-works\/\">Read More<\/a><\/span><\/p>\n","protected":false},"author":1,"featured_media":4759,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_bbp_topic_count":0,"_bbp_reply_count":0,"_bbp_total_topic_count":0,"_bbp_total_reply_count":0,"_bbp_voice_count":0,"_bbp_anonymous_reply_count":0,"_bbp_topic_count_hidden":0,"_bbp_reply_count_hidden":0,"_bbp_forum_subforum_count":0,"footnotes":""},"categories":[24],"tags":[],"_links":{"self":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts\/4758"}],"collection":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/comments?post=4758"}],"version-history":[{"count":0,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts\/4758\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/media\/4759"}],"wp:attachment":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/media?parent=4758"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/categories?post=4758"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/tags?post=4758"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}