{"id":4717,"date":"2021-06-08T06:28:52","date_gmt":"2021-06-08T06:28:52","guid":{"rendered":"https:\/\/www.aiproblog.com\/index.php\/2021\/06\/08\/gradient-descent-optimization-with-adamax-from-scratch\/"},"modified":"2021-06-08T06:28:52","modified_gmt":"2021-06-08T06:28:52","slug":"gradient-descent-optimization-with-adamax-from-scratch","status":"publish","type":"post","link":"https:\/\/www.aiproblog.com\/index.php\/2021\/06\/08\/gradient-descent-optimization-with-adamax-from-scratch\/","title":{"rendered":"Gradient Descent Optimization With AdaMax From Scratch"},"content":{"rendered":"<p>Author: Jason Brownlee<\/p>\n<div>\n<p>Gradient descent is an optimization algorithm that follows the negative gradient of an objective function in order to locate the minimum of the function.<\/p>\n<p>A limitation of gradient descent is that a single step size (learning rate) is used for all input variables. Extensions to gradient descent, like the Adaptive Movement Estimation (Adam) algorithm, use a separate step size for each input variable but may result in a step size that rapidly decreases to very small values.<\/p>\n<p><strong>AdaMax<\/strong> is an extension to the Adam version of gradient descent that generalizes the approach to the infinite norm (max) and may result in a more effective optimization on some problems.<\/p>\n<p>In this tutorial, you will discover how to develop gradient descent optimization with AdaMax from scratch.<\/p>\n<p>After completing this tutorial, you will know:<\/p>\n<ul>\n<li>Gradient descent is an optimization algorithm that uses the gradient of the objective function to navigate the search space.<\/li>\n<li>AdaMax is an extension of the Adam version of gradient descent designed to accelerate the optimization process.<\/li>\n<li>How to implement the AdaMax optimization algorithm from scratch and apply it to an objective function and evaluate the results.<\/li>\n<\/ul>\n<p>Let\u2019s get started.<\/p>\n<div id=\"attachment_12177\" style=\"width: 810px\" class=\"wp-caption aligncenter\">\n<img decoding=\"async\" aria-describedby=\"caption-attachment-12177\" loading=\"lazy\" class=\"size-full wp-image-12177\" src=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2021\/05\/Gradient-Descent-Optimization-With-AdaMax-From-Scratch.jpg\" alt=\"Gradient Descent Optimization With AdaMax From Scratch\" width=\"800\" height=\"511\" srcset=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2021\/05\/Gradient-Descent-Optimization-With-AdaMax-From-Scratch.jpg 800w, https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2021\/05\/Gradient-Descent-Optimization-With-AdaMax-From-Scratch-300x192.jpg 300w, https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2021\/05\/Gradient-Descent-Optimization-With-AdaMax-From-Scratch-768x491.jpg 768w\" sizes=\"(max-width: 800px) 100vw, 800px\"><\/p>\n<p id=\"caption-attachment-12177\" class=\"wp-caption-text\">Gradient Descent Optimization With AdaMax From Scratch<br \/>Photo by <a href=\"https:\/\/www.flickr.com\/photos\/23155134@N06\/25264818302\/\">Don Graham<\/a>, some rights reserved.<\/p>\n<\/div>\n<h2>Tutorial Overview<\/h2>\n<p>This tutorial is divided into three parts; they are:<\/p>\n<ol>\n<li>Gradient Descent<\/li>\n<li>AdaMax Optimization Algorithm<\/li>\n<li>Gradient Descent With AdaMax\n<ol>\n<li>Two-Dimensional Test Problem<\/li>\n<li>Gradient Descent Optimization With AdaMax<\/li>\n<li>Visualization of AdaMax Optimization<\/li>\n<\/ol>\n<\/li>\n<\/ol>\n<h2>Gradient Descent<\/h2>\n<p><a href=\"https:\/\/en.wikipedia.org\/wiki\/Gradient_descent\">Gradient<\/a> descent is an optimization algorithm.<\/p>\n<p>It is technically referred to as a first-order optimization algorithm as it explicitly makes use of the first-order derivative of the target objective function.<\/p>\n<blockquote>\n<p>First-order methods rely on gradient information to help direct the search for a minimum \u2026<\/p>\n<\/blockquote>\n<p>\u2014 Page 69, <a href=\"https:\/\/amzn.to\/39KZSQn\">Algorithms for Optimization<\/a>, 2019.<\/p>\n<p>The first-order derivative, or simply the \u201c<a href=\"https:\/\/en.wikipedia.org\/wiki\/Derivative\">derivative<\/a>,\u201d is the rate of change or slope of the target function at a specific point, e.g. for a specific input.<\/p>\n<p>If the target function takes multiple input variables, it is referred to as a multivariate function and the input variables can be thought of as a vector. In turn, the derivative of a multivariate target function may also be taken as a vector and is referred to generally as the <a href=\"https:\/\/en.wikipedia.org\/wiki\/Gradient\">gradient<\/a>.<\/p>\n<ul>\n<li>\n<strong>Gradient<\/strong>: First-order derivative for a multivariate objective function.<\/li>\n<\/ul>\n<p>The derivative or the gradient points in the direction of the steepest ascent of the target function for a specific input.<\/p>\n<p>Gradient descent refers to a minimization optimization algorithm that follows the negative of the gradient downhill of the target function to locate the minimum of the function.<\/p>\n<p>The gradient descent algorithm requires a target function that is being optimized and the derivative function for the objective function. The target function f() returns a score for a given set of inputs, and the derivative function f'() gives the derivative of the target function for a given set of inputs.<\/p>\n<p>The gradient descent algorithm requires a starting point (x) in the problem, such as a randomly selected point in the input space.<\/p>\n<p>The derivative is then calculated and a step is taken in the input space that is expected to result in a downhill movement in the target function, assuming we are minimizing the target function.<\/p>\n<p>A downhill movement is made by first calculating how far to move in the input space, calculated as the step size (called alpha or the learning rate) multiplied by the gradient. This is then subtracted from the current point, ensuring we move against the gradient, or down the target function.<\/p>\n<ul>\n<li><span style=\"font-weight: 400;\">x(t) = x(t-1) \u2013 step_size * f'(x(t))<\/span><\/li>\n<\/ul>\n<p>The steeper the objective function at a given point, the larger the magnitude of the gradient, and in turn, the larger the step taken in the search space. The size of the step taken is scaled using a step size hyperparameter.<\/p>\n<ul>\n<li>\n<strong>Step Size<\/strong>: Hyperparameter that controls how far to move in the search space against the gradient each iteration of the algorithm.<\/li>\n<\/ul>\n<p>If the step size is too small, the movement in the search space will be small and the search will take a long time. If the step size is too large, the search may bounce around the search space and skip over the optima.<\/p>\n<p>Now that we are familiar with the gradient descent optimization algorithm, let\u2019s take a look at the AdaMax algorithm.<\/p>\n<h2>AdaMax Optimization Algorithm<\/h2>\n<p>AdaMax algorithm is an extension to the Adaptive Movement Estimation (Adam) Optimization algorithm. More broadly, is an extension to the Gradient Descent Optimization algorithm.<\/p>\n<p>The algorithm was described in the 2014 paper by Diederik Kingma and Jimmy Lei Ba titled \u201c<a href=\"https:\/\/arxiv.org\/abs\/1412.6980\">Adam: A Method for Stochastic Optimization<\/a>.\u201d<\/p>\n<p>Adam can be understood as updating weights inversely proportional to the scaled L2 norm (squared) of past gradients. AdaMax extends this to the so-called infinite norm (max) of past gradients.<\/p>\n<blockquote>\n<p>In Adam, the update rule for individual weights is to scale their gradients inversely proportional to a (scaled) L^2 norm of their individual current and past gradients<\/p>\n<\/blockquote>\n<p>\u2014 <a href=\"https:\/\/arxiv.org\/abs\/1412.6980\">Adam: A Method for Stochastic Optimization<\/a>, 2014.<\/p>\n<p>Generally, AdaMax automatically adapts a separate step size (learning rate) for each parameter in the optimization problem.<\/p>\n<p>Let\u2019s step through each element of the algorithm.<\/p>\n<p>First, we must maintain a moment vector and exponentially weighted infinity norm for each parameter being optimized as part of the search, referred to as <em>m<\/em> and <em>u<\/em> respectively.<\/p>\n<p>They are initialized to 0.0 at the start of the search.<\/p>\n<ul>\n<li>m = 0<\/li>\n<li>u = 0<\/li>\n<\/ul>\n<p>The algorithm is executed iteratively over time t starting at t=1, and each iteration involves calculating a new set of parameter values x, e.g. going from <em>x(t-1)<\/em> to <em>x(t)<\/em>.<\/p>\n<p>It is perhaps easy to understand the algorithm if we focus on updating one parameter, which generalizes to updating all parameters via vector operations.<\/p>\n<p>First, the gradient (partial derivatives) are calculated for the current time step.<\/p>\n<ul>\n<li><span style=\"font-weight: 400;\">g(t) = f'(x(t-1))<\/span><\/li>\n<\/ul>\n<p>Next, the moment vector is updated using the gradient and a hyperparameter <em>beta1<\/em>.<\/p>\n<ul>\n<li><span style=\"font-weight: 400;\">m(t) = beta1 * m(t-1) + (1 \u2013 beta1) * g(t)<\/span><\/li>\n<\/ul>\n<p>The exponentially weighted infinity norm is updated using the <em>beta2<\/em> hyperparameter.<\/p>\n<ul>\n<li><span style=\"font-weight: 400;\">u(t) = max(beta2 * u(t-1), abs(g(t)))<\/span><\/li>\n<\/ul>\n<p>Where <em>max()<\/em> selects the maximum of the parameters and <em>abs()<\/em> calculates the absolute value.<\/p>\n<p>We can then update the parameter value. This can be broken down into three pieces; the first calculates the step size parameter, the second the gradient, and the third uses the step size and gradient to calculate the new parameter value.<\/p>\n<p>Let\u2019s start with calculating the step size for the parameter using an initial step size hyperparameter called <em>alpha<\/em> and a version of <em>beta1<\/em> that is decaying over time with a specific value for this time step <em>beta1(t)<\/em>:<\/p>\n<ul>\n<li><span style=\"font-weight: 400;\">step_size(t) = alpha \/ (1 \u2013 beta1(t))<\/span><\/li>\n<\/ul>\n<p>The gradient used for updating the parameter is calculated as follows:<\/p>\n<ul>\n<li><span style=\"font-weight: 400;\">delta(t) = m(t) \/ u(t)<\/span><\/li>\n<\/ul>\n<p>Finally, we can calculate the value for the parameter for this iteration.<\/p>\n<ul>\n<li><span style=\"font-weight: 400;\">x(t) = x(t-1) \u2013 step_size(t) * delta(t)<\/span><\/li>\n<\/ul>\n<p>Or the complete update equation can be stated as:<\/p>\n<ul>\n<li><span style=\"font-weight: 400;\">x(t) = x(t-1) \u2013 (alpha \/ (1 \u2013 beta1(t))) * m(t) \/ u(t)<\/span><\/li>\n<\/ul>\n<p>To review, there are three hyperparameters for the algorithm; they are:<\/p>\n<ul>\n<li>\n<strong>alpha<\/strong>: Initial step size (learning rate), a typical value is 0.002.<\/li>\n<li>\n<strong>beta1<\/strong>: Decay factor for first momentum, a typical value is 0.9.<\/li>\n<li>\n<strong>beta2<\/strong>: Decay factor for infinity norm, a typical value is 0.999.<\/li>\n<\/ul>\n<p>The decay schedule for beta1(t) suggested in the paper is calculated using the initial beta1 value raised to the power t, although other decay schedules could be used such as holding the value constant or decaying more aggressively.<\/p>\n<ul>\n<li>beta1(t) = beta1^t<\/li>\n<\/ul>\n<p>And that\u2019s it.<\/p>\n<p>For the full derivation of the AdaMax algorithm in the context of the Adam algorithm, I recommend reading the paper:<\/p>\n<ul>\n<li>\n<a href=\"https:\/\/arxiv.org\/abs\/1412.6980\">Adam: A Method for Stochastic Optimization<\/a>, 2014.<\/li>\n<\/ul>\n<p>Next, let\u2019s look at how we might implement the algorithm from scratch in Python.<\/p>\n<h2>Gradient Descent With AdaMax<\/h2>\n<p>In this section, we will explore how to implement the gradient descent optimization algorithm with AdaMax Momentum.<\/p>\n<h3>Two-Dimensional Test Problem<\/h3>\n<p>First, let\u2019s define an optimization function.<\/p>\n<p>We will use a simple two-dimensional function that squares the input of each dimension and define the range of valid inputs from -1.0 to 1.0.<\/p>\n<p>The <em>objective()<\/em> function below implements this.<\/p>\n<pre class=\"urvanov-syntax-highlighter-plain-tag\"># objective function\r\ndef objective(x, y):\r\n\treturn x**2.0 + y**2.0<\/pre>\n<p>We can create a three-dimensional plot of the dataset to get a feeling for the curvature of the response surface.<\/p>\n<p>The complete example of plotting the objective function is listed below.<\/p>\n<pre class=\"urvanov-syntax-highlighter-plain-tag\"># 3d plot of the test function\r\nfrom numpy import arange\r\nfrom numpy import meshgrid\r\nfrom matplotlib import pyplot\r\n\r\n# objective function\r\ndef objective(x, y):\r\n\treturn x**2.0 + y**2.0\r\n\r\n# define range for input\r\nr_min, r_max = -1.0, 1.0\r\n# sample input range uniformly at 0.1 increments\r\nxaxis = arange(r_min, r_max, 0.1)\r\nyaxis = arange(r_min, r_max, 0.1)\r\n# create a mesh from the axis\r\nx, y = meshgrid(xaxis, yaxis)\r\n# compute targets\r\nresults = objective(x, y)\r\n# create a surface plot with the jet color scheme\r\nfigure = pyplot.figure()\r\naxis = figure.gca(projection='3d')\r\naxis.plot_surface(x, y, results, cmap='jet')\r\n# show the plot\r\npyplot.show()<\/pre>\n<p>Running the example creates a three-dimensional surface plot of the objective function.<\/p>\n<p>We can see the familiar bowl shape with the global minima at f(0, 0) = 0.<\/p>\n<div id=\"attachment_12105\" style=\"width: 1290px\" class=\"wp-caption aligncenter\">\n<img decoding=\"async\" aria-describedby=\"caption-attachment-12105\" loading=\"lazy\" class=\"size-full wp-image-12105\" src=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2020\/12\/Three-Dimensional-Plot-of-the-Test-Objective-Function.png\" alt=\"Three-Dimensional Plot of the Test Objective Function\" width=\"1280\" height=\"960\" srcset=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2020\/12\/Three-Dimensional-Plot-of-the-Test-Objective-Function.png 1280w, https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2020\/12\/Three-Dimensional-Plot-of-the-Test-Objective-Function-300x225.png 300w, https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2020\/12\/Three-Dimensional-Plot-of-the-Test-Objective-Function-1024x768.png 1024w, https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2020\/12\/Three-Dimensional-Plot-of-the-Test-Objective-Function-768x576.png 768w\" sizes=\"(max-width: 1280px) 100vw, 1280px\"><\/p>\n<p id=\"caption-attachment-12105\" class=\"wp-caption-text\">Three-Dimensional Plot of the Test Objective Function<\/p>\n<\/div>\n<p>We can also create a two-dimensional plot of the function. This will be helpful later when we want to plot the progress of the search.<\/p>\n<p>The example below creates a contour plot of the objective function.<\/p>\n<pre class=\"urvanov-syntax-highlighter-plain-tag\"># contour plot of the test function\r\nfrom numpy import asarray\r\nfrom numpy import arange\r\nfrom numpy import meshgrid\r\nfrom matplotlib import pyplot\r\n\r\n# objective function\r\ndef objective(x, y):\r\n\treturn x**2.0 + y**2.0\r\n\r\n# define range for input\r\nbounds = asarray([[-1.0, 1.0], [-1.0, 1.0]])\r\n# sample input range uniformly at 0.1 increments\r\nxaxis = arange(bounds[0,0], bounds[0,1], 0.1)\r\nyaxis = arange(bounds[1,0], bounds[1,1], 0.1)\r\n# create a mesh from the axis\r\nx, y = meshgrid(xaxis, yaxis)\r\n# compute targets\r\nresults = objective(x, y)\r\n# create a filled contour plot with 50 levels and jet color scheme\r\npyplot.contourf(x, y, results, levels=50, cmap='jet')\r\n# show the plot\r\npyplot.show()<\/pre>\n<p>Running the example creates a two-dimensional contour plot of the objective function.<\/p>\n<p>We can see the bowl shape compressed to contours shown with a color gradient. We will use this plot to plot the specific points explored during the progress of the search.<\/p>\n<div id=\"attachment_12106\" style=\"width: 1290px\" class=\"wp-caption aligncenter\">\n<img decoding=\"async\" aria-describedby=\"caption-attachment-12106\" loading=\"lazy\" class=\"size-full wp-image-12106\" src=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2020\/12\/Two-Dimensional-Contour-Plot-of-the-Test-Objective-Function.png\" alt=\"Two-Dimensional Contour Plot of the Test Objective Function\" width=\"1280\" height=\"960\" srcset=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2020\/12\/Two-Dimensional-Contour-Plot-of-the-Test-Objective-Function.png 1280w, https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2020\/12\/Two-Dimensional-Contour-Plot-of-the-Test-Objective-Function-300x225.png 300w, https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2020\/12\/Two-Dimensional-Contour-Plot-of-the-Test-Objective-Function-1024x768.png 1024w, https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2020\/12\/Two-Dimensional-Contour-Plot-of-the-Test-Objective-Function-768x576.png 768w\" sizes=\"(max-width: 1280px) 100vw, 1280px\"><\/p>\n<p id=\"caption-attachment-12106\" class=\"wp-caption-text\">Two-Dimensional Contour Plot of the Test Objective Function<\/p>\n<\/div>\n<p>Now that we have a test objective function, let\u2019s look at how we might implement the AdaMax optimization algorithm.<\/p>\n<h3>Gradient Descent Optimization With AdaMax<\/h3>\n<p>We can apply the gradient descent with AdaMax to the test problem.<\/p>\n<p>First, we need a function that calculates the derivative for this function.<\/p>\n<p>The derivative of x^2 is x * 2 in each dimension.<\/p>\n<ul>\n<li>f(x) = x^2<\/li>\n<li>f'(x) = x * 2<\/li>\n<\/ul>\n<p>The derivative() function implements this below.<\/p>\n<pre class=\"urvanov-syntax-highlighter-plain-tag\"># derivative of objective function\r\ndef derivative(x, y):\r\n\treturn asarray([x * 2.0, y * 2.0])<\/pre>\n<p>Next, we can implement gradient descent optimization with AdaMax.<\/p>\n<p>First, we can select a random point in the bounds of the problem as a starting point for the search.<\/p>\n<p>This assumes we have an array that defines the bounds of the search with one row for each dimension and the first column defines the minimum and the second column defines the maximum of the dimension.<\/p>\n<pre class=\"urvanov-syntax-highlighter-plain-tag\">...\r\n# generate an initial point\r\nx = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0])<\/pre>\n<p>Next, we need to initialize the moment vector and exponentially weighted infinity norm.<\/p>\n<pre class=\"urvanov-syntax-highlighter-plain-tag\">...\r\n# initialize moment vector and weighted infinity norm\r\nm = [0.0 for _ in range(bounds.shape[0])]\r\nu = [0.0 for _ in range(bounds.shape[0])]<\/pre>\n<p>We then run a fixed number of iterations of the algorithm defined by the \u201c<em>n_iter<\/em>\u201d hyperparameter.<\/p>\n<pre class=\"urvanov-syntax-highlighter-plain-tag\">...\r\n# run iterations of gradient descent\r\nfor t in range(n_iter):\r\n\t...<\/pre>\n<p>The first step is to calculate the derivative for the current set of parameters.<\/p>\n<pre class=\"urvanov-syntax-highlighter-plain-tag\">...\r\n# calculate gradient g(t)\r\ng = derivative(x[0], x[1])<\/pre>\n<p>Next, we need to perform the AdaMax update calculations. We will perform these calculations one variable at a time using an imperative programming style for readability.<\/p>\n<p>In practice, I recommend using NumPy vector operations for efficiency.<\/p>\n<pre class=\"urvanov-syntax-highlighter-plain-tag\">...\r\n# build a solution one variable at a time\r\nfor i in range(x.shape[0]):\r\n\t...<\/pre>\n<p>First, we need to calculate the moment vector.<\/p>\n<pre class=\"urvanov-syntax-highlighter-plain-tag\">...\r\n# m(t) = beta1 * m(t-1) + (1 - beta1) * g(t)\r\nm[i] = beta1 * m[i] + (1.0 - beta1) * g[i]<\/pre>\n<p>Next, we need to calculate the exponentially weighted infinity norm.<\/p>\n<pre class=\"urvanov-syntax-highlighter-plain-tag\">...\r\n# u(t) = max(beta2 * u(t-1), abs(g(t)))\r\nu[i] = max(beta2 * u[i], abs(g[i]))<\/pre>\n<p>Then the step size used in the update.<\/p>\n<pre class=\"urvanov-syntax-highlighter-plain-tag\">...\r\n# step_size(t) = alpha \/ (1 - beta1(t))\r\nstep_size = alpha \/ (1.0 - beta1**(t+1))<\/pre>\n<p>And the change in variable.<\/p>\n<pre class=\"urvanov-syntax-highlighter-plain-tag\">...\r\n# delta(t) = m(t) \/ u(t)\r\ndelta = m[i] \/ u[i]<\/pre>\n<p>Finally, we can calculate the new value for the variable.<\/p>\n<pre class=\"urvanov-syntax-highlighter-plain-tag\">...\r\n# x(t) = x(t-1) - step_size(t) * delta(t)\r\nx[i] = x[i] - step_size * delta<\/pre>\n<p>This is then repeated for each parameter that is being optimized.<\/p>\n<p>At the end of the iteration, we can evaluate the new parameter values and report the performance of the search.<\/p>\n<pre class=\"urvanov-syntax-highlighter-plain-tag\">...\r\n# evaluate candidate point\r\nscore = objective(x[0], x[1])\r\n# report progress\r\nprint('&gt;%d f(%s) = %.5f' % (t, x, score))<\/pre>\n<p>We can tie all of this together into a function named <em>adamax()<\/em> that takes the names of the objective and derivative functions as well as the algorithm hyperparameters and returns the best solution found at the end of the search and its evaluation.<\/p>\n<pre class=\"urvanov-syntax-highlighter-plain-tag\"># gradient descent algorithm with adamax\r\ndef adamax(objective, derivative, bounds, n_iter, alpha, beta1, beta2):\r\n\t# generate an initial point\r\n\tx = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0])\r\n\t# initialize moment vector and weighted infinity norm\r\n\tm = [0.0 for _ in range(bounds.shape[0])]\r\n\tu = [0.0 for _ in range(bounds.shape[0])]\r\n\t# run iterations of gradient descent\r\n\tfor t in range(n_iter):\r\n\t\t# calculate gradient g(t)\r\n\t\tg = derivative(x[0], x[1])\r\n\t\t# build a solution one variable at a time\r\n\t\tfor i in range(x.shape[0]):\r\n\t\t\t# m(t) = beta1 * m(t-1) + (1 - beta1) * g(t)\r\n\t\t\tm[i] = beta1 * m[i] + (1.0 - beta1) * g[i]\r\n\t\t\t# u(t) = max(beta2 * u(t-1), abs(g(t)))\r\n\t\t\tu[i] = max(beta2 * u[i], abs(g[i]))\r\n\t\t\t# step_size(t) = alpha \/ (1 - beta1(t))\r\n\t\t\tstep_size = alpha \/ (1.0 - beta1**(t+1))\r\n\t\t\t# delta(t) = m(t) \/ u(t)\r\n\t\t\tdelta = m[i] \/ u[i]\r\n\t\t\t# x(t) = x(t-1) - step_size(t) * delta(t)\r\n\t\t\tx[i] = x[i] - step_size * delta\r\n\t\t# evaluate candidate point\r\n\t\tscore = objective(x[0], x[1])\r\n\t\t# report progress\r\n\t\tprint('&gt;%d f(%s) = %.5f' % (t, x, score))\r\n\treturn [x, score]<\/pre>\n<p>We can then define the bounds of the function and the hyperparameters and call the function to perform the optimization.<\/p>\n<p>In this case, we will run the algorithm for 60 iterations with an initial learning rate of 0.02, beta of 0.8, and a beta2 of 0.99, found after a little trial and error.<\/p>\n<pre class=\"urvanov-syntax-highlighter-plain-tag\">...\r\n# seed the pseudo random number generator\r\nseed(1)\r\n# define range for input\r\nbounds = asarray([[-1.0, 1.0], [-1.0, 1.0]])\r\n# define the total iterations\r\nn_iter = 60\r\n# steps size\r\nalpha = 0.02\r\n# factor for average gradient\r\nbeta1 = 0.8\r\n# factor for average squared gradient\r\nbeta2 = 0.99\r\n# perform the gradient descent search with adamax\r\nbest, score = adamax(objective, derivative, bounds, n_iter, alpha, beta1, beta2)<\/pre>\n<p>At the end of the run, we will report the best solution found.<\/p>\n<pre class=\"urvanov-syntax-highlighter-plain-tag\">...\r\n# summarize the result\r\nprint('Done!')\r\nprint('f(%s) = %f' % (best, score))<\/pre>\n<p>Tying all of this together, the complete example of AdaMax gradient descent applied to our test problem is listed below.<\/p>\n<pre class=\"urvanov-syntax-highlighter-plain-tag\"># gradient descent optimization with adamax for a two-dimensional test function\r\nfrom numpy import asarray\r\nfrom numpy.random import rand\r\nfrom numpy.random import seed\r\n\r\n# objective function\r\ndef objective(x, y):\r\n\treturn x**2.0 + y**2.0\r\n\r\n# derivative of objective function\r\ndef derivative(x, y):\r\n\treturn asarray([x * 2.0, y * 2.0])\r\n\r\n# gradient descent algorithm with adamax\r\ndef adamax(objective, derivative, bounds, n_iter, alpha, beta1, beta2):\r\n\t# generate an initial point\r\n\tx = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0])\r\n\t# initialize moment vector and weighted infinity norm\r\n\tm = [0.0 for _ in range(bounds.shape[0])]\r\n\tu = [0.0 for _ in range(bounds.shape[0])]\r\n\t# run iterations of gradient descent\r\n\tfor t in range(n_iter):\r\n\t\t# calculate gradient g(t)\r\n\t\tg = derivative(x[0], x[1])\r\n\t\t# build a solution one variable at a time\r\n\t\tfor i in range(x.shape[0]):\r\n\t\t\t# m(t) = beta1 * m(t-1) + (1 - beta1) * g(t)\r\n\t\t\tm[i] = beta1 * m[i] + (1.0 - beta1) * g[i]\r\n\t\t\t# u(t) = max(beta2 * u(t-1), abs(g(t)))\r\n\t\t\tu[i] = max(beta2 * u[i], abs(g[i]))\r\n\t\t\t# step_size(t) = alpha \/ (1 - beta1(t))\r\n\t\t\tstep_size = alpha \/ (1.0 - beta1**(t+1))\r\n\t\t\t# delta(t) = m(t) \/ u(t)\r\n\t\t\tdelta = m[i] \/ u[i]\r\n\t\t\t# x(t) = x(t-1) - step_size(t) * delta(t)\r\n\t\t\tx[i] = x[i] - step_size * delta\r\n\t\t# evaluate candidate point\r\n\t\tscore = objective(x[0], x[1])\r\n\t\t# report progress\r\n\t\tprint('&gt;%d f(%s) = %.5f' % (t, x, score))\r\n\treturn [x, score]\r\n\r\n# seed the pseudo random number generator\r\nseed(1)\r\n# define range for input\r\nbounds = asarray([[-1.0, 1.0], [-1.0, 1.0]])\r\n# define the total iterations\r\nn_iter = 60\r\n# steps size\r\nalpha = 0.02\r\n# factor for average gradient\r\nbeta1 = 0.8\r\n# factor for average squared gradient\r\nbeta2 = 0.99\r\n# perform the gradient descent search with adamax\r\nbest, score = adamax(objective, derivative, bounds, n_iter, alpha, beta1, beta2)\r\n# summarize the result\r\nprint('Done!')\r\nprint('f(%s) = %f' % (best, score))<\/pre>\n<p>Running the example applies the optimization algorithm with AdaMax to our test problem and reports the performance of the search for each iteration of the algorithm.<\/p>\n<p><strong>Note<\/strong>: Your <a href=\"https:\/\/machinelearningmastery.com\/different-results-each-time-in-machine-learning\/\">results may vary<\/a> given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.<\/p>\n<p>In this case, we can see that a near-optimal solution was found after perhaps 35 iterations of the search, with input values near 0.0 and 0.0, evaluating to 0.0.<\/p>\n<pre class=\"urvanov-syntax-highlighter-plain-tag\">...\r\n&gt;33 f([-0.00122185 0.00427944]) = 0.00002\r\n&gt;34 f([-0.00045147 0.00289913]) = 0.00001\r\n&gt;35 f([0.00022176 0.00165754]) = 0.00000\r\n&gt;36 f([0.00073314 0.00058534]) = 0.00000\r\n&gt;37 f([ 0.00105092 -0.00030082]) = 0.00000\r\n&gt;38 f([ 0.00117382 -0.00099624]) = 0.00000\r\n&gt;39 f([ 0.00112512 -0.00150609]) = 0.00000\r\n&gt;40 f([ 0.00094497 -0.00184321]) = 0.00000\r\n&gt;41 f([ 0.00068206 -0.002026 ]) = 0.00000\r\n&gt;42 f([ 0.00038579 -0.00207647]) = 0.00000\r\n&gt;43 f([ 9.99977780e-05 -2.01849176e-03]) = 0.00000\r\n&gt;44 f([-0.00014145 -0.00187632]) = 0.00000\r\n&gt;45 f([-0.00031698 -0.00167338]) = 0.00000\r\n&gt;46 f([-0.00041753 -0.00143134]) = 0.00000\r\n&gt;47 f([-0.00044531 -0.00116942]) = 0.00000\r\n&gt;48 f([-0.00041125 -0.00090399]) = 0.00000\r\n&gt;49 f([-0.00033193 -0.00064834]) = 0.00000\r\nDone!\r\nf([-0.00033193 -0.00064834]) = 0.000001<\/pre>\n<\/p>\n<h3>Visualization of AdaMax Optimization<\/h3>\n<p>We can plot the progress of the AdaMax search on a contour plot of the domain.<\/p>\n<p>This can provide an intuition for the progress of the search over the iterations of the algorithm.<\/p>\n<p>We must update the adamax() function to maintain a list of all solutions found during the search, then return this list at the end of the search.<\/p>\n<p>The updated version of the function with these changes is listed below.<\/p>\n<pre class=\"urvanov-syntax-highlighter-plain-tag\"># gradient descent algorithm with adamax\r\ndef adamax(objective, derivative, bounds, n_iter, alpha, beta1, beta2):\r\n\tsolutions = list()\r\n\t# generate an initial point\r\n\tx = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0])\r\n\t# initialize moment vector and weighted infinity norm\r\n\tm = [0.0 for _ in range(bounds.shape[0])]\r\n\tu = [0.0 for _ in range(bounds.shape[0])]\r\n\t# run iterations of gradient descent\r\n\tfor t in range(n_iter):\r\n\t\t# calculate gradient g(t)\r\n\t\tg = derivative(x[0], x[1])\r\n\t\t# build a solution one variable at a time\r\n\t\tfor i in range(x.shape[0]):\r\n\t\t\t# m(t) = beta1 * m(t-1) + (1 - beta1) * g(t)\r\n\t\t\tm[i] = beta1 * m[i] + (1.0 - beta1) * g[i]\r\n\t\t\t# u(t) = max(beta2 * u(t-1), abs(g(t)))\r\n\t\t\tu[i] = max(beta2 * u[i], abs(g[i]))\r\n\t\t\t# step_size(t) = alpha \/ (1 - beta1(t))\r\n\t\t\tstep_size = alpha \/ (1.0 - beta1**(t+1))\r\n\t\t\t# delta(t) = m(t) \/ u(t)\r\n\t\t\tdelta = m[i] \/ u[i]\r\n\t\t\t# x(t) = x(t-1) - step_size(t) * delta(t)\r\n\t\t\tx[i] = x[i] - step_size * delta\r\n\t\t# evaluate candidate point\r\n\t\tscore = objective(x[0], x[1])\r\n\t\tsolutions.append(x.copy())\r\n\t\t# report progress\r\n\t\tprint('&gt;%d f(%s) = %.5f' % (t, x, score))\r\n\treturn solutions<\/pre>\n<p>We can then execute the search as before, and this time retrieve the list of solutions instead of the best final solution.<\/p>\n<pre class=\"urvanov-syntax-highlighter-plain-tag\">...\r\n# seed the pseudo random number generator\r\nseed(1)\r\n# define range for input\r\nbounds = asarray([[-1.0, 1.0], [-1.0, 1.0]])\r\n# define the total iterations\r\nn_iter = 60\r\n# steps size\r\nalpha = 0.02\r\n# factor for average gradient\r\nbeta1 = 0.8\r\n# factor for average squared gradient\r\nbeta2 = 0.99\r\n# perform the gradient descent search with adamax\r\nsolutions = adamax(objective, derivative, bounds, n_iter, alpha, beta1, beta2)<\/pre>\n<p>We can then create a contour plot of the objective function, as before.<\/p>\n<pre class=\"urvanov-syntax-highlighter-plain-tag\">...\r\n# sample input range uniformly at 0.1 increments\r\nxaxis = arange(bounds[0,0], bounds[0,1], 0.1)\r\nyaxis = arange(bounds[1,0], bounds[1,1], 0.1)\r\n# create a mesh from the axis\r\nx, y = meshgrid(xaxis, yaxis)\r\n# compute targets\r\nresults = objective(x, y)\r\n# create a filled contour plot with 50 levels and jet color scheme\r\npyplot.contourf(x, y, results, levels=50, cmap='jet')<\/pre>\n<p>Finally, we can plot each solution found during the search as a white dot connected by a line.<\/p>\n<pre class=\"urvanov-syntax-highlighter-plain-tag\">...\r\n# plot the sample as black circles\r\nsolutions = asarray(solutions)\r\npyplot.plot(solutions[:, 0], solutions[:, 1], '.-', color='w')<\/pre>\n<p>Tying this all together, the complete example of performing the AdaMax optimization on the test problem and plotting the results on a contour plot is listed below.<\/p>\n<pre class=\"urvanov-syntax-highlighter-plain-tag\"># example of plotting the adamax search on a contour plot of the test function\r\nfrom numpy import asarray\r\nfrom numpy import arange\r\nfrom numpy.random import rand\r\nfrom numpy.random import seed\r\nfrom numpy import meshgrid\r\nfrom matplotlib import pyplot\r\nfrom mpl_toolkits.mplot3d import Axes3D\r\n\r\n# objective function\r\ndef objective(x, y):\r\n\treturn x**2.0 + y**2.0\r\n\r\n# derivative of objective function\r\ndef derivative(x, y):\r\n\treturn asarray([x * 2.0, y * 2.0])\r\n\r\n# gradient descent algorithm with adamax\r\ndef adamax(objective, derivative, bounds, n_iter, alpha, beta1, beta2):\r\n\tsolutions = list()\r\n\t# generate an initial point\r\n\tx = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0])\r\n\t# initialize moment vector and weighted infinity norm\r\n\tm = [0.0 for _ in range(bounds.shape[0])]\r\n\tu = [0.0 for _ in range(bounds.shape[0])]\r\n\t# run iterations of gradient descent\r\n\tfor t in range(n_iter):\r\n\t\t# calculate gradient g(t)\r\n\t\tg = derivative(x[0], x[1])\r\n\t\t# build a solution one variable at a time\r\n\t\tfor i in range(x.shape[0]):\r\n\t\t\t# m(t) = beta1 * m(t-1) + (1 - beta1) * g(t)\r\n\t\t\tm[i] = beta1 * m[i] + (1.0 - beta1) * g[i]\r\n\t\t\t# u(t) = max(beta2 * u(t-1), abs(g(t)))\r\n\t\t\tu[i] = max(beta2 * u[i], abs(g[i]))\r\n\t\t\t# step_size(t) = alpha \/ (1 - beta1(t))\r\n\t\t\tstep_size = alpha \/ (1.0 - beta1**(t+1))\r\n\t\t\t# delta(t) = m(t) \/ u(t)\r\n\t\t\tdelta = m[i] \/ u[i]\r\n\t\t\t# x(t) = x(t-1) - step_size(t) * delta(t)\r\n\t\t\tx[i] = x[i] - step_size * delta\r\n\t\t# evaluate candidate point\r\n\t\tscore = objective(x[0], x[1])\r\n\t\tsolutions.append(x.copy())\r\n\t\t# report progress\r\n\t\tprint('&gt;%d f(%s) = %.5f' % (t, x, score))\r\n\treturn solutions\r\n\r\n# seed the pseudo random number generator\r\nseed(1)\r\n# define range for input\r\nbounds = asarray([[-1.0, 1.0], [-1.0, 1.0]])\r\n# define the total iterations\r\nn_iter = 60\r\n# steps size\r\nalpha = 0.02\r\n# factor for average gradient\r\nbeta1 = 0.8\r\n# factor for average squared gradient\r\nbeta2 = 0.99\r\n# perform the gradient descent search with adamax\r\nsolutions = adamax(objective, derivative, bounds, n_iter, alpha, beta1, beta2)\r\n# sample input range uniformly at 0.1 increments\r\nxaxis = arange(bounds[0,0], bounds[0,1], 0.1)\r\nyaxis = arange(bounds[1,0], bounds[1,1], 0.1)\r\n# create a mesh from the axis\r\nx, y = meshgrid(xaxis, yaxis)\r\n# compute targets\r\nresults = objective(x, y)\r\n# create a filled contour plot with 50 levels and jet color scheme\r\npyplot.contourf(x, y, results, levels=50, cmap='jet')\r\n# plot the sample as black circles\r\nsolutions = asarray(solutions)\r\npyplot.plot(solutions[:, 0], solutions[:, 1], '.-', color='w')\r\n# show the plot\r\npyplot.show()<\/pre>\n<p>Running the example performs the search as before, except in this case, the contour plot of the objective function is created.<\/p>\n<p>In this case, we can see that a white dot is shown for each solution found during the search, starting above the optima and progressively getting closer to the optima at the center of the plot.<\/p>\n<div id=\"attachment_12176\" style=\"width: 1290px\" class=\"wp-caption aligncenter\">\n<img decoding=\"async\" aria-describedby=\"caption-attachment-12176\" loading=\"lazy\" class=\"size-full wp-image-12176\" src=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2021\/01\/Contour-Plot-of-the-Test-Objective-Function-With-AdaMax-Search-Results-Shown.png\" alt=\"Contour Plot of the Test Objective Function With AdaMax Search Results Shown\" width=\"1280\" height=\"960\" srcset=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2021\/01\/Contour-Plot-of-the-Test-Objective-Function-With-AdaMax-Search-Results-Shown.png 1280w, https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2021\/01\/Contour-Plot-of-the-Test-Objective-Function-With-AdaMax-Search-Results-Shown-300x225.png 300w, https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2021\/01\/Contour-Plot-of-the-Test-Objective-Function-With-AdaMax-Search-Results-Shown-1024x768.png 1024w, https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2021\/01\/Contour-Plot-of-the-Test-Objective-Function-With-AdaMax-Search-Results-Shown-768x576.png 768w\" sizes=\"(max-width: 1280px) 100vw, 1280px\"><\/p>\n<p id=\"caption-attachment-12176\" class=\"wp-caption-text\">Contour Plot of the Test Objective Function With AdaMax Search Results Shown<\/p>\n<\/div>\n<h2>Further Reading<\/h2>\n<p>This section provides more resources on the topic if you are looking to go deeper.<\/p>\n<h3>Papers<\/h3>\n<ul>\n<li>\n<a href=\"https:\/\/arxiv.org\/abs\/1412.6980\">Adam: A Method for Stochastic Optimization<\/a>, 2014.<\/li>\n<li>\n<a href=\"https:\/\/arxiv.org\/abs\/1609.04747\">An Overview Of Gradient Descent Optimization Algorithms<\/a>, 2016.<\/li>\n<\/ul>\n<h3>Books<\/h3>\n<ul>\n<li>\n<a href=\"https:\/\/amzn.to\/39KZSQn\">Algorithms for Optimization<\/a>, 2019.<\/li>\n<li>\n<a href=\"https:\/\/amzn.to\/3qSk3C2\">Deep Learning<\/a>, 2016.<\/li>\n<\/ul>\n<h3>APIs<\/h3>\n<ul>\n<li>\n<a href=\"https:\/\/numpy.org\/doc\/stable\/reference\/random\/generated\/numpy.random.rand.html\">numpy.random.rand API<\/a>.<\/li>\n<li>\n<a href=\"https:\/\/numpy.org\/doc\/stable\/reference\/generated\/numpy.asarray.html\">numpy.asarray API<\/a>.<\/li>\n<li>\n<a href=\"https:\/\/matplotlib.org\/api\/pyplot_api.html\">Matplotlib API<\/a>.<\/li>\n<\/ul>\n<h3>Articles<\/h3>\n<ul>\n<li>\n<a href=\"https:\/\/en.wikipedia.org\/wiki\/Gradient_descent\">Gradient descent, Wikipedia<\/a>.<\/li>\n<li>\n<a href=\"https:\/\/en.wikipedia.org\/wiki\/Stochastic_gradient_descent\">Stochastic gradient descent, Wikipedia<\/a>.<\/li>\n<li>\n<a href=\"https:\/\/ruder.io\/optimizing-gradient-descent\/index.html\">An overview of gradient descent optimization algorithms<\/a>, 2016.<\/li>\n<\/ul>\n<h2>Summary<\/h2>\n<p>In this tutorial, you discovered how to develop the gradient descent optimization with AdaMax from scratch.<\/p>\n<p>Specifically, you learned:<\/p>\n<ul>\n<li>Gradient descent is an optimization algorithm that uses the gradient of the objective function to navigate the search space.<\/li>\n<li>AdaMax is an extension of the Adam version of gradient descent designed to accelerate the optimization process.<\/li>\n<li>How to implement the AdaMax optimization algorithm from scratch and apply it to an objective function and evaluate the results.<\/li>\n<\/ul>\n<p><strong>Do you have any questions?<\/strong><br \/>\nAsk your questions in the comments below and I will do my best to answer.<\/p>\n<p>The post <a rel=\"nofollow\" href=\"https:\/\/machinelearningmastery.com\/gradient-descent-optimization-with-adamax-from-scratch\/\">Gradient Descent Optimization With AdaMax From Scratch<\/a> appeared first on <a rel=\"nofollow\" href=\"https:\/\/machinelearningmastery.com\/\">Machine Learning Mastery<\/a>.<\/p>\n<\/div>\n<p><a href=\"https:\/\/machinelearningmastery.com\/gradient-descent-optimization-with-adamax-from-scratch\/\">Go to Source<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Author: Jason Brownlee Gradient descent is an optimization algorithm that follows the negative gradient of an objective function in order to locate the minimum of [&hellip;] <span class=\"read-more-link\"><a class=\"read-more\" href=\"https:\/\/www.aiproblog.com\/index.php\/2021\/06\/08\/gradient-descent-optimization-with-adamax-from-scratch\/\">Read More<\/a><\/span><\/p>\n","protected":false},"author":1,"featured_media":4718,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_bbp_topic_count":0,"_bbp_reply_count":0,"_bbp_total_topic_count":0,"_bbp_total_reply_count":0,"_bbp_voice_count":0,"_bbp_anonymous_reply_count":0,"_bbp_topic_count_hidden":0,"_bbp_reply_count_hidden":0,"_bbp_forum_subforum_count":0,"footnotes":""},"categories":[24],"tags":[],"_links":{"self":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts\/4717"}],"collection":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/comments?post=4717"}],"version-history":[{"count":0,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts\/4717\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/media\/4718"}],"wp:attachment":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/media?parent=4717"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/categories?post=4717"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/tags?post=4717"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}