{"id":2671,"date":"2019-10-08T18:00:13","date_gmt":"2019-10-08T18:00:13","guid":{"rendered":"https:\/\/www.aiproblog.com\/index.php\/2019\/10\/08\/how-to-implement-bayesian-optimization-from-scratch-in-python\/"},"modified":"2019-10-08T18:00:13","modified_gmt":"2019-10-08T18:00:13","slug":"how-to-implement-bayesian-optimization-from-scratch-in-python","status":"publish","type":"post","link":"https:\/\/www.aiproblog.com\/index.php\/2019\/10\/08\/how-to-implement-bayesian-optimization-from-scratch-in-python\/","title":{"rendered":"How to Implement Bayesian Optimization from Scratch in Python"},"content":{"rendered":"<p>Author: Jason Brownlee<\/p>\n<div>\n<h4 style=\"text-align: center;\">Discover a Gentle Introduction to Bayesian Optimization.<\/h4>\n<p>Global optimization is a challenging problem of finding an input that results in the minimum or maximum cost of a given objective function.<\/p>\n<p>Typically, the form of the objective function is complex and intractable to analyze and is often non-convex, nonlinear, high dimension, noisy, and computationally expensive to evaluate.<\/p>\n<p>Bayesian Optimization provides a principled technique based on Bayes Theorem to direct a search of a global optimization problem that is efficient and effective. It works by building a probabilistic model of the objective function, called the surrogate function, that is then searched efficiently with an acquisition function before candidate samples are chosen for evaluation on the real objective function.<\/p>\n<p>Bayesian Optimization is often used in applied machine learning to tune the hyperparameters of a given well-performing model on a validation dataset.<\/p>\n<p>In this tutorial, you will discover Bayesian Optimization for directed search of complex optimization problems.<\/p>\n<p>After completing this tutorial, you will know:<\/p>\n<ul>\n<li>Global optimization is a challenging problem that involves black box and often non-convex, non-linear, noisy, and computationally expensive objective functions.<\/li>\n<li>Bayesian Optimization provides a probabilistically principled method for global optimization.<\/li>\n<li>How to implement Bayesian Optimization from scratch and how to use open-source implementations.<\/li>\n<\/ul>\n<p>Discover bayes opimization, naive bayes, maximum likelihood, distributions, cross entropy, and much more <a href=\"https:\/\/machinelearningmastery.com\/probability-for-machine-learning\/\" rel=\"nofollow\">in my new book<\/a>, with 28 step-by-step tutorials and full Python source code.<\/p>\n<p>Let\u2019s get started.<\/p>\n<div id=\"attachment_8827\" style=\"width: 650px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-8827\" class=\"size-full wp-image-8827\" src=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2019\/10\/A-Gentle-Introduction-to-Bayesian-Optimization.jpg\" alt=\"A Gentle Introduction to Bayesian Optimization\" width=\"640\" height=\"424\" srcset=\"http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2019\/10\/A-Gentle-Introduction-to-Bayesian-Optimization.jpg 640w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2019\/10\/A-Gentle-Introduction-to-Bayesian-Optimization-300x199.jpg 300w\" sizes=\"(max-width: 640px) 100vw, 640px\"><\/p>\n<p id=\"caption-attachment-8827\" class=\"wp-caption-text\">A Gentle Introduction to Bayesian Optimization<br \/>Photo by <a href=\"https:\/\/www.flickr.com\/photos\/banjipark\/19636375808\/\">Beni Arnold<\/a>, some rights reserved.<\/p>\n<\/div>\n<h2>Tutorial Overview<\/h2>\n<p>This tutorial is divided into four parts; they are:<\/p>\n<ol>\n<li>Challenge of Function Optimization<\/li>\n<li>What Is Bayesian Optimization<\/li>\n<li>How to Perform Bayesian Optimization<\/li>\n<li>Hyperparameter Tuning With Bayesian Optimization<\/li>\n<\/ol>\n<h2>Challenge of Function Optimization<\/h2>\n<p><a href=\"https:\/\/en.wikipedia.org\/wiki\/Global_optimization\">Global function optimization<\/a>, or function optimization for short, involves finding the minimum or maximum of an objective function.<\/p>\n<p>Samples are drawn from the domain and evaluated by the objective function to give a score or cost.<\/p>\n<p>Let\u2019s define some common terms:<\/p>\n<ul>\n<li><strong>Samples<\/strong>. One example from the domain, represented as a vector.<\/li>\n<li><strong>Search Space<\/strong>: Extent of the domain from which samples can be drawn.<\/li>\n<li><strong>Objective Function<\/strong>. Function that takes a sample and returns a cost.<\/li>\n<li><strong>Cost<\/strong>. Numeric score for a sample calculated via the objective function.<\/li>\n<\/ul>\n<p>Samples are comprised of one or more variables generally easy to devise or create. One sample is often defined as a vector of variables with a predefined range in an n-dimensional space. This space must be sampled and explored in order to find the specific combination of variable values that result in the best cost.<\/p>\n<p>The cost often has units that are specific to a given domain. Optimization is often described in terms of minimizing cost, as a maximization problem can easily be transformed into a minimization problem by inverting the calculated cost. Together, the minimum and maximum of a function are referred to as the extreme of the function (or the plural extrema).<\/p>\n<p>The objective function is often easy to specify but can be computationally challenging to calculate or result in a noisy calculation of cost over time. The form of the objective function is unknown and is often highly nonlinear, and highly multi-dimensional defined by the number of input variables. The function is also probably non-convex. This means that local extrema may or may not be the global extrema (e.g. could be misleading and result in premature convergence), hence the name of the task as global rather than local optimization.<\/p>\n<p>Although little is known about the objective function, (it is known whether the minimum or the maximum cost from the function is sought), and as such, it is often referred to as a black box function and the search process as black box optimization. Further, the objective function is sometimes called an oracle given the ability to only give answers.<\/p>\n<p>Function optimization is a fundamental part of machine learning. Most machine learning algorithms involve the optimization of parameters (weights, coefficients, etc.) in response to training data. Optimization also refers to the process of finding the best set of hyperparameters that configure the training of a machine learning algorithm. Taking one step higher again, the selection of training data, data preparation, and machine learning algorithms themselves is also a problem of function optimization.<\/p>\n<p>Summary of optimization in machine learning:<\/p>\n<ul>\n<li><strong>Algorithm Training<\/strong>. Optimization of model parameters.<\/li>\n<li><strong>Algorithm Tuning<\/strong>. Optimization of model hyperparameters.<\/li>\n<li><strong>Predictive Modeling<\/strong>. Optimization of data, data preparation, and algorithm selection.<\/li>\n<\/ul>\n<p>Many methods exist for function optimization, such as randomly sampling the variable search space, called random search, or systematically evaluating samples in a grid across the search space, called grid search.<\/p>\n<p>More principled methods are able to learn from sampling the space so that future samples are directed toward the parts of the search space that are most likely to contain the extrema.<\/p>\n<p>A directed approach to global optimization that uses probability is called Bayesian Optimization.<\/p>\n<div class=\"woo-sc-hr\"><\/div>\n<p><center><\/p>\n<h3>Want to Learn Probability for Machine Learning<\/h3>\n<p>Take my free 7-day email crash course now (with sample code).<\/p>\n<p>Click to sign-up and also get a free PDF Ebook version of the course.<\/p>\n<p><a href=\"https:\/\/machinelearningmastery.lpages.co\/leadbox\/16cf92561172a2%3A164f8be4f346dc\/4623731828588544\/\" target=\"_blank\" style=\"background: rgb(255, 206, 10); color: rgb(255, 255, 255); text-decoration: none; font-family: Helvetica, Arial, sans-serif; font-weight: bold; font-size: 16px; line-height: 20px; padding: 10px; display: inline-block; max-width: 300px; border-radius: 5px; text-shadow: rgba(0, 0, 0, 0.25) 0px -1px 1px; box-shadow: rgba(255, 255, 255, 0.5) 0px 1px 3px inset, rgba(0, 0, 0, 0.5) 0px 1px 3px;\" rel=\"noopener noreferrer\">Download Your FREE Mini-Course<\/a><script data-leadbox=\"16cf92561172a2:164f8be4f346dc\" data-url=\"https:\/\/machinelearningmastery.lpages.co\/leadbox\/16cf92561172a2%3A164f8be4f346dc\/4623731828588544\/\" data-config=\"%7B%7D\" type=\"text\/javascript\" src=\"https:\/\/machinelearningmastery.lpages.co\/leadbox-1568216021.js\"><\/script><\/p>\n<p><\/center><\/p>\n<div class=\"woo-sc-hr\"><\/div>\n<h2>What Is Bayesian Optimization<\/h2>\n<p><a href=\"https:\/\/en.wikipedia.org\/wiki\/Bayesian_optimization\">Bayesian Optimization<\/a> is an approach that uses <a href=\"https:\/\/machinelearningmastery.com\/bayes-theorem-for-machine-learning\/\">Bayes Theorem<\/a> to direct the search in order to find the minimum or maximum of an objective function.<\/p>\n<p>It is an approach that is most useful for objective functions that are complex, noisy, and\/or expensive to evaluate.<\/p>\n<blockquote>\n<p>Bayesian optimization is a powerful strategy for finding the extrema of objective functions that are expensive to evaluate. [\u2026] It is particularly useful when these evaluations are costly, when one does not have access to derivatives, or when the problem at hand is non-convex.<\/p>\n<\/blockquote>\n<p>\u2014 <a href=\"https:\/\/arxiv.org\/abs\/1012.2599\">A Tutorial on Bayesian Optimization of Expensive Cost Functions, with Application to Active User Modeling and Hierarchical Reinforcement Learning<\/a>, 2010.<\/p>\n<p>Recall that Bayes Theorem is an approach for calculating the conditional probability of an event:<\/p>\n<ul>\n<li>P(A|B) = P(B|A) * P(A) \/ P(B)<\/li>\n<\/ul>\n<p>We can simplify this calculation by removing the normalizing value of <em>P(B)<\/em> and describe the conditional probability as a proportional quantity. This is useful as we are not interested in calculating a specific conditional probability, but instead in optimizing a quantity.<\/p>\n<ul>\n<li>P(A|B) = P(B|A) * P(A)<\/li>\n<\/ul>\n<p>The conditional probability that we are calculating is referred to generally as the <em>posterior<\/em> probability; the reverse conditional probability is sometimes referred to as the likelihood, and the marginal probability is referred to as the <em>prior <\/em>probability; for example:<\/p>\n<ul>\n<li>posterior = likelihood * prior<\/li>\n<\/ul>\n<p>This provides a framework that can be used to quantify the beliefs about an unknown objective function given samples from the domain and their evaluation via the objective function.<\/p>\n<p>We can devise specific samples (<em>x1, x2, \u2026, xn<\/em>) and evaluate them using the objective function <em>f(xi)<\/em> that returns the cost or outcome for the sample xi. Samples and their outcome are collected sequentially and define our data <em>D<\/em>, e.g. <em>D = {xi, f(xi), \u2026 xn, f(xn)}<\/em> and is used to define the prior. The likelihood function is defined as the probability of observing the data given the function <em>P(D | f)<\/em>. This likelihood function will change as more observations are collected.<\/p>\n<ul>\n<li>P(f|D) = P(D|f) * P(f)<\/li>\n<\/ul>\n<p>The posterior represents everything we know about the objective function. It is an approximation of the objective function and can be used to estimate the cost of different candidate samples that we may want to evaluate.<\/p>\n<p>In this way, the posterior probability is a surrogate objective function.<\/p>\n<blockquote>\n<p>The posterior captures the updated beliefs about the unknown objective function. One may also interpret this step of Bayesian optimization as estimating the objective function with a surrogate function (also called a response surface).<\/p>\n<\/blockquote>\n<p>\u2014 <a href=\"https:\/\/arxiv.org\/abs\/1012.2599\">A Tutorial on Bayesian Optimization of Expensive Cost Functions, with Application to Active User Modeling and Hierarchical Reinforcement Learning<\/a>, 2010.<\/p>\n<ul>\n<li><strong>Surrogate Function<\/strong>: Bayesian approximation of the objective function that can be sampled efficiently.<\/li>\n<\/ul>\n<p>The surrogate function gives us an estimate of the objective function, which can be used to direct future sampling. Sampling involves careful use of the posterior in a function known as the \u201c<em>acquisition<\/em>\u201d function, e.g. for acquiring more samples. We want to use our belief about the objective function to sample the area of the search space that is most likely to pay off, therefore the acquisition will optimize the conditional probability of locations in the search to generate the next sample.<\/p>\n<ul>\n<li><strong>Acquisition Function<\/strong>: Technique by which the posterior is used to select the next sample from the search space.<\/li>\n<\/ul>\n<p>Once additional samples and their evaluation via the objective function <em>f()<\/em> have been collected, they are added to data <em>D<\/em> and the posterior is then updated.<\/p>\n<p>This process is repeated until the extrema of the objective function is located, a good enough result is located, or resources are exhausted.<\/p>\n<p>The Bayesian Optimization algorithm can be summarized as follows:<\/p>\n<ul>\n<li>1. Select a Sample by Optimizing the Acquisition Function.<\/li>\n<li>2. Evaluate the Sample With the Objective Function.<\/li>\n<li>3. Update the Data and, in turn, the Surrogate Function.<\/li>\n<li>4. Go To 1.<\/li>\n<\/ul>\n<h2>How to Perform Bayesian Optimization<\/h2>\n<p>In this section, we will explore how Bayesian Optimization works by developing an implementation from scratch for a simple one-dimensional test function.<\/p>\n<p>First, we will define the test problem, then how to model the mapping of inputs to outputs with a surrogate function. Next, we will see how the surrogate function can be searched efficiently with an acquisition function before tying all of these elements together into the Bayesian Optimization procedure.<\/p>\n<h3>Test Problem<\/h3>\n<p>The first step is to define a test problem.<\/p>\n<p>We will use a multimodal problem with five peaks, calculated as:<\/p>\n<ul>\n<li>y = x^2 * sin(5 * PI * x)^6<\/li>\n<\/ul>\n<p>Where <em>x<\/em> is a real value in the range [0,1] and <em>PI<\/em> is the value of pi.<\/p>\n<p>We will augment this function by adding Gaussian noise with a mean of zero and a standard deviation of 0.1. This will mean that the real evaluation will have a positive or negative random value added to it, making the function challenging to optimize.<\/p>\n<p>The <em>objective()<\/em> function below implements this.<\/p>\n<pre class=\"crayon-plain-tag\"># objective function\r\ndef objective(x, noise=0.1):\r\n\tnoise = normal(loc=0, scale=noise)\r\n\treturn (x**2 * sin(5 * pi * x)**6.0) + noise<\/pre>\n<p>We can test this function by first defining a grid-based sample of inputs from 0 to 1 with a step size of 0.01 across the domain.<\/p>\n<pre class=\"crayon-plain-tag\">...\r\n# grid-based sample of the domain [0,1]\r\nX = arange(0, 1, 0.01)<\/pre>\n<p>We can then evaluate these samples using the target function without any noise to see what the real objective function looks like.<\/p>\n<pre class=\"crayon-plain-tag\">...\r\n# sample the domain without noise\r\ny = [objective(x, 0) for x in X]<\/pre>\n<p>We can then evaluate these same points with noise to see what the objective function will look like when we are optimizing it.<\/p>\n<pre class=\"crayon-plain-tag\">...\r\n# sample the domain with noise\r\nynoise = [objective(x) for x in X]<\/pre>\n<p>We can look at all of the non-noisy objective function values to find the input that resulted in the best score and report it. This will be the optima, in this case, maxima, as we are maximizing the output of the objective function.<\/p>\n<p>We would not know this in practice, but for out test problem, it is good to know the real best input and output of the function to see if the Bayesian Optimization algorithm can locate it.<\/p>\n<pre class=\"crayon-plain-tag\">...\r\n# find best result\r\nix = argmax(y)\r\nprint('Optima: x=%.3f, y=%.3f' % (X[ix], y[ix]))<\/pre>\n<p>Finally, we can create a plot, first showing the noisy evaluation as a scatter plot with input on the x-axis and score on the y-axis, then a line plot of the scores without any noise.<\/p>\n<pre class=\"crayon-plain-tag\">...\r\n# plot the points with noise\r\npyplot.scatter(X, ynoise)\r\n# plot the points without noise\r\npyplot.plot(X, y)\r\n# show the plot\r\npyplot.show()<\/pre>\n<p>The complete example of reviewing the test function that we wish to optimize is listed below.<\/p>\n<pre class=\"crayon-plain-tag\"># example of the test problem\r\nfrom math import sin\r\nfrom math import pi\r\nfrom numpy import arange\r\nfrom numpy import argmax\r\nfrom numpy.random import normal\r\nfrom matplotlib import pyplot\r\n\r\n# objective function\r\ndef objective(x, noise=0.1):\r\n\tnoise = normal(loc=0, scale=noise)\r\n\treturn (x**2 * sin(5 * pi * x)**6.0) + noise\r\n\r\n# grid-based sample of the domain [0,1]\r\nX = arange(0, 1, 0.01)\r\n# sample the domain without noise\r\ny = [objective(x, 0) for x in X]\r\n# sample the domain with noise\r\nynoise = [objective(x) for x in X]\r\n# find best result\r\nix = argmax(y)\r\nprint('Optima: x=%.3f, y=%.3f' % (X[ix], y[ix]))\r\n# plot the points with noise\r\npyplot.scatter(X, ynoise)\r\n# plot the points without noise\r\npyplot.plot(X, y)\r\n# show the plot\r\npyplot.show()<\/pre>\n<p>Running the example first reports the global optima as an input with the value 0.9 that gives the score 0.81.<\/p>\n<pre class=\"crayon-plain-tag\">Optima: x=0.900, y=0.810<\/pre>\n<p>A plot is then created showing the noisy evaluation of the samples (dots) and the non-noisy and true shape of the objective function (line).<\/p>\n<p>Your specific dots will differ given the stochastic nature of the noisy objective function.<\/p>\n<div id=\"attachment_8823\" style=\"width: 1290px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-8823\" class=\"size-full wp-image-8823\" src=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2019\/08\/Plot-of-The-Input-Samples-Evaluated-with-a-Noisy-dots-and-Non-Noisy-Line-Objective-Function.png\" alt=\"Plot of The Input Samples Evaluated with a Noisy (dots) and Non-Noisy (Line) Objective Function\" width=\"1280\" height=\"960\" srcset=\"http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2019\/08\/Plot-of-The-Input-Samples-Evaluated-with-a-Noisy-dots-and-Non-Noisy-Line-Objective-Function.png 1280w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2019\/08\/Plot-of-The-Input-Samples-Evaluated-with-a-Noisy-dots-and-Non-Noisy-Line-Objective-Function-300x225.png 300w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2019\/08\/Plot-of-The-Input-Samples-Evaluated-with-a-Noisy-dots-and-Non-Noisy-Line-Objective-Function-768x576.png 768w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2019\/08\/Plot-of-The-Input-Samples-Evaluated-with-a-Noisy-dots-and-Non-Noisy-Line-Objective-Function-1024x768.png 1024w\" sizes=\"(max-width: 1280px) 100vw, 1280px\"><\/p>\n<p id=\"caption-attachment-8823\" class=\"wp-caption-text\">Plot of The Input Samples Evaluated with a Noisy (dots) and Non-Noisy (Line) Objective Function<\/p>\n<\/div>\n<p>Now that we have a test problem, let\u2019s review how to train a surrogate function.<\/p>\n<h3>Surrogate Function<\/h3>\n<p>The surrogate function is a technique used to best approximate the mapping of input examples to an output score.<\/p>\n<p>Probabilistically, it summarizes the conditional probability of an objective function (<em>f<\/em>), given the available data (<em>D<\/em>) or <em>P(f|D)<\/em>.<\/p>\n<p>A number of techniques can be used for this, although the most popular is to treat the problem as a regression predictive modeling problem with the data representing the input and the score representing the output to the model. This is often best modeled using a random forest or a Gaussian Process.<\/p>\n<p>A <a href=\"https:\/\/en.wikipedia.org\/wiki\/Gaussian_process\">Gaussian Process<\/a>, or GP, is a model that constructs a joint probability distribution over the variables, assuming a multivariate Gaussian distribution. As such, it is capable of efficient and effective summarization of a large number of functions and smooth transition as more observations are made available to the model.<\/p>\n<p>This smooth structure and smooth transition to new functions based on data are desirable properties as we sample the domain, and the multivariate Gaussian basis to the model means that an estimate from the model will be a mean of a distribution with a standard deviation; that will be helpful later in the acquisition function.<\/p>\n<p>As such, using a GP regression model is often preferred.<\/p>\n<p>We can fit a GP regression model using the <a href=\"https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.gaussian_process.GaussianProcessRegressor.html\">GaussianProcessRegressor<\/a> scikit-learn implementation from a sample of inputs (<em>X<\/em>) and noisy evaluations from the objective function (<em>y<\/em>).<\/p>\n<p>First, the model must be defined. An important aspect in defining the GP model is the kernel. This controls the shape of the function at specific points based on distance measures between actual data observations. Many different kernel functions can be used, and some may offer better performance for specific datasets.<\/p>\n<p>By default, a <a href=\"https:\/\/en.wikipedia.org\/wiki\/Radial_basis_function\">Radial Basis Function<\/a>, or RBF, is used that can work well.<\/p>\n<pre class=\"crayon-plain-tag\">...\r\n# define the model\r\nmodel = GaussianProcessRegressor()<\/pre>\n<p>Once defined, the model can be fit on the training dataset directly by calling the <em>fit()<\/em> function.<\/p>\n<p>The defined model can be fit again at any time with updated data concatenated to the existing data by another call to <em>fit()<\/em>.<\/p>\n<pre class=\"crayon-plain-tag\">...\r\n# fit the model\r\nmodel.fit(X, y)<\/pre>\n<p>The model will estimate the cost for one or more samples provided to it.<\/p>\n<p>The model is used by calling the <em>predict()<\/em> function. The result for a given sample will be a mean of the distribution at that point. We can also get the standard deviation of the distribution at that point in the function by specifying the argument <em>return_std=True<\/em>; for example:<\/p>\n<pre class=\"crayon-plain-tag\">...\r\nyhat = model.predict(X, return_std=True)<\/pre>\n<p>This function can result in warnings if the distribution is thin at a given point we are interested in sampling.<\/p>\n<p>Therefore, we can silence all of the warnings when making a prediction. The <em>surrogate()<\/em> function below takes the fit model and one or more samples and returns the mean and standard deviation estimated costs whilst not printing any warnings.<\/p>\n<pre class=\"crayon-plain-tag\"># surrogate or approximation for the objective function\r\ndef surrogate(model, X):\r\n\t# catch any warning generated when making a prediction\r\n\twith catch_warnings():\r\n\t\t# ignore generated warnings\r\n\t\tsimplefilter(\"ignore\")\r\n\t\treturn model.predict(X, return_std=True)<\/pre>\n<p>We can call this function any time to estimate the cost of one or more samples, such as when we want to optimize the acquisition function in the next section.<\/p>\n<p>For now, it is interesting to see what the surrogate function looks like across the domain after it is trained on a random sample.<\/p>\n<p>We can achieve this by first fitting the GP model on a random sample of 100 data points and their real objective function values with noise. We can then plot a scatter plot of these points. Next, we can perform a grid-based sample across the input domain and estimate the cost at each point using the surrogate function and plot the result as a line.<\/p>\n<p>We would expect the surrogate function to have a crude approximation of the true non-noisy objective function.<\/p>\n<p>The <em>plot()<\/em> function below creates this plot, given the random data sample of the real noisy objective function and the fit model.<\/p>\n<pre class=\"crayon-plain-tag\"># plot real observations vs surrogate function\r\ndef plot(X, y, model):\r\n\t# scatter plot of inputs and real objective function\r\n\tpyplot.scatter(X, y)\r\n\t# line plot of surrogate function across domain\r\n\tXsamples = asarray(arange(0, 1, 0.001))\r\n\tXsamples = Xsamples.reshape(len(Xsamples), 1)\r\n\tysamples, _ = surrogate(model, Xsamples)\r\n\tpyplot.plot(Xsamples, ysamples)\r\n\t# show the plot\r\n\tpyplot.show()<\/pre>\n<p>Tying this together, the complete example of fitting a Gaussian Process regression model on noisy samples and plotting the sample vs. the surrogate function is listed below.<\/p>\n<pre class=\"crayon-plain-tag\"># example of a gaussian process surrogate function\r\nfrom math import sin\r\nfrom math import pi\r\nfrom numpy import arange\r\nfrom numpy import asarray\r\nfrom numpy.random import normal\r\nfrom numpy.random import random\r\nfrom matplotlib import pyplot\r\nfrom warnings import catch_warnings\r\nfrom warnings import simplefilter\r\nfrom sklearn.gaussian_process import GaussianProcessRegressor\r\n\r\n# objective function\r\ndef objective(x, noise=0.1):\r\n\tnoise = normal(loc=0, scale=noise)\r\n\treturn (x**2 * sin(5 * pi * x)**6.0) + noise\r\n\r\n# surrogate or approximation for the objective function\r\ndef surrogate(model, X):\r\n\t# catch any warning generated when making a prediction\r\n\twith catch_warnings():\r\n\t\t# ignore generated warnings\r\n\t\tsimplefilter(\"ignore\")\r\n\t\treturn model.predict(X, return_std=True)\r\n\r\n# plot real observations vs surrogate function\r\ndef plot(X, y, model):\r\n\t# scatter plot of inputs and real objective function\r\n\tpyplot.scatter(X, y)\r\n\t# line plot of surrogate function across domain\r\n\tXsamples = asarray(arange(0, 1, 0.001))\r\n\tXsamples = Xsamples.reshape(len(Xsamples), 1)\r\n\tysamples, _ = surrogate(model, Xsamples)\r\n\tpyplot.plot(Xsamples, ysamples)\r\n\t# show the plot\r\n\tpyplot.show()\r\n\r\n# sample the domain sparsely with noise\r\nX = random(100)\r\ny = asarray([objective(x) for x in X])\r\n# reshape into rows and cols\r\nX = X.reshape(len(X), 1)\r\ny = y.reshape(len(y), 1)\r\n# define the model\r\nmodel = GaussianProcessRegressor()\r\n# fit the model\r\nmodel.fit(X, y)\r\n# plot the surrogate function\r\nplot(X, y, model)<\/pre>\n<p>Running the example first draws the random sample, evaluates it with the noisy objective function, then fits the GP model.<\/p>\n<p>The data sample and a grid of points across the domain evaluated via the surrogate function are then plotted as dots and a line respectively.<\/p>\n<p>Your specific results will vary given the stochastic nature of the data sample. Consider running the example a few times.<\/p>\n<p>In this case, as we expected, the plot resembles a crude version of the underlying non-noisy objective function, importantly with a peak around 0.9 where we know the true maxima is located.<\/p>\n<div id=\"attachment_8824\" style=\"width: 1290px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-8824\" class=\"size-full wp-image-8824\" src=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2019\/08\/Plot-Showing-Random-Sample-with-Noisy-Evaluation-dots-and-Surrogate-Function-Across-the-Domain-line.png\" alt=\"Plot Showing Random Sample With Noisy Evaluation (dots) and Surrogate Function Across the Domain (line).\" width=\"1280\" height=\"960\" srcset=\"http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2019\/08\/Plot-Showing-Random-Sample-with-Noisy-Evaluation-dots-and-Surrogate-Function-Across-the-Domain-line.png 1280w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2019\/08\/Plot-Showing-Random-Sample-with-Noisy-Evaluation-dots-and-Surrogate-Function-Across-the-Domain-line-300x225.png 300w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2019\/08\/Plot-Showing-Random-Sample-with-Noisy-Evaluation-dots-and-Surrogate-Function-Across-the-Domain-line-768x576.png 768w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2019\/08\/Plot-Showing-Random-Sample-with-Noisy-Evaluation-dots-and-Surrogate-Function-Across-the-Domain-line-1024x768.png 1024w\" sizes=\"(max-width: 1280px) 100vw, 1280px\"><\/p>\n<p id=\"caption-attachment-8824\" class=\"wp-caption-text\">Plot Showing Random Sample With Noisy Evaluation (dots) and Surrogate Function Across the Domain (line).<\/p>\n<\/div>\n<p>Next, we must define a strategy for sampling the surrogate function.<\/p>\n<h3>Acquisition Function<\/h3>\n<p>The surrogate function is used to test a range of candidate samples in the domain.<\/p>\n<p>From these results, one or more candidates can be selected and evaluated with the real, and in normal practice, computationally expensive cost function.<\/p>\n<p>This involves two pieces: the search strategy used to navigate the domain in response to the surrogate function and the acquisition function that is used to interpret and score the response from the surrogate function.<\/p>\n<p>A simple search strategy, such as a random sample or grid-based sample, can be used, although it is more common to use a local search strategy, such as the popular <a href=\"https:\/\/en.wikipedia.org\/wiki\/Broyden%E2%80%93Fletcher%E2%80%93Goldfarb%E2%80%93Shanno_algorithm\">BFGS algorithm<\/a>. In this case, we will use a random search or random sample of the domain in order to keep the example simple.<\/p>\n<p>This involves first drawing a random sample of candidate samples from the domain, evaluating them with the acquisition function, then maximizing the acquisition function or choosing the candidate sample that gives the best score. The <em>opt_acquisition()<\/em> function below implements this.<\/p>\n<pre class=\"crayon-plain-tag\"># optimize the acquisition function\r\ndef opt_acquisition(X, y, model):\r\n\t# random search, generate random samples\r\n\tXsamples = random(100)\r\n\tXsamples = Xsamples.reshape(len(Xsamples), 1)\r\n\t# calculate the acquisition function for each sample\r\n\tscores = acquisition(X, Xsamples, model)\r\n\t# locate the index of the largest scores\r\n\tix = argmax(scores)\r\n\treturn Xsamples[ix, 0]<\/pre>\n<p>The acquisition function is responsible for scoring or estimating the likelihood that a given candidate sample (input) is worth evaluating with the real objective function.<\/p>\n<p>We could just use the surrogate score directly. Alternately, given that we have chosen a Gaussian Process model as the surrogate function, we can use the probabilistic information from this model in the acquisition function to calculate the probability that a given sample is worth evaluating.<\/p>\n<p>There are many different types of probabilistic acquisition functions that can be used, each providing a different trade-off for how exploitative (greedy) and explorative they are.<\/p>\n<p>Three common examples include:<\/p>\n<ul>\n<li>Probability of Improvement (PI).<\/li>\n<li>Expected Improvement (EI).<\/li>\n<li>Lower Confidence Bound (LCB).<\/li>\n<\/ul>\n<p>The Probability of Improvement method is the simplest, whereas the Expected Improvement method is the most commonly used.<\/p>\n<p>In this case, we will use the simpler Probability of Improvement method, which is calculated as the normal cumulative probability of the normalized expected improvement, calculated as follows:<\/p>\n<ul>\n<li>PI = cdf((mu \u2013 best_mu) \/ stdev)<\/li>\n<\/ul>\n<p>Where PI is the probability of improvement, <em>cdf()<\/em> is the normal cumulative distribution function, <em>mu<\/em> is the mean of the surrogate function for a given sample <em>x<\/em>, <em>stdev<\/em> is the standard deviation of the surrogate function for a given sample <em>x<\/em>, and <em>best_mu<\/em> is the mean of the surrogate function for the best sample found so far.<\/p>\n<p>We can add a very small number to the standard deviation to avoid a divide by zero error.<\/p>\n<p>The <em>acquisition()<\/em> function below implements this given the current training dataset of input samples, an array of new candidate samples, and the fit GP model.<\/p>\n<pre class=\"crayon-plain-tag\"># probability of improvement acquisition function\r\ndef acquisition(X, Xsamples, model):\r\n\t# calculate the best surrogate score found so far\r\n\tyhat, _ = surrogate(model, X)\r\n\tbest = max(yhat)\r\n\t# calculate mean and stdev via surrogate function\r\n\tmu, std = surrogate(model, Xsamples)\r\n\tmu = mu[:, 0]\r\n\t# calculate the probability of improvement\r\n\tprobs = norm.cdf((mu - best) \/ (std+1E-9))\r\n\treturn probs<\/pre>\n<\/p>\n<h3>Complete Bayesian Optimization Algorithm<\/h3>\n<p>We can tie all of this together into the Bayesian Optimization algorithm.<\/p>\n<p>The main algorithm involves cycles of selecting candidate samples, evaluating them with the objective function, then updating the GP model.<\/p>\n<pre class=\"crayon-plain-tag\">...\r\n# perform the optimization process\r\nfor i in range(100):\r\n\t# select the next point to sample\r\n\tx = opt_acquisition(X, y, model)\r\n\t# sample the point\r\n\tactual = objective(x)\r\n\t# summarize the finding for our own reporting\r\n\test, _ = surrogate(model, [[x]])\r\n\tprint('>x=%.3f, f()=%3f, actual=%.3f' % (x, est, actual))\r\n\t# add the data to the dataset\r\n\tX = vstack((X, [[x]]))\r\n\ty = vstack((y, [[actual]]))\r\n\t# update the model\r\n\tmodel.fit(X, y)<\/pre>\n<p>The complete example is listed below.<\/p>\n<pre class=\"crayon-plain-tag\"># example of bayesian optimization for a 1d function from scratch\r\nfrom math import sin\r\nfrom math import pi\r\nfrom numpy import arange\r\nfrom numpy import vstack\r\nfrom numpy import argmax\r\nfrom numpy import asarray\r\nfrom numpy.random import normal\r\nfrom numpy.random import random\r\nfrom scipy.stats import norm\r\nfrom sklearn.gaussian_process import GaussianProcessRegressor\r\nfrom warnings import catch_warnings\r\nfrom warnings import simplefilter\r\nfrom matplotlib import pyplot\r\n\r\n# objective function\r\ndef objective(x, noise=0.1):\r\n\tnoise = normal(loc=0, scale=noise)\r\n\treturn (x**2 * sin(5 * pi * x)**6.0) + noise\r\n\r\n# surrogate or approximation for the objective function\r\ndef surrogate(model, X):\r\n\t# catch any warning generated when making a prediction\r\n\twith catch_warnings():\r\n\t\t# ignore generated warnings\r\n\t\tsimplefilter(\"ignore\")\r\n\t\treturn model.predict(X, return_std=True)\r\n\r\n# probability of improvement acquisition function\r\ndef acquisition(X, Xsamples, model):\r\n\t# calculate the best surrogate score found so far\r\n\tyhat, _ = surrogate(model, X)\r\n\tbest = max(yhat)\r\n\t# calculate mean and stdev via surrogate function\r\n\tmu, std = surrogate(model, Xsamples)\r\n\tmu = mu[:, 0]\r\n\t# calculate the probability of improvement\r\n\tprobs = norm.cdf((mu - best) \/ (std+1E-9))\r\n\treturn probs\r\n\r\n# optimize the acquisition function\r\ndef opt_acquisition(X, y, model):\r\n\t# random search, generate random samples\r\n\tXsamples = random(100)\r\n\tXsamples = Xsamples.reshape(len(Xsamples), 1)\r\n\t# calculate the acquisition function for each sample\r\n\tscores = acquisition(X, Xsamples, model)\r\n\t# locate the index of the largest scores\r\n\tix = argmax(scores)\r\n\treturn Xsamples[ix, 0]\r\n\r\n# plot real observations vs surrogate function\r\ndef plot(X, y, model):\r\n\t# scatter plot of inputs and real objective function\r\n\tpyplot.scatter(X, y)\r\n\t# line plot of surrogate function across domain\r\n\tXsamples = asarray(arange(0, 1, 0.001))\r\n\tXsamples = Xsamples.reshape(len(Xsamples), 1)\r\n\tysamples, _ = surrogate(model, Xsamples)\r\n\tpyplot.plot(Xsamples, ysamples)\r\n\t# show the plot\r\n\tpyplot.show()\r\n\r\n# sample the domain sparsely with noise\r\nX = random(100)\r\ny = asarray([objective(x) for x in X])\r\n# reshape into rows and cols\r\nX = X.reshape(len(X), 1)\r\ny = y.reshape(len(y), 1)\r\n# define the model\r\nmodel = GaussianProcessRegressor()\r\n# fit the model\r\nmodel.fit(X, y)\r\n# plot before hand\r\nplot(X, y, model)\r\n# perform the optimization process\r\nfor i in range(100):\r\n\t# select the next point to sample\r\n\tx = opt_acquisition(X, y, model)\r\n\t# sample the point\r\n\tactual = objective(x)\r\n\t# summarize the finding\r\n\test, _ = surrogate(model, [[x]])\r\n\tprint('>x=%.3f, f()=%3f, actual=%.3f' % (x, est, actual))\r\n\t# add the data to the dataset\r\n\tX = vstack((X, [[x]]))\r\n\ty = vstack((y, [[actual]]))\r\n\t# update the model\r\n\tmodel.fit(X, y)\r\n\r\n# plot all samples and the final surrogate function\r\nplot(X, y, model)\r\n# best result\r\nix = argmax(y)\r\nprint('Best Result: x=%.3f, y=%.3f' % (X[ix], y[ix]))<\/pre>\n<p>Running the example first creates an initial random sample of the search space and evaluation of the results. Then a GP model is fit on this data.<\/p>\n<p>Your specific results will vary given the stochastic nature of the sampling of the domain. Try running the example a few times.<\/p>\n<p>A plot is created showing the raw observations as dots and the surrogate function across the entire domain. In this case, the initial sample has a good spread across the domain and the surrogate function has a bias towards the part of the domain where we know the optima is located.<\/p>\n<div id=\"attachment_8825\" style=\"width: 1290px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-8825\" class=\"size-full wp-image-8825\" src=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2019\/08\/Plot-of-Initial-Sample-dots-and-Surrogate-Function-Across-the-Domain-line.png\" alt=\"Plot of Initial Sample (dots) and Surrogate Function Across the Domain (line).\" width=\"1280\" height=\"960\" srcset=\"http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2019\/08\/Plot-of-Initial-Sample-dots-and-Surrogate-Function-Across-the-Domain-line.png 1280w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2019\/08\/Plot-of-Initial-Sample-dots-and-Surrogate-Function-Across-the-Domain-line-300x225.png 300w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2019\/08\/Plot-of-Initial-Sample-dots-and-Surrogate-Function-Across-the-Domain-line-768x576.png 768w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2019\/08\/Plot-of-Initial-Sample-dots-and-Surrogate-Function-Across-the-Domain-line-1024x768.png 1024w\" sizes=\"(max-width: 1280px) 100vw, 1280px\"><\/p>\n<p id=\"caption-attachment-8825\" class=\"wp-caption-text\">Plot of Initial Sample (dots) and Surrogate Function Across the Domain (line).<\/p>\n<\/div>\n<p>The algorithm then iterates for 100 cycles, selecting samples, evaluating them, and adding them to the dataset to update the surrogate function, and over again.<\/p>\n<p>Each cycle reports the selected input value, the estimated score from the surrogate function, and the actual score. Ideally, these scores would get closer and closer as the algorithm converges on one area of the search space.<\/p>\n<pre class=\"crayon-plain-tag\">...\r\n>x=0.922, f()=0.661501, actual=0.682\r\n>x=0.895, f()=0.661668, actual=0.905\r\n>x=0.928, f()=0.648008, actual=0.403\r\n>x=0.908, f()=0.674864, actual=0.750\r\n>x=0.436, f()=0.071377, actual=-0.115<\/pre>\n<p>Next, a final plot is created with the same form as the prior plot.<\/p>\n<p>This time, all 200 samples evaluated during the optimization task are plotted. We would expect an overabundance of sampling around the known optima, and this is what we see, with may dots around 0.9. We also see that the surrogate function has a stronger representation of the underlying target domain.<\/p>\n<div id=\"attachment_8826\" style=\"width: 1290px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-8826\" class=\"size-full wp-image-8826\" src=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2019\/08\/Plot-of-All-Samples-dots-and-Surrogate-Function-Across-the-Domain-line-after-Bayesian-Optimization.png\" alt=\"Plot of All Samples (dots) and Surrogate Function Across the Domain (line) after Bayesian Optimization.\" width=\"1280\" height=\"960\" srcset=\"http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2019\/08\/Plot-of-All-Samples-dots-and-Surrogate-Function-Across-the-Domain-line-after-Bayesian-Optimization.png 1280w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2019\/08\/Plot-of-All-Samples-dots-and-Surrogate-Function-Across-the-Domain-line-after-Bayesian-Optimization-300x225.png 300w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2019\/08\/Plot-of-All-Samples-dots-and-Surrogate-Function-Across-the-Domain-line-after-Bayesian-Optimization-768x576.png 768w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2019\/08\/Plot-of-All-Samples-dots-and-Surrogate-Function-Across-the-Domain-line-after-Bayesian-Optimization-1024x768.png 1024w\" sizes=\"(max-width: 1280px) 100vw, 1280px\"><\/p>\n<p id=\"caption-attachment-8826\" class=\"wp-caption-text\">Plot of All Samples (dots) and Surrogate Function Across the Domain (line) after Bayesian Optimization.<\/p>\n<\/div>\n<p>Finally, the best input and its objective function score are reported.<\/p>\n<p>We know the optima has an input of 0.9 and an output of 0.810 if there was no sampling noise.<\/p>\n<p>Given the sampling noise, the optimization algorithm gets close in this case, suggesting an input of 0.905.<\/p>\n<pre class=\"crayon-plain-tag\">Best Result: x=0.905, y=1.150<\/pre>\n<\/p>\n<h2>Hyperparameter Tuning With Bayesian Optimization<\/h2>\n<p>It can be a useful exercise to implement Bayesian Optimization to learn how it works.<\/p>\n<p>In practice, when using Bayesian Optimization on a project, it is a good idea to use a standard implementation provided in an open-source library. This is to both avoid bugs and to leverage a wider range of configuration options and speed improvements.<\/p>\n<p>Two popular libraries for Bayesian Optimization include <a href=\"https:\/\/github.com\/scikit-optimize\/scikit-optimize\">Scikit-Optimize<\/a> and <a href=\"https:\/\/github.com\/hyperopt\/hyperopt\">HyperOpt<\/a>. In machine learning, these libraries are often used to tune the hyperparameters of algorithms.<\/p>\n<p>Hyperparameter tuning is a good fit for Bayesian Optimization because the evaluation function is computationally expensive (e.g. training models for each set of hyperparameters) and noisy (e.g. noise in training data and stochastic learning algorithms).<\/p>\n<p>In this section, we will take a brief look at how to use the Scikit-Optimize library to optimize the hyperparameters of a k-nearest neighbor classifier for a simple test classification problem. This will provide a useful template that you can use on your own projects.<\/p>\n<p>The <a href=\"https:\/\/github.com\/scikit-optimize\/scikit-optimize\">Scikit-Optimize project<\/a> is designed to provide access to Bayesian Optimization for applications that use SciPy and NumPy, or applications that use scikit-learn machine learning algorithms.<\/p>\n<p>First, the library must be installed, which can be achieved easily using pip; for example:<\/p>\n<pre class=\"crayon-plain-tag\">sudo pip install scikit-optimize<\/pre>\n<p>It is also assumed that you have <a href=\"https:\/\/scikit-learn.org\/stable\/index.html\">scikit-learn<\/a> installed for this example.<\/p>\n<p>Once installed, there are two ways that scikit-optimize can be used to optimize the hyperparameters of a scikit-learn algorithm. The first is to perform the optimization directly on a search space, and the second is to use the BayesSearchCV class, a sibling of the scikit-learn native classes for random and grid searching.<\/p>\n<p>In this example, will use the simpler approach of optimizing the hyperparameters directly.<\/p>\n<p>The first step is to prepare the data and define the model. We will use a simple test classification problem via the <a href=\"https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.datasets.make_blobs.html\">make_blobs() function<\/a> with 500 examples, each with two features and three class labels. We will then use a <a href=\"https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.neighbors.KNeighborsClassifier.html\">KNeighborsClassifier algorithm<\/a>.<\/p>\n<pre class=\"crayon-plain-tag\">...\r\n# generate 2d classification dataset\r\nX, y = make_blobs(n_samples=500, centers=3, n_features=2)\r\n# define the model\r\nmodel = KNeighborsClassifier()<\/pre>\n<p>Next, we must define the search space.<\/p>\n<p>In this case, we will tune the number of neighbors (<em>n_neighbors<\/em>) and the shape of the neighborhood function (<em>p<\/em>). This requires ranges be defined for a given data type. In this case, they are Integers, defined with the min, max, and the name of the parameter to the scikit-learn model. For your algorithm, you can just as easily optimize <em>Real()<\/em> and <em>Categorical()<\/em> data types.<\/p>\n<pre class=\"crayon-plain-tag\">...\r\n# define the space of hyperparameters to search\r\nsearch_space = [Integer(1, 5, name='n_neighbors'), Integer(1, 2, name='p')]<\/pre>\n<p>Next, we need to define a function that will be used to evaluate a given set of hyperparameters. We want to minimize this function, therefore smaller values returned must indicate a better performing model.<\/p>\n<p>We can use the <em>use_named_args()<\/em> decorator from the scikit-optimize project on the function definition that allows the function to be called directly with a specific set of parameters from the search space.<\/p>\n<p>As such, our custom function will take the hyperparameter values as arguments, which can be provided to the model directly in order to configure it. We can define these arguments generically in python using the <em>**params<\/em> argument to the function, then pass them to the model via the <a href=\"https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.neighbors.KNeighborsClassifier.html#sklearn.neighbors.KNeighborsClassifier.set_params\">set_params(**) function<\/a>.<\/p>\n<p>Now that the model is configured, we can evaluate it. In this case, we will use 5-fold cross-validation on our dataset and evaluate the accuracy for each fold. We can then report the performance of the model as one minus the mean accuracy across these folds. This means that a perfect model with an accuracy of 1.0 will return a value of 0.0 (1.0 \u2013 mean accuracy).<\/p>\n<p>This function is defined after we have loaded the dataset and defined the model so that both the dataset and model are in scope and can be used directly.<\/p>\n<pre class=\"crayon-plain-tag\"># define the function used to evaluate a given configuration\r\n@use_named_args(search_space)\r\ndef evaluate_model(**params):\r\n\t# something\r\n\tmodel.set_params(**params)\r\n\t# calculate 5-fold cross validation\r\n\tresult = cross_val_score(model, X, y, cv=5, n_jobs=-1, scoring='accuracy')\r\n\t# calculate the mean of the scores\r\n\testimate = mean(result)\r\n\treturn 1.0 - estimate<\/pre>\n<p>Next, we can perform the optimization.<\/p>\n<p>This is achieved by calling the <a href=\"https:\/\/scikit-optimize.github.io\/#skopt.gp_minimize\">gp_minimize() function<\/a> with the name of the objective function and the defined search space.<\/p>\n<p>By default, this function will use a \u2018<em>gp_hedge<\/em>\u2018 acquisition function that tries to figure out the best strategy, but this can be configured via the <em>acq_func<\/em>\u00a0argument. The optimization will also run for 100 iterations by default, but this can be controlled via the <em>n_calls<\/em> argument.<\/p>\n<pre class=\"crayon-plain-tag\">...\r\n# perform optimization\r\nresult = gp_minimize(evaluate_model, search_space)<\/pre>\n<p>Once run, we can access the best score via the \u201cfun\u201d property and the best set of hyperparameters via the \u201c<em>x<\/em>\u201d array property.<\/p>\n<pre class=\"crayon-plain-tag\">...\r\n# summarizing finding:\r\nprint('Best Accuracy: %.3f' % (1.0 - result.fun))\r\nprint('Best Parameters: n_neighbors=%d, p=%d' % (result.x[0], result.x[1]))<\/pre>\n<p>Tying this all together, the complete example is listed below.<\/p>\n<pre class=\"crayon-plain-tag\"># example of bayesian optimization with scikit-optimize\r\nfrom numpy import mean\r\nfrom sklearn.datasets.samples_generator import make_blobs\r\nfrom sklearn.model_selection import cross_val_score\r\nfrom sklearn.neighbors import KNeighborsClassifier\r\nfrom skopt.space import Integer\r\nfrom skopt.utils import use_named_args\r\nfrom skopt import gp_minimize\r\n\r\n# generate 2d classification dataset\r\nX, y = make_blobs(n_samples=500, centers=3, n_features=2)\r\n# define the model\r\nmodel = KNeighborsClassifier()\r\n# define the space of hyperparameters to search\r\nsearch_space = [Integer(1, 5, name='n_neighbors'), Integer(1, 2, name='p')]\r\n\r\n# define the function used to evaluate a given configuration\r\n@use_named_args(search_space)\r\ndef evaluate_model(**params):\r\n\t# something\r\n\tmodel.set_params(**params)\r\n\t# calculate 5-fold cross validation\r\n\tresult = cross_val_score(model, X, y, cv=5, n_jobs=-1, scoring='accuracy')\r\n\t# calculate the mean of the scores\r\n\testimate = mean(result)\r\n\treturn 1.0 - estimate\r\n\r\n# perform optimization\r\nresult = gp_minimize(evaluate_model, search_space)\r\n# summarizing finding:\r\nprint('Best Accuracy: %.3f' % (1.0 - result.fun))\r\nprint('Best Parameters: n_neighbors=%d, p=%d' % (result.x[0], result.x[1]))<\/pre>\n<p>Running the example executes the hyperparameter tuning using Bayesian Optimization.<\/p>\n<p>The code may report many warning messages, such as:<\/p>\n<pre class=\"crayon-plain-tag\">UserWarning: The objective has been evaluated at this point before.<\/pre>\n<p>This is to be expected and is caused by the same hyperparameter configuration being evaluated more than once.<\/p>\n<p>Your specific results will vary given the stochastic nature of the test problem. Try running the example a few times.<\/p>\n<p>In this case, the model achieved about 97% accuracy via mean 5-fold cross-validation with 3 neighbors and a p-value of 2.<\/p>\n<pre class=\"crayon-plain-tag\">Best Accuracy: 0.976\r\nBest Parameters: n_neighbors=3, p=2<\/pre>\n<\/p>\n<h2>Further Reading<\/h2>\n<p>This section provides more resources on the topic if you are looking to go deeper.<\/p>\n<h3>Papers<\/h3>\n<ul>\n<li><a href=\"https:\/\/arxiv.org\/abs\/1012.2599\">A Tutorial on Bayesian Optimization of Expensive Cost Functions, with Application to Active User Modeling and Hierarchical Reinforcement Learning<\/a>, 2010.<\/li>\n<li><a href=\"http:\/\/papers.nips.cc\/paper\/4522-practical-bayesian-optimization\">Practical Bayesian Optimization of Machine Learning Algorithms<\/a>, 2012.<\/li>\n<li><a href=\"https:\/\/arxiv.org\/abs\/1807.02811\">A Tutorial on Bayesian Optimization<\/a>, 2018.<\/li>\n<\/ul>\n<h3>API<\/h3>\n<ul>\n<li><a href=\"https:\/\/scikit-learn.org\/stable\/modules\/gaussian_process.html\">Gaussian Processes, Scikit-Learn API<\/a>.<\/li>\n<li><a href=\"https:\/\/github.com\/hyperopt\/hyperopt\">Hyperopt: Distributed Asynchronous Hyper-parameter Optimization<\/a><\/li>\n<li><a href=\"https:\/\/github.com\/scikit-optimize\/scikit-optimize\">Scikit-Optimize Project.<\/a><\/li>\n<li><a href=\"https:\/\/github.com\/scikit-optimize\/scikit-optimize\/blob\/master\/examples\/hyperparameter-optimization.ipynb\">Tuning a scikit-learn estimator with skopt<\/a><\/li>\n<\/ul>\n<h3>Articles<\/h3>\n<ul>\n<li><a href=\"https:\/\/en.wikipedia.org\/wiki\/Global_optimization\">Global optimization, Wikipedia<\/a>.<\/li>\n<li><a href=\"https:\/\/en.wikipedia.org\/wiki\/Bayesian_optimization\">Bayesian optimization, Wikipedia<\/a>.<\/li>\n<li><a href=\"http:\/\/krasserm.github.io\/2018\/03\/21\/bayesian-optimization\/\">Bayesian optimization<\/a>, 2018.<\/li>\n<li><a href=\"https:\/\/www.quora.com\/How-does-Bayesian-optimization-work\">How does Bayesian optimization work?, Quora<\/a>.<\/li>\n<\/ul>\n<h2>Summary<\/h2>\n<p>In this tutorial, you discovered Bayesian Optimization for directed search of complex optimization problems.<\/p>\n<p>Specifically, you learned:<\/p>\n<ul>\n<li>Global optimization is a challenging problem that involves black box and often non-convex, non-linear, noisy, and computationally expensive objective functions.<\/li>\n<li>Bayesian Optimization provides a probabilistically principled method for global optimization.<\/li>\n<li>How to implement Bayesian Optimization from scratch and how to use open-source implementations.<\/li>\n<\/ul>\n<p>Do you have any questions?<br \/>\nAsk your questions in the comments below and I will do my best to answer.<\/p>\n<p>The post <a rel=\"nofollow\" href=\"https:\/\/machinelearningmastery.com\/what-is-bayesian-optimization\/\">How to Implement Bayesian Optimization from Scratch in Python<\/a> appeared first on <a rel=\"nofollow\" href=\"https:\/\/machinelearningmastery.com\/\">Machine Learning Mastery<\/a>.<\/p>\n<\/div>\n<p><a href=\"https:\/\/machinelearningmastery.com\/what-is-bayesian-optimization\/\">Go to Source<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Author: Jason Brownlee Discover a Gentle Introduction to Bayesian Optimization. Global optimization is a challenging problem of finding an input that results in the minimum [&hellip;] <span class=\"read-more-link\"><a class=\"read-more\" href=\"https:\/\/www.aiproblog.com\/index.php\/2019\/10\/08\/how-to-implement-bayesian-optimization-from-scratch-in-python\/\">Read More<\/a><\/span><\/p>\n","protected":false},"author":1,"featured_media":2672,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_bbp_topic_count":0,"_bbp_reply_count":0,"_bbp_total_topic_count":0,"_bbp_total_reply_count":0,"_bbp_voice_count":0,"_bbp_anonymous_reply_count":0,"_bbp_topic_count_hidden":0,"_bbp_reply_count_hidden":0,"_bbp_forum_subforum_count":0,"footnotes":""},"categories":[1],"tags":[],"_links":{"self":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts\/2671"}],"collection":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/comments?post=2671"}],"version-history":[{"count":0,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts\/2671\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/media\/2672"}],"wp:attachment":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/media?parent=2671"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/categories?post=2671"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/tags?post=2671"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}