{"id":4232,"date":"2020-12-24T18:00:59","date_gmt":"2020-12-24T18:00:59","guid":{"rendered":"https:\/\/www.aiproblog.com\/index.php\/2020\/12\/24\/feature-selection-with-stochastic-optimization-algorithms\/"},"modified":"2020-12-24T18:00:59","modified_gmt":"2020-12-24T18:00:59","slug":"feature-selection-with-stochastic-optimization-algorithms","status":"publish","type":"post","link":"https:\/\/www.aiproblog.com\/index.php\/2020\/12\/24\/feature-selection-with-stochastic-optimization-algorithms\/","title":{"rendered":"Feature Selection with Stochastic Optimization Algorithms"},"content":{"rendered":"<p>Author: Jason Brownlee<\/p>\n<div>\n<p>Typically, a simpler and better-performing machine learning model can be developed by removing input features (columns) from the training dataset.<\/p>\n<p>This is called feature selection and there are many different types of algorithms that can be used.<\/p>\n<p>It is possible to frame the problem of feature selection as an optimization problem. In the case that there are few input features, all possible combinations of input features can be evaluated and the best subset found definitively. In the case of a vast number of input features, a stochastic optimization algorithm can be used to explore the search space and find an effective subset of features.<\/p>\n<p>In this tutorial, you will discover how to use optimization algorithms for feature selection in machine learning.<\/p>\n<p>After completing this tutorial, you will know:<\/p>\n<ul>\n<li>The problem of feature selection can be broadly defined as an optimization problem.<\/li>\n<li>How to enumerate all possible subsets of input features for a dataset.<\/li>\n<li>How to apply stochastic optimization to select an optimal subset of input features.<\/li>\n<\/ul>\n<p>Let&rsquo;s get started.<\/p>\n<div id=\"attachment_11937\" style=\"width: 810px\" class=\"wp-caption aligncenter\"><img decoding=\"async\" aria-describedby=\"caption-attachment-11937\" loading=\"lazy\" class=\"size-full wp-image-11937\" src=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2021\/03\/How-to-Use-Optimization-for-Feature-Selection.jpg\" alt=\"How to Use Optimization for Feature Selection\" width=\"800\" height=\"330\" srcset=\"http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2021\/03\/How-to-Use-Optimization-for-Feature-Selection.jpg 800w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2021\/03\/How-to-Use-Optimization-for-Feature-Selection-300x124.jpg 300w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2021\/03\/How-to-Use-Optimization-for-Feature-Selection-768x317.jpg 768w\" sizes=\"(max-width: 800px) 100vw, 800px\"><\/p>\n<p id=\"caption-attachment-11937\" class=\"wp-caption-text\">How to Use Optimization for Feature Selection<br \/>Photo by <a href=\"https:\/\/www.flickr.com\/photos\/slobirdr\/22613339576\/\">Gregory &ldquo;Slobirdr&rdquo; Smith<\/a>, some rights reserved.<\/p>\n<\/div>\n<h2>Tutorial Overview<\/h2>\n<p>This tutorial is divided into three parts; they are:<\/p>\n<ol>\n<li>Optimization for Feature Selection<\/li>\n<li>Enumerate All Feature Subsets<\/li>\n<li>Optimize Feature Subsets<\/li>\n<\/ol>\n<h2>Optimization for Feature Selection<\/h2>\n<p>Feature selection is the process of reducing the number of input variables when developing a predictive model.<\/p>\n<p>It is desirable to reduce the number of input variables to both reduce the computational cost of modeling and, in some cases, to improve the performance of the model. There are many different types of feature selection algorithms, although they can broadly be grouped into two main types: wrapper and filter methods.<\/p>\n<p>Wrapper feature selection methods create many models with different subsets of input features and select those features that result in the best performing model according to a performance metric. These methods are unconcerned with the variable types, although they can be computationally expensive. RFE is a good example of a wrapper feature selection method.<\/p>\n<p>Filter feature selection methods use statistical techniques to evaluate the relationship between each input variable and the target variable, and these scores are used as the basis to choose (filter) those input variables that will be used in the model.<\/p>\n<ul>\n<li><strong>Wrapper Feature Selection<\/strong>: Search for well-performing subsets of features.<\/li>\n<li><strong>Filter Feature Selection<\/strong>: Select subsets of features based on their relationship with the target.<\/li>\n<\/ul>\n<p>For more on choosing feature selection algorithms, see the tutorial:<\/p>\n<ul>\n<li><a href=\"https:\/\/machinelearningmastery.com\/feature-selection-with-real-and-categorical-data\/\">How to Choose a Feature Selection Method For Machine Learning<\/a><\/li>\n<\/ul>\n<p>A popular wrapper method is the Recursive Feature Elimination, or RFE, algorithm.<\/p>\n<p>RFE works by searching for a subset of features by starting with all features in the training dataset and successfully removing features until the desired number remains.<\/p>\n<p>This is achieved by fitting the given machine learning algorithm used in the core of the model, ranking features by importance, discarding the least important features, and re-fitting the model. This process is repeated until a specified number of features remains.<\/p>\n<p>For more on RFE, see the tutorial:<\/p>\n<ul>\n<li><a href=\"https:\/\/machinelearningmastery.com\/rfe-feature-selection-in-python\/\">Recursive Feature Elimination (RFE) for Feature Selection in Python<\/a><\/li>\n<\/ul>\n<p>The problem of wrapper feature selection can be framed as an optimization problem. That is, find a subset of input features that result in the best model performance.<\/p>\n<p>RFE is one approach to solving this problem systematically, although it may be limited by a large number of features.<\/p>\n<p>An alternative approach would be to use a stochastic optimization algorithm, such as a stochastic hill climbing algorithm, when the number of features is very large. When the number of features is relatively small, it may be possible to enumerate all possible subsets of features.<\/p>\n<ul>\n<li><strong>Few Input Variables<\/strong>: Enumerate all possible subsets of features.<\/li>\n<li><strong>Many Input Features<\/strong>: Stochastic optimization algorithm to find good subsets of features.<\/li>\n<\/ul>\n<p>Now that we are familiar with the idea that feature selection may be explored as an optimization problem, let&rsquo;s look at how we might enumerate all possible feature subsets.<\/p>\n<h2>Enumerate All Feature Subsets<\/h2>\n<p>When the number of input variables is relatively small and the model evaluation is relatively fast, then it may be possible to enumerate all possible subsets of input variables.<\/p>\n<p>This means evaluating the performance of a model using a test harness given every possible unique group of input variables.<\/p>\n<p>We will explore how to do this with a worked example.<\/p>\n<p>First, let&rsquo;s define a small binary classification dataset with few input features. We can use the <a href=\"https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.datasets.make_classification.html\">make_classification() function<\/a> to define a dataset with five input variables, two of which are informative, and 1,000 rows.<\/p>\n<p>The example below defines the dataset and summarizes its shape.<\/p>\n<pre class=\"urvanov-syntax-highlighter-plain-tag\"># define a small classification dataset\r\nfrom sklearn.datasets import make_classification\r\n# define dataset\r\nX, y = make_classification(n_samples=1000, n_features=5, n_informative=2, n_redundant=3, random_state=1)\r\n# summarize the shape of the dataset\r\nprint(X.shape, y.shape)<\/pre>\n<p>Running the example creates the dataset and confirms that it has the desired shape.<\/p>\n<pre class=\"urvanov-syntax-highlighter-plain-tag\">(1000, 5) (1000,)<\/pre>\n<p>Next, we can establish a baseline in performance using a model evaluated on the entire dataset.<\/p>\n<p>We will use a <a href=\"https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.tree.DecisionTreeClassifier.html\">DecisionTreeClassifier<\/a> as the model because its performance is quite sensitive to the choice of input variables.<\/p>\n<p>We will evaluate the model using good practices, such as <a href=\"https:\/\/machinelearningmastery.com\/k-fold-cross-validation\/\">repeated stratified k-fold cross-validation<\/a> with three repeats and 10 folds.<\/p>\n<p>The complete example is listed below.<\/p>\n<pre class=\"urvanov-syntax-highlighter-plain-tag\"># evaluate a decision tree on the entire small dataset\r\nfrom numpy import mean\r\nfrom numpy import std\r\nfrom sklearn.datasets import make_classification\r\nfrom sklearn.model_selection import cross_val_score\r\nfrom sklearn.model_selection import RepeatedStratifiedKFold\r\nfrom sklearn.tree import DecisionTreeClassifier\r\n# define dataset\r\nX, y = make_classification(n_samples=1000, n_features=3, n_informative=2, n_redundant=1, random_state=1)\r\n# define model\r\nmodel = DecisionTreeClassifier()\r\n# define evaluation procedure\r\ncv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)\r\n# evaluate model\r\nscores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)\r\n# report result\r\nprint('Mean Accuracy: %.3f (%.3f)' % (mean(scores), std(scores)))<\/pre>\n<p>Running the example evaluates the decision tree on the entire dataset and reports the mean and standard deviation classification accuracy.<\/p>\n<p><strong>Note<\/strong>: Your <a href=\"https:\/\/machinelearningmastery.com\/different-results-each-time-in-machine-learning\/\">results may vary<\/a> given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.<\/p>\n<p>In this case, we can see that the model achieved an accuracy of about 80.5 percent.<\/p>\n<pre class=\"urvanov-syntax-highlighter-plain-tag\">Mean Accuracy: 0.805 (0.030)<\/pre>\n<p>Next, we can try to improve model performance by using a subset of the input features.<\/p>\n<p>First, we must choose a representation to enumerate.<\/p>\n<p>In this case, we will enumerate a list of boolean values, with one value for each input feature: <em>True<\/em> if the feature is to be used and <em>False<\/em> if the feature is not to be used as input.<\/p>\n<p>For example, with the five input features the sequence [<em>True, True, True, True, True<\/em>] would use all input features, and [<em>True, False, False, False, False<\/em>] would only use the first input feature as input.<\/p>\n<p>We can enumerate all sequences of boolean values with the <em>length=5<\/em> using the <a href=\"https:\/\/docs.python.org\/3\/library\/itertools.html#itertools.product\">product() Python function<\/a>. We must specify the valid values [<em>True, False<\/em>] and the number of steps in the sequence, which is equal to the number of input variables.<\/p>\n<p>The function returns an iterable that we can enumerate directly for each sequence.<\/p>\n<pre class=\"urvanov-syntax-highlighter-plain-tag\">...\r\n# determine the number of columns\r\nn_cols = X.shape[1]\r\nbest_subset, best_score = None, 0.0\r\n# enumerate all combinations of input features\r\nfor subset in product([True, False], repeat=n_cols):\r\n\t...<\/pre>\n<p>For a given sequence of boolean values, we can enumerate it and transform it into a sequence of column indexes for each <em>True<\/em> in the sequence.<\/p>\n<pre class=\"urvanov-syntax-highlighter-plain-tag\">...\r\n# convert into column indexes\r\nix = [i for i, x in enumerate(subset) if x]<\/pre>\n<p>If the sequence has no column indexes (in the case of all <em>False<\/em> values), then we can skip that sequence.<\/p>\n<pre class=\"urvanov-syntax-highlighter-plain-tag\"># check for now column (all False)\r\nif len(ix) == 0:\r\n\tcontinue<\/pre>\n<p>We can then use the column indexes to choose the columns in the dataset.<\/p>\n<pre class=\"urvanov-syntax-highlighter-plain-tag\">...\r\n# select columns\r\nX_new = X[:, ix]<\/pre>\n<p>And this subset of the dataset can then be evaluated as we did before.<\/p>\n<pre class=\"urvanov-syntax-highlighter-plain-tag\">...\r\n# define model\r\nmodel = DecisionTreeClassifier()\r\n# define evaluation procedure\r\ncv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)\r\n# evaluate model\r\nscores = cross_val_score(model, X_new, y, scoring='accuracy', cv=cv, n_jobs=-1)\r\n# summarize scores\r\nresult = mean(scores)<\/pre>\n<p>If the accuracy for the model is better than the best sequence found so far, we can store it.<\/p>\n<pre class=\"urvanov-syntax-highlighter-plain-tag\">...\r\n# check if it is better than the best so far\r\nif best_score is None or result &gt;= best_score:\r\n\t# better result\r\n\tbest_subset, best_score = ix, result<\/pre>\n<p>And that&rsquo;s it.<\/p>\n<p>Tying this together, the complete example of feature selection by enumerating all possible feature subsets is listed below.<\/p>\n<pre class=\"urvanov-syntax-highlighter-plain-tag\"># feature selection by enumerating all possible subsets of features\r\nfrom itertools import product\r\nfrom numpy import mean\r\nfrom sklearn.datasets import make_classification\r\nfrom sklearn.model_selection import cross_val_score\r\nfrom sklearn.model_selection import RepeatedStratifiedKFold\r\nfrom sklearn.tree import DecisionTreeClassifier\r\n# define dataset\r\nX, y = make_classification(n_samples=1000, n_features=5, n_informative=2, n_redundant=3, random_state=1)\r\n# determine the number of columns\r\nn_cols = X.shape[1]\r\nbest_subset, best_score = None, 0.0\r\n# enumerate all combinations of input features\r\nfor subset in product([True, False], repeat=n_cols):\r\n\t# convert into column indexes\r\n\tix = [i for i, x in enumerate(subset) if x]\r\n\t# check for now column (all False)\r\n\tif len(ix) == 0:\r\n\t\tcontinue\r\n\t# select columns\r\n\tX_new = X[:, ix]\r\n\t# define model\r\n\tmodel = DecisionTreeClassifier()\r\n\t# define evaluation procedure\r\n\tcv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)\r\n\t# evaluate model\r\n\tscores = cross_val_score(model, X_new, y, scoring='accuracy', cv=cv, n_jobs=-1)\r\n\t# summarize scores\r\n\tresult = mean(scores)\r\n\t# report progress\r\n\tprint('&gt;f(%s) = %f ' % (ix, result))\r\n\t# check if it is better than the best so far\r\n\tif best_score is None or result &gt;= best_score:\r\n\t\t# better result\r\n\t\tbest_subset, best_score = ix, result\r\n# report best\r\nprint('Done!')\r\nprint('f(%s) = %f' % (best_subset, best_score))<\/pre>\n<p>Running the example reports the mean classification accuracy of the model for each subset of features considered. The best subset is then reported at the end of the run.<\/p>\n<p><strong>Note<\/strong>: Your <a href=\"https:\/\/machinelearningmastery.com\/different-results-each-time-in-machine-learning\/\">results may vary<\/a> given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.<\/p>\n<p>In this case, we can see that the best subset of features involved features at indexes [2, 3, 4] that resulted in a mean classification accuracy of about 83.0 percent, which is better than the result reported previously using all input features.<\/p>\n<pre class=\"urvanov-syntax-highlighter-plain-tag\">&gt;f([0, 1, 2, 3, 4]) = 0.813667\r\n&gt;f([0, 1, 2, 3]) = 0.827667\r\n&gt;f([0, 1, 2, 4]) = 0.815333\r\n&gt;f([0, 1, 2]) = 0.824000\r\n&gt;f([0, 1, 3, 4]) = 0.821333\r\n&gt;f([0, 1, 3]) = 0.825667\r\n&gt;f([0, 1, 4]) = 0.807333\r\n&gt;f([0, 1]) = 0.817667\r\n&gt;f([0, 2, 3, 4]) = 0.830333\r\n&gt;f([0, 2, 3]) = 0.819000\r\n&gt;f([0, 2, 4]) = 0.828000\r\n&gt;f([0, 2]) = 0.818333\r\n&gt;f([0, 3, 4]) = 0.830333\r\n&gt;f([0, 3]) = 0.821333\r\n&gt;f([0, 4]) = 0.816000\r\n&gt;f([0]) = 0.639333\r\n&gt;f([1, 2, 3, 4]) = 0.823667\r\n&gt;f([1, 2, 3]) = 0.821667\r\n&gt;f([1, 2, 4]) = 0.823333\r\n&gt;f([1, 2]) = 0.818667\r\n&gt;f([1, 3, 4]) = 0.818000\r\n&gt;f([1, 3]) = 0.820667\r\n&gt;f([1, 4]) = 0.809000\r\n&gt;f([1]) = 0.797000\r\n&gt;f([2, 3, 4]) = 0.827667\r\n&gt;f([2, 3]) = 0.755000\r\n&gt;f([2, 4]) = 0.827000\r\n&gt;f([2]) = 0.516667\r\n&gt;f([3, 4]) = 0.824000\r\n&gt;f([3]) = 0.514333\r\n&gt;f([4]) = 0.777667\r\nDone!\r\nf([0, 3, 4]) = 0.830333<\/pre>\n<p>Now that we know how to enumerate all possible feature subsets, let&rsquo;s look at how we might use a stochastic optimization algorithm to choose a subset of features.<\/p>\n<h2>Optimize Feature Subsets<\/h2>\n<p>We can apply a stochastic optimization algorithm to the search space of subsets of input features.<\/p>\n<p>First, let&rsquo;s define a larger problem that has many more features, making model evaluation too slow and the search space too large for enumerating all subsets.<\/p>\n<p>We will define a classification problem with 10,000 rows and 500 input features, 10 of which are relevant and the remaining 490 are redundant.<\/p>\n<pre class=\"urvanov-syntax-highlighter-plain-tag\"># define a large classification dataset\r\nfrom sklearn.datasets import make_classification\r\n# define dataset\r\nX, y = make_classification(n_samples=10000, n_features=500, n_informative=10, n_redundant=490, random_state=1)\r\n# summarize the shape of the dataset\r\nprint(X.shape, y.shape)<\/pre>\n<p>Running the example creates the dataset and confirms that it has the desired shape.<\/p>\n<pre class=\"urvanov-syntax-highlighter-plain-tag\">(10000, 500) (10000,)<\/pre>\n<p>We can establish a baseline in performance by evaluating a model on the dataset with all input features.<\/p>\n<p>Because the dataset is large and the model is slow to evaluate, we will modify the evaluation of the model to use 3-fold cross-validation, e.g. fewer folds and no repeats.<\/p>\n<p>The complete example is listed below.<\/p>\n<pre class=\"urvanov-syntax-highlighter-plain-tag\"># evaluate a decision tree on the entire larger dataset\r\nfrom numpy import mean\r\nfrom numpy import std\r\nfrom sklearn.datasets import make_classification\r\nfrom sklearn.model_selection import cross_val_score\r\nfrom sklearn.model_selection import StratifiedKFold\r\nfrom sklearn.tree import DecisionTreeClassifier\r\n# define dataset\r\nX, y = make_classification(n_samples=10000, n_features=500, n_informative=10, n_redundant=490, random_state=1)\r\n# define model\r\nmodel = DecisionTreeClassifier()\r\n# define evaluation procedure\r\ncv = StratifiedKFold(n_splits=3)\r\n# evaluate model\r\nscores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)\r\n# report result\r\nprint('Mean Accuracy: %.3f (%.3f)' % (mean(scores), std(scores)))<\/pre>\n<p>Running the example evaluates the decision tree on the entire dataset and reports the mean and standard deviation classification accuracy.<\/p>\n<p><strong>Note<\/strong>: Your <a href=\"https:\/\/machinelearningmastery.com\/different-results-each-time-in-machine-learning\/\">results may vary<\/a> given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.<\/p>\n<p>In this case, we can see that the model achieved an accuracy of about 91.3 percent.<\/p>\n<p>This provides a baseline that we would expect to outperform using feature selection.<\/p>\n<pre class=\"urvanov-syntax-highlighter-plain-tag\">Mean Accuracy: 0.913 (0.001)<\/pre>\n<p>We will use a simple stochastic hill climbing algorithm as the optimization algorithm.<\/p>\n<p>First, we must define the objective function. It will take the dataset and a subset of features to use as input and return an estimated model accuracy from 0 (worst) to 1 (best). It is a maximizing optimization problem.<\/p>\n<p>This objective function is simply the decoding of the sequence and model evaluation step from the previous section.<\/p>\n<p>The <em>objective()<\/em> function below implements this and returns both the score and the decoded subset of columns used for helpful reporting.<\/p>\n<pre class=\"urvanov-syntax-highlighter-plain-tag\"># objective function\r\ndef objective(X, y, subset):\r\n\t# convert into column indexes\r\n\tix = [i for i, x in enumerate(subset) if x]\r\n\t# check for now column (all False)\r\n\tif len(ix) == 0:\r\n\t\treturn 0.0\r\n\t# select columns\r\n\tX_new = X[:, ix]\r\n\t# define model\r\n\tmodel = DecisionTreeClassifier()\r\n\t# evaluate model\r\n\tscores = cross_val_score(model, X_new, y, scoring='accuracy', cv=3, n_jobs=-1)\r\n\t# summarize scores\r\n\tresult = mean(scores)\r\n\treturn result, ix<\/pre>\n<p>We also need a function that can take a step in the search space.<\/p>\n<p>Given an existing solution, it must modify it and return a new solution in close proximity. In this case, we will achieve this by randomly flipping the inclusion\/exclusion of columns in subsequence.<\/p>\n<p>Each position in the sequence will be considered independently and will be flipped probabilistically where the probability of flipping is a hyperparameter.<\/p>\n<p>The <em>mutate()<\/em> function below implements this given a candidate solution (sequence of booleans) and a mutation hyperparameter, creating and returning a modified solution (a step in the search space).<\/p>\n<p>The larger the <em>p_mutate<\/em> value (in the range 0 to 1), the larger the step in the search space.<\/p>\n<pre class=\"urvanov-syntax-highlighter-plain-tag\"># mutation operator\r\ndef mutate(solution, p_mutate):\r\n\t# make a copy\r\n\tchild = solution.copy()\r\n\tfor i in range(len(child)):\r\n\t\t# check for a mutation\r\n\t\tif rand() &lt; p_mutate:\r\n\t\t\t# flip the inclusion\r\n\t\t\tchild[i] = not child[i]\r\n\treturn child<\/pre>\n<p>We can now implement the hill climbing algorithm.<\/p>\n<p>The initial solution is a randomly generated sequence, which is then evaluated.<\/p>\n<pre class=\"urvanov-syntax-highlighter-plain-tag\">...\r\n# generate an initial point\r\nsolution = choice([True, False], size=X.shape[1])\r\n# evaluate the initial point\r\nsolution_eval, ix = objective(X, y, solution)<\/pre>\n<p>We then loop for a fixed number of iterations, creating mutated versions of the current solution, evaluating them, and saving them if the score is better.<\/p>\n<pre class=\"urvanov-syntax-highlighter-plain-tag\">...\r\n# run the hill climb\r\nfor i in range(n_iter):\r\n\t# take a step\r\n\tcandidate = mutate(solution, p_mutate)\r\n\t# evaluate candidate point\r\n\tcandidate_eval, ix = objective(X, y, candidate)\r\n\t# check if we should keep the new point\r\n\tif candidate_eval &gt;= solution_eval:\r\n\t\t# store the new point\r\n\t\tsolution, solution_eval = candidate, candidate_eval\r\n\t# report progress\r\n\tprint('&gt;%d f(%s) = %f' % (i+1, len(ix), solution_eval))<\/pre>\n<p>The <em>hillclimbing()<\/em> function below implements this, taking the dataset, objective function, and hyperparameters as arguments and returns the best subset of dataset columns and the estimated performance of the model.<\/p>\n<pre class=\"urvanov-syntax-highlighter-plain-tag\"># hill climbing local search algorithm\r\ndef hillclimbing(X, y, objective, n_iter, p_mutate):\r\n\t# generate an initial point\r\n\tsolution = choice([True, False], size=X.shape[1])\r\n\t# evaluate the initial point\r\n\tsolution_eval, ix = objective(X, y, solution)\r\n\t# run the hill climb\r\n\tfor i in range(n_iter):\r\n\t\t# take a step\r\n\t\tcandidate = mutate(solution, p_mutate)\r\n\t\t# evaluate candidate point\r\n\t\tcandidate_eval, ix = objective(X, y, candidate)\r\n\t\t# check if we should keep the new point\r\n\t\tif candidate_eval &gt;= solution_eval:\r\n\t\t\t# store the new point\r\n\t\t\tsolution, solution_eval = candidate, candidate_eval\r\n\t\t# report progress\r\n\t\tprint('&gt;%d f(%s) = %f' % (i+1, len(ix), solution_eval))\r\n\treturn solution, solution_eval<\/pre>\n<p>We can then call this function and pass in our synthetic dataset to perform optimization for feature selection.<\/p>\n<p>In this case, we will run the algorithm for 100 iterations and make about five flips to the sequence for a given mutation, which is quite conservative.<\/p>\n<pre class=\"urvanov-syntax-highlighter-plain-tag\">...\r\n# define dataset\r\nX, y = make_classification(n_samples=10000, n_features=500, n_informative=10, n_redundant=490, random_state=1)\r\n# define the total iterations\r\nn_iter = 100\r\n# probability of including\/excluding a column\r\np_mut = 10.0 \/ 500.0\r\n# perform the hill climbing search\r\nsubset, score = hillclimbing(X, y, objective, n_iter, p_mut)<\/pre>\n<p>At the end of the run, we will convert the boolean sequence into column indexes (so we could fit a final model if we wanted) and report the performance of the best subsequence.<\/p>\n<pre class=\"urvanov-syntax-highlighter-plain-tag\">...\r\n# convert into column indexes\r\nix = [i for i, x in enumerate(subset) if x]\r\nprint('Done!')\r\nprint('Best: f(%d) = %f' % (len(ix), score))<\/pre>\n<p>Tying this all together, the complete example is listed below.<\/p>\n<pre class=\"urvanov-syntax-highlighter-plain-tag\"># stochastic optimization for feature selection\r\nfrom numpy import mean\r\nfrom numpy.random import rand\r\nfrom numpy.random import choice\r\nfrom sklearn.datasets import make_classification\r\nfrom sklearn.model_selection import cross_val_score\r\nfrom sklearn.tree import DecisionTreeClassifier\r\n\r\n# objective function\r\ndef objective(X, y, subset):\r\n\t# convert into column indexes\r\n\tix = [i for i, x in enumerate(subset) if x]\r\n\t# check for now column (all False)\r\n\tif len(ix) == 0:\r\n\t\treturn 0.0\r\n\t# select columns\r\n\tX_new = X[:, ix]\r\n\t# define model\r\n\tmodel = DecisionTreeClassifier()\r\n\t# evaluate model\r\n\tscores = cross_val_score(model, X_new, y, scoring='accuracy', cv=3, n_jobs=-1)\r\n\t# summarize scores\r\n\tresult = mean(scores)\r\n\treturn result, ix\r\n\r\n# mutation operator\r\ndef mutate(solution, p_mutate):\r\n\t# make a copy\r\n\tchild = solution.copy()\r\n\tfor i in range(len(child)):\r\n\t\t# check for a mutation\r\n\t\tif rand() &lt; p_mutate:\r\n\t\t\t# flip the inclusion\r\n\t\t\tchild[i] = not child[i]\r\n\treturn child\r\n\r\n# hill climbing local search algorithm\r\ndef hillclimbing(X, y, objective, n_iter, p_mutate):\r\n\t# generate an initial point\r\n\tsolution = choice([True, False], size=X.shape[1])\r\n\t# evaluate the initial point\r\n\tsolution_eval, ix = objective(X, y, solution)\r\n\t# run the hill climb\r\n\tfor i in range(n_iter):\r\n\t\t# take a step\r\n\t\tcandidate = mutate(solution, p_mutate)\r\n\t\t# evaluate candidate point\r\n\t\tcandidate_eval, ix = objective(X, y, candidate)\r\n\t\t# check if we should keep the new point\r\n\t\tif candidate_eval &gt;= solution_eval:\r\n\t\t\t# store the new point\r\n\t\t\tsolution, solution_eval = candidate, candidate_eval\r\n\t\t# report progress\r\n\t\tprint('&gt;%d f(%s) = %f' % (i+1, len(ix), solution_eval))\r\n\treturn solution, solution_eval\r\n\r\n# define dataset\r\nX, y = make_classification(n_samples=10000, n_features=500, n_informative=10, n_redundant=490, random_state=1)\r\n# define the total iterations\r\nn_iter = 100\r\n# probability of including\/excluding a column\r\np_mut = 10.0 \/ 500.0\r\n# perform the hill climbing search\r\nsubset, score = hillclimbing(X, y, objective, n_iter, p_mut)\r\n# convert into column indexes\r\nix = [i for i, x in enumerate(subset) if x]\r\nprint('Done!')\r\nprint('Best: f(%d) = %f' % (len(ix), score))<\/pre>\n<p>Running the example reports the mean classification accuracy of the model for each subset of features considered. The best subset is then reported at the end of the run.<\/p>\n<p><strong>Note<\/strong>: Your <a href=\"https:\/\/machinelearningmastery.com\/different-results-each-time-in-machine-learning\/\">results may vary<\/a> given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.<\/p>\n<p>In this case, we can see that the best performance was achieved with a subset of 239 features and a classification accuracy of approximately 91.8 percent.<\/p>\n<p>This is better than a model evaluated on all input features.<\/p>\n<p>Although the result is better, we know we can do a lot better, perhaps with tuning of the hyperparameters of the optimization algorithm or perhaps by using an alternate optimization algorithm.<\/p>\n<pre class=\"urvanov-syntax-highlighter-plain-tag\">...\r\n&gt;80 f(240) = 0.918099\r\n&gt;81 f(236) = 0.918099\r\n&gt;82 f(238) = 0.918099\r\n&gt;83 f(236) = 0.918099\r\n&gt;84 f(239) = 0.918099\r\n&gt;85 f(240) = 0.918099\r\n&gt;86 f(239) = 0.918099\r\n&gt;87 f(245) = 0.918099\r\n&gt;88 f(241) = 0.918099\r\n&gt;89 f(239) = 0.918099\r\n&gt;90 f(239) = 0.918099\r\n&gt;91 f(241) = 0.918099\r\n&gt;92 f(243) = 0.918099\r\n&gt;93 f(245) = 0.918099\r\n&gt;94 f(239) = 0.918099\r\n&gt;95 f(245) = 0.918099\r\n&gt;96 f(244) = 0.918099\r\n&gt;97 f(242) = 0.918099\r\n&gt;98 f(238) = 0.918099\r\n&gt;99 f(248) = 0.918099\r\n&gt;100 f(238) = 0.918099\r\nDone!\r\nBest: f(239) = 0.918099<\/pre>\n<\/p>\n<h2>Further Reading<\/h2>\n<p>This section provides more resources on the topic if you are looking to go deeper.<\/p>\n<h3>Tutorials<\/h3>\n<ul>\n<li><a href=\"https:\/\/machinelearningmastery.com\/rfe-feature-selection-in-python\/\">Recursive Feature Elimination (RFE) for Feature Selection in Python<\/a><\/li>\n<li><a href=\"https:\/\/machinelearningmastery.com\/feature-selection-with-real-and-categorical-data\/\">How to Choose a Feature Selection Method For Machine Learning<\/a><\/li>\n<\/ul>\n<h3>APIs<\/h3>\n<ul>\n<li><a href=\"https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.datasets.make_classification.html\">sklearn.datasets.make_classification API<\/a>.<\/li>\n<li><a href=\"https:\/\/docs.python.org\/3\/library\/itertools.html#itertools.product\">itertools.product API<\/a>.<\/li>\n<\/ul>\n<h2>Summary<\/h2>\n<p>In this tutorial, you discovered how to use optimization algorithms for feature selection in machine learning.<\/p>\n<p>Specifically, you learned:<\/p>\n<ul>\n<li>The problem of feature selection can be broadly defined as an optimization problem.<\/li>\n<li>How to enumerate all possible subsets of input features for a dataset.<\/li>\n<li>How to apply stochastic optimization to select an optimal subset of input features.<\/li>\n<\/ul>\n<p><strong>Do you have any questions?<\/strong><br \/>\nAsk your questions in the comments below and I will do my best to answer.<\/p>\n<p>The post <a rel=\"nofollow\" href=\"https:\/\/machinelearningmastery.com\/feature-selection-with-optimization\/\">Feature Selection with Stochastic Optimization Algorithms<\/a> appeared first on <a rel=\"nofollow\" href=\"https:\/\/machinelearningmastery.com\/\">Machine Learning Mastery<\/a>.<\/p>\n<\/div>\n<p><a href=\"https:\/\/machinelearningmastery.com\/feature-selection-with-optimization\/\">Go to Source<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Author: Jason Brownlee Typically, a simpler and better-performing machine learning model can be developed by removing input features (columns) from the training dataset. This is [&hellip;] <span class=\"read-more-link\"><a class=\"read-more\" href=\"https:\/\/www.aiproblog.com\/index.php\/2020\/12\/24\/feature-selection-with-stochastic-optimization-algorithms\/\">Read More<\/a><\/span><\/p>\n","protected":false},"author":1,"featured_media":4233,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_bbp_topic_count":0,"_bbp_reply_count":0,"_bbp_total_topic_count":0,"_bbp_total_reply_count":0,"_bbp_voice_count":0,"_bbp_anonymous_reply_count":0,"_bbp_topic_count_hidden":0,"_bbp_reply_count_hidden":0,"_bbp_forum_subforum_count":0,"footnotes":""},"categories":[24],"tags":[],"_links":{"self":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts\/4232"}],"collection":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/comments?post=4232"}],"version-history":[{"count":0,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts\/4232\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/media\/4233"}],"wp:attachment":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/media?parent=4232"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/categories?post=4232"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/tags?post=4232"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}