{"id":3663,"date":"2020-07-14T19:00:19","date_gmt":"2020-07-14T19:00:19","guid":{"rendered":"https:\/\/www.aiproblog.com\/index.php\/2020\/07\/14\/how-to-grid-search-data-preparation-techniques\/"},"modified":"2020-07-14T19:00:19","modified_gmt":"2020-07-14T19:00:19","slug":"how-to-grid-search-data-preparation-techniques","status":"publish","type":"post","link":"https:\/\/www.aiproblog.com\/index.php\/2020\/07\/14\/how-to-grid-search-data-preparation-techniques\/","title":{"rendered":"How to Grid Search Data Preparation Techniques"},"content":{"rendered":"<p>Author: Jason Brownlee<\/p>\n<div>\n<p>Machine learning predictive modeling performance is only as good as your data, and your data is only as good as the way you prepare it for modeling.<\/p>\n<p>The most common approach to data preparation is to study a dataset and review the expectations of a machine learning algorithms, then carefully choose the most appropriate data preparation techniques to transform the raw data to best meet the expectations of the algorithm. This is slow, expensive, and requires a vast amount of expertise.<\/p>\n<p>An alternative approach to data preparation is to grid search a suite of common and commonly useful data preparation techniques to the raw data. This is an alternative philosophy for data preparation that <strong>treats data transforms as another hyperparameter<\/strong> of the modeling pipeline to be searched and tuned.<\/p>\n<p>This approach requires less expertise than the traditional manual approach to data preparation, although it is computationally costly. The benefit is that it can aid in the discovery of non-intuitive data preparation solutions that achieve good or best performance for a given predictive modeling problem.<\/p>\n<p>In this tutorial, you will discover how to use the grid search approach for data preparation with tabular data.<\/p>\n<p>After completing this tutorial, you will know:<\/p>\n<ul>\n<li>Grid search provides an alternative approach to data preparation for tabular data, where transforms are tried as hyperparameters of the modeling pipeline.<\/li>\n<li>How to use the grid search method for data preparation to improve model performance over a baseline for a standard classification dataset.<\/li>\n<li>How to grid search sequences of data preparation methods to further improve model performance.<\/li>\n<\/ul>\n<p>Discover data cleaning, feature selection, data transforms, dimensionality reduction and much more <a href=\"https:\/\/machinelearningmastery.com\/data-preparation-for-machine-learning\/\">in my new book<\/a>, with 30 step-by-step tutorials and full Python source code.<\/p>\n<p>Let&rsquo;s get started.<\/p>\n<div id=\"attachment_11022\" style=\"width: 810px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-11022\" class=\"size-full wp-image-11022\" src=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2020\/07\/How-to-Grid-Search-Data-Preparation-Techniques.jpg\" alt=\"How to Grid Search Data Preparation Techniques\" width=\"800\" height=\"449\" srcset=\"http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2020\/07\/How-to-Grid-Search-Data-Preparation-Techniques.jpg 800w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2020\/07\/How-to-Grid-Search-Data-Preparation-Techniques-300x168.jpg 300w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2020\/07\/How-to-Grid-Search-Data-Preparation-Techniques-768x431.jpg 768w\" sizes=\"(max-width: 800px) 100vw, 800px\"><\/p>\n<p id=\"caption-attachment-11022\" class=\"wp-caption-text\">How to Grid Search Data Preparation Techniques<br \/>Photo by <a href=\"https:\/\/www.flickr.com\/photos\/wallboat\/37224599356\/\">Wall Boat<\/a>, some rights reserved.<\/p>\n<\/div>\n<h2>Tutorial Overview<\/h2>\n<p>This tutorial is divided into three parts; they are:<\/p>\n<ol>\n<li>Grid Search Technique for Data Preparation<\/li>\n<li>Dataset and Performance Baseline\n<ol>\n<li>Wine Classification Dataset<\/li>\n<li>Baseline Model Performance<\/li>\n<\/ol>\n<\/li>\n<li>Grid Search Approach to Data Preparation<\/li>\n<\/ol>\n<h2>Grid Search Technique for Data Preparation<\/h2>\n<p>Data preparation can be challenging.<\/p>\n<p>The approach that is most often prescribed and followed is to analyze the dataset, review the requirements of the algorithms, and transform the raw data to best meet the expectations of the algorithms.<\/p>\n<p>This can be effective but is also slow and can require deep expertise with data analysis and machine learning algorithms.<\/p>\n<p>An alternative approach is to treat the preparation of input variables as a hyperparameter of the modeling pipeline and to tune it along with the choice of algorithm and algorithm configurations.<\/p>\n<p>This might be a data transform that &ldquo;<em>should not work<\/em>&rdquo; or &ldquo;<em>should not be appropriate for the algorithm<\/em>&rdquo; yet results in good or great performance. Alternatively, it may be the absence of a data transform for an input variable that is deemed &ldquo;<em>absolutely required<\/em>&rdquo; yet results in good or great performance.<\/p>\n<p>This can be achieved by designing a <strong>grid search of data preparation techniques<\/strong> and\/or sequences of data preparation techniques in pipelines. This may involve evaluating each on a single chosen machine learning algorithm, or on a suite of machine learning algorithms.<\/p>\n<p>The benefit of this approach is that it always results in suggestions of modeling pipelines that give good relative results. Most importantly, it can unearth the non-obvious and unintuitive solutions to practitioners without the need for deep expertise.<\/p>\n<p>We can explore this approach to data preparation with a worked example.<\/p>\n<p>Before we dive into a worked example, let&rsquo;s first select a standard dataset and develop a baseline in performance.<\/p>\n<\/p>\n<div class=\"woo-sc-hr\"><\/div>\n<p><center><\/p>\n<h3>Want to Get Started With Data Preparation?<\/h3>\n<p>Take my free 7-day email crash course now (with sample code).<\/p>\n<p>Click to sign-up and also get a free PDF Ebook version of the course.<\/p>\n<p><a href=\"https:\/\/machinelearningmastery.lpages.co\/leadbox\/1041bc0ec172a2%3A164f8be4f346dc\/4935938752774144\/\" target=\"_blank\" style=\"background: rgb(255, 206, 10); color: rgb(255, 255, 255); text-decoration: none; font-family: Helvetica, Arial, sans-serif; font-weight: bold; font-size: 16px; line-height: 20px; padding: 10px; display: inline-block; max-width: 300px; border-radius: 5px; text-shadow: rgba(0, 0, 0, 0.25) 0px -1px 1px; box-shadow: rgba(255, 255, 255, 0.5) 0px 1px 3px inset, rgba(0, 0, 0, 0.5) 0px 1px 3px;\" rel=\"noopener noreferrer\">Download Your FREE Mini-Course<\/a><script data-leadbox=\"1041bc0ec172a2:164f8be4f346dc\" data-url=\"https:\/\/machinelearningmastery.lpages.co\/leadbox\/1041bc0ec172a2%3A164f8be4f346dc\/4935938752774144\/\" data-config=\"%7B%7D\" type=\"text\/javascript\" src=\"https:\/\/machinelearningmastery.lpages.co\/leadbox-1589485176.js\"><\/script><\/p>\n<p><\/center><\/p>\n<div class=\"woo-sc-hr\"><\/div>\n<h2>Dataset and Performance Baseline<\/h2>\n<p>In this section, we will first select a standard machine learning dataset and establish a baseline in performance on this dataset. This will provide the context for exploring the grid search method of data preparation in the next section.<\/p>\n<h3>Wine Classification Dataset<\/h3>\n<p>We will use the wine classification dataset.<\/p>\n<p>This dataset has 13 input variables that describe the chemical composition of samples of wine and requires that the wine be classified as one of three types.<\/p>\n<p>You can learn more about the dataset here:<\/p>\n<ul>\n<li><a href=\"https:\/\/raw.githubusercontent.com\/jbrownlee\/Datasets\/master\/wine.csv\">Wine Dataset (wine.csv)<\/a><\/li>\n<li><a href=\"https:\/\/raw.githubusercontent.com\/jbrownlee\/Datasets\/master\/wine.names\">Wine Dataset Description (wine.names)<\/a><\/li>\n<\/ul>\n<p>No need to download the dataset as we will download it automatically as part of our worked examples.<\/p>\n<p>Open the dataset and review the raw data. The first few rows of data are listed below.<\/p>\n<p>We can see that it is a multi-class classification predictive modeling problem with numerical input variables, each of which has different scales.<\/p>\n<pre class=\"crayon-plain-tag\">14.23,1.71,2.43,15.6,127,2.8,3.06,.28,2.29,5.64,1.04,3.92,1065,1\r\n13.2,1.78,2.14,11.2,100,2.65,2.76,.26,1.28,4.38,1.05,3.4,1050,1\r\n13.16,2.36,2.67,18.6,101,2.8,3.24,.3,2.81,5.68,1.03,3.17,1185,1\r\n14.37,1.95,2.5,16.8,113,3.85,3.49,.24,2.18,7.8,.86,3.45,1480,1\r\n13.24,2.59,2.87,21,118,2.8,2.69,.39,1.82,4.32,1.04,2.93,735,1\r\n...<\/pre>\n<p>The example below loads the dataset and splits it into the input and output columns, then summarizes the data arrays.<\/p>\n<pre class=\"crayon-plain-tag\"># example of loading and summarizing the wine dataset\r\nfrom pandas import read_csv\r\n# define the location of the dataset\r\nurl = 'https:\/\/raw.githubusercontent.com\/jbrownlee\/Datasets\/master\/wine.csv'\r\n# load the dataset as a data frame\r\ndf = read_csv(url, header=None)\r\n# retrieve the numpy array\r\ndata = df.values\r\n# split the columns into input and output variables\r\nX, y = data[:, :-1], data[:, -1]\r\n# summarize the shape of the loaded data\r\nprint(X.shape, y.shape)<\/pre>\n<p>Running the example, we can see that the dataset was loaded correctly and that there are 179 rows of data with 13 input variables and a single target variable.<\/p>\n<pre class=\"crayon-plain-tag\">(178, 13) (178,)<\/pre>\n<p>Next, let&rsquo;s evaluate a model on this dataset and establish a baseline in performance.<\/p>\n<h3>Baseline Model Performance<\/h3>\n<p>We can establish a baseline in performance on the wine classification task by evaluating a model on the raw input data.<\/p>\n<p>In this case, we will evaluate a logistic regression model.<\/p>\n<p>First, we can define a function to load the dataset and perform some minimal data preparation to ensure the inputs are numeric and the target is label encoded.<\/p>\n<pre class=\"crayon-plain-tag\"># prepare the dataset\r\ndef load_dataset():\r\n\t# load the dataset\r\n\turl = 'https:\/\/raw.githubusercontent.com\/jbrownlee\/Datasets\/master\/wine.csv'\r\n\tdf = read_csv(url, header=None)\r\n\tdata = df.values\r\n\tX, y = data[:, :-1], data[:, -1]\r\n\t# minimally prepare dataset\r\n\tX = X.astype('float')\r\n\ty = LabelEncoder().fit_transform(y.astype('str'))\r\n\treturn X, y<\/pre>\n<p>We will evaluate the model using the gold standard of repeated stratified <a href=\"https:\/\/machinelearningmastery.com\/k-fold-cross-validation\/\">k-fold cross-validation<\/a> with 10 folds and three repeats.<\/p>\n<pre class=\"crayon-plain-tag\"># evaluate a model\r\ndef evaluate_model(X, y, model):\r\n\t# define the cross-validation procedure\r\n\tcv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)\r\n\t# evaluate model\r\n\tscores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)\r\n\treturn scores<\/pre>\n<p>We can then call the function to load the dataset, define our model, then evaluate it, reporting the mean and standard deviation accuracy.<\/p>\n<pre class=\"crayon-plain-tag\">...\r\n# get the dataset\r\nX, y = load_dataset()\r\n# define the model\r\nmodel = LogisticRegression(solver='liblinear')\r\n# evaluate the model\r\nscores = evaluate_model(X, y, model)\r\n# report performance\r\nprint('Accuracy: %.3f (%.3f)' % (mean(scores), std(scores)))<\/pre>\n<p>Tying this together, the complete example of evaluating a logistic regression model on the raw wine classification dataset is listed below.<\/p>\n<pre class=\"crayon-plain-tag\"># baseline model performance on the wine dataset\r\nfrom numpy import mean\r\nfrom numpy import std\r\nfrom pandas import read_csv\r\nfrom sklearn.preprocessing import LabelEncoder\r\nfrom sklearn.model_selection import RepeatedStratifiedKFold\r\nfrom sklearn.model_selection import cross_val_score\r\nfrom sklearn.linear_model import LogisticRegression\r\n\r\n# prepare the dataset\r\ndef load_dataset():\r\n\t# load the dataset\r\n\turl = 'https:\/\/raw.githubusercontent.com\/jbrownlee\/Datasets\/master\/wine.csv'\r\n\tdf = read_csv(url, header=None)\r\n\tdata = df.values\r\n\tX, y = data[:, :-1], data[:, -1]\r\n\t# minimally prepare dataset\r\n\tX = X.astype('float')\r\n\ty = LabelEncoder().fit_transform(y.astype('str'))\r\n\treturn X, y\r\n\r\n# evaluate a model\r\ndef evaluate_model(X, y, model):\r\n\t# define the cross-validation procedure\r\n\tcv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)\r\n\t# evaluate model\r\n\tscores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)\r\n\treturn scores\r\n\r\n# get the dataset\r\nX, y = load_dataset()\r\n# define the model\r\nmodel = LogisticRegression(solver='liblinear')\r\n# evaluate the model\r\nscores = evaluate_model(X, y, model)\r\n# report performance\r\nprint('Accuracy: %.3f (%.3f)' % (mean(scores), std(scores)))<\/pre>\n<p>Running the example evaluates the model performance and reports the mean and standard deviation classification accuracy.<\/p>\n<p>Your results may vary given the stochastic nature of the learning algorithm, the evaluation procedure, and differences in precision across machines. Try running the example a few times.<\/p>\n<p>In this case, we can see that the logistic regression model fit on the raw input data achieved the average classification accuracy of about 95.3 percent, providing a baseline in performance.<\/p>\n<pre class=\"crayon-plain-tag\">Accuracy: 0.953 (0.048)<\/pre>\n<p>Next, let&rsquo;s explore whether we can improve the performance using the grid-search-based approach to data preparation.<\/p>\n<h2>Grid Search Approach to Data Preparation<\/h2>\n<p>In this section, we can explore whether we can improve performance using the grid search approach to data preparation.<\/p>\n<p>The first step is to define a series of modeling pipelines to evaluate, where each pipeline defines one (or more) data preparation techniques and ends with a model that takes the transformed data as input.<\/p>\n<p>We will define a function to create these pipelines as a list of tuples, where each tuple defines the short name for the pipeline and the pipeline itself. We will evaluate a range of different data scaling methods (e.g. <a href=\"https:\/\/machinelearningmastery.com\/standardscaler-and-minmaxscaler-transforms-in-python\/\">MinMaxScaler and StandardScaler<\/a>), distribution transforms (<a href=\"https:\/\/machinelearningmastery.com\/quantile-transforms-for-machine-learning\/\">QuantileTransformer<\/a> and <a href=\"https:\/\/machinelearningmastery.com\/discretization-transforms-for-machine-learning\/\">KBinsDiscretizer<\/a>), as well as dimensionality reduction transforms (<a href=\"https:\/\/machinelearningmastery.com\/principal-components-analysis-for-dimensionality-reduction-in-python\/\">PCA<\/a> and <a href=\"https:\/\/machinelearningmastery.com\/singular-value-decomposition-for-dimensionality-reduction-in-python\/\">SVD<\/a>).<\/p>\n<pre class=\"crayon-plain-tag\"># get modeling pipelines to evaluate\r\ndef get_pipelines(model):\r\n\tpipelines = list()\r\n\t# normalize\r\n\tp = Pipeline([('s',MinMaxScaler()), ('m',model)])\r\n\tpipelines.append(('norm', p))\r\n\t# standardize\r\n\tp = Pipeline([('s',StandardScaler()), ('m',model)])\r\n\tpipelines.append(('std', p))\r\n\t# quantile\r\n\tp = Pipeline([('s',QuantileTransformer(n_quantiles=100, output_distribution='normal')), ('m',model)])\r\n\tpipelines.append(('quan', p))\r\n\t# discretize\r\n\tp = Pipeline([('s',KBinsDiscretizer(n_bins=10, encode='ordinal', strategy='uniform')), ('m',model)])\r\n\tpipelines.append(('kbins', p))\r\n\t# pca\r\n\tp = Pipeline([('s',PCA(n_components=7)), ('m',model)])\r\n\tpipelines.append(('pca', p))\r\n\t# svd\r\n\tp = Pipeline([('s',TruncatedSVD(n_components=7)), ('m',model)])\r\n\tpipelines.append(('svd', p))\r\n\treturn pipelines<\/pre>\n<p>We can then call this function to get the list of transforms, then enumerate each, evaluating it and reporting the performance along the way.<\/p>\n<pre class=\"crayon-plain-tag\">...\r\n# get the modeling pipelines\r\npipelines = get_pipelines(model)\r\n# evaluate each pipeline\r\nresults, names = list(), list()\r\nfor name, pipeline in pipelines:\r\n\t# evaluate\r\n\tscores = evaluate_model(X, y, pipeline)\r\n\t# summarize\r\n\tprint('&gt;%s: %.3f (%.3f)' % (name, mean(scores), std(scores)))\r\n\t# store\r\n\tresults.append(scores)\r\n\tnames.append(name)<\/pre>\n<p>At the end of the run, we can create a box and whisker plot for each set of scores and compare the distributions of results visually.<\/p>\n<pre class=\"crayon-plain-tag\">...\r\n# plot the result\r\npyplot.boxplot(results, labels=names, showmeans=True)\r\npyplot.show()<\/pre>\n<p>Tying this together, the complete example of grid searching data preparation techniques on the wine classification dataset is listed below.<\/p>\n<pre class=\"crayon-plain-tag\"># compare data preparation methods for the wine classification dataset\r\nfrom numpy import mean\r\nfrom numpy import std\r\nfrom pandas import read_csv\r\nfrom sklearn.model_selection import RepeatedStratifiedKFold\r\nfrom sklearn.model_selection import cross_val_score\r\nfrom sklearn.linear_model import LogisticRegression\r\nfrom sklearn.pipeline import Pipeline\r\nfrom sklearn.preprocessing import LabelEncoder\r\nfrom sklearn.preprocessing import MinMaxScaler\r\nfrom sklearn.preprocessing import StandardScaler\r\nfrom sklearn.preprocessing import QuantileTransformer\r\nfrom sklearn.preprocessing import KBinsDiscretizer\r\nfrom sklearn.decomposition import PCA\r\nfrom sklearn.decomposition import TruncatedSVD\r\nfrom matplotlib import pyplot\r\n\r\n# prepare the dataset\r\ndef load_dataset():\r\n\t# load the dataset\r\n\turl = 'https:\/\/raw.githubusercontent.com\/jbrownlee\/Datasets\/master\/wine.csv'\r\n\tdf = read_csv(url, header=None)\r\n\tdata = df.values\r\n\tX, y = data[:, :-1], data[:, -1]\r\n\t# minimally prepare dataset\r\n\tX = X.astype('float')\r\n\ty = LabelEncoder().fit_transform(y.astype('str'))\r\n\treturn X, y\r\n\r\n# evaluate a model\r\ndef evaluate_model(X, y, model):\r\n\t# define the cross-validation procedure\r\n\tcv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)\r\n\t# evaluate model\r\n\tscores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)\r\n\treturn scores\r\n\r\n# get modeling pipelines to evaluate\r\ndef get_pipelines(model):\r\n\tpipelines = list()\r\n\t# normalize\r\n\tp = Pipeline([('s',MinMaxScaler()), ('m',model)])\r\n\tpipelines.append(('norm', p))\r\n\t# standardize\r\n\tp = Pipeline([('s',StandardScaler()), ('m',model)])\r\n\tpipelines.append(('std', p))\r\n\t# quantile\r\n\tp = Pipeline([('s',QuantileTransformer(n_quantiles=100, output_distribution='normal')), ('m',model)])\r\n\tpipelines.append(('quan', p))\r\n\t# discretize\r\n\tp = Pipeline([('s',KBinsDiscretizer(n_bins=10, encode='ordinal', strategy='uniform')), ('m',model)])\r\n\tpipelines.append(('kbins', p))\r\n\t# pca\r\n\tp = Pipeline([('s',PCA(n_components=7)), ('m',model)])\r\n\tpipelines.append(('pca', p))\r\n\t# svd\r\n\tp = Pipeline([('s',TruncatedSVD(n_components=7)), ('m',model)])\r\n\tpipelines.append(('svd', p))\r\n\treturn pipelines\r\n\r\n# get the dataset\r\nX, y = load_dataset()\r\n# define the model\r\nmodel = LogisticRegression(solver='liblinear')\r\n# get the modeling pipelines\r\npipelines = get_pipelines(model)\r\n# evaluate each pipeline\r\nresults, names = list(), list()\r\nfor name, pipeline in pipelines:\r\n\t# evaluate\r\n\tscores = evaluate_model(X, y, pipeline)\r\n\t# summarize\r\n\tprint('&gt;%s: %.3f (%.3f)' % (name, mean(scores), std(scores)))\r\n\t# store\r\n\tresults.append(scores)\r\n\tnames.append(name)\r\n# plot the result\r\npyplot.boxplot(results, labels=names, showmeans=True)\r\npyplot.show()<\/pre>\n<p>Running the example evaluates the performance of each pipeline and reports the mean and standard deviation classification accuracy.<\/p>\n<p>Your results may vary given the stochastic nature of the learning algorithm, the evaluation procedure, and differences in precision across machines. Try running the example a few times.<\/p>\n<p>In this case, we can see that standardizing the input variables and using a quantile transform both achieves the best result with a classification accuracy of about 98.7 percent, an improvement over the baseline with no data preparation that achieved a classification accuracy of 95.3 percent.<\/p>\n<p>You can add your own modeling pipelines to the <em>get_pipelines()<\/em> function and compare their result.<\/p>\n<p><strong>Can you get better results?<\/strong><br \/>\nLet me know in the comments below.<\/p>\n<pre class=\"crayon-plain-tag\">&gt;norm: 0.976 (0.031)\r\n&gt;std: 0.987 (0.023)\r\n&gt;quan: 0.987 (0.023)\r\n&gt;kbins: 0.968 (0.045)\r\n&gt;pca: 0.963 (0.039)\r\n&gt;svd: 0.953 (0.048)<\/pre>\n<p>A figure is created showing box and whisker plots that summarize the distribution of classification accuracy scores for each data preparation technique. We can see that the distribution of scores for the standardization and quantile transforms are compact and very similar and have an outlier. We can see that the spread of scores for the other transforms is larger and skewing down.<\/p>\n<p>The results may suggest that standardizing the dataset is probably an important step in the data preparation and related transforms, such as the quantile transform, and perhaps even the power transform may offer benefits if combined with standardization by making one or more input variables more Gaussian.<\/p>\n<div id=\"attachment_11020\" style=\"width: 1290px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-11020\" class=\"size-full wp-image-11020\" src=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2020\/06\/Box-and-Whisker-Plot-of-Classification-Accuracy-for-Different-Data-Transforms-on-the-Wine-Classification-Dataset.png\" alt=\"Box and Whisker Plot of Classification Accuracy for Different Data Transforms on the Wine Classification Dataset\" width=\"1280\" height=\"960\" srcset=\"http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2020\/06\/Box-and-Whisker-Plot-of-Classification-Accuracy-for-Different-Data-Transforms-on-the-Wine-Classification-Dataset.png 1280w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2020\/06\/Box-and-Whisker-Plot-of-Classification-Accuracy-for-Different-Data-Transforms-on-the-Wine-Classification-Dataset-300x225.png 300w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2020\/06\/Box-and-Whisker-Plot-of-Classification-Accuracy-for-Different-Data-Transforms-on-the-Wine-Classification-Dataset-1024x768.png 1024w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2020\/06\/Box-and-Whisker-Plot-of-Classification-Accuracy-for-Different-Data-Transforms-on-the-Wine-Classification-Dataset-768x576.png 768w\" sizes=\"(max-width: 1280px) 100vw, 1280px\"><\/p>\n<p id=\"caption-attachment-11020\" class=\"wp-caption-text\">Box and Whisker Plot of Classification Accuracy for Different Data Transforms on the Wine Classification Dataset<\/p>\n<\/div>\n<p>We can also explore sequences of transforms to see if they can offer a lift in performance.<\/p>\n<p>For example, we might want to apply <a href=\"https:\/\/machinelearningmastery.com\/rfe-feature-selection-in-python\/\">RFE feature selection<\/a> after the standardization transform to see if the same or better results can be used with fewer input variables (e.g. less complexity).<\/p>\n<p>We might also want to see if a <a href=\"https:\/\/machinelearningmastery.com\/power-transforms-with-scikit-learn\/\">power transform<\/a> preceded with a data scaling transform can achieve good performance on the dataset as we believe it could given the success of the quantile transform.<\/p>\n<p>The updated <em>get_pipelines()<\/em> function with sequences of transforms is provided below.<\/p>\n<pre class=\"crayon-plain-tag\"># get modeling pipelines to evaluate\r\ndef get_pipelines(model):\r\n\tpipelines = list()\r\n\t# standardize\r\n\tp = Pipeline([('s',StandardScaler()), ('r', RFE(estimator=LogisticRegression(solver='liblinear'), n_features_to_select=10)), ('m',model)])\r\n\tpipelines.append(('std', p))\r\n\t# scale and power\r\n\tp = Pipeline([('s',MinMaxScaler((1,2))), ('p', PowerTransformer()), ('m',model)])\r\n\tpipelines.append(('power', p))\r\n\treturn pipelines<\/pre>\n<p>Tying this together, the complete example is listed below.<\/p>\n<pre class=\"crayon-plain-tag\"># compare sequences of data preparation methods for the wine classification dataset\r\nfrom numpy import mean\r\nfrom numpy import std\r\nfrom pandas import read_csv\r\nfrom sklearn.model_selection import RepeatedStratifiedKFold\r\nfrom sklearn.model_selection import cross_val_score\r\nfrom sklearn.linear_model import LogisticRegression\r\nfrom sklearn.pipeline import Pipeline\r\nfrom sklearn.preprocessing import LabelEncoder\r\nfrom sklearn.preprocessing import MinMaxScaler\r\nfrom sklearn.preprocessing import StandardScaler\r\nfrom sklearn.preprocessing import QuantileTransformer\r\nfrom sklearn.preprocessing import PowerTransformer\r\nfrom sklearn.preprocessing import KBinsDiscretizer\r\nfrom sklearn.decomposition import PCA\r\nfrom sklearn.decomposition import TruncatedSVD\r\nfrom sklearn.feature_selection import RFE\r\nfrom matplotlib import pyplot\r\n\r\n# prepare the dataset\r\ndef load_dataset():\r\n\t# load the dataset\r\n\turl = 'https:\/\/raw.githubusercontent.com\/jbrownlee\/Datasets\/master\/wine.csv'\r\n\tdf = read_csv(url, header=None)\r\n\tdata = df.values\r\n\tX, y = data[:, :-1], data[:, -1]\r\n\t# minimally prepare dataset\r\n\tX = X.astype('float')\r\n\ty = LabelEncoder().fit_transform(y.astype('str'))\r\n\treturn X, y\r\n\r\n# evaluate a model\r\ndef evaluate_model(X, y, model):\r\n\t# define the cross-validation procedure\r\n\tcv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)\r\n\t# evaluate model\r\n\tscores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)\r\n\treturn scores\r\n\r\n# get modeling pipelines to evaluate\r\ndef get_pipelines(model):\r\n\tpipelines = list()\r\n\t# standardize\r\n\tp = Pipeline([('s',StandardScaler()), ('r', RFE(estimator=LogisticRegression(solver='liblinear'), n_features_to_select=10)), ('m',model)])\r\n\tpipelines.append(('std', p))\r\n\t# scale and power\r\n\tp = Pipeline([('s',MinMaxScaler((1,2))), ('p', PowerTransformer()), ('m',model)])\r\n\tpipelines.append(('power', p))\r\n\treturn pipelines\r\n\r\n# get the dataset\r\nX, y = load_dataset()\r\n# define the model\r\nmodel = LogisticRegression(solver='liblinear')\r\n# get the modeling pipelines\r\npipelines = get_pipelines(model)\r\n# evaluate each pipeline\r\nresults, names = list(), list()\r\nfor name, pipeline in pipelines:\r\n\t# evaluate\r\n\tscores = evaluate_model(X, y, pipeline)\r\n\t# summarize\r\n\tprint('&gt;%s: %.3f (%.3f)' % (name, mean(scores), std(scores)))\r\n\t# store\r\n\tresults.append(scores)\r\n\tnames.append(name)\r\n# plot the result\r\npyplot.boxplot(results, labels=names, showmeans=True)\r\npyplot.show()<\/pre>\n<p>Running the example evaluates the performance of each pipeline and reports the mean and standard deviation classification accuracy.<\/p>\n<p>Your results may vary given the stochastic nature of the learning algorithm, the evaluation procedure, and differences in precision across machines. Try running the example a few times.<\/p>\n<p>In this case, we can see that the standardization with feature selection offers an additional lift in accuracy from 98.7 percent to 98.9 percent, although the data scaling and power transform do not offer any additional benefit over the quantile transform.<\/p>\n<pre class=\"crayon-plain-tag\">&gt;std: 0.989 (0.022)\r\n&gt;power: 0.987 (0.023)<\/pre>\n<p>A figure is created showing box and whisker plots that summarize the distribution of classification accuracy scores for each data preparation technique.<\/p>\n<p>We can see that the distribution of results for both pipelines of transforms is compact with very little spread other than outlier.<\/p>\n<div id=\"attachment_11021\" style=\"width: 1290px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-11021\" class=\"size-full wp-image-11021\" src=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2020\/06\/Box-and-Whisker-Plot-of-Classification-Accuracy-for-Different-Sequences-of-Data-Transforms-on-the-Wine-Classification-Dataset.png\" alt=\"Box and Whisker Plot of Classification Accuracy for Different Sequences of Data Transforms on the Wine Classification Dataset\" width=\"1280\" height=\"960\" srcset=\"http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2020\/06\/Box-and-Whisker-Plot-of-Classification-Accuracy-for-Different-Sequences-of-Data-Transforms-on-the-Wine-Classification-Dataset.png 1280w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2020\/06\/Box-and-Whisker-Plot-of-Classification-Accuracy-for-Different-Sequences-of-Data-Transforms-on-the-Wine-Classification-Dataset-300x225.png 300w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2020\/06\/Box-and-Whisker-Plot-of-Classification-Accuracy-for-Different-Sequences-of-Data-Transforms-on-the-Wine-Classification-Dataset-1024x768.png 1024w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2020\/06\/Box-and-Whisker-Plot-of-Classification-Accuracy-for-Different-Sequences-of-Data-Transforms-on-the-Wine-Classification-Dataset-768x576.png 768w\" sizes=\"(max-width: 1280px) 100vw, 1280px\"><\/p>\n<p id=\"caption-attachment-11021\" class=\"wp-caption-text\">Box and Whisker Plot of Classification Accuracy for Different Sequences of Data Transforms on the Wine Classification Dataset<\/p>\n<\/div>\n<h2>Further Reading<\/h2>\n<p>This section provides more resources on the topic if you are looking to go deeper.<\/p>\n<h3>Books<\/h3>\n<ul>\n<li><a href=\"https:\/\/amzn.to\/3aydNGf\">Feature Engineering and Selection<\/a>, 2019.<\/li>\n<li><a href=\"https:\/\/amzn.to\/2XZJNR2\">Feature Engineering for Machine Learning<\/a>, 2018.<\/li>\n<\/ul>\n<h3>APIs<\/h3>\n<ul>\n<li><a href=\"https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.pipeline.Pipeline.html\">sklearn.pipeline.Pipeline API<\/a>.<\/li>\n<\/ul>\n<h2>Summary<\/h2>\n<p>In this tutorial, you discovered how to use a grid search approach for data preparation with tabular data.<\/p>\n<p>Specifically, you learned:<\/p>\n<ul>\n<li>Grid search provides an alternative approach to data preparation for tabular data, where transforms are tried as hyperparameters of the modeling pipeline.<\/li>\n<li>How to use the grid search method for data preparation to improve model performance over a baseline for a standard classification dataset.<\/li>\n<li>How to grid search sequences of data preparation methods to further improve model performance.<\/li>\n<\/ul>\n<p><strong>Do you have any questions?<\/strong><br \/>\nAsk your questions in the comments below and I will do my best to answer.<\/p>\n<p>The post <a rel=\"nofollow\" href=\"https:\/\/machinelearningmastery.com\/grid-search-data-preparation-techniques\/\">How to Grid Search Data Preparation Techniques<\/a> appeared first on <a rel=\"nofollow\" href=\"https:\/\/machinelearningmastery.com\/\">Machine Learning Mastery<\/a>.<\/p>\n<\/div>\n<p><a href=\"https:\/\/machinelearningmastery.com\/grid-search-data-preparation-techniques\/\">Go to Source<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Author: Jason Brownlee Machine learning predictive modeling performance is only as good as your data, and your data is only as good as the way [&hellip;] <span class=\"read-more-link\"><a class=\"read-more\" href=\"https:\/\/www.aiproblog.com\/index.php\/2020\/07\/14\/how-to-grid-search-data-preparation-techniques\/\">Read More<\/a><\/span><\/p>\n","protected":false},"author":1,"featured_media":3664,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_bbp_topic_count":0,"_bbp_reply_count":0,"_bbp_total_topic_count":0,"_bbp_total_reply_count":0,"_bbp_voice_count":0,"_bbp_anonymous_reply_count":0,"_bbp_topic_count_hidden":0,"_bbp_reply_count_hidden":0,"_bbp_forum_subforum_count":0,"footnotes":""},"categories":[24],"tags":[],"_links":{"self":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts\/3663"}],"collection":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/comments?post=3663"}],"version-history":[{"count":0,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts\/3663\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/media\/3664"}],"wp:attachment":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/media?parent=3663"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/categories?post=3663"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/tags?post=3663"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}