{"id":3688,"date":"2020-07-21T19:00:15","date_gmt":"2020-07-21T19:00:15","guid":{"rendered":"https:\/\/www.aiproblog.com\/index.php\/2020\/07\/21\/how-to-selectively-scale-numerical-input-variables-for-machine-learning\/"},"modified":"2020-07-21T19:00:15","modified_gmt":"2020-07-21T19:00:15","slug":"how-to-selectively-scale-numerical-input-variables-for-machine-learning","status":"publish","type":"post","link":"https:\/\/www.aiproblog.com\/index.php\/2020\/07\/21\/how-to-selectively-scale-numerical-input-variables-for-machine-learning\/","title":{"rendered":"How to Selectively Scale Numerical Input Variables for Machine Learning"},"content":{"rendered":"<p>Author: Jason Brownlee<\/p>\n<div>\n<p>Many machine learning models perform better when input variables are carefully transformed or scaled prior to modeling.<\/p>\n<p>It is convenient, and therefore common, to apply the same data transforms, such as standardization and normalization, equally to all input variables. This can achieve good results on many problems. Nevertheless, better results may be achieved by carefully <strong>selecting which data transform to apply to each input variable<\/strong> prior to modeling.<\/p>\n<p>In this tutorial, you will discover how to apply selective scaling of numerical input variables.<\/p>\n<p>After completing this tutorial, you will know:<\/p>\n<ul>\n<li>How to load and calculate a baseline predictive performance for the diabetes classification dataset.<\/li>\n<li>How to evaluate modeling pipelines with data transforms applied blindly to all numerical input variables.<\/li>\n<li>How to evaluate modeling pipelines with selective normalization and standardization applied to subsets of input variables.<\/li>\n<\/ul>\n<p>Discover data cleaning, feature selection, data transforms, dimensionality reduction and much more <a href=\"https:\/\/machinelearningmastery.com\/data-preparation-for-machine-learning\/\">in my new book<\/a>, with 30 step-by-step tutorials and full Python source code.<\/p>\n<p>Let&rsquo;s get started.<\/p>\n<div id=\"attachment_11054\" style=\"width: 810px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-11054\" class=\"size-full wp-image-11054\" src=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2020\/07\/How-to-Selectively-Scale-Numerical-Input-Variables-for-Machine-Learning.jpg\" alt=\"How to Selectively Scale Numerical Input Variables for Machine Learning\" width=\"800\" height=\"534\" srcset=\"http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2020\/07\/How-to-Selectively-Scale-Numerical-Input-Variables-for-Machine-Learning.jpg 800w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2020\/07\/How-to-Selectively-Scale-Numerical-Input-Variables-for-Machine-Learning-300x200.jpg 300w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2020\/07\/How-to-Selectively-Scale-Numerical-Input-Variables-for-Machine-Learning-768x513.jpg 768w\" sizes=\"(max-width: 800px) 100vw, 800px\"><\/p>\n<p id=\"caption-attachment-11054\" class=\"wp-caption-text\">How to Selectively Scale Numerical Input Variables for Machine Learning<br \/>Photo by <a href=\"https:\/\/www.flickr.com\/photos\/160866001@N07\/46824499581\/\">Marco Verch<\/a>, some rights reserved.<\/p>\n<\/div>\n<h2>Tutorial Overview<\/h2>\n<p>This tutorial is divided into three parts; they are:<\/p>\n<ol>\n<li>Diabetes Numerical Dataset<\/li>\n<li>Non-Selective Scaling of Numerical Inputs\n<ol>\n<li>Normalize All Input Variables<\/li>\n<li>Standardize All Input Variables<\/li>\n<\/ol>\n<\/li>\n<li>Selective Scaling of Numerical Inputs\n<ol>\n<li>Normalize Only Non-Gaussian Input Variables<\/li>\n<li>Standardize Only Gaussian-Like Input Variables<\/li>\n<li>Selectively Normalize and Standardize Input Variables<\/li>\n<\/ol>\n<\/li>\n<\/ol>\n<h2>Diabetes Numerical Dataset<\/h2>\n<p>As the basis of this tutorial, we will use the so-called &ldquo;diabetes&rdquo; dataset that has been widely studied as a machine learning dataset since the 1990s.<\/p>\n<p>The dataset classifies patients&rsquo; data as either an onset of diabetes within five years or not. There are 768 examples and eight input variables. It is a binary classification problem.<\/p>\n<p>You can learn more about the dataset here:<\/p>\n<ul>\n<li><a href=\"https:\/\/raw.githubusercontent.com\/jbrownlee\/Datasets\/master\/pima-indians-diabetes.csv\">Diabetes Dataset (pima-indians-diabetes.csv)<\/a><\/li>\n<li><a href=\"https:\/\/raw.githubusercontent.com\/jbrownlee\/Datasets\/master\/pima-indians-diabetes.names\">Diabetes Dataset Description (pima-indians-diabetes.names)<\/a><\/li>\n<\/ul>\n<p>No need to download the dataset; we will download it automatically as part of the worked examples that follow.<\/p>\n<p>Looking at the data, we can see that all nine input variables are numerical.<\/p>\n<pre class=\"urvanov-syntax-highlighter-plain-tag\">6,148,72,35,0,33.6,0.627,50,1\r\n1,85,66,29,0,26.6,0.351,31,0\r\n8,183,64,0,0,23.3,0.672,32,1\r\n1,89,66,23,94,28.1,0.167,21,0\r\n0,137,40,35,168,43.1,2.288,33,1\r\n...<\/pre>\n<p>We can load this dataset into memory using the Pandas library.<\/p>\n<p>The example below downloads and summarizes the diabetes dataset.<\/p>\n<pre class=\"urvanov-syntax-highlighter-plain-tag\"># load and summarize the diabetes dataset\r\nfrom pandas import read_csv\r\nfrom pandas.plotting import scatter_matrix\r\nfrom matplotlib import pyplot\r\n# Load dataset\r\nurl = \"https:\/\/raw.githubusercontent.com\/jbrownlee\/Datasets\/master\/pima-indians-diabetes.csv\"\r\ndataset = read_csv(url, header=None)\r\n# summarize the shape of the dataset\r\nprint(dataset.shape)\r\n# histograms of the variables\r\ndataset.hist()\r\npyplot.show()<\/pre>\n<p>Running the example first downloads the dataset and loads it as a DataFrame.<\/p>\n<p>The shape of the dataset is printed, confirming the number of rows, and nine variables, eight input, and one target.<\/p>\n<pre class=\"urvanov-syntax-highlighter-plain-tag\">(768, 9)<\/pre>\n<p>Finally, a plot is created showing a histogram for each variable in the dataset.<\/p>\n<p>This is useful as we can see that some variables have a Gaussian or Gaussian-like distribution (1, 2, 5) and others have an exponential-like distribution (0, 3, 4, 6, 7). This may suggest the need for different numerical data transforms for the different types of input variables.<\/p>\n<div id=\"attachment_11052\" style=\"width: 1290px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-11052\" class=\"size-full wp-image-11052\" src=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2020\/06\/Histogram-of-Each-Variable-in-the-Diabetes-Classification-Dataset.png\" alt=\"Histogram of Each Variable in the Diabetes Classification Dataset\" width=\"1280\" height=\"960\" srcset=\"http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2020\/06\/Histogram-of-Each-Variable-in-the-Diabetes-Classification-Dataset.png 1280w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2020\/06\/Histogram-of-Each-Variable-in-the-Diabetes-Classification-Dataset-300x225.png 300w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2020\/06\/Histogram-of-Each-Variable-in-the-Diabetes-Classification-Dataset-1024x768.png 1024w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2020\/06\/Histogram-of-Each-Variable-in-the-Diabetes-Classification-Dataset-768x576.png 768w\" sizes=\"(max-width: 1280px) 100vw, 1280px\"><\/p>\n<p id=\"caption-attachment-11052\" class=\"wp-caption-text\">Histogram of Each Variable in the Diabetes Classification Dataset<\/p>\n<\/div>\n<p>Now that we are a little familiar with the dataset, let&rsquo;s try fitting and evaluating a model on the raw dataset.<\/p>\n<p>We will use a logistic regression model as they are a robust and effective linear model for binary classification tasks. We will evaluate the model using repeated stratified k-fold cross-validation, a best practice, and use 10 folds and three repeats.<\/p>\n<p>The complete example is listed below.<\/p>\n<pre class=\"urvanov-syntax-highlighter-plain-tag\"># evaluate a logistic regression model on the raw diabetes dataset\r\nfrom numpy import mean\r\nfrom numpy import std\r\nfrom pandas import read_csv\r\nfrom sklearn.preprocessing import LabelEncoder\r\nfrom sklearn.model_selection import cross_val_score\r\nfrom sklearn.model_selection import RepeatedStratifiedKFold\r\nfrom sklearn.linear_model import LogisticRegression\r\n# load dataset\r\nurl = 'https:\/\/raw.githubusercontent.com\/jbrownlee\/Datasets\/master\/pima-indians-diabetes.csv'\r\ndataframe = read_csv(url, header=None)\r\ndata = dataframe.values\r\n# separate into input and output elements\r\nX, y = data[:, :-1], data[:, -1]\r\n# minimally prepare dataset\r\nX = X.astype('float')\r\ny = LabelEncoder().fit_transform(y.astype('str'))\r\n# define the model\r\nmodel = LogisticRegression(solver='liblinear')\r\n# define the evaluation procedure\r\ncv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)\r\n# evaluate the model\r\nm_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)\r\n# summarize the result\r\nprint('Accuracy: %.3f (%.3f)' % (mean(m_scores), std(m_scores)))<\/pre>\n<p>Running the example evaluates the model and reports the mean and standard deviation accuracy for fitting a logistic regression model on the raw dataset.<\/p>\n<p>Your specific results may differ given the stochastic nature of the learning algorithm, the stochastic nature of the evaluation procedure, and differences in precision across machines and library versions. Try running the example a few times.<\/p>\n<p>In this case, we can see that the model achieved an accuracy of about 76.8 percent.<\/p>\n<pre class=\"urvanov-syntax-highlighter-plain-tag\">Accuracy: 0.768 (0.040)<\/pre>\n<p>Now that we have established a baseline in performance on the dataset, let&rsquo;s see if we can improve the performance using data scaling.<\/p>\n<\/p>\n<div class=\"woo-sc-hr\"><\/div>\n<p><center><\/p>\n<h3>Want to Get Started With Data Preparation?<\/h3>\n<p>Take my free 7-day email crash course now (with sample code).<\/p>\n<p>Click to sign-up and also get a free PDF Ebook version of the course.<\/p>\n<p><a href=\"https:\/\/machinelearningmastery.lpages.co\/leadbox\/1041bc0ec172a2%3A164f8be4f346dc\/4935938752774144\/\" target=\"_blank\" style=\"background: rgb(255, 206, 10); color: rgb(255, 255, 255); text-decoration: none; font-family: Helvetica, Arial, sans-serif; font-weight: bold; font-size: 16px; line-height: 20px; padding: 10px; display: inline-block; max-width: 300px; border-radius: 5px; text-shadow: rgba(0, 0, 0, 0.25) 0px -1px 1px; box-shadow: rgba(255, 255, 255, 0.5) 0px 1px 3px inset, rgba(0, 0, 0, 0.5) 0px 1px 3px;\" rel=\"noopener noreferrer\">Download Your FREE Mini-Course<\/a><script data-leadbox=\"1041bc0ec172a2:164f8be4f346dc\" data-url=\"https:\/\/machinelearningmastery.lpages.co\/leadbox\/1041bc0ec172a2%3A164f8be4f346dc\/4935938752774144\/\" data-config=\"%7B%7D\" type=\"text\/javascript\" src=\"https:\/\/machinelearningmastery.lpages.co\/leadbox-1589485176.js\"><\/script><\/p>\n<p><\/center><\/p>\n<div class=\"woo-sc-hr\"><\/div>\n<h2>Non-Selective Scaling of Numerical Inputs<\/h2>\n<p>Many algorithms prefer or require that input variables are scaled to a consistent range prior to fitting a model.<\/p>\n<p>This includes the logistic regression model that assumes input variables have a Gaussian probability distribution. It may also provide a more numerically stable model if the input variables are standardized. Nevertheless, even when these expectations are violated, the logistic regression can perform well or best for a given dataset as may be the case for the diabetes dataset.<\/p>\n<p>Two common techniques for scaling numerical input variables are normalization and standardization.<\/p>\n<p>Normalization scales each input variable to the range 0-1 and can be implemented using the <a href=\"https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.preprocessing.MinMaxScaler.html\">MinMaxScaler<\/a> class in scikit-learn. Standardization scales each input variable to have a mean of 0.0 and a standard deviation of 1.0 and can be implemented using the <a href=\"https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.preprocessing.StandardScaler.html\">StandardScaler<\/a> class in scikit-learn.<\/p>\n<p>To learn more about normalization, standardization, and how to use these methods in scikit-learn, see the tutorial:<\/p>\n<ul>\n<li><a href=\"https:\/\/machinelearningmastery.com\/standardscaler-and-minmaxscaler-transforms-in-python\/\">How to Use StandardScaler and MinMaxScaler Transforms in Python<\/a><\/li>\n<\/ul>\n<p>A naive approach to data scaling applies a single transform to all input variables, regardless of their scale or probability distribution. And this is often effective.<\/p>\n<p>Let&rsquo;s try normalizing and standardizing all input variables directly and compare the performance to the baseline logistic regression model fit on the raw data.<\/p>\n<h3>Normalize All Input Variables<\/h3>\n<p>We can update the baseline code example to use a modeling pipeline where the first step is to apply a scaler and the final step is to fit the model.<\/p>\n<p>This ensures that the scaling operation is fit or prepared on the training set only and then applied to the train and test sets during the cross-validation process, avoiding data leakage. Data leakage can result in an optimistically biased estimate of model performance.<\/p>\n<p>This can be achieved using the Pipeline class where each step in the pipeline is defined as a tuple with a name and the instance of the transform or model to use.<\/p>\n<pre class=\"urvanov-syntax-highlighter-plain-tag\">...\r\n# define the modeling pipeline\r\nscaler = MinMaxScaler()\r\nmodel = LogisticRegression(solver='liblinear')\r\npipeline = Pipeline([('s',scaler),('m',model)])<\/pre>\n<p>Tying this together, the complete example of evaluating a logistic regression on diabetes dataset with all input variables normalized is listed below.<\/p>\n<pre class=\"urvanov-syntax-highlighter-plain-tag\"># evaluate a logistic regression model on the normalized diabetes dataset\r\nfrom numpy import mean\r\nfrom numpy import std\r\nfrom pandas import read_csv\r\nfrom sklearn.preprocessing import LabelEncoder\r\nfrom sklearn.model_selection import cross_val_score\r\nfrom sklearn.model_selection import RepeatedStratifiedKFold\r\nfrom sklearn.pipeline import Pipeline\r\nfrom sklearn.linear_model import LogisticRegression\r\nfrom sklearn.preprocessing import MinMaxScaler\r\n# load dataset\r\nurl = 'https:\/\/raw.githubusercontent.com\/jbrownlee\/Datasets\/master\/pima-indians-diabetes.csv'\r\ndataframe = read_csv(url, header=None)\r\ndata = dataframe.values\r\n# separate into input and output elements\r\nX, y = data[:, :-1], data[:, -1]\r\n# minimally prepare dataset\r\nX = X.astype('float')\r\ny = LabelEncoder().fit_transform(y.astype('str'))\r\n# define the modeling pipeline\r\nmodel = LogisticRegression(solver='liblinear')\r\nscaler = MinMaxScaler()\r\npipeline = Pipeline([('s',scaler),('m',model)])\r\n# define the evaluation procedure\r\ncv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)\r\n# evaluate the model\r\nm_scores = cross_val_score(pipeline, X, y, scoring='accuracy', cv=cv, n_jobs=-1)\r\n# summarize the result\r\nprint('Accuracy: %.3f (%.3f)' % (mean(m_scores), std(m_scores)))<\/pre>\n<p>Running the example evaluates the modeling pipeline and reports the mean and standard deviation accuracy for fitting a logistic regression model on the normalized dataset.<\/p>\n<p>Your specific results may differ given the stochastic nature of the learning algorithm, the stochastic nature of the evaluation procedure, and differences in precision across machines and library versions. Try running the example a few times.<\/p>\n<p>In this case, we can see that the normalization of the input variables has resulted in a drop in the mean classification accuracy from 76.8 percent with a model fit on the raw data to about 76.4 percent for the pipeline with normalization.<\/p>\n<pre class=\"urvanov-syntax-highlighter-plain-tag\">Accuracy: 0.764 (0.045)<\/pre>\n<p>Next, let&rsquo;s try standardizing all input variables.<\/p>\n<h3>Standardize All Input Variables<\/h3>\n<p>We can update the modeling pipeline to use standardization instead of normalization for all input variables prior to fitting and evaluating the logistic regression model.<\/p>\n<p>This might be an appropriate transform for those input variables with a Gaussian-like distribution, but perhaps not the other variables.<\/p>\n<pre class=\"urvanov-syntax-highlighter-plain-tag\">...\r\n# define the modeling pipeline\r\nscaler = StandardScaler()\r\nmodel = LogisticRegression(solver='liblinear')\r\npipeline = Pipeline([('s',scaler),('m',model)])<\/pre>\n<p>Tying this together, the complete example of evaluating a logistic regression model on diabetes dataset with all input variables standardized is listed below.<\/p>\n<pre class=\"urvanov-syntax-highlighter-plain-tag\"># evaluate a logistic regression model on the standardized diabetes dataset\r\nfrom numpy import mean\r\nfrom numpy import std\r\nfrom pandas import read_csv\r\nfrom sklearn.preprocessing import LabelEncoder\r\nfrom sklearn.model_selection import cross_val_score\r\nfrom sklearn.model_selection import RepeatedStratifiedKFold\r\nfrom sklearn.pipeline import Pipeline\r\nfrom sklearn.linear_model import LogisticRegression\r\nfrom sklearn.preprocessing import StandardScaler\r\n# load dataset\r\nurl = 'https:\/\/raw.githubusercontent.com\/jbrownlee\/Datasets\/master\/pima-indians-diabetes.csv'\r\ndataframe = read_csv(url, header=None)\r\ndata = dataframe.values\r\n# separate into input and output elements\r\nX, y = data[:, :-1], data[:, -1]\r\n# minimally prepare dataset\r\nX = X.astype('float')\r\ny = LabelEncoder().fit_transform(y.astype('str'))\r\n# define the modeling pipeline\r\nscaler = StandardScaler()\r\nmodel = LogisticRegression(solver='liblinear')\r\npipeline = Pipeline([('s',scaler),('m',model)])\r\n# define the evaluation procedure\r\ncv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)\r\n# evaluate the model\r\nm_scores = cross_val_score(pipeline, X, y, scoring='accuracy', cv=cv, n_jobs=-1)\r\n# summarize the result\r\nprint('Accuracy: %.3f (%.3f)' % (mean(m_scores), std(m_scores)))<\/pre>\n<p>Running the example evaluates the modeling pipeline and reports the mean and standard deviation accuracy for fitting a logistic regression model on the standardized dataset.<\/p>\n<p>Your specific results may differ given the stochastic nature of the learning algorithm, the stochastic nature of the evaluation procedure, and differences in precision across machines and library versions. Try running the example a few times.<\/p>\n<p>In this case, we can see that standardizing all numerical input variables has resulted in a lift in mean classification accuracy from 76.8 percent with a model evaluated on the raw dataset to about 77.2 percent for a model evaluated on the dataset with standardized input variables.<\/p>\n<pre class=\"urvanov-syntax-highlighter-plain-tag\">Accuracy: 0.772 (0.043)<\/pre>\n<p>So far, we have learned that normalizing all variables does not help performance, but standardizing all input variables does help performance.<\/p>\n<p>Next, let&rsquo;s explore if selectively applying scaling to the input variables can offer further improvement.<\/p>\n<h2>Selective Scaling of Numerical Inputs<\/h2>\n<p>Data transforms can be applied selectively to input variables using the <a href=\"https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.compose.ColumnTransformer.html\">ColumnTransformer class in scikit-learn<\/a>.<\/p>\n<p>It allows you to specify the transform (or pipeline of transforms) to apply and the column indexes to apply them to. This can then be used as part of a modeling pipeline and evaluated using cross-validation.<\/p>\n<p>You can learn more about how to use the ColumnTransformer in the tutorial:<\/p>\n<ul>\n<li><a href=\"https:\/\/machinelearningmastery.com\/columntransformer-for-numerical-and-categorical-data\/\">How to Use the ColumnTransformer for Data Preparation<\/a><\/li>\n<\/ul>\n<p>We can explore using the ColumnTransformer to selectively apply normalization and standardization to the numerical input variables of the diabetes dataset in order to see if we can achieve further performance improvements.<\/p>\n<h3>Normalize Only Non-Gaussian Input Variables<\/h3>\n<p>First, let&rsquo;s try normalizing just those input variables that do not have a Gaussian-like probability distribution and leave the rest of the input variables alone in the raw state.<\/p>\n<p>We can define two groups of input variables using the column indexes, one for the variables with a Gaussian-like distribution, and one for the input variables with the exponential-like distribution.<\/p>\n<pre class=\"urvanov-syntax-highlighter-plain-tag\">...\r\n# define column indexes for the variables with \"normal\" and \"exponential\" distributions\r\nnorm_ix = [1, 2, 5]\r\nexp_ix = [0, 3, 4, 6, 7]<\/pre>\n<p>We can then selectively normalize the &ldquo;<em>exp_ix<\/em>&rdquo; group and let the other input variables pass through without any data preparation.<\/p>\n<pre class=\"urvanov-syntax-highlighter-plain-tag\">...\r\n# define the selective transforms\r\nt = [('e', MinMaxScaler(), exp_ix)]\r\nselective = ColumnTransformer(transformers=t, remainder='passthrough')<\/pre>\n<p>The selective transform can then be used as part of our modeling pipeline.<\/p>\n<pre class=\"urvanov-syntax-highlighter-plain-tag\">...\r\n# define the modeling pipeline\r\nmodel = LogisticRegression(solver='liblinear')\r\npipeline = Pipeline([('s',selective),('m',model)])<\/pre>\n<p>Tying this together, the complete example of evaluating a logistic regression model on data with selective normalization of some input variables is listed below.<\/p>\n<pre class=\"urvanov-syntax-highlighter-plain-tag\"># evaluate a logistic regression model on the diabetes dataset with selective normalization\r\nfrom numpy import mean\r\nfrom numpy import std\r\nfrom pandas import read_csv\r\nfrom sklearn.preprocessing import LabelEncoder\r\nfrom sklearn.model_selection import cross_val_score\r\nfrom sklearn.model_selection import RepeatedStratifiedKFold\r\nfrom sklearn.pipeline import Pipeline\r\nfrom sklearn.linear_model import LogisticRegression\r\nfrom sklearn.preprocessing import MinMaxScaler\r\nfrom sklearn.compose import ColumnTransformer\r\n# load dataset\r\nurl = 'https:\/\/raw.githubusercontent.com\/jbrownlee\/Datasets\/master\/pima-indians-diabetes.csv'\r\ndataframe = read_csv(url, header=None)\r\ndata = dataframe.values\r\n# separate into input and output elements\r\nX, y = data[:, :-1], data[:, -1]\r\n# minimally prepare dataset\r\nX = X.astype('float')\r\ny = LabelEncoder().fit_transform(y.astype('str'))\r\n# define column indexes for the variables with \"normal\" and \"exponential\" distributions\r\nnorm_ix = [1, 2, 5]\r\nexp_ix = [0, 3, 4, 6, 7]\r\n# define the selective transforms\r\nt = [('e', MinMaxScaler(), exp_ix)]\r\nselective = ColumnTransformer(transformers=t, remainder='passthrough')\r\n# define the modeling pipeline\r\nmodel = LogisticRegression(solver='liblinear')\r\npipeline = Pipeline([('s',selective),('m',model)])\r\n# define the evaluation procedure\r\ncv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)\r\n# evaluate the model\r\nm_scores = cross_val_score(pipeline, X, y, scoring='accuracy', cv=cv, n_jobs=-1)\r\n# summarize the result\r\nprint('Accuracy: %.3f (%.3f)' % (mean(m_scores), std(m_scores)))<\/pre>\n<p>Running the example evaluates the modeling pipeline and reports the mean and standard deviation accuracy.<\/p>\n<p>Your specific results may differ given the stochastic nature of the learning algorithm, the stochastic nature of the evaluation procedure, and differences in precision across machines and library versions. Try running the example a few times.<\/p>\n<p>In this case, we can see slightly better performance, increasing mean accuracy with the baseline model fit on the raw dataset with 76.8 percent to about 76.9 with selective normalization of some input variables.<\/p>\n<p>The results are not as good as standardizing all input variables though.<\/p>\n<pre class=\"urvanov-syntax-highlighter-plain-tag\">Accuracy: 0.769 (0.043)<\/pre>\n<\/p>\n<h3>Standardize Only Gaussian-Like Input Variables<\/h3>\n<p>We can repeat the experiment from the previous section, although in this case, selectively standardize those input variables that have a Gaussian-like distribution and leave the remaining input variables untouched.<\/p>\n<pre class=\"urvanov-syntax-highlighter-plain-tag\">...\r\n# define the selective transforms\r\nt = [('n', StandardScaler(), norm_ix)]\r\nselective = ColumnTransformer(transformers=t, remainder='passthrough')<\/pre>\n<p>Tying this together, the complete example of evaluating a logistic regression model on data with selective standardizing of some input variables is listed below.<\/p>\n<pre class=\"urvanov-syntax-highlighter-plain-tag\"># evaluate a logistic regression model on the diabetes dataset with selective standardization\r\nfrom numpy import mean\r\nfrom numpy import std\r\nfrom pandas import read_csv\r\nfrom sklearn.preprocessing import LabelEncoder\r\nfrom sklearn.model_selection import cross_val_score\r\nfrom sklearn.model_selection import RepeatedStratifiedKFold\r\nfrom sklearn.pipeline import Pipeline\r\nfrom sklearn.linear_model import LogisticRegression\r\nfrom sklearn.preprocessing import StandardScaler\r\nfrom sklearn.compose import ColumnTransformer\r\n# load dataset\r\nurl = 'https:\/\/raw.githubusercontent.com\/jbrownlee\/Datasets\/master\/pima-indians-diabetes.csv'\r\ndataframe = read_csv(url, header=None)\r\ndata = dataframe.values\r\n# separate into input and output elements\r\nX, y = data[:, :-1], data[:, -1]\r\n# minimally prepare dataset\r\nX = X.astype('float')\r\ny = LabelEncoder().fit_transform(y.astype('str'))\r\n# define column indexes for the variables with \"normal\" and \"exponential\" distributions\r\nnorm_ix = [1, 2, 5]\r\nexp_ix = [0, 3, 4, 6, 7]\r\n# define the selective transforms\r\nt = [('n', StandardScaler(), norm_ix)]\r\nselective = ColumnTransformer(transformers=t, remainder='passthrough')\r\n# define the modeling pipeline\r\nmodel = LogisticRegression(solver='liblinear')\r\npipeline = Pipeline([('s',selective),('m',model)])\r\n# define the evaluation procedure\r\ncv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)\r\n# evaluate the model\r\nm_scores = cross_val_score(pipeline, X, y, scoring='accuracy', cv=cv, n_jobs=-1)\r\n# summarize the result\r\nprint('Accuracy: %.3f (%.3f)' % (mean(m_scores), std(m_scores)))<\/pre>\n<p>Running the example evaluates the modeling pipeline and reports the mean and standard deviation accuracy.<\/p>\n<p>Your specific results may differ given the stochastic nature of the learning algorithm, the stochastic nature of the evaluation procedure, and differences in precision across machines and library versions. Try running the example a few times.<\/p>\n<p>In this case, we can see that we achieved a lift in performance over both the baseline model fit on the raw dataset with 76.8 percent and over the standardization of all input variables that achieved 77.2 percent. With selective standardization, we have achieved a mean accuracy of about 77.3 percent, a modest but measurable bump.<\/p>\n<pre class=\"urvanov-syntax-highlighter-plain-tag\">Accuracy: 0.773 (0.041)<\/pre>\n<\/p>\n<h3>Selectively Normalize and Standardize Input Variables<\/h3>\n<p>The results so far raise the question as to whether we can get a further lift by combining the use of selective normalization and standardization on the dataset at the same time.<\/p>\n<p>This can be achieved by defining both transforms and their respective column indexes for the ColumnTransformer class, with no remaining variables being passed through.<\/p>\n<pre class=\"urvanov-syntax-highlighter-plain-tag\">...\r\n# define the selective transforms\r\nt = [('e', MinMaxScaler(), exp_ix), ('n', StandardScaler(), norm_ix)]\r\nselective = ColumnTransformer(transformers=t)<\/pre>\n<p>Tying this together, the complete example of evaluating a logistic regression model on data with selective normalization and standardization of the input variables is listed below.<\/p>\n<pre class=\"urvanov-syntax-highlighter-plain-tag\"># evaluate a logistic regression model on the diabetes dataset with selective scaling\r\nfrom numpy import mean\r\nfrom numpy import std\r\nfrom pandas import read_csv\r\nfrom sklearn.preprocessing import LabelEncoder\r\nfrom sklearn.model_selection import cross_val_score\r\nfrom sklearn.model_selection import RepeatedStratifiedKFold\r\nfrom sklearn.pipeline import Pipeline\r\nfrom sklearn.linear_model import LogisticRegression\r\nfrom sklearn.preprocessing import MinMaxScaler\r\nfrom sklearn.preprocessing import StandardScaler\r\nfrom sklearn.compose import ColumnTransformer\r\n# load dataset\r\nurl = 'https:\/\/raw.githubusercontent.com\/jbrownlee\/Datasets\/master\/pima-indians-diabetes.csv'\r\ndataframe = read_csv(url, header=None)\r\ndata = dataframe.values\r\n# separate into input and output elements\r\nX, y = data[:, :-1], data[:, -1]\r\n# minimally prepare dataset\r\nX = X.astype('float')\r\ny = LabelEncoder().fit_transform(y.astype('str'))\r\n# define column indexes for the variables with \"normal\" and \"exponential\" distributions\r\nnorm_ix = [1, 2, 5]\r\nexp_ix = [0, 3, 4, 6, 7]\r\n# define the selective transforms\r\nt = [('e', MinMaxScaler(), exp_ix), ('n', StandardScaler(), norm_ix)]\r\nselective = ColumnTransformer(transformers=t)\r\n# define the modeling pipeline\r\nmodel = LogisticRegression(solver='liblinear')\r\npipeline = Pipeline([('s',selective),('m',model)])\r\n# define the evaluation procedure\r\ncv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)\r\n# evaluate the model\r\nm_scores = cross_val_score(pipeline, X, y, scoring='accuracy', cv=cv, n_jobs=-1)\r\n# summarize the result\r\nprint('Accuracy: %.3f (%.3f)' % (mean(m_scores), std(m_scores)))<\/pre>\n<p>Running the example evaluates the modeling pipeline and reports the mean and standard deviation accuracy.<\/p>\n<p>Your specific results may differ given the stochastic nature of the learning algorithm, the stochastic nature of the evaluation procedure, and differences in precision across machines and library versions. Try running the example a few times.<\/p>\n<p>In this case, interestingly, we can see that we have achieved the same performance as standardizing all input variables with 77.2 percent.<\/p>\n<p>Further, the results suggest that the chosen model performs better when the non-Gaussian like variables are left as-is than being standardized or normalized.<\/p>\n<p>I would not have guessed at this finding, which highlights the importance of careful experimentation.<\/p>\n<pre class=\"urvanov-syntax-highlighter-plain-tag\">Accuracy: 0.772 (0.040)<\/pre>\n<p><strong>Can you do better?<\/strong><\/p>\n<p>Try other transforms or combinations of transforms and see if you can achieve better results.<br \/>\nShare your findings in the comments below.<\/p>\n<h2>Further Reading<\/h2>\n<p>This section provides more resources on the topic if you are looking to go deeper.<\/p>\n<h3>Tutorials<\/h3>\n<ul>\n<li><a href=\"https:\/\/machinelearningmastery.com\/results-for-standard-classification-and-regression-machine-learning-datasets\/\">Best Results for Standard Machine Learning Datasets<\/a><\/li>\n<li><a href=\"https:\/\/machinelearningmastery.com\/columntransformer-for-numerical-and-categorical-data\/\">How to Use the ColumnTransformer for Data Preparation<\/a><\/li>\n<li><a href=\"https:\/\/machinelearningmastery.com\/standardscaler-and-minmaxscaler-transforms-in-python\/\">How to Use StandardScaler and MinMaxScaler Transforms in Python<\/a><\/li>\n<\/ul>\n<h3>APIs<\/h3>\n<ul>\n<li><a href=\"https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.compose.ColumnTransformer.html\">sklearn.compose.ColumnTransformer API<\/a>.<\/li>\n<\/ul>\n<h2>Summary<\/h2>\n<p>In this tutorial, you discovered how to apply selective scaling of numerical input variables.<\/p>\n<p>Specifically, you learned:<\/p>\n<ul>\n<li>How to load and calculate a baseline predictive performance for the diabetes classification dataset.<\/li>\n<li>How to evaluate modeling pipelines with data transforms applied blindly to all numerical input variables.<\/li>\n<li>How to evaluate modeling pipelines with selective normalization and standardization applied to subsets of input variables.<\/li>\n<\/ul>\n<p><strong>Do you have any questions?<\/strong><br \/>\nAsk your questions in the comments below and I will do my best to answer.<\/p>\n<p>The post <a rel=\"nofollow\" href=\"https:\/\/machinelearningmastery.com\/selectively-scale-numerical-input-variables-for-machine-learning\/\">How to Selectively Scale Numerical Input Variables for Machine Learning<\/a> appeared first on <a rel=\"nofollow\" href=\"https:\/\/machinelearningmastery.com\/\">Machine Learning Mastery<\/a>.<\/p>\n<\/div>\n<p><a href=\"https:\/\/machinelearningmastery.com\/selectively-scale-numerical-input-variables-for-machine-learning\/\">Go to Source<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Author: Jason Brownlee Many machine learning models perform better when input variables are carefully transformed or scaled prior to modeling. It is convenient, and therefore [&hellip;] <span class=\"read-more-link\"><a class=\"read-more\" href=\"https:\/\/www.aiproblog.com\/index.php\/2020\/07\/21\/how-to-selectively-scale-numerical-input-variables-for-machine-learning\/\">Read More<\/a><\/span><\/p>\n","protected":false},"author":1,"featured_media":3689,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_bbp_topic_count":0,"_bbp_reply_count":0,"_bbp_total_topic_count":0,"_bbp_total_reply_count":0,"_bbp_voice_count":0,"_bbp_anonymous_reply_count":0,"_bbp_topic_count_hidden":0,"_bbp_reply_count_hidden":0,"_bbp_forum_subforum_count":0,"footnotes":""},"categories":[24],"tags":[],"_links":{"self":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts\/3688"}],"collection":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/comments?post=3688"}],"version-history":[{"count":0,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts\/3688\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/media\/3689"}],"wp:attachment":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/media?parent=3688"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/categories?post=3688"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/tags?post=3688"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}