{"id":3499,"date":"2020-05-26T19:00:52","date_gmt":"2020-05-26T19:00:52","guid":{"rendered":"https:\/\/www.aiproblog.com\/index.php\/2020\/05\/26\/how-to-scale-data-with-outliers-for-machine-learning\/"},"modified":"2020-05-26T19:00:52","modified_gmt":"2020-05-26T19:00:52","slug":"how-to-scale-data-with-outliers-for-machine-learning","status":"publish","type":"post","link":"https:\/\/www.aiproblog.com\/index.php\/2020\/05\/26\/how-to-scale-data-with-outliers-for-machine-learning\/","title":{"rendered":"How to Scale Data With Outliers for Machine Learning"},"content":{"rendered":"<p>Author: Jason Brownlee<\/p>\n<div>\n<p>Many machine learning algorithms perform better when numerical input variables are scaled to a standard range.<\/p>\n<p>This includes algorithms that use a weighted sum of the input, like linear regression, and algorithms that use distance measures, like k-nearest neighbors.<\/p>\n<p>Standardizing is a popular scaling technique that subtracts the mean from values and divides by the standard deviation, transforming the probability distribution for an input variable to a standard Gaussian (zero mean and unit variance). Standardization can become skewed or biased if the input variable contains outlier values.<\/p>\n<p>To overcome this, the median and interquartile range can be used when standardizing numerical input variables, generally referred to as robust scaling.<\/p>\n<p>In this tutorial, you will discover how to use robust scaler transforms to standardize numerical input variables for classification and regression.<\/p>\n<p>After completing this tutorial, you will know:<\/p>\n<ul>\n<li>Many machine learning algorithms prefer or perform better when numerical input variables are scaled.<\/li>\n<li>Robust scaling techniques that use percentiles can be used to scale numerical input variables that contain outliers.<\/li>\n<li>How to use the RobustScaler to scale numerical input variables using the median and interquartile range.<\/li>\n<\/ul>\n<p>Let&rsquo;s get started.<\/p>\n<div id=\"attachment_10364\" style=\"width: 809px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-10364\" class=\"size-full wp-image-10364\" src=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2020\/05\/How-to-Use-Robust-Scaler-Transforms-for-Machine-Learning.jpg\" alt=\"How to Use Robust Scaler Transforms for Machine Learning\" width=\"799\" height=\"533\" srcset=\"http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2020\/05\/How-to-Use-Robust-Scaler-Transforms-for-Machine-Learning.jpg 799w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2020\/05\/How-to-Use-Robust-Scaler-Transforms-for-Machine-Learning-300x200.jpg 300w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2020\/05\/How-to-Use-Robust-Scaler-Transforms-for-Machine-Learning-768x512.jpg 768w\" sizes=\"(max-width: 799px) 100vw, 799px\"><\/p>\n<p id=\"caption-attachment-10364\" class=\"wp-caption-text\">How to Use Robust Scaler Transforms for Machine Learning<br \/>Photo by <a href=\"https:\/\/flickr.com\/photos\/rayinmanila\/43941746162\/\">Ray in Manila<\/a>, some rights reserved.<\/p>\n<\/div>\n<h2>Tutorial Overview<\/h2>\n<p>This tutorial is divided into five parts; they are:<\/p>\n<ol>\n<li>Scaling Data<\/li>\n<li>Robust Scaler Transforms<\/li>\n<li>Sonar Dataset<\/li>\n<li>IQR Robust Scaler Transform<\/li>\n<li>Explore Robust Scaler Range<\/li>\n<\/ol>\n<h2>Robust Scaling Data<\/h2>\n<p>It is common to scale data prior to fitting a machine learning model.<\/p>\n<p>This is because data often consists of many different input variables or features (columns) and each may have a different range of values or units of measure, such as feet, miles, kilograms, dollars, etc.<\/p>\n<p>If there are input variables that have very large values relative to the other input variables, these large values can dominate or skew some machine learning algorithms. The result is that the algorithms pay most of their attention to the large values and ignore the variables with smaller values.<\/p>\n<p>This includes algorithms that use a weighted sum of inputs like linear regression, logistic regression, and artificial neural networks, as well as algorithms that use distance measures between examples, such as k-nearest neighbors and support vector machines.<\/p>\n<p>As such, it is normal to scale input variables to a common range as a data preparation technique prior to fitting a model.<\/p>\n<p>One approach to data scaling involves calculating the mean and standard deviation of each variable and using these values to scale the values to have a mean of zero and a standard deviation of one, a so-called &ldquo;<em>standard normal<\/em>&rdquo; probability distribution. This process is called standardization and is most useful when input variables have a Gaussian probability distribution.<\/p>\n<p>Standardization is calculated by subtracting the mean value and dividing by the standard deviation.<\/p>\n<ul>\n<li>value = (value &ndash; mean) \/ stdev<\/li>\n<\/ul>\n<p>Sometimes an input variable may have <a href=\"https:\/\/machinelearningmastery.com\/how-to-use-statistics-to-identify-outliers-in-data\/\">outlier values<\/a>. These are values on the edge of the distribution that may have a low probability of occurrence, yet are overrepresented for some reason. Outliers can skew a probability distribution and make data scaling using standardization difficult as the calculated mean and standard deviation will be skewed by the presence of the outliers.<\/p>\n<p>One approach to standardizing input variables in the presence of outliers is to ignore the outliers from the calculation of the mean and standard deviation, then use the calculated values to scale the variable.<\/p>\n<p>This is called robust standardization or <strong>robust data scaling<\/strong>.<\/p>\n<p>This can be achieved by calculating the median (50th percentile) and the 25th and 75th percentiles. The values of each variable then have their median subtracted and are divided by the <a href=\"https:\/\/en.wikipedia.org\/wiki\/Interquartile_range\">interquartile range<\/a> (IQR) which is the difference between the 75th and 25th percentiles.<\/p>\n<ul>\n<li>value = (value &ndash; median) \/ (p75 &ndash; p25)<\/li>\n<\/ul>\n<p>The resulting variable has a zero mean and median and a standard deviation of 1, although not skewed by outliers and the outliers are still present with the same relative relationships to other values.<\/p>\n<h2>Robust Scaler Transforms<\/h2>\n<p>The robust scaler transform is available in the scikit-learn Python machine learning library via the <a href=\"https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.preprocessing.RobustScaler.html\">RobustScaler class<\/a>.<\/p>\n<p>The &ldquo;<em>with_centering<\/em>&rdquo; argument controls whether the value is centered to zero (median is subtracted) and defaults to <em>True<\/em>.<\/p>\n<p>The &ldquo;<em>with_scaling<\/em>&rdquo; argument controls whether the value is scaled to the IQR (standard deviation set to one) or not and defaults to <em>True<\/em>.<\/p>\n<p>Interestingly, the definition of the scaling range can be specified via the &ldquo;<em>quantile_range<\/em>&rdquo; argument. It takes a tuple of two integers between 0 and 100 and defaults to the percentile values of the IQR, specifically (25, 75). Changing this will change the definition of outliers and the scope of the scaling.<\/p>\n<p>We will take a closer look at how to use the robust scaler transforms on a real dataset.<\/p>\n<p>First, let&rsquo;s introduce a real dataset.<\/p>\n<h2>Sonar Dataset<\/h2>\n<p>The sonar dataset is a standard machine learning dataset for binary classification.<\/p>\n<p>It involves 60 real-valued inputs and a two-class target variable. There are 208 examples in the dataset and the classes are reasonably balanced.<\/p>\n<p>A baseline classification algorithm can achieve a classification accuracy of about 53.4 percent using repeated stratified 10-fold cross-validation. <a href=\"https:\/\/machinelearningmastery.com\/results-for-standard-classification-and-regression-machine-learning-datasets\/\">Top performance<\/a> on this dataset is about 88 percent using repeated stratified 10-fold cross-validation.<\/p>\n<p>The dataset describes radar returns of rocks or simulated mines.<\/p>\n<p>You can learn more about the dataset from here:<\/p>\n<ul>\n<li><a href=\"https:\/\/raw.githubusercontent.com\/jbrownlee\/Datasets\/master\/sonar.csv\">Sonar Dataset<\/a><\/li>\n<li><a href=\"https:\/\/raw.githubusercontent.com\/jbrownlee\/Datasets\/master\/sonar.names\">Sonar Dataset Description<\/a><\/li>\n<\/ul>\n<p>No need to download the dataset; we will download it automatically from our worked examples.<\/p>\n<p>First, let&rsquo;s load and summarize the dataset. The complete example is listed below.<\/p>\n<pre class=\"crayon-plain-tag\"># load and summarize the sonar dataset\r\nfrom pandas import read_csv\r\nfrom pandas.plotting import scatter_matrix\r\nfrom matplotlib import pyplot\r\n# Load dataset\r\nurl = \"https:\/\/raw.githubusercontent.com\/jbrownlee\/Datasets\/master\/sonar.csv\"\r\ndataset = read_csv(url, header=None)\r\n# summarize the shape of the dataset\r\nprint(dataset.shape)\r\n# summarize each variable\r\nprint(dataset.describe())\r\n# histograms of the variables\r\ndataset.hist()\r\npyplot.show()<\/pre>\n<p>Running the example first summarizes the shape of the loaded dataset.<\/p>\n<p>This confirms the 60 input variables, one output variable, and 208 rows of data.<\/p>\n<p>A statistical summary of the input variables is provided showing that values are numeric and range approximately from 0 to 1.<\/p>\n<pre class=\"crayon-plain-tag\">(208, 61)\r\n               0           1           2   ...          57          58          59\r\ncount  208.000000  208.000000  208.000000  ...  208.000000  208.000000  208.000000\r\nmean     0.029164    0.038437    0.043832  ...    0.007949    0.007941    0.006507\r\nstd      0.022991    0.032960    0.038428  ...    0.006470    0.006181    0.005031\r\nmin      0.001500    0.000600    0.001500  ...    0.000300    0.000100    0.000600\r\n25%      0.013350    0.016450    0.018950  ...    0.003600    0.003675    0.003100\r\n50%      0.022800    0.030800    0.034300  ...    0.005800    0.006400    0.005300\r\n75%      0.035550    0.047950    0.057950  ...    0.010350    0.010325    0.008525\r\nmax      0.137100    0.233900    0.305900  ...    0.044000    0.036400    0.043900\r\n\r\n[8 rows x 60 columns]<\/pre>\n<p>Finally, a histogram is created for each input variable.<\/p>\n<p>If we ignore the clutter of the plots and focus on the histograms themselves, we can see that many variables have a skewed distribution.<\/p>\n<p>The dataset provides a good candidate for using a robust scaler transform to standardize the data in the presence of skewed distributions and outliers.<\/p>\n<div id=\"attachment_10360\" style=\"width: 1290px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-10360\" class=\"size-full wp-image-10360\" src=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2020\/02\/Histogram-Plots-of-Input-Variables-for-the-Sonar-Binary-Classification-Dataset-2.png\" alt=\"Histogram Plots of Input Variables for the Sonar Binary Classification Dataset\" width=\"1280\" height=\"960\" srcset=\"http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2020\/02\/Histogram-Plots-of-Input-Variables-for-the-Sonar-Binary-Classification-Dataset-2.png 1280w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2020\/02\/Histogram-Plots-of-Input-Variables-for-the-Sonar-Binary-Classification-Dataset-2-300x225.png 300w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2020\/02\/Histogram-Plots-of-Input-Variables-for-the-Sonar-Binary-Classification-Dataset-2-1024x768.png 1024w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2020\/02\/Histogram-Plots-of-Input-Variables-for-the-Sonar-Binary-Classification-Dataset-2-768x576.png 768w\" sizes=\"(max-width: 1280px) 100vw, 1280px\"><\/p>\n<p id=\"caption-attachment-10360\" class=\"wp-caption-text\">Histogram Plots of Input Variables for the Sonar Binary Classification Dataset<\/p>\n<\/div>\n<p>Next, let&rsquo;s fit and evaluate a machine learning model on the raw dataset.<\/p>\n<p>We will use a k-nearest neighbor algorithm with default hyperparameters and evaluate it using <a href=\"https:\/\/machinelearningmastery.com\/k-fold-cross-validation\/\">repeated stratified k-fold cross-validation<\/a>. The complete example is listed below.<\/p>\n<pre class=\"crayon-plain-tag\"># evaluate knn on the raw sonar dataset\r\nfrom numpy import mean\r\nfrom numpy import std\r\nfrom pandas import read_csv\r\nfrom sklearn.model_selection import cross_val_score\r\nfrom sklearn.model_selection import RepeatedStratifiedKFold\r\nfrom sklearn.neighbors import KNeighborsClassifier\r\nfrom sklearn.preprocessing import LabelEncoder\r\nfrom matplotlib import pyplot\r\n# load dataset\r\nurl = \"https:\/\/raw.githubusercontent.com\/jbrownlee\/Datasets\/master\/sonar.csv\"\r\ndataset = read_csv(url, header=None)\r\ndata = dataset.values\r\n# separate into input and output columns\r\nX, y = data[:, :-1], data[:, -1]\r\n# ensure inputs are floats and output is an integer label\r\nX = X.astype('float32')\r\ny = LabelEncoder().fit_transform(y.astype('str'))\r\n# define and configure the model\r\nmodel = KNeighborsClassifier()\r\n# evaluate the model\r\ncv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)\r\nn_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise')\r\n# report model performance\r\nprint('Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))<\/pre>\n<p>Running the example evaluates a KNN model on the raw sonar dataset.<\/p>\n<p>We can see that the model achieved a mean classification accuracy of about 79.7 percent, showing that it has skill (better than 53.4 percent) and is in the ball-park of good performance (88 percent).<\/p>\n<pre class=\"crayon-plain-tag\">Accuracy: 0.797 (0.073)<\/pre>\n<p>Next, let&rsquo;s explore a robust scaling transform of the dataset.<\/p>\n<h2>IQR Robust Scaler Transform<\/h2>\n<p>We can apply the robust scaler to the Sonar dataset directly.<\/p>\n<p>We will use the default configuration and scale values to the IQR. First, a RobustScaler instance is defined with default hyperparameters. Once defined, we can call the <em>fit_transform()<\/em> function and pass it to our dataset to create a quantile transformed version of our dataset.<\/p>\n<pre class=\"crayon-plain-tag\">...\r\n# perform a robust scaler transform of the dataset\r\ntrans = RobustScaler()\r\ndata = trans.fit_transform(data)<\/pre>\n<p>Let&rsquo;s try it on our sonar dataset.<\/p>\n<p>The complete example of creating a robust scaler transform of the sonar dataset and plotting histograms of the result is listed below.<\/p>\n<pre class=\"crayon-plain-tag\"># visualize a robust scaler transform of the sonar dataset\r\nfrom pandas import read_csv\r\nfrom pandas import DataFrame\r\nfrom pandas.plotting import scatter_matrix\r\nfrom sklearn.preprocessing import RobustScaler\r\nfrom matplotlib import pyplot\r\n# load dataset\r\nurl = \"https:\/\/raw.githubusercontent.com\/jbrownlee\/Datasets\/master\/sonar.csv\"\r\ndataset = read_csv(url, header=None)\r\n# retrieve just the numeric input values\r\ndata = dataset.values[:, :-1]\r\n# perform a robust scaler transform of the dataset\r\ntrans = RobustScaler()\r\ndata = trans.fit_transform(data)\r\n# convert the array back to a dataframe\r\ndataset = DataFrame(data)\r\n# summarize\r\nprint(dataset.describe())\r\n# histograms of the variables\r\ndataset.hist()\r\npyplot.show()<\/pre>\n<p>Running the example first reports a summary of each input variable.<\/p>\n<p>We can see that the distributions have been adjusted. The median values are now zero and the standard deviation values are now close to 1.0.<\/p>\n<pre class=\"crayon-plain-tag\">0           1   ...            58          59\r\ncount  208.000000  208.000000  ...  2.080000e+02  208.000000\r\nmean     0.286664    0.242430  ...  2.317814e-01    0.222527\r\nstd      1.035627    1.046347  ...  9.295312e-01    0.927381\r\nmin     -0.959459   -0.958730  ... -9.473684e-01   -0.866359\r\n25%     -0.425676   -0.455556  ... -4.097744e-01   -0.405530\r\n50%      0.000000    0.000000  ...  6.591949e-17    0.000000\r\n75%      0.574324    0.544444  ...  5.902256e-01    0.594470\r\nmax      5.148649    6.447619  ...  4.511278e+00    7.115207\r\n\r\n[8 rows x 60 columns]<\/pre>\n<p>Histogram plots of the variables are created, although the distributions don&rsquo;t look much different from their original distributions seen in the previous section.<\/p>\n<div id=\"attachment_10362\" style=\"width: 1290px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-10362\" class=\"size-full wp-image-10362\" src=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2020\/02\/Histogram-Plots-of-Robust-Scaler-Transformed-Input-Variables-for-the-Sonar-Dataset.png\" alt=\"Histogram Plots of Robust Scaler Transformed Input Variables for the Sonar Dataset\" width=\"1280\" height=\"960\" srcset=\"http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2020\/02\/Histogram-Plots-of-Robust-Scaler-Transformed-Input-Variables-for-the-Sonar-Dataset.png 1280w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2020\/02\/Histogram-Plots-of-Robust-Scaler-Transformed-Input-Variables-for-the-Sonar-Dataset-300x225.png 300w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2020\/02\/Histogram-Plots-of-Robust-Scaler-Transformed-Input-Variables-for-the-Sonar-Dataset-1024x768.png 1024w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2020\/02\/Histogram-Plots-of-Robust-Scaler-Transformed-Input-Variables-for-the-Sonar-Dataset-768x576.png 768w\" sizes=\"(max-width: 1280px) 100vw, 1280px\"><\/p>\n<p id=\"caption-attachment-10362\" class=\"wp-caption-text\">Histogram Plots of Robust Scaler Transformed Input Variables for the Sonar Dataset<\/p>\n<\/div>\n<p>Next, let&rsquo;s evaluate the same KNN model as the previous section, but in this case on a robust scaler transform of the dataset.<\/p>\n<p>The complete example is listed below.<\/p>\n<pre class=\"crayon-plain-tag\"># evaluate knn on the sonar dataset with robust scaler transform\r\nfrom numpy import mean\r\nfrom numpy import std\r\nfrom pandas import read_csv\r\nfrom sklearn.model_selection import cross_val_score\r\nfrom sklearn.model_selection import RepeatedStratifiedKFold\r\nfrom sklearn.neighbors import KNeighborsClassifier\r\nfrom sklearn.preprocessing import LabelEncoder\r\nfrom sklearn.preprocessing import RobustScaler\r\nfrom sklearn.pipeline import Pipeline\r\nfrom matplotlib import pyplot\r\n# load dataset\r\nurl = \"https:\/\/raw.githubusercontent.com\/jbrownlee\/Datasets\/master\/sonar.csv\"\r\ndataset = read_csv(url, header=None)\r\ndata = dataset.values\r\n# separate into input and output columns\r\nX, y = data[:, :-1], data[:, -1]\r\n# ensure inputs are floats and output is an integer label\r\nX = X.astype('float32')\r\ny = LabelEncoder().fit_transform(y.astype('str'))\r\n# define the pipeline\r\ntrans = RobustScaler(with_centering=False, with_scaling=True)\r\nmodel = KNeighborsClassifier()\r\npipeline = Pipeline(steps=[('t', trans), ('m', model)])\r\n# evaluate the pipeline\r\ncv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)\r\nn_scores = cross_val_score(pipeline, X, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise')\r\n# report pipeline performance\r\nprint('Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))<\/pre>\n<p>Running the example, we can see that the robust scaler transform results in a lift in performance from 79.7 percent accuracy without the transform to about 81.9 percent with the transform.<\/p>\n<pre class=\"crayon-plain-tag\">Accuracy: 0.819 (0.076)<\/pre>\n<p>Next, let&rsquo;s explore the effect of different scaling ranges.<\/p>\n<h2>Explore Robust Scaler Range<\/h2>\n<p>The range used to scale each variable is chosen by default as the IQR is bounded by the 25th and 75th percentiles.<\/p>\n<p>This is specified by the &ldquo;<em>quantile_range<\/em>&rdquo; argument as a tuple.<\/p>\n<p>Other values can be specified and might improve the performance of the model, such as a wider range, allowing fewer values to be considered outliers, or a more narrow range, allowing more values to be considered outliers.<\/p>\n<p>The example below explores the effect of different definitions of the range from 1st to the 99th percentiles to 30th to 70th percentiles.<\/p>\n<p>The complete example is listed below.<\/p>\n<pre class=\"crayon-plain-tag\"># explore the scaling range of the robust scaler transform\r\nfrom numpy import mean\r\nfrom numpy import std\r\nfrom pandas import read_csv\r\nfrom sklearn.model_selection import cross_val_score\r\nfrom sklearn.model_selection import RepeatedStratifiedKFold\r\nfrom sklearn.neighbors import KNeighborsClassifier\r\nfrom sklearn.preprocessing import RobustScaler\r\nfrom sklearn.preprocessing import LabelEncoder\r\nfrom sklearn.pipeline import Pipeline\r\nfrom matplotlib import pyplot\r\n\r\n# get the dataset\r\ndef get_dataset():\r\n\t# load dataset\r\n\turl = \"https:\/\/raw.githubusercontent.com\/jbrownlee\/Datasets\/master\/sonar.csv\"\r\n\tdataset = read_csv(url, header=None)\r\n\tdata = dataset.values\r\n\t# separate into input and output columns\r\n\tX, y = data[:, :-1], data[:, -1]\r\n\t# ensure inputs are floats and output is an integer label\r\n\tX = X.astype('float32')\r\n\ty = LabelEncoder().fit_transform(y.astype('str'))\r\n\treturn X, y\r\n\r\n# get a list of models to evaluate\r\ndef get_models():\r\n\tmodels = dict()\r\n\tfor value in [1, 5, 10, 15, 20, 25, 30]:\r\n\t\t# define the pipeline\r\n\t\ttrans = RobustScaler(quantile_range=(value, 100-value))\r\n\t\tmodel = KNeighborsClassifier()\r\n\t\tmodels[str(value)] = Pipeline(steps=[('t', trans), ('m', model)])\r\n\treturn models\r\n\r\n# evaluate a give model using cross-validation\r\ndef evaluate_model(model):\r\n\tcv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)\r\n\tscores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise')\r\n\treturn scores\r\n\r\n# define dataset\r\nX, y = get_dataset()\r\n# get the models to evaluate\r\nmodels = get_models()\r\n# evaluate the models and store results\r\nresults, names = list(), list()\r\nfor name, model in models.items():\r\n\tscores = evaluate_model(model)\r\n\tresults.append(scores)\r\n\tnames.append(name)\r\n\tprint('&gt;%s %.3f (%.3f)' % (name, mean(scores), std(scores)))\r\n# plot model performance for comparison\r\npyplot.boxplot(results, labels=names, showmeans=True)\r\npyplot.show()<\/pre>\n<p>Running the example reports the mean classification accuracy for each value-defined IQR range.<\/p>\n<p>We can see that the default of 25th to 75th percentile achieves the best results, although the values of 20-80 and 30-70 achieve results that are very similar.<\/p>\n<pre class=\"crayon-plain-tag\">&gt;1 0.818 (0.069)\r\n&gt;5 0.813 (0.085)\r\n&gt;10 0.812 (0.076)\r\n&gt;15 0.811 (0.081)\r\n&gt;20 0.811 (0.080)\r\n&gt;25 0.819 (0.076)\r\n&gt;30 0.816 (0.072)<\/pre>\n<p>Box and whisker plots are created to summarize the classification accuracy scores for each IQR range.<\/p>\n<p>We can see a marked difference in the distribution and mean accuracy with the larger ranges of 25-75 and 30-70 percentiles.<\/p>\n<div id=\"attachment_10361\" style=\"width: 1290px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-10361\" class=\"size-full wp-image-10361\" src=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2020\/02\/Box-Plots-of-Robust-Scaler-IQR-Range-vs-Classification-Accuracy-of-KNN-on-the-Sonar-Dataset.png\" alt=\"Histogram Plots of Robust Scaler Transformed Input Variables for the Sonar Dataset\" width=\"1280\" height=\"960\" srcset=\"http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2020\/02\/Box-Plots-of-Robust-Scaler-IQR-Range-vs-Classification-Accuracy-of-KNN-on-the-Sonar-Dataset.png 1280w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2020\/02\/Box-Plots-of-Robust-Scaler-IQR-Range-vs-Classification-Accuracy-of-KNN-on-the-Sonar-Dataset-300x225.png 300w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2020\/02\/Box-Plots-of-Robust-Scaler-IQR-Range-vs-Classification-Accuracy-of-KNN-on-the-Sonar-Dataset-1024x768.png 1024w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2020\/02\/Box-Plots-of-Robust-Scaler-IQR-Range-vs-Classification-Accuracy-of-KNN-on-the-Sonar-Dataset-768x576.png 768w\" sizes=\"(max-width: 1280px) 100vw, 1280px\"><\/p>\n<p id=\"caption-attachment-10361\" class=\"wp-caption-text\">Box Plots of Robust Scaler IQR Range vs Classification Accuracy of KNN on the Sonar Dataset<\/p>\n<\/div>\n<h2>Further Reading<\/h2>\n<p>This section provides more resources on the topic if you are looking to go deeper.<\/p>\n<h3>Tutorials<\/h3>\n<ul>\n<li><a href=\"https:\/\/machinelearningmastery.com\/how-to-use-statistics-to-identify-outliers-in-data\/\">How to Use Statistics to Identify Outliers in Data<\/a><\/li>\n<\/ul>\n<h3>APIs<\/h3>\n<ul>\n<li><a href=\"https:\/\/scikit-learn.org\/stable\/modules\/preprocessing.html#preprocessing-scaler\">Standardization, or mean removal and variance scaling, scikit-learn<\/a>.<\/li>\n<li><a href=\"https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.preprocessing.RobustScaler.html\">sklearn.preprocessing.RobustScaler API<\/a>.<\/li>\n<\/ul>\n<h3>Articles<\/h3>\n<ul>\n<li><a href=\"https:\/\/en.wikipedia.org\/wiki\/Interquartile_range\">Interquartile range, Wikipedia<\/a>.<\/li>\n<\/ul>\n<h2>Summary<\/h2>\n<p>In this tutorial, you discovered how to use robust scaler transforms to standardize numerical input variables for classification and regression.<\/p>\n<p>Specifically, you learned:<\/p>\n<ul>\n<li>Many machine learning algorithms prefer or perform better when numerical input variables are scaled.<\/li>\n<li>Robust scaling techniques that use percentiles can be used to scale numerical input variables that contain outliers.<\/li>\n<li>How to use the RobustScaler to scale numerical input variables using the median and interquartile range.<\/li>\n<\/ul>\n<p><strong>Do you have any questions?<\/strong><br \/>\nAsk your questions in the comments below and I will do my best to answer.<\/p>\n<p>The post <a rel=\"nofollow\" href=\"https:\/\/machinelearningmastery.com\/robust-scaler-transforms-for-machine-learning\/\">How to Scale Data With Outliers for Machine Learning<\/a> appeared first on <a rel=\"nofollow\" href=\"https:\/\/machinelearningmastery.com\/\">Machine Learning Mastery<\/a>.<\/p>\n<\/div>\n<p><a href=\"https:\/\/machinelearningmastery.com\/robust-scaler-transforms-for-machine-learning\/\">Go to Source<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Author: Jason Brownlee Many machine learning algorithms perform better when numerical input variables are scaled to a standard range. This includes algorithms that use a [&hellip;] <span class=\"read-more-link\"><a class=\"read-more\" href=\"https:\/\/www.aiproblog.com\/index.php\/2020\/05\/26\/how-to-scale-data-with-outliers-for-machine-learning\/\">Read More<\/a><\/span><\/p>\n","protected":false},"author":1,"featured_media":3500,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_bbp_topic_count":0,"_bbp_reply_count":0,"_bbp_total_topic_count":0,"_bbp_total_reply_count":0,"_bbp_voice_count":0,"_bbp_anonymous_reply_count":0,"_bbp_topic_count_hidden":0,"_bbp_reply_count_hidden":0,"_bbp_forum_subforum_count":0,"footnotes":""},"categories":[24],"tags":[],"_links":{"self":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts\/3499"}],"collection":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/comments?post=3499"}],"version-history":[{"count":0,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts\/3499\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/media\/3500"}],"wp:attachment":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/media?parent=3499"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/categories?post=3499"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/tags?post=3499"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}