{"id":3474,"date":"2020-05-19T19:00:26","date_gmt":"2020-05-19T19:00:26","guid":{"rendered":"https:\/\/www.aiproblog.com\/index.php\/2020\/05\/19\/how-to-use-quantile-transforms-for-machine-learning\/"},"modified":"2020-05-19T19:00:26","modified_gmt":"2020-05-19T19:00:26","slug":"how-to-use-quantile-transforms-for-machine-learning","status":"publish","type":"post","link":"https:\/\/www.aiproblog.com\/index.php\/2020\/05\/19\/how-to-use-quantile-transforms-for-machine-learning\/","title":{"rendered":"How to Use Quantile Transforms for Machine Learning"},"content":{"rendered":"<p>Author: Jason Brownlee<\/p>\n<div>\n<p>Numerical input variables may have a highly skewed or non-standard distribution.<\/p>\n<p>This could be caused by outliers in the data, multi-modal distributions, highly exponential distributions, and more.<\/p>\n<p>Many machine learning algorithms prefer or perform better when numerical input variables and even output variables in the case of regression have a standard probability distribution, such as a Gaussian (normal) or a uniform distribution.<\/p>\n<p>The quantile transform provides an automatic way to transform a numeric input variable to have a different data distribution, which in turn, can be used as input to a predictive model.<\/p>\n<p>In this tutorial, you will discover how to use quantile transforms to change the distribution of numeric variables for machine learning.<\/p>\n<p>After completing this tutorial, you will know:<\/p>\n<ul>\n<li>Many machine learning algorithms prefer or perform better when numerical variables have a Gaussian or standard probability distribution.<\/li>\n<li>Quantile transforms are a technique for transforming numerical input or output variables to have a Gaussian or uniform probability distribution.<\/li>\n<li>How to use the QuantileTransformer to change the probability distribution of numeric variables to improve the performance of predictive models.<\/li>\n<\/ul>\n<p>Let&rsquo;s get started.<\/p>\n<div id=\"attachment_10332\" style=\"width: 809px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-10332\" class=\"size-full wp-image-10332\" src=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2020\/05\/How-to-Use-Quantile-Transforms-for-Machine-Learning.jpg\" alt=\"How to Use Quantile Transforms for Machine Learning\" width=\"799\" height=\"453\" srcset=\"http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2020\/05\/How-to-Use-Quantile-Transforms-for-Machine-Learning.jpg 799w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2020\/05\/How-to-Use-Quantile-Transforms-for-Machine-Learning-300x170.jpg 300w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2020\/05\/How-to-Use-Quantile-Transforms-for-Machine-Learning-768x435.jpg 768w\" sizes=\"(max-width: 799px) 100vw, 799px\"><\/p>\n<p id=\"caption-attachment-10332\" class=\"wp-caption-text\">How to Use Quantile Transforms for Machine Learning<br \/>Photo by <a href=\"https:\/\/flickr.com\/photos\/volvob12b\/44164103752\/\">Bernard Spragg. NZ<\/a>, some rights reserved.<\/p>\n<\/div>\n<h2>Tutorial Overview<\/h2>\n<p>This tutorial is divided into five parts; they are:<\/p>\n<ol>\n<li>Change Data Distribution<\/li>\n<li>Quantile Transforms<\/li>\n<li>Sonar Dataset<\/li>\n<li>Normal Quantile Transform<\/li>\n<li>Uniform Quantile Transform<\/li>\n<\/ol>\n<h2>Change Data Distribution<\/h2>\n<p>Many machine learning algorithms perform better when the distribution of variables is Gaussian.<\/p>\n<p>Recall that the observations for each variable may be thought to be drawn from a probability distribution. The Gaussian is a common distribution with the familiar bell shape. It is so common that it is often referred to as the &ldquo;<em>normal<\/em>&rdquo; distribution.<\/p>\n<p>For more on the Gaussian probability distribution, see the tutorial:<\/p>\n<ul>\n<li><a href=\"https:\/\/machinelearningmastery.com\/continuous-probability-distributions-for-machine-learning\/\">Continuous Probability Distributions for Machine Learning<\/a><\/li>\n<\/ul>\n<p>Some algorithms, like <a href=\"https:\/\/machinelearningmastery.com\/implement-linear-regression-stochastic-gradient-descent-scratch-python\/\">linear regression<\/a> and <a href=\"https:\/\/machinelearningmastery.com\/logistic-regression-for-machine-learning\/\">logistic regression<\/a>, explicitly assume the real-valued variables have a Gaussian distribution. Other nonlinear algorithms may not have this assumption, yet often perform better when variables have a Gaussian distribution.<\/p>\n<p>This applies both to real-valued input variables in the case of classification and regression tasks, and real-valued target variables in the case of regression tasks.<\/p>\n<p>Some input variables may have a highly skewed distribution, such as an exponential distribution where the most common observations are bunched together. Some input variables may have outliers that cause the distribution to be highly spread.<\/p>\n<p>These concerns and others, like non-standard distributions and multi-modal distributions, can make a dataset challenging to model with a range of machine learning models.<\/p>\n<p>As such, it is often desirable to transform each input variable to have a standard probability distribution, such as a Gaussian (normal) distribution or a uniform distribution.<\/p>\n<h2>Quantile Transforms<\/h2>\n<p>A quantile transform will map a variable&rsquo;s probability distribution to another probability distribution.<\/p>\n<p>Recall that a <a href=\"https:\/\/en.wikipedia.org\/wiki\/Quantile_function\">quantile function<\/a>, also called a percent-point function (PPF), is the inverse of the cumulative probability distribution (CDF). A CDF is a function that returns the probability of a value at or below a given value. The PPF is the inverse of this function and returns the value at or below a given probability.<\/p>\n<p>The quantile function ranks or smooths out the relationship between observations and can be mapped onto other distributions, such as the uniform or normal distribution.<\/p>\n<p>The transformation can be applied to each numeric input variable in the training dataset and then provided as input to a machine learning model to learn a predictive modeling task.<\/p>\n<p>This quantile transform is available in the scikit-learn Python machine learning library via the <a href=\"https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.preprocessing.QuantileTransformer.html\">QuantileTransformer class<\/a>.<\/p>\n<p>The class has an &ldquo;<em>output_distribution<\/em>&rdquo; argument that can be set to &ldquo;<em>uniform<\/em>&rdquo; or &ldquo;<em>random<\/em>&rdquo; and defaults to &ldquo;<em>uniform<\/em>&ldquo;.<\/p>\n<p>It also provides a &ldquo;<em>n_quantiles<\/em>&rdquo; that determines the resolution of the mapping or ranking of the observations in the dataset. This must be set to a value less than the number of observations in the dataset and defaults to 1,000.<\/p>\n<p>We can demonstrate the <em>QuantileTransformer<\/em>&nbsp;with a small worked example. We can generate a sample of&nbsp;<a href=\"https:\/\/machinelearningmastery.com\/how-to-generate-random-numbers-in-python\/\">random Gaussian numbers<\/a> and impose a skew on the distribution by calculating the exponent. The <em>QuantileTransformer<\/em> can then be used to transform the dataset to be another distribution, in this cases back to a Gaussian distribution.<\/p>\n<p>The complete example is listed below.<\/p>\n<pre class=\"crayon-plain-tag\"># demonstration of the quantile transform\r\nfrom numpy import exp\r\nfrom numpy.random import randn\r\nfrom sklearn.preprocessing import QuantileTransformer\r\nfrom matplotlib import pyplot\r\n# generate gaussian data sample\r\ndata = randn(1000)\r\n# add a skew to the data distribution\r\ndata = exp(data)\r\n# histogram of the raw data with a skew\r\npyplot.hist(data, bins=25)\r\npyplot.show()\r\n# reshape data to have rows and columns\r\ndata = data.reshape((len(data),1))\r\n# quantile transform the raw data\r\nquantile = QuantileTransformer(output_distribution='normal')\r\ndata_trans = quantile.fit_transform(data)\r\n# histogram of the transformed data\r\npyplot.hist(data_trans, bins=25)\r\npyplot.show()<\/pre>\n<p>Running the example first creates a sample of 1,000 random Gaussian values and adds a skew to the dataset.<\/p>\n<p>A histogram is created from the skewed dataset and clearly shows the distribution pushed to the far left.<\/p>\n<div id=\"attachment_10929\" style=\"width: 650px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-10929\" class=\"size-full wp-image-10929\" src=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2020\/05\/Histogram-of-Skewed-Gaussian-Distribution2.png\" alt=\"Histogram of Skewed Gaussian Distribution\" width=\"640\" height=\"480\" srcset=\"http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2020\/05\/Histogram-of-Skewed-Gaussian-Distribution2.png 640w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2020\/05\/Histogram-of-Skewed-Gaussian-Distribution2-300x225.png 300w\" sizes=\"(max-width: 640px) 100vw, 640px\"><\/p>\n<p id=\"caption-attachment-10929\" class=\"wp-caption-text\">Histogram of Skewed Gaussian Distribution<\/p>\n<\/div>\n<p>Then a <em>QuantileTransformer<\/em> is used to map the data distribution Gaussian and standardize the result, centering the values on the mean value of 0 and a standard deviation of 1.0.<\/p>\n<p>A histogram of the transform data is created showing a Gaussian shaped data distribution.<\/p>\n<div id=\"attachment_10930\" style=\"width: 650px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-10930\" class=\"size-full wp-image-10930\" src=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2020\/05\/Histogram-of-Skewed-Gaussian-Data-After-Quantile-Transform.png\" alt=\"Histogram of Skewed Gaussian Data After Quantile Transform\" width=\"640\" height=\"480\" srcset=\"http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2020\/05\/Histogram-of-Skewed-Gaussian-Data-After-Quantile-Transform.png 640w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2020\/05\/Histogram-of-Skewed-Gaussian-Data-After-Quantile-Transform-300x225.png 300w\" sizes=\"(max-width: 640px) 100vw, 640px\"><\/p>\n<p id=\"caption-attachment-10930\" class=\"wp-caption-text\">Histogram of Skewed Gaussian Data After Quantile Transform<\/p>\n<\/div>\n<p>In the following sections will take a closer look at how to use the quantile transform on a real dataset.<\/p>\n<p>Next, let&rsquo;s introduce the dataset.<\/p>\n<h2>Sonar Dataset<\/h2>\n<p>The sonar dataset is a standard machine learning dataset for binary classification.<\/p>\n<p>It involves 60 real-valued inputs and a two-class target variable. There are 208 examples in the dataset and the classes are reasonably balanced.<\/p>\n<p>A baseline classification algorithm can achieve a classification accuracy of about 53.4 percent using repeated stratified 10-fold cross-validation. <a href=\"https:\/\/machinelearningmastery.com\/results-for-standard-classification-and-regression-machine-learning-datasets\/\">Top performance<\/a> on this dataset is about 88 percent using repeated stratified 10-fold cross-validation.<\/p>\n<p>The dataset describes radar returns of rocks or simulated mines.<\/p>\n<p>You can learn more about the dataset from here:<\/p>\n<ul>\n<li><a href=\"https:\/\/raw.githubusercontent.com\/jbrownlee\/Datasets\/master\/sonar.csv\">Sonar Dataset<\/a><\/li>\n<li><a href=\"https:\/\/raw.githubusercontent.com\/jbrownlee\/Datasets\/master\/sonar.names\">Sonar Dataset Description<\/a><\/li>\n<\/ul>\n<p>No need to download the dataset; we will download it automatically from our worked examples.<\/p>\n<p>First, let&rsquo;s load and summarize the dataset. The complete example is listed below.<\/p>\n<pre class=\"crayon-plain-tag\"># load and summarize the sonar dataset\r\nfrom pandas import read_csv\r\nfrom pandas.plotting import scatter_matrix\r\nfrom matplotlib import pyplot\r\n# Load dataset\r\nurl = \"https:\/\/raw.githubusercontent.com\/jbrownlee\/Datasets\/master\/sonar.csv\"\r\ndataset = read_csv(url, header=None)\r\n# summarize the shape of the dataset\r\nprint(dataset.shape)\r\n# summarize each variable\r\nprint(dataset.describe())\r\n# histograms of the variables\r\ndataset.hist()\r\npyplot.show()<\/pre>\n<p>Running the example first summarizes the shape of the loaded dataset.<\/p>\n<p>This confirms the 60 input variables, one output variable, and 208 rows of data.<\/p>\n<p>A statistical summary of the input variables is provided showing that values are numeric and range approximately from 0 to 1.<\/p>\n<pre class=\"crayon-plain-tag\">(208, 61)\r\n               0           1           2   ...          57          58          59\r\ncount  208.000000  208.000000  208.000000  ...  208.000000  208.000000  208.000000\r\nmean     0.029164    0.038437    0.043832  ...    0.007949    0.007941    0.006507\r\nstd      0.022991    0.032960    0.038428  ...    0.006470    0.006181    0.005031\r\nmin      0.001500    0.000600    0.001500  ...    0.000300    0.000100    0.000600\r\n25%      0.013350    0.016450    0.018950  ...    0.003600    0.003675    0.003100\r\n50%      0.022800    0.030800    0.034300  ...    0.005800    0.006400    0.005300\r\n75%      0.035550    0.047950    0.057950  ...    0.010350    0.010325    0.008525\r\nmax      0.137100    0.233900    0.305900  ...    0.044000    0.036400    0.043900\r\n\r\n[8 rows x 60 columns]<\/pre>\n<p>Finally a histogram is created for each input variable.<\/p>\n<p>If we ignore the clutter of the plots and focus on the histograms themselves, we can see that many variables have a skewed distribution.<\/p>\n<p>The dataset provides a good candidate for using a quantile transform to make the variables more-Gaussian.<\/p>\n<div id=\"attachment_10328\" style=\"width: 1290px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-10328\" class=\"size-full wp-image-10328\" src=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2020\/05\/Histogram-Plots-of-Input-Variables-for-the-Sonar-Binary-Classification-Dataset.png\" alt=\"Histogram Plots of Input Variables for the Sonar Binary Classification Dataset\" width=\"1280\" height=\"960\" srcset=\"http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2020\/05\/Histogram-Plots-of-Input-Variables-for-the-Sonar-Binary-Classification-Dataset.png 1280w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2020\/05\/Histogram-Plots-of-Input-Variables-for-the-Sonar-Binary-Classification-Dataset-300x225.png 300w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2020\/05\/Histogram-Plots-of-Input-Variables-for-the-Sonar-Binary-Classification-Dataset-1024x768.png 1024w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2020\/05\/Histogram-Plots-of-Input-Variables-for-the-Sonar-Binary-Classification-Dataset-768x576.png 768w\" sizes=\"(max-width: 1280px) 100vw, 1280px\"><\/p>\n<p id=\"caption-attachment-10328\" class=\"wp-caption-text\">Histogram Plots of Input Variables for the Sonar Binary Classification Dataset<\/p>\n<\/div>\n<p>Next, let&rsquo;s fit and evaluate a machine learning model on the raw dataset.<\/p>\n<p>We will use a k-nearest neighbor algorithm with default hyperparameters and evaluate it using <a href=\"https:\/\/machinelearningmastery.com\/k-fold-cross-validation\/\">repeated stratified k-fold cross-validation<\/a>. The complete example is listed below.<\/p>\n<pre class=\"crayon-plain-tag\"># evaluate knn on the raw sonar dataset\r\nfrom numpy import mean\r\nfrom numpy import std\r\nfrom pandas import read_csv\r\nfrom sklearn.model_selection import cross_val_score\r\nfrom sklearn.model_selection import RepeatedStratifiedKFold\r\nfrom sklearn.neighbors import KNeighborsClassifier\r\nfrom sklearn.preprocessing import LabelEncoder\r\nfrom matplotlib import pyplot\r\n# load dataset\r\nurl = \"https:\/\/raw.githubusercontent.com\/jbrownlee\/Datasets\/master\/sonar.csv\"\r\ndataset = read_csv(url, header=None)\r\ndata = dataset.values\r\n# separate into input and output columns\r\nX, y = data[:, :-1], data[:, -1]\r\n# ensure inputs are floats and output is an integer label\r\nX = X.astype('float32')\r\ny = LabelEncoder().fit_transform(y.astype('str'))\r\n# define and configure the model\r\nmodel = KNeighborsClassifier()\r\n# evaluate the model\r\ncv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)\r\nn_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise')\r\n# report model performance\r\nprint('Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))<\/pre>\n<p>Running the example evaluates a KNN model on the raw sonar dataset.<\/p>\n<p>We can see that the model achieved a mean classification accuracy of about 79.7 percent, showing that it has skill (better than 53.4 percent) and is in the ball-park of good performance (88 percent).<\/p>\n<pre class=\"crayon-plain-tag\">Accuracy: 0.797 (0.073)<\/pre>\n<p>Next, let&rsquo;s explore a normal quantile transform of the dataset.<\/p>\n<h2>Normal Quantile Transform<\/h2>\n<p>It is often desirable to transform an input variable to have a normal probability distribution to improve the modeling performance.<\/p>\n<p>We can apply the Quantile transform using the <em>QuantileTransformer<\/em> class and set the &ldquo;<em>output_distribution<\/em>&rdquo; argument to &ldquo;<em>normal<\/em>&ldquo;. We must also set the &ldquo;<em>n_quantiles<\/em>&rdquo; argument to a value less than the number of observations in the training dataset, in this case, 100.<\/p>\n<p>Once defined, we can call the <em>fit_transform()<\/em> function and pass it to our dataset to create a quantile transformed version of our dataset.<\/p>\n<pre class=\"crayon-plain-tag\">...\r\n# perform a normal quantile transform of the dataset\r\ntrans = QuantileTransformer(n_quantiles=100, output_distribution='normal')\r\ndata = trans.fit_transform(data)<\/pre>\n<p>Let&rsquo;s try it on our sonar dataset.<\/p>\n<p>The complete example of creating a normal quantile transform of the sonar dataset and plotting histograms of the result is listed below.<\/p>\n<pre class=\"crayon-plain-tag\"># visualize a normal quantile transform of the sonar dataset\r\nfrom pandas import read_csv\r\nfrom pandas import DataFrame\r\nfrom pandas.plotting import scatter_matrix\r\nfrom sklearn.preprocessing import QuantileTransformer\r\nfrom matplotlib import pyplot\r\n# load dataset\r\nurl = \"https:\/\/raw.githubusercontent.com\/jbrownlee\/Datasets\/master\/sonar.csv\"\r\ndataset = read_csv(url, header=None)\r\n# retrieve just the numeric input values\r\ndata = dataset.values[:, :-1]\r\n# perform a normal quantile transform of the dataset\r\ntrans = QuantileTransformer(n_quantiles=100, output_distribution='normal')\r\ndata = trans.fit_transform(data)\r\n# convert the array back to a dataframe\r\ndataset = DataFrame(data)\r\n# histograms of the variables\r\ndataset.hist()\r\npyplot.show()<\/pre>\n<p>Running the example transforms the dataset and plots histograms of each input variable.<\/p>\n<p>We can see that the shape of the histograms for each variable looks very Gaussian as compared to the raw data.<\/p>\n<div id=\"attachment_10329\" style=\"width: 1290px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-10329\" class=\"size-full wp-image-10329\" src=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2020\/05\/Histogram-Plots-of-Normal-Quantile-Transformed-Input-Variables-for-the-Sonar-Dataset.png\" alt=\"Histogram Plots of Normal Quantile Transformed Input Variables for the Sonar Dataset\" width=\"1280\" height=\"960\" srcset=\"http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2020\/05\/Histogram-Plots-of-Normal-Quantile-Transformed-Input-Variables-for-the-Sonar-Dataset.png 1280w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2020\/05\/Histogram-Plots-of-Normal-Quantile-Transformed-Input-Variables-for-the-Sonar-Dataset-300x225.png 300w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2020\/05\/Histogram-Plots-of-Normal-Quantile-Transformed-Input-Variables-for-the-Sonar-Dataset-1024x768.png 1024w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2020\/05\/Histogram-Plots-of-Normal-Quantile-Transformed-Input-Variables-for-the-Sonar-Dataset-768x576.png 768w\" sizes=\"(max-width: 1280px) 100vw, 1280px\"><\/p>\n<p id=\"caption-attachment-10329\" class=\"wp-caption-text\">Histogram Plots of Normal Quantile Transformed Input Variables for the Sonar Dataset<\/p>\n<\/div>\n<p>Next, let&rsquo;s evaluate the same KNN model as the previous section, but in this case on a normal quantile transform of the dataset.<\/p>\n<p>The complete example is listed below.<\/p>\n<pre class=\"crayon-plain-tag\"># evaluate knn on the sonar dataset with normal quantile transform\r\nfrom numpy import mean\r\nfrom numpy import std\r\nfrom pandas import read_csv\r\nfrom sklearn.model_selection import cross_val_score\r\nfrom sklearn.model_selection import RepeatedStratifiedKFold\r\nfrom sklearn.neighbors import KNeighborsClassifier\r\nfrom sklearn.preprocessing import LabelEncoder\r\nfrom sklearn.preprocessing import QuantileTransformer\r\nfrom sklearn.pipeline import Pipeline\r\nfrom matplotlib import pyplot\r\n# load dataset\r\nurl = \"https:\/\/raw.githubusercontent.com\/jbrownlee\/Datasets\/master\/sonar.csv\"\r\ndataset = read_csv(url, header=None)\r\ndata = dataset.values\r\n# separate into input and output columns\r\nX, y = data[:, :-1], data[:, -1]\r\n# ensure inputs are floats and output is an integer label\r\nX = X.astype('float32')\r\ny = LabelEncoder().fit_transform(y.astype('str'))\r\n# define the pipeline\r\ntrans = QuantileTransformer(n_quantiles=100, output_distribution='normal')\r\nmodel = KNeighborsClassifier()\r\npipeline = Pipeline(steps=[('t', trans), ('m', model)])\r\n# evaluate the pipeline\r\ncv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)\r\nn_scores = cross_val_score(pipeline, X, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise')\r\n# report pipeline performance\r\nprint('Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))<\/pre>\n<p>Running the example, we can see that the normal quantile transform results in a lift in performance from 79.7% accuracy without the transform to about 81.7% with the transform.<\/p>\n<pre class=\"crayon-plain-tag\">Accuracy: 0.817 (0.087)<\/pre>\n<p>Next, let&rsquo;s take a closer look at the uniform quantile transform.<\/p>\n<h2>Uniform Quantile Transform<\/h2>\n<p>Sometimes it can be beneficial to transform a highly exponential or multi-modal distribution to have a uniform distribution.<\/p>\n<p>This is especially useful for data with a large and sparse range of values, e.g. outliers that are common rather than rare.<\/p>\n<p>We can apply the transform by defining a <em>QuantileTransformer<\/em> class and setting the &ldquo;<em>output_distribution<\/em>&rdquo; argument to &ldquo;<em>uniform<\/em>&rdquo; (the default).<\/p>\n<pre class=\"crayon-plain-tag\">...\r\n# perform a uniform quantile transform of the dataset\r\ntrans = QuantileTransformer(n_quantiles=100, output_distribution='uniform')\r\ndata = trans.fit_transform(data)<\/pre>\n<p>The example below applies the uniform quantile transform and creates histogram plots of each of the transformed variables.<\/p>\n<pre class=\"crayon-plain-tag\"># visualize a uniform quantile transform of the sonar dataset\r\nfrom pandas import read_csv\r\nfrom pandas import DataFrame\r\nfrom pandas.plotting import scatter_matrix\r\nfrom sklearn.preprocessing import QuantileTransformer\r\nfrom matplotlib import pyplot\r\n# load dataset\r\nurl = \"https:\/\/raw.githubusercontent.com\/jbrownlee\/Datasets\/master\/sonar.csv\"\r\ndataset = read_csv(url, header=None)\r\n# retrieve just the numeric input values\r\ndata = dataset.values[:, :-1]\r\n# perform a uniform quantile transform of the dataset\r\ntrans = QuantileTransformer(n_quantiles=100, output_distribution='uniform')\r\ndata = trans.fit_transform(data)\r\n# convert the array back to a dataframe\r\ndataset = DataFrame(data)\r\n# histograms of the variables\r\ndataset.hist()\r\npyplot.show()<\/pre>\n<p>Running the example transforms the dataset and plots histograms of each input variable.<\/p>\n<p>We can see that the shape of the histograms for each variable looks very uniform compared to the raw data.<\/p>\n<div id=\"attachment_10330\" style=\"width: 1290px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-10330\" class=\"size-full wp-image-10330\" src=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2020\/05\/Histogram-Plots-of-Uniform-Quantile-Transformed-Input-Variables-for-the-Sonar-Dataset.png\" alt=\"Histogram Plots of Uniform Quantile Transformed Input Variables for the Sonar Dataset\" width=\"1280\" height=\"960\" srcset=\"http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2020\/05\/Histogram-Plots-of-Uniform-Quantile-Transformed-Input-Variables-for-the-Sonar-Dataset.png 1280w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2020\/05\/Histogram-Plots-of-Uniform-Quantile-Transformed-Input-Variables-for-the-Sonar-Dataset-300x225.png 300w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2020\/05\/Histogram-Plots-of-Uniform-Quantile-Transformed-Input-Variables-for-the-Sonar-Dataset-1024x768.png 1024w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2020\/05\/Histogram-Plots-of-Uniform-Quantile-Transformed-Input-Variables-for-the-Sonar-Dataset-768x576.png 768w\" sizes=\"(max-width: 1280px) 100vw, 1280px\"><\/p>\n<p id=\"caption-attachment-10330\" class=\"wp-caption-text\">Histogram Plots of Uniform Quantile Transformed Input Variables for the Sonar Dataset<\/p>\n<\/div>\n<p>Next, let&rsquo;s evaluate the same KNN model as the previous section, but in this case on a uniform quantile transform of the raw dataset.<\/p>\n<p>The complete example is listed below.<\/p>\n<pre class=\"crayon-plain-tag\"># evaluate knn on the sonar dataset with uniform quantile transform\r\nfrom numpy import mean\r\nfrom numpy import std\r\nfrom pandas import read_csv\r\nfrom sklearn.model_selection import cross_val_score\r\nfrom sklearn.model_selection import RepeatedStratifiedKFold\r\nfrom sklearn.neighbors import KNeighborsClassifier\r\nfrom sklearn.preprocessing import LabelEncoder\r\nfrom sklearn.preprocessing import QuantileTransformer\r\nfrom sklearn.pipeline import Pipeline\r\nfrom matplotlib import pyplot\r\n# load dataset\r\nurl = \"https:\/\/raw.githubusercontent.com\/jbrownlee\/Datasets\/master\/sonar.csv\"\r\ndataset = read_csv(url, header=None)\r\ndata = dataset.values\r\n# separate into input and output columns\r\nX, y = data[:, :-1], data[:, -1]\r\n# ensure inputs are floats and output is an integer label\r\nX = X.astype('float32')\r\ny = LabelEncoder().fit_transform(y.astype('str'))\r\n# define the pipeline\r\ntrans = QuantileTransformer(n_quantiles=100, output_distribution='uniform')\r\nmodel = KNeighborsClassifier()\r\npipeline = Pipeline(steps=[('t', trans), ('m', model)])\r\n# evaluate the pipeline\r\ncv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)\r\nn_scores = cross_val_score(pipeline, X, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise')\r\n# report pipeline performance\r\nprint('Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))<\/pre>\n<p>Running the example, we can see that the uniform transform results in a lift in performance from 79.7 percent accuracy without the transform to about 84.5 percent with the transform, better than the normal transform that achieved a score of 81.7 percent.<\/p>\n<pre class=\"crayon-plain-tag\">Accuracy: 0.845 (0.074)<\/pre>\n<p>We chose the number of quantiles as an arbitrary number, in this case, 100.<\/p>\n<p>This hyperparameter can be tuned to explore the effect of the resolution of the transform on the resulting skill of the model.<\/p>\n<p>The example below performs this experiment and plots the mean accuracy for different &ldquo;<em>n_quantiles<\/em>&rdquo; values from 1 to 99.<\/p>\n<pre class=\"crayon-plain-tag\"># explore number of quantiles on classification accuracy\r\nfrom numpy import mean\r\nfrom numpy import std\r\nfrom pandas import read_csv\r\nfrom sklearn.model_selection import cross_val_score\r\nfrom sklearn.model_selection import RepeatedStratifiedKFold\r\nfrom sklearn.neighbors import KNeighborsClassifier\r\nfrom sklearn.preprocessing import QuantileTransformer\r\nfrom sklearn.preprocessing import LabelEncoder\r\nfrom sklearn.pipeline import Pipeline\r\nfrom matplotlib import pyplot\r\n\r\n# get the dataset\r\ndef get_dataset():\r\n\t# load dataset\r\n\turl = \"https:\/\/raw.githubusercontent.com\/jbrownlee\/Datasets\/master\/sonar.csv\"\r\n\tdataset = read_csv(url, header=None)\r\n\tdata = dataset.values\r\n\t# separate into input and output columns\r\n\tX, y = data[:, :-1], data[:, -1]\r\n\t# ensure inputs are floats and output is an integer label\r\n\tX = X.astype('float32')\r\n\ty = LabelEncoder().fit_transform(y.astype('str'))\r\n\treturn X, y\r\n\r\n# get a list of models to evaluate\r\ndef get_models():\r\n\tmodels = dict()\r\n\tfor i in range(1,100):\r\n\t\t# define the pipeline\r\n\t\ttrans = QuantileTransformer(n_quantiles=i, output_distribution='uniform')\r\n\t\tmodel = KNeighborsClassifier()\r\n\t\tmodels[str(i)] = Pipeline(steps=[('t', trans), ('m', model)])\r\n\treturn models\r\n\r\n# evaluate a give model using cross-validation\r\ndef evaluate_model(model):\r\n\tcv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)\r\n\tscores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise')\r\n\treturn scores\r\n\r\n# define dataset\r\nX, y = get_dataset()\r\n# get the models to evaluate\r\nmodels = get_models()\r\n# evaluate the models and store results\r\nresults = list()\r\nfor name, model in models.items():\r\n\tscores = evaluate_model(model)\r\n\tresults.append(mean(scores))\r\n\tprint('&gt;%s %.3f (%.3f)' % (name, mean(scores), std(scores)))\r\n# plot model performance for comparison\r\npyplot.plot(results)\r\npyplot.show()<\/pre>\n<p>Running the example reports the mean classification accuracy for each value of the &ldquo;<em>n_quantiles<\/em>&rdquo; argument.<\/p>\n<p>We can see that surprisingly smaller values resulted in better accuracy, with values such as 4 achieving an accuracy of about 85.4 percent.<\/p>\n<pre class=\"crayon-plain-tag\">&gt;1 0.466 (0.016)\r\n&gt;2 0.813 (0.085)\r\n&gt;3 0.840 (0.080)\r\n&gt;4 0.854 (0.075)\r\n&gt;5 0.848 (0.072)\r\n&gt;6 0.851 (0.071)\r\n&gt;7 0.845 (0.071)\r\n&gt;8 0.848 (0.066)\r\n&gt;9 0.848 (0.071)\r\n&gt;10 0.843 (0.074)\r\n...<\/pre>\n<p>A line plot is created showing the number of quantiles used in the transform versus the classification accuracy of the resulting model.<\/p>\n<p>We can see a bump with values less than 10 and drop and flat performance after that.<\/p>\n<p>The results highlight that there is likely some benefit in exploring different distributions and number of quantiles to see if better performance can be achieved.<\/p>\n<div id=\"attachment_10331\" style=\"width: 1290px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-10331\" class=\"size-full wp-image-10331\" src=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2020\/05\/Line-Plot-of-Number-of-Quantiles-vs-Classification-Accuracy-of-KNN-on-the-Sonar-Dataset.png\" alt=\"Line Plot of Number of Quantiles vs. Classification Accuracy of KNN on the Sonar Dataset\" width=\"1280\" height=\"960\" srcset=\"http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2020\/05\/Line-Plot-of-Number-of-Quantiles-vs-Classification-Accuracy-of-KNN-on-the-Sonar-Dataset.png 1280w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2020\/05\/Line-Plot-of-Number-of-Quantiles-vs-Classification-Accuracy-of-KNN-on-the-Sonar-Dataset-300x225.png 300w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2020\/05\/Line-Plot-of-Number-of-Quantiles-vs-Classification-Accuracy-of-KNN-on-the-Sonar-Dataset-1024x768.png 1024w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2020\/05\/Line-Plot-of-Number-of-Quantiles-vs-Classification-Accuracy-of-KNN-on-the-Sonar-Dataset-768x576.png 768w\" sizes=\"(max-width: 1280px) 100vw, 1280px\"><\/p>\n<p id=\"caption-attachment-10331\" class=\"wp-caption-text\">Line Plot of Number of Quantiles vs. Classification Accuracy of KNN on the Sonar Dataset<\/p>\n<\/div>\n<h2>Further Reading<\/h2>\n<p>This section provides more resources on the topic if you are looking to go deeper.<\/p>\n<h3>Tutorials<\/h3>\n<ul>\n<li><a href=\"https:\/\/machinelearningmastery.com\/continuous-probability-distributions-for-machine-learning\/\">Continuous Probability Distributions for Machine Learning<\/a><\/li>\n<li><a href=\"https:\/\/machinelearningmastery.com\/how-to-transform-target-variables-for-regression-with-scikit-learn\/\">How to Transform Target Variables for Regression With Scikit-Learn<\/a><\/li>\n<\/ul>\n<h3>Dataset<\/h3>\n<ul>\n<li><a href=\"https:\/\/raw.githubusercontent.com\/jbrownlee\/Datasets\/master\/sonar.csv\">Sonar Dataset<\/a><\/li>\n<li><a href=\"https:\/\/raw.githubusercontent.com\/jbrownlee\/Datasets\/master\/sonar.names\">Sonar Dataset Description<\/a><\/li>\n<\/ul>\n<h3>APIs<\/h3>\n<ul>\n<li><a href=\"https:\/\/scikit-learn.org\/stable\/modules\/preprocessing.html#preprocessing-transformer\">Non-linear transformation, scikit-learn Guide<\/a>.<\/li>\n<li><a href=\"https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.preprocessing.QuantileTransformer.html\">sklearn.preprocessing.QuantileTransformer API<\/a>.<\/li>\n<\/ul>\n<h3>Articles<\/h3>\n<ul>\n<li><a href=\"https:\/\/en.wikipedia.org\/wiki\/Quantile_function\">Quantile function, Wikipedia<\/a>.<\/li>\n<\/ul>\n<h2>Summary<\/h2>\n<p>In this tutorial, you discovered how to use quantile transforms to change the distribution of numeric variables for machine learning.<\/p>\n<p>Specifically, you learned:<\/p>\n<ul>\n<li>Many machine learning algorithms prefer or perform better when numerical variables have a Gaussian or standard probability distribution.<\/li>\n<li>Quantile transforms are a technique for transforming numerical input or output variables to have a Gaussian or uniform probability distribution.<\/li>\n<li>How to use the QuantileTransformer to change the probability distribution of numeric variables to improve the performance of predictive models.<\/li>\n<\/ul>\n<p><strong>Do you have any questions?<\/strong><br \/>\nAsk your questions in the comments below and I will do my best to answer.<\/p>\n<p>The post <a rel=\"nofollow\" href=\"https:\/\/machinelearningmastery.com\/quantile-transforms-for-machine-learning\/\">How to Use Quantile Transforms for Machine Learning<\/a> appeared first on <a rel=\"nofollow\" href=\"https:\/\/machinelearningmastery.com\/\">Machine Learning Mastery<\/a>.<\/p>\n<\/div>\n<p><a href=\"https:\/\/machinelearningmastery.com\/quantile-transforms-for-machine-learning\/\">Go to Source<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Author: Jason Brownlee Numerical input variables may have a highly skewed or non-standard distribution. This could be caused by outliers in the data, multi-modal distributions, [&hellip;] <span class=\"read-more-link\"><a class=\"read-more\" href=\"https:\/\/www.aiproblog.com\/index.php\/2020\/05\/19\/how-to-use-quantile-transforms-for-machine-learning\/\">Read More<\/a><\/span><\/p>\n","protected":false},"author":1,"featured_media":3475,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_bbp_topic_count":0,"_bbp_reply_count":0,"_bbp_total_topic_count":0,"_bbp_total_reply_count":0,"_bbp_voice_count":0,"_bbp_anonymous_reply_count":0,"_bbp_topic_count_hidden":0,"_bbp_reply_count_hidden":0,"_bbp_forum_subforum_count":0,"footnotes":""},"categories":[24],"tags":[],"_links":{"self":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts\/3474"}],"collection":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/comments?post=3474"}],"version-history":[{"count":0,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts\/3474\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/media\/3475"}],"wp:attachment":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/media?parent=3474"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/categories?post=3474"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/tags?post=3474"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}