{"id":3483,"date":"2020-05-21T19:00:53","date_gmt":"2020-05-21T19:00:53","guid":{"rendered":"https:\/\/www.aiproblog.com\/index.php\/2020\/05\/21\/how-to-use-discretization-transforms-for-machine-learning\/"},"modified":"2020-05-21T19:00:53","modified_gmt":"2020-05-21T19:00:53","slug":"how-to-use-discretization-transforms-for-machine-learning","status":"publish","type":"post","link":"https:\/\/www.aiproblog.com\/index.php\/2020\/05\/21\/how-to-use-discretization-transforms-for-machine-learning\/","title":{"rendered":"How to Use Discretization Transforms for Machine Learning"},"content":{"rendered":"<p>Author: Jason Brownlee<\/p>\n<div>\n<p>Numerical input variables may have a highly skewed or non-standard distribution.<\/p>\n<p>This could be caused by outliers in the data, multi-modal distributions, highly exponential distributions, and more.<\/p>\n<p>Many machine learning algorithms prefer or perform better when numerical input variables have a standard probability distribution.<\/p>\n<p>The discretization transform provides an automatic way to change a numeric input variable to have a different data distribution, which in turn can be used as input to a predictive model.<\/p>\n<p>In this tutorial, you will discover how to use discretization transforms to map numerical values to discrete categories for machine learning<\/p>\n<p>After completing this tutorial, you will know:<\/p>\n<ul>\n<li>Many machine learning algorithms prefer or perform better when numerical with non-standard probability distributions are made discrete.<\/li>\n<li>Discretization transforms are a technique for transforming numerical input or output variables to have discrete ordinal labels.<\/li>\n<li>How to use the KBinsDiscretizer to change the structure and distribution of numeric variables to improve the performance of predictive models.<\/li>\n<\/ul>\n<p>Let&rsquo;s get started.<\/p>\n<div id=\"attachment_10344\" style=\"width: 809px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-10344\" class=\"size-full wp-image-10344\" src=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2020\/05\/How-to-Use-Discretization-Transforms-for-Machine-Learning.jpg\" alt=\"How to Use Discretization Transforms for Machine Learning\" width=\"799\" height=\"533\" srcset=\"http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2020\/05\/How-to-Use-Discretization-Transforms-for-Machine-Learning.jpg 799w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2020\/05\/How-to-Use-Discretization-Transforms-for-Machine-Learning-300x200.jpg 300w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2020\/05\/How-to-Use-Discretization-Transforms-for-Machine-Learning-768x512.jpg 768w\" sizes=\"(max-width: 799px) 100vw, 799px\"><\/p>\n<p id=\"caption-attachment-10344\" class=\"wp-caption-text\">How to Use Discretization Transforms for Machine Learning<br \/>Photo by <a href=\"https:\/\/flickr.com\/photos\/kateed\/37017732716\/\">Kate Russell<\/a>, some rights reserved.<\/p>\n<\/div>\n<h2>Tutorial Overview<\/h2>\n<p>This tutorial is divided into six parts; they are:<\/p>\n<ol>\n<li>Change Data Distribution<\/li>\n<li>Discretization Transforms<\/li>\n<li>Sonar Dataset<\/li>\n<li>Uniform Discretization Transform<\/li>\n<li>K-means Discretization Transform<\/li>\n<li>Quantile Discretization Transform<\/li>\n<\/ol>\n<h2>Change Data Distribution<\/h2>\n<p>Some machine learning algorithms may prefer or require categorical or ordinal input variables, such as some decision tree and rule-based algorithms.<\/p>\n<blockquote>\n<p>Some classification and clustering algorithms deal with nominal attributes only and cannot handle ones measured on a numeric scale.<\/p>\n<\/blockquote>\n<p>&mdash; Page 296, <a href=\"https:\/\/amzn.to\/2tzBoXF\">Data Mining: Practical Machine Learning Tools and Techniques<\/a>, 4th edition, 2016.<\/p>\n<p>Further, the performance of many machine learning algorithms degrades for variables that have non-standard probability distributions.<\/p>\n<p>This applies both to real-valued input variables in the case of classification and regression tasks, and real-valued target variables in the case of regression tasks.<\/p>\n<p>Some input variables may have a highly skewed distribution, such as an exponential distribution where the most common observations are bunched together. Some input variables may have outliers that cause the distribution to be highly spread.<\/p>\n<p>These concerns and others, like non-standard distributions and multi-modal distributions, can make a dataset challenging to model with a range of machine learning models.<\/p>\n<p>As such, it is often desirable to transform each input variable to have a standard probability distribution.<\/p>\n<p>One approach is to use transform of the numerical variable to have a discrete probability distribution where each numerical value is assigned a label and the labels have an ordered (ordinal) relationship.<\/p>\n<p>This is called a <strong>binning<\/strong> or a <strong>discretization transform<\/strong> and can improve the performance of some machine learning models for datasets by making the probability distribution of numerical input variables discrete.<\/p>\n<h2>Discretization Transforms<\/h2>\n<p>A <a href=\"https:\/\/en.wikipedia.org\/wiki\/Discretization_of_continuous_features\">discretization transform<\/a> will map numerical variables onto discrete values.<\/p>\n<blockquote>\n<p>Binning, also known as categorization or discretization, is the process of translating a quantitative variable into a set of two or more qualitative buckets (i.e., categories).<\/p>\n<\/blockquote>\n<p>&mdash; Page 129, <a href=\"https:\/\/amzn.to\/2Yvcupn\">Feature Engineering and Selection<\/a>, 2019.<\/p>\n<p>Values for the variable are grouped together into discrete bins and each bin is assigned a unique integer such that the ordinal relationship between the bins is preserved.<\/p>\n<p>The use of bins is often referred to as binning or <em>k<\/em>-bins, where <em>k<\/em> refers to the number of groups to which a numeric variable is mapped.<\/p>\n<p>The mapping provides a high-order ranking of values that can smooth out the relationships between observations. The transformation can be applied to each numeric input variable in the training dataset and then provided as input to a machine learning model to learn a predictive modeling task.<\/p>\n<blockquote>\n<p>The determination of the bins must be included inside of the resampling process.<\/p>\n<\/blockquote>\n<p>&mdash; Page 132, <a href=\"https:\/\/amzn.to\/2Yvcupn\">Feature Engineering and Selection<\/a>, 2019.<\/p>\n<p>Different methods for grouping the values into k discrete bins can be used; common techniques include:<\/p>\n<ul>\n<li><strong>Uniform<\/strong>: Each bin has the same width in the span of possible values for the variable.<\/li>\n<li><strong>Quantile<\/strong>: Each bin has the same number of values, split based on percentiles.<\/li>\n<li><strong>Clustered<\/strong>: Clusters are identified and examples are assigned to each group.<\/li>\n<\/ul>\n<p>The discretization transform is available in the scikit-learn Python machine learning library via the <a href=\"https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.preprocessing.KBinsDiscretizer.html\">KBinsDiscretizer class<\/a>.<\/p>\n<p>The &ldquo;<em>strategy<\/em>&rdquo; argument controls the manner in which the input variable is divided, as either &ldquo;<em>uniform<\/em>,&rdquo; &ldquo;<em>quantile<\/em>,&rdquo; or &ldquo;<em>kmeans<\/em>.&rdquo;<\/p>\n<p>The &ldquo;<em>n_bins<\/em>&rdquo; argument controls the number of bins that will be created and must be set based on the choice of strategy, e.g. &ldquo;<em>uniform<\/em>&rdquo; is flexible, &ldquo;<em>quantile<\/em>&rdquo; must have a &ldquo;<em>n_bins<\/em>&rdquo; less than the number of observations or sensible percentiles, and &ldquo;<em>kmeans<\/em>&rdquo; must use a value for the number of clusters that can be reasonably found.<\/p>\n<p>The &ldquo;<em>encode<\/em>&rdquo; argument controls whether the transform will map each value to an integer value by setting &ldquo;<em>ordinal<\/em>&rdquo; or a one-hot encoding &ldquo;<em>onehot<\/em>.&rdquo; An ordinal encoding is almost always preferred, although a one-hot encoding may allow a model to learn non-ordinal relationships between the groups, such as in the case of <em>k<\/em>-means clustering strategy.<\/p>\n<p>We can demonstrate the <em>KBinsDiscretizer<\/em>&nbsp;with a small worked example. We can generate a sample of&nbsp;<a href=\"https:\/\/machinelearningmastery.com\/how-to-generate-random-numbers-in-python\/\">random Gaussian numbers<\/a>. The KBinsDiscretizer can then be used to convert the floating values into fixed number of discrete categories with an ranked ordinal relationship.<\/p>\n<p>The complete example is listed below.<\/p>\n<pre class=\"crayon-plain-tag\"># demonstration of the discretization transform\r\nfrom numpy.random import randn\r\nfrom sklearn.preprocessing import KBinsDiscretizer\r\nfrom matplotlib import pyplot\r\n# generate gaussian data sample\r\ndata = randn(1000)\r\n# histogram of the raw data\r\npyplot.hist(data, bins=25)\r\npyplot.show()\r\n# reshape data to have rows and columns\r\ndata = data.reshape((len(data),1))\r\n# discretization transform the raw data\r\nkbins = KBinsDiscretizer(n_bins=10, encode='ordinal', strategy='uniform')\r\ndata_trans = kbins.fit_transform(data)\r\n# summarize first few rows\r\nprint(data_trans[:10, :])\r\n# histogram of the transformed data\r\npyplot.hist(data_trans, bins=10)\r\npyplot.show()<\/pre>\n<p>Running the example first creates a sample of 1,000 random Gaussian floating-point values and plots the data as a histogram.<\/p>\n<div id=\"attachment_10931\" style=\"width: 1290px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-10931\" class=\"size-full wp-image-10931\" src=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2020\/05\/Histogram-of-Data-With-a-Gaussian-Distribution.png\" alt=\"Histogram of Data With a Gaussian Distribution\" width=\"1280\" height=\"960\" srcset=\"http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2020\/05\/Histogram-of-Data-With-a-Gaussian-Distribution.png 1280w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2020\/05\/Histogram-of-Data-With-a-Gaussian-Distribution-300x225.png 300w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2020\/05\/Histogram-of-Data-With-a-Gaussian-Distribution-1024x768.png 1024w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2020\/05\/Histogram-of-Data-With-a-Gaussian-Distribution-768x576.png 768w\" sizes=\"(max-width: 1280px) 100vw, 1280px\"><\/p>\n<p id=\"caption-attachment-10931\" class=\"wp-caption-text\">Histogram of Data With a Gaussian Distribution<\/p>\n<\/div>\n<p>Next the KBinsDiscretizer is used to map the numerical values to categorical values. We configure the transform to create 10 categories (0 to 9), to output the result in ordinal format (integers) and to divide the range of the input data uniformly.<\/p>\n<p>A sample of the transformed data is printed, clearly showing the integer format of the data as expected.<\/p>\n<pre class=\"crayon-plain-tag\">[[5.]\r\n [3.]\r\n [2.]\r\n [6.]\r\n [7.]\r\n [5.]\r\n [3.]\r\n [4.]\r\n [4.]\r\n [2.]]<\/pre>\n<p>Finally, a histogram is created showing the 10 discrete categories and how the observations are distributed across these groups, following the same pattern as the original data with a Gaussian shape.<\/p>\n<div id=\"attachment_10932\" style=\"width: 1290px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-10932\" class=\"size-full wp-image-10932\" src=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2020\/05\/Historam-of-Transformed-Data-With-Discrete-Categories.png\" alt=\"Histogram of Transformed Data With Discrete Categories\" width=\"1280\" height=\"960\" srcset=\"http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2020\/05\/Historam-of-Transformed-Data-With-Discrete-Categories.png 1280w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2020\/05\/Historam-of-Transformed-Data-With-Discrete-Categories-300x225.png 300w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2020\/05\/Historam-of-Transformed-Data-With-Discrete-Categories-1024x768.png 1024w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2020\/05\/Historam-of-Transformed-Data-With-Discrete-Categories-768x576.png 768w\" sizes=\"(max-width: 1280px) 100vw, 1280px\"><\/p>\n<p id=\"caption-attachment-10932\" class=\"wp-caption-text\">Histogram of Transformed Data With Discrete Categories<\/p>\n<\/div>\n<p>In the following sections will take a closer look at how to use the discretization transform on a real dataset.<\/p>\n<p>Next, let&rsquo;s introduce the dataset.<\/p>\n<h2>Sonar Dataset<\/h2>\n<p>The sonar dataset is a standard machine learning dataset for binary classification.<\/p>\n<p>It involves 60 real-valued inputs and a two-class target variable. There are 208 examples in the dataset and the classes are reasonably balanced.<\/p>\n<p>A baseline classification algorithm can achieve a classification accuracy of about 53.4 percent using repeated stratified 10-fold cross-validation. <a href=\"https:\/\/machinelearningmastery.com\/results-for-standard-classification-and-regression-machine-learning-datasets\/\">Top performance<\/a> on this dataset is about 88 percent using repeated stratified 10-fold cross-validation.<\/p>\n<p>The dataset describes radar returns of rocks or simulated mines.<\/p>\n<p>You can learn more about the dataset from here:<\/p>\n<ul>\n<li><a href=\"https:\/\/raw.githubusercontent.com\/jbrownlee\/Datasets\/master\/sonar.csv\">Sonar Dataset<\/a><\/li>\n<li><a href=\"https:\/\/raw.githubusercontent.com\/jbrownlee\/Datasets\/master\/sonar.names\">Sonar Dataset Description<\/a><\/li>\n<\/ul>\n<p>No need to download the dataset; we will download it automatically from our worked examples.<\/p>\n<p>First, let&rsquo;s load and summarize the dataset. The complete example is listed below.<\/p>\n<pre class=\"crayon-plain-tag\"># load and summarize the sonar dataset\r\nfrom pandas import read_csv\r\nfrom pandas.plotting import scatter_matrix\r\nfrom matplotlib import pyplot\r\n# Load dataset\r\nurl = \"https:\/\/raw.githubusercontent.com\/jbrownlee\/Datasets\/master\/sonar.csv\"\r\ndataset = read_csv(url, header=None)\r\n# summarize the shape of the dataset\r\nprint(dataset.shape)\r\n# summarize each variable\r\nprint(dataset.describe())\r\n# histograms of the variables\r\ndataset.hist()\r\npyplot.show()<\/pre>\n<p>Running the example first summarizes the shape of the loaded dataset.<\/p>\n<p>This confirms the 60 input variables, one output variable, and 208 rows of data.<\/p>\n<p>A statistical summary of the input variables is provided showing that values are numeric and range approximately from 0 to 1.<\/p>\n<pre class=\"crayon-plain-tag\">(208, 61)\r\n               0           1           2   ...          57          58          59\r\ncount  208.000000  208.000000  208.000000  ...  208.000000  208.000000  208.000000\r\nmean     0.029164    0.038437    0.043832  ...    0.007949    0.007941    0.006507\r\nstd      0.022991    0.032960    0.038428  ...    0.006470    0.006181    0.005031\r\nmin      0.001500    0.000600    0.001500  ...    0.000300    0.000100    0.000600\r\n25%      0.013350    0.016450    0.018950  ...    0.003600    0.003675    0.003100\r\n50%      0.022800    0.030800    0.034300  ...    0.005800    0.006400    0.005300\r\n75%      0.035550    0.047950    0.057950  ...    0.010350    0.010325    0.008525\r\nmax      0.137100    0.233900    0.305900  ...    0.044000    0.036400    0.043900\r\n\r\n[8 rows x 60 columns]<\/pre>\n<p>Finally, a histogram is created for each input variable.<\/p>\n<p>If we ignore the clutter of the plots and focus on the histograms themselves, we can see that many variables have a skewed distribution.<\/p>\n<div id=\"attachment_10339\" style=\"width: 1290px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-10339\" class=\"size-full wp-image-10339\" src=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2020\/02\/Histogram-Plots-of-Input-Variables-for-the-Sonar-Binary-Classification-Dataset-1.png\" alt=\"Histogram Plots of Input Variables for the Sonar Binary Classification Dataset\" width=\"1280\" height=\"960\" srcset=\"http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2020\/02\/Histogram-Plots-of-Input-Variables-for-the-Sonar-Binary-Classification-Dataset-1.png 1280w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2020\/02\/Histogram-Plots-of-Input-Variables-for-the-Sonar-Binary-Classification-Dataset-1-300x225.png 300w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2020\/02\/Histogram-Plots-of-Input-Variables-for-the-Sonar-Binary-Classification-Dataset-1-1024x768.png 1024w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2020\/02\/Histogram-Plots-of-Input-Variables-for-the-Sonar-Binary-Classification-Dataset-1-768x576.png 768w\" sizes=\"(max-width: 1280px) 100vw, 1280px\"><\/p>\n<p id=\"caption-attachment-10339\" class=\"wp-caption-text\">Histogram Plots of Input Variables for the Sonar Binary Classification Dataset<\/p>\n<\/div>\n<p>Next, let&rsquo;s fit and evaluate a machine learning model on the raw dataset.<\/p>\n<p>We will use a <a href=\"https:\/\/machinelearningmastery.com\/tutorial-to-implement-k-nearest-neighbors-in-python-from-scratch\/\">k-nearest neighbor algorithm<\/a> with default hyperparameters and evaluate it using <a href=\"https:\/\/machinelearningmastery.com\/k-fold-cross-validation\/\">repeated stratified k-fold cross-validation<\/a>.<\/p>\n<p>The complete example is listed below.<\/p>\n<pre class=\"crayon-plain-tag\"># evaluate knn on the raw sonar dataset\r\nfrom numpy import mean\r\nfrom numpy import std\r\nfrom pandas import read_csv\r\nfrom sklearn.model_selection import cross_val_score\r\nfrom sklearn.model_selection import RepeatedStratifiedKFold\r\nfrom sklearn.neighbors import KNeighborsClassifier\r\nfrom sklearn.preprocessing import LabelEncoder\r\nfrom matplotlib import pyplot\r\n# load dataset\r\nurl = \"https:\/\/raw.githubusercontent.com\/jbrownlee\/Datasets\/master\/sonar.csv\"\r\ndataset = read_csv(url, header=None)\r\ndata = dataset.values\r\n# separate into input and output columns\r\nX, y = data[:, :-1], data[:, -1]\r\n# ensure inputs are floats and output is an integer label\r\nX = X.astype('float32')\r\ny = LabelEncoder().fit_transform(y.astype('str'))\r\n# define and configure the model\r\nmodel = KNeighborsClassifier()\r\n# evaluate the model\r\ncv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)\r\nn_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise')\r\n# report model performance\r\nprint('Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))<\/pre>\n<p>Running the example evaluates a KNN model on the raw sonar dataset.<\/p>\n<p>We can see that the model achieved a mean classification accuracy of about 79.7 percent, showing that it has skill (better than 53.4 percent) and is in the ball-park of good performance (88 percent).<\/p>\n<pre class=\"crayon-plain-tag\">Accuracy: 0.797 (0.073)<\/pre>\n<p>Next, let&rsquo;s explore a uniform discretization transform of the dataset.<\/p>\n<h2>Uniform Discretization Transform<\/h2>\n<p>A uniform discretization transform will preserve the probability distribution of each input variable but will make it discrete with the specified number of ordinal groups or labels.<\/p>\n<p>We can apply the uniform discretization transform using the <a href=\"https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.preprocessing.KBinsDiscretizer.html\">KBinsDiscretizer<\/a> class and setting the &ldquo;<em>strategy<\/em>&rdquo; argument to &ldquo;<em>uniform<\/em>.&rdquo; We must also set the desired number of bins set via the &ldquo;<em>n_bins<\/em>&rdquo; argument; in this case, we will use 10.<\/p>\n<p>Once defined, we can call the <em>fit_transform()<\/em> function and pass it our dataset to create a quantile transformed version of our dataset.<\/p>\n<pre class=\"crayon-plain-tag\">...\r\n# perform a uniform discretization transform of the dataset\r\ntrans = KBinsDiscretizer(n_bins=10, encode='ordinal', strategy='uniform')\r\ndata = trans.fit_transform(data)<\/pre>\n<p>Let&rsquo;s try it on our sonar dataset.<\/p>\n<p>The complete example of creating a uniform discretization transform of the sonar dataset and plotting histograms of the result is listed below.<\/p>\n<pre class=\"crayon-plain-tag\"># visualize a uniform ordinal discretization transform of the sonar dataset\r\nfrom pandas import read_csv\r\nfrom pandas import DataFrame\r\nfrom pandas.plotting import scatter_matrix\r\nfrom sklearn.preprocessing import KBinsDiscretizer\r\nfrom matplotlib import pyplot\r\n# load dataset\r\nurl = \"https:\/\/raw.githubusercontent.com\/jbrownlee\/Datasets\/master\/sonar.csv\"\r\ndataset = read_csv(url, header=None)\r\n# retrieve just the numeric input values\r\ndata = dataset.values[:, :-1]\r\n# perform a uniform discretization transform of the dataset\r\ntrans = KBinsDiscretizer(n_bins=10, encode='ordinal', strategy='uniform')\r\ndata = trans.fit_transform(data)\r\n# convert the array back to a dataframe\r\ndataset = DataFrame(data)\r\n# histograms of the variables\r\ndataset.hist()\r\npyplot.show()<\/pre>\n<p>Running the example transforms the dataset and plots histograms of each input variable.<\/p>\n<p>We can see that the shape of the histograms generally matches the shape of the raw dataset, although in this case, each variable has a fixed number of 10 values or ordinal groups.<\/p>\n<div id=\"attachment_10340\" style=\"width: 1290px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-10340\" class=\"size-full wp-image-10340\" src=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2020\/02\/Histogram-Plots-of-Uniform-Discretization-Transformed-Input-Variables-for-the-Sonar-Dataset.png\" alt=\"Histogram Plots of Uniform Discretization Transformed Input Variables for the Sonar Dataset\" width=\"1280\" height=\"960\" srcset=\"http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2020\/02\/Histogram-Plots-of-Uniform-Discretization-Transformed-Input-Variables-for-the-Sonar-Dataset.png 1280w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2020\/02\/Histogram-Plots-of-Uniform-Discretization-Transformed-Input-Variables-for-the-Sonar-Dataset-300x225.png 300w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2020\/02\/Histogram-Plots-of-Uniform-Discretization-Transformed-Input-Variables-for-the-Sonar-Dataset-1024x768.png 1024w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2020\/02\/Histogram-Plots-of-Uniform-Discretization-Transformed-Input-Variables-for-the-Sonar-Dataset-768x576.png 768w\" sizes=\"(max-width: 1280px) 100vw, 1280px\"><\/p>\n<p id=\"caption-attachment-10340\" class=\"wp-caption-text\">Histogram Plots of Uniform Discretization Transformed Input Variables for the Sonar Dataset<\/p>\n<\/div>\n<p>Next, let&rsquo;s evaluate the same KNN model as the previous section, but in this case on a uniform discretization transform of the dataset.<\/p>\n<p>The complete example is listed below.<\/p>\n<pre class=\"crayon-plain-tag\"># evaluate knn on the sonar dataset with uniform ordinal discretization transform\r\nfrom numpy import mean\r\nfrom numpy import std\r\nfrom pandas import read_csv\r\nfrom sklearn.model_selection import cross_val_score\r\nfrom sklearn.model_selection import RepeatedStratifiedKFold\r\nfrom sklearn.neighbors import KNeighborsClassifier\r\nfrom sklearn.preprocessing import LabelEncoder\r\nfrom sklearn.preprocessing import KBinsDiscretizer\r\nfrom sklearn.pipeline import Pipeline\r\nfrom matplotlib import pyplot\r\n# load dataset\r\nurl = \"https:\/\/raw.githubusercontent.com\/jbrownlee\/Datasets\/master\/sonar.csv\"\r\ndataset = read_csv(url, header=None)\r\ndata = dataset.values\r\n# separate into input and output columns\r\nX, y = data[:, :-1], data[:, -1]\r\n# ensure inputs are floats and output is an integer label\r\nX = X.astype('float32')\r\ny = LabelEncoder().fit_transform(y.astype('str'))\r\n# define the pipeline\r\ntrans = KBinsDiscretizer(n_bins=10, encode='ordinal', strategy='uniform')\r\nmodel = KNeighborsClassifier()\r\npipeline = Pipeline(steps=[('t', trans), ('m', model)])\r\n# evaluate the pipeline\r\ncv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)\r\nn_scores = cross_val_score(pipeline, X, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise')\r\n# report pipeline performance\r\nprint('Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))<\/pre>\n<p>Running the example, we can see that the uniform discretization transform results in a lift in performance from 79.7 percent accuracy without the transform to about 82.7 percent with the transform.<\/p>\n<pre class=\"crayon-plain-tag\">Accuracy: 0.827 (0.082)<\/pre>\n<p>Next, let&rsquo;s take a closer look at the k-means discretization transform.<\/p>\n<h2>K-means Discretization Transform<\/h2>\n<p>A K-means discretization transform will attempt to fit k clusters for each input variable and then assign each observation to a cluster.<\/p>\n<p>Unless the empirical distribution of the variable is complex, the number of clusters is likely to be small, such as 3-to-5.<\/p>\n<p>We can apply the K-means discretization transform using the <em>KBinsDiscretizer<\/em> class and setting the &ldquo;<em>strategy<\/em>&rdquo; argument to &ldquo;<em>kmeans<\/em>.&rdquo; We must also set the desired number of bins set via the &ldquo;<em>n_bins<\/em>&rdquo; argument; in this case, we will use three.<\/p>\n<p>Once defined, we can call the <em>fit_transform()<\/em> function and pass it to our dataset to create a quantile transformed version of our dataset.<\/p>\n<pre class=\"crayon-plain-tag\">...\r\n# perform a k-means discretization transform of the dataset\r\ntrans = KBinsDiscretizer(n_bins=3, encode='ordinal', strategy='kmeans')\r\ndata = trans.fit_transform(data)<\/pre>\n<p>Let&rsquo;s try it on our sonar dataset.<\/p>\n<p>The complete example of creating a K-means discretization transform of the sonar dataset and plotting histograms of the result is listed below.<\/p>\n<pre class=\"crayon-plain-tag\"># visualize a k-means ordinal discretization transform of the sonar dataset\r\nfrom pandas import read_csv\r\nfrom pandas import DataFrame\r\nfrom pandas.plotting import scatter_matrix\r\nfrom sklearn.preprocessing import KBinsDiscretizer\r\nfrom matplotlib import pyplot\r\n# load dataset\r\nurl = \"https:\/\/raw.githubusercontent.com\/jbrownlee\/Datasets\/master\/sonar.csv\"\r\ndataset = read_csv(url, header=None)\r\n# retrieve just the numeric input values\r\ndata = dataset.values[:, :-1]\r\n# perform a k-means discretization transform of the dataset\r\ntrans = KBinsDiscretizer(n_bins=3, encode='ordinal', strategy='kmeans')\r\ndata = trans.fit_transform(data)\r\n# convert the array back to a dataframe\r\ndataset = DataFrame(data)\r\n# histograms of the variables\r\ndataset.hist()\r\npyplot.show()<\/pre>\n<p>Running the example transforms the dataset and plots histograms of each input variable.<\/p>\n<p>We can see that the observations for each input variable are organized into one of three groups, some of which appear to be quite even in terms of observations, and others much less so.<\/p>\n<div id=\"attachment_10341\" style=\"width: 1290px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-10341\" class=\"size-full wp-image-10341\" src=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2020\/02\/Histogram-Plots-of-K-means-Discretization-Transformed-Input-Variables-for-the-Sonar-Dataset.png\" alt=\"Histogram Plots of K-means Discretization Transformed Input Variables for the Sonar Dataset\" width=\"1280\" height=\"960\" srcset=\"http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2020\/02\/Histogram-Plots-of-K-means-Discretization-Transformed-Input-Variables-for-the-Sonar-Dataset.png 1280w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2020\/02\/Histogram-Plots-of-K-means-Discretization-Transformed-Input-Variables-for-the-Sonar-Dataset-300x225.png 300w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2020\/02\/Histogram-Plots-of-K-means-Discretization-Transformed-Input-Variables-for-the-Sonar-Dataset-1024x768.png 1024w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2020\/02\/Histogram-Plots-of-K-means-Discretization-Transformed-Input-Variables-for-the-Sonar-Dataset-768x576.png 768w\" sizes=\"(max-width: 1280px) 100vw, 1280px\"><\/p>\n<p id=\"caption-attachment-10341\" class=\"wp-caption-text\">Histogram Plots of K-means Discretization Transformed Input Variables for the Sonar Dataset<\/p>\n<\/div>\n<p>Next, let&rsquo;s evaluate the same KNN model as the previous section, but in this case on a K-means discretization transform of the dataset.<\/p>\n<p>The complete example is listed below.<\/p>\n<pre class=\"crayon-plain-tag\"># evaluate knn on the sonar dataset with k-means ordinal discretization transform\r\nfrom numpy import mean\r\nfrom numpy import std\r\nfrom pandas import read_csv\r\nfrom sklearn.model_selection import cross_val_score\r\nfrom sklearn.model_selection import RepeatedStratifiedKFold\r\nfrom sklearn.neighbors import KNeighborsClassifier\r\nfrom sklearn.preprocessing import LabelEncoder\r\nfrom sklearn.preprocessing import KBinsDiscretizer\r\nfrom sklearn.pipeline import Pipeline\r\nfrom matplotlib import pyplot\r\n# load dataset\r\nurl = \"https:\/\/raw.githubusercontent.com\/jbrownlee\/Datasets\/master\/sonar.csv\"\r\ndataset = read_csv(url, header=None)\r\ndata = dataset.values\r\n# separate into input and output columns\r\nX, y = data[:, :-1], data[:, -1]\r\n# ensure inputs are floats and output is an integer label\r\nX = X.astype('float32')\r\ny = LabelEncoder().fit_transform(y.astype('str'))\r\n# define the pipeline\r\ntrans = KBinsDiscretizer(n_bins=3, encode='ordinal', strategy='kmeans')\r\nmodel = KNeighborsClassifier()\r\npipeline = Pipeline(steps=[('t', trans), ('m', model)])\r\n# evaluate the pipeline\r\ncv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)\r\nn_scores = cross_val_score(pipeline, X, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise')\r\n# report pipeline performance\r\nprint('Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))<\/pre>\n<p>Running the example, we can see that the K-means discretization transform results in a lift in performance from 79.7 percent accuracy without the transform to about 81.4 percent with the transform, although slightly less than the uniform distribution in the previous section.<\/p>\n<pre class=\"crayon-plain-tag\">Accuracy: 0.814 (0.088)<\/pre>\n<p>Next, let&rsquo;s take a closer look at the quantile discretization transform.<\/p>\n<h2>Quantile Discretization Transform<\/h2>\n<p>A quantile discretization transform will attempt to split the observations for each input variable into k groups, where the number of observations assigned to each group is approximately equal.<\/p>\n<p>Unless there are a large number of observations or a complex empirical distribution, the number of bins must be kept small, such as 5-10.<\/p>\n<p>We can apply the quantile discretization transform using the <em>KBinsDiscretizer<\/em> class and setting the &ldquo;<em>strategy<\/em>&rdquo; argument to &ldquo;<em>quantile<\/em>.&rdquo; We must also set the desired number of bins set via the &ldquo;<em>n_bins<\/em>&rdquo; argument; in this case, we will use 10.<\/p>\n<pre class=\"crayon-plain-tag\">...\r\n# perform a quantile discretization transform of the dataset\r\ntrans = KBinsDiscretizer(n_bins=10, encode='ordinal', strategy='quantile')\r\ndata = trans.fit_transform(data)<\/pre>\n<p>The example below applies the quantile discretization transform and creates histogram plots of each of the transformed variables.<\/p>\n<pre class=\"crayon-plain-tag\"># visualize a quantile ordinal discretization transform of the sonar dataset\r\nfrom pandas import read_csv\r\nfrom pandas import DataFrame\r\nfrom pandas.plotting import scatter_matrix\r\nfrom sklearn.preprocessing import KBinsDiscretizer\r\nfrom matplotlib import pyplot\r\n# load dataset\r\nurl = \"https:\/\/raw.githubusercontent.com\/jbrownlee\/Datasets\/master\/sonar.csv\"\r\ndataset = read_csv(url, header=None)\r\n# retrieve just the numeric input values\r\ndata = dataset.values[:, :-1]\r\n# perform a quantile discretization transform of the dataset\r\ntrans = KBinsDiscretizer(n_bins=10, encode='ordinal', strategy='quantile')\r\ndata = trans.fit_transform(data)\r\n# convert the array back to a dataframe\r\ndataset = DataFrame(data)\r\n# histograms of the variables\r\ndataset.hist()\r\npyplot.show()<\/pre>\n<p>Running the example transforms the dataset and plots histograms of each input variable.<\/p>\n<p>We can see that the histograms all show a uniform probability distribution for each input variable, where each of the 10 groups has the same number of observations.<\/p>\n<div id=\"attachment_10342\" style=\"width: 1290px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-10342\" class=\"size-full wp-image-10342\" src=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2020\/02\/Histogram-Plots-of-Quantile-Discretization-Transformed-Input-Variables-for-the-Sonar-Dataset.png\" alt=\"Histogram Plots of Quantile Discretization Transformed Input Variables for the Sonar Dataset\" width=\"1280\" height=\"960\" srcset=\"http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2020\/02\/Histogram-Plots-of-Quantile-Discretization-Transformed-Input-Variables-for-the-Sonar-Dataset.png 1280w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2020\/02\/Histogram-Plots-of-Quantile-Discretization-Transformed-Input-Variables-for-the-Sonar-Dataset-300x225.png 300w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2020\/02\/Histogram-Plots-of-Quantile-Discretization-Transformed-Input-Variables-for-the-Sonar-Dataset-1024x768.png 1024w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2020\/02\/Histogram-Plots-of-Quantile-Discretization-Transformed-Input-Variables-for-the-Sonar-Dataset-768x576.png 768w\" sizes=\"(max-width: 1280px) 100vw, 1280px\"><\/p>\n<p id=\"caption-attachment-10342\" class=\"wp-caption-text\">Histogram Plots of Quantile Discretization Transformed Input Variables for the Sonar Dataset<\/p>\n<\/div>\n<p>Next, let&rsquo;s evaluate the same KNN model as the previous section, but in this case, on a quantile discretization transform of the raw dataset.<\/p>\n<p>The complete example is listed below.<\/p>\n<pre class=\"crayon-plain-tag\"># evaluate knn on the sonar dataset with quantile ordinal discretization transform\r\nfrom numpy import mean\r\nfrom numpy import std\r\nfrom pandas import read_csv\r\nfrom sklearn.model_selection import cross_val_score\r\nfrom sklearn.model_selection import RepeatedStratifiedKFold\r\nfrom sklearn.neighbors import KNeighborsClassifier\r\nfrom sklearn.preprocessing import LabelEncoder\r\nfrom sklearn.preprocessing import KBinsDiscretizer\r\nfrom sklearn.pipeline import Pipeline\r\nfrom matplotlib import pyplot\r\n# load dataset\r\nurl = \"https:\/\/raw.githubusercontent.com\/jbrownlee\/Datasets\/master\/sonar.csv\"\r\ndataset = read_csv(url, header=None)\r\ndata = dataset.values\r\n# separate into input and output columns\r\nX, y = data[:, :-1], data[:, -1]\r\n# ensure inputs are floats and output is an integer label\r\nX = X.astype('float32')\r\ny = LabelEncoder().fit_transform(y.astype('str'))\r\n# define the pipeline\r\ntrans = KBinsDiscretizer(n_bins=10, encode='ordinal', strategy='quantile')\r\nmodel = KNeighborsClassifier()\r\npipeline = Pipeline(steps=[('t', trans), ('m', model)])\r\n# evaluate the pipeline\r\ncv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)\r\nn_scores = cross_val_score(pipeline, X, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise')\r\n# report pipeline performance\r\nprint('Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))<\/pre>\n<p>Running the example, we can see that the uniform transform results in a lift in performance from 79.7 percent accuracy without the transform to about 84.0 percent with the transform, better than the uniform and K-means methods of the previous sections.<\/p>\n<pre class=\"crayon-plain-tag\">Accuracy: 0.840 (0.072)<\/pre>\n<p>We chose the number of bins as an arbitrary number; in this case, 10.<\/p>\n<p>This hyperparameter can be tuned to explore the effect of the resolution of the transform on the resulting skill of the model.<\/p>\n<p>The example below performs this experiment and plots the mean accuracy for different &ldquo;<em>n_bins<\/em>&rdquo; values from two to 10.<\/p>\n<pre class=\"crayon-plain-tag\"># explore number of discrete bins on classification accuracy\r\nfrom numpy import mean\r\nfrom numpy import std\r\nfrom pandas import read_csv\r\nfrom sklearn.model_selection import cross_val_score\r\nfrom sklearn.model_selection import RepeatedStratifiedKFold\r\nfrom sklearn.neighbors import KNeighborsClassifier\r\nfrom sklearn.preprocessing import KBinsDiscretizer\r\nfrom sklearn.preprocessing import LabelEncoder\r\nfrom sklearn.pipeline import Pipeline\r\nfrom matplotlib import pyplot\r\n\r\n# get the dataset\r\ndef get_dataset():\r\n\t# load dataset\r\n\turl = \"https:\/\/raw.githubusercontent.com\/jbrownlee\/Datasets\/master\/sonar.csv\"\r\n\tdataset = read_csv(url, header=None)\r\n\tdata = dataset.values\r\n\t# separate into input and output columns\r\n\tX, y = data[:, :-1], data[:, -1]\r\n\t# ensure inputs are floats and output is an integer label\r\n\tX = X.astype('float32')\r\n\ty = LabelEncoder().fit_transform(y.astype('str'))\r\n\treturn X, y\r\n\r\n# get a list of models to evaluate\r\ndef get_models():\r\n\tmodels = dict()\r\n\tfor i in range(2,11):\r\n\t\t# define the pipeline\r\n\t\ttrans = KBinsDiscretizer(n_bins=i, encode='ordinal', strategy='quantile')\r\n\t\tmodel = KNeighborsClassifier()\r\n\t\tmodels[str(i)] = Pipeline(steps=[('t', trans), ('m', model)])\r\n\treturn models\r\n\r\n# evaluate a give model using cross-validation\r\ndef evaluate_model(model):\r\n\tcv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)\r\n\tscores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise')\r\n\treturn scores\r\n\r\n# get the dataset\r\nX, y = get_dataset()\r\n# get the models to evaluate\r\nmodels = get_models()\r\n# evaluate the models and store results\r\nresults, names = list(), list()\r\nfor name, model in models.items():\r\n\tscores = evaluate_model(model)\r\n\tresults.append(scores)\r\n\tnames.append(name)\r\n\tprint('&gt;%s %.3f (%.3f)' % (name, mean(scores), std(scores)))\r\n# plot model performance for comparison\r\npyplot.boxplot(results, labels=names, showmeans=True)\r\npyplot.show()<\/pre>\n<p>Running the example reports the mean classification accuracy for each value of the &ldquo;<em>n_bins<\/em>&rdquo; argument.<\/p>\n<p>We can see that surprisingly smaller values resulted in better accuracy, with values such as three achieving an accuracy of about 86.7 percent.<\/p>\n<pre class=\"crayon-plain-tag\">&gt;2 0.806 (0.080)\r\n&gt;3 0.867 (0.070)\r\n&gt;4 0.835 (0.083)\r\n&gt;5 0.838 (0.070)\r\n&gt;6 0.836 (0.071)\r\n&gt;7 0.854 (0.071)\r\n&gt;8 0.837 (0.077)\r\n&gt;9 0.841 (0.069)\r\n&gt;10 0.840 (0.072)<\/pre>\n<p>Box and whisker plots are created to summarize the classification accuracy scores for each number of discrete bins on the dataset.<\/p>\n<p>We can see a small bump in accuracy at three bins and the scores drop and remain flat for larger values.<\/p>\n<p>The results highlight that there is likely some benefit in exploring different numbers of discrete bins for the chosen method to see if better performance can be achieved.<\/p>\n<div id=\"attachment_10343\" style=\"width: 1290px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-10343\" class=\"size-full wp-image-10343\" src=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2020\/02\/Box-Plots-of-Number-of-Discrete-Bins-vs-Classification-Accuracy-of-KNN-on-the-Sonar-Dataset.png\" alt=\"Box Plots of Number of Discrete Bins vs. Classification Accuracy of KNN on the Sonar Dataset\" width=\"1280\" height=\"960\" srcset=\"http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2020\/02\/Box-Plots-of-Number-of-Discrete-Bins-vs-Classification-Accuracy-of-KNN-on-the-Sonar-Dataset.png 1280w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2020\/02\/Box-Plots-of-Number-of-Discrete-Bins-vs-Classification-Accuracy-of-KNN-on-the-Sonar-Dataset-300x225.png 300w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2020\/02\/Box-Plots-of-Number-of-Discrete-Bins-vs-Classification-Accuracy-of-KNN-on-the-Sonar-Dataset-1024x768.png 1024w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2020\/02\/Box-Plots-of-Number-of-Discrete-Bins-vs-Classification-Accuracy-of-KNN-on-the-Sonar-Dataset-768x576.png 768w\" sizes=\"(max-width: 1280px) 100vw, 1280px\"><\/p>\n<p id=\"caption-attachment-10343\" class=\"wp-caption-text\">Box Plots of Number of Discrete Bins vs. Classification Accuracy of KNN on the Sonar Dataset<\/p>\n<\/div>\n<h2>Further Reading<\/h2>\n<p>This section provides more resources on the topic if you are looking to go deeper.<\/p>\n<h3>Tutorials<\/h3>\n<ul>\n<li><a href=\"https:\/\/machinelearningmastery.com\/continuous-probability-distributions-for-machine-learning\/\">Continuous Probability Distributions for Machine Learning<\/a><\/li>\n<li><a href=\"https:\/\/machinelearningmastery.com\/how-to-transform-target-variables-for-regression-with-scikit-learn\/\">How to Transform Target Variables for Regression With Scikit-Learn<\/a><\/li>\n<\/ul>\n<h3>Books<\/h3>\n<ul>\n<li><a href=\"https:\/\/amzn.to\/2tzBoXF\">Data Mining: Practical Machine Learning Tools and Techniques<\/a>, 4th edition, 2016.<\/li>\n<li><a href=\"https:\/\/amzn.to\/2Yvcupn\">Feature Engineering and Selection<\/a>, 2019.<\/li>\n<\/ul>\n<h3>Dataset<\/h3>\n<ul>\n<li><a href=\"https:\/\/raw.githubusercontent.com\/jbrownlee\/Datasets\/master\/sonar.csv\">Sonar Dataset<\/a><\/li>\n<li><a href=\"https:\/\/raw.githubusercontent.com\/jbrownlee\/Datasets\/master\/sonar.names\">Sonar Dataset Description<\/a><\/li>\n<\/ul>\n<h3>APIs<\/h3>\n<ul>\n<li><a href=\"https:\/\/scikit-learn.org\/stable\/modules\/preprocessing.html#preprocessing-transformer\">Non-linear transformation, scikit-learn Guide<\/a>.<\/li>\n<li><a href=\"https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.preprocessing.KBinsDiscretizer.html\">sklearn.preprocessing.KBinsDiscretizer API<\/a>.<\/li>\n<\/ul>\n<h3>Articles<\/h3>\n<ul>\n<li><a href=\"https:\/\/en.wikipedia.org\/wiki\/Discretization_of_continuous_features\">Discretization of continuous features, Wikipedia<\/a>.<\/li>\n<\/ul>\n<h2>Summary<\/h2>\n<p>In this tutorial, you discovered how to use discretization transforms to map numerical values to discrete categories for machine learning.<\/p>\n<p>Specifically, you learned:<\/p>\n<ul>\n<li>Many machine learning algorithms prefer or perform better when numerical with non-standard probability distributions are made discrete.<\/li>\n<li>Discretization transforms are a technique for transforming numerical input or output variables to have discrete ordinal labels.<\/li>\n<li>How to use the KBinsDiscretizer to change the structure and distribution of numeric variables to improve the performance of predictive models.<\/li>\n<\/ul>\n<p><strong>Do you have any questions?<\/strong><br \/>\nAsk your questions in the comments below and I will do my best to answer.<\/p>\n<p>The post <a rel=\"nofollow\" href=\"https:\/\/machinelearningmastery.com\/discretization-transforms-for-machine-learning\/\">How to Use Discretization Transforms for Machine Learning<\/a> appeared first on <a rel=\"nofollow\" href=\"https:\/\/machinelearningmastery.com\/\">Machine Learning Mastery<\/a>.<\/p>\n<\/div>\n<p><a href=\"https:\/\/machinelearningmastery.com\/discretization-transforms-for-machine-learning\/\">Go to Source<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Author: Jason Brownlee Numerical input variables may have a highly skewed or non-standard distribution. This could be caused by outliers in the data, multi-modal distributions, [&hellip;] <span class=\"read-more-link\"><a class=\"read-more\" href=\"https:\/\/www.aiproblog.com\/index.php\/2020\/05\/21\/how-to-use-discretization-transforms-for-machine-learning\/\">Read More<\/a><\/span><\/p>\n","protected":false},"author":1,"featured_media":3484,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_bbp_topic_count":0,"_bbp_reply_count":0,"_bbp_total_topic_count":0,"_bbp_total_reply_count":0,"_bbp_voice_count":0,"_bbp_anonymous_reply_count":0,"_bbp_topic_count_hidden":0,"_bbp_reply_count_hidden":0,"_bbp_forum_subforum_count":0,"footnotes":""},"categories":[24],"tags":[],"_links":{"self":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts\/3483"}],"collection":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/comments?post=3483"}],"version-history":[{"count":0,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts\/3483\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/media\/3484"}],"wp:attachment":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/media?parent=3483"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/categories?post=3483"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/tags?post=3483"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}