{"id":3522,"date":"2020-06-02T19:00:09","date_gmt":"2020-06-02T19:00:09","guid":{"rendered":"https:\/\/www.aiproblog.com\/index.php\/2020\/06\/02\/iterative-imputation-for-missing-values-in-machine-learning\/"},"modified":"2020-06-02T19:00:09","modified_gmt":"2020-06-02T19:00:09","slug":"iterative-imputation-for-missing-values-in-machine-learning","status":"publish","type":"post","link":"https:\/\/www.aiproblog.com\/index.php\/2020\/06\/02\/iterative-imputation-for-missing-values-in-machine-learning\/","title":{"rendered":"Iterative Imputation for Missing Values in Machine Learning"},"content":{"rendered":"<p>Author: Jason Brownlee<\/p>\n<div>\n<p>Datasets may have missing values, and this can cause problems for many machine learning algorithms.<\/p>\n<p>As such, it is good practice to identify and replace missing values for each column in your input data prior to modeling your prediction task. This is called missing data imputation, or imputing for short.<\/p>\n<p>A sophisticated approach involves defining a model to predict each missing feature as a function of all other features and to repeat this process of estimating feature values multiple times. The repetition allows the refined estimated values for other features to be used as input in subsequent iterations of predicting missing values. This is generally referred to as iterative imputation.<\/p>\n<p>In this tutorial, you will discover how to use iterative imputation strategies for missing data in machine learning.<\/p>\n<p>After completing this tutorial, you will know:<\/p>\n<ul>\n<li>Missing values must be marked with NaN values and can be replaced with iteratively estimated values.<\/li>\n<li>How to load a CSV value with missing values and mark the missing values with NaN values and report the number and percentage of missing values for each column.<\/li>\n<li>How to impute missing values with iterative models as a data preparation method when evaluating models and when fitting a final model to make predictions on new data.<\/li>\n<\/ul>\n<p>Let&rsquo;s get started.<\/p>\n<div id=\"attachment_10537\" style=\"width: 809px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-10537\" class=\"size-full wp-image-10537\" src=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2020\/07\/Iterative-Imputation-for-Missing-Values-in-Machine-Learning.jpg\" alt=\"Iterative Imputation for Missing Values in Machine Learning\" width=\"799\" height=\"533\" srcset=\"http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2020\/07\/Iterative-Imputation-for-Missing-Values-in-Machine-Learning.jpg 799w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2020\/07\/Iterative-Imputation-for-Missing-Values-in-Machine-Learning-300x200.jpg 300w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2020\/07\/Iterative-Imputation-for-Missing-Values-in-Machine-Learning-768x512.jpg 768w\" sizes=\"(max-width: 799px) 100vw, 799px\"><\/p>\n<p id=\"caption-attachment-10537\" class=\"wp-caption-text\">Iterative Imputation for Missing Values in Machine Learning<br \/>Photo by <a href=\"https:\/\/flickr.com\/photos\/macskapocs\/44808123531\/\">Gergely Csatari<\/a>, some rights reserved.<\/p>\n<\/div>\n<h2>Tutorial Overview<\/h2>\n<p>This tutorial is divided into three parts; they are:<\/p>\n<ol>\n<li>Iterative Imputation<\/li>\n<li>Horse Colic Dataset<\/li>\n<li>Iterative Imputation With IterativeImputer\n<ol>\n<li>IterativeImputer Data Transform<\/li>\n<li>IterativeImputer and Model Evaluation<\/li>\n<li>IterativeImputer and Different Imputation Order<\/li>\n<li>IterativeImputer and Different Number of Iterations<\/li>\n<li>IterativeImputer Transform When Making a Prediction<\/li>\n<\/ol>\n<\/li>\n<\/ol>\n<h2>Iterative Imputation<\/h2>\n<p>A dataset may have missing values.<\/p>\n<p>These are rows of data where one or more values or columns in that row are not present. The values may be missing completely or they may be marked with a special character or value, such as a question mark &ldquo;?&rdquo;.<\/p>\n<p>Values could be missing for many reasons, often specific to the problem domain, and might include reasons such as corrupt measurements or unavailability.<\/p>\n<p>Most machine learning algorithms require numeric input values, and a value to be present for each row and column in a dataset. As such, missing values can cause problems for machine learning algorithms.<\/p>\n<p>As such, it is common to identify missing values in a dataset and replace them with a numeric value. This is called data imputing, or missing data imputation.<\/p>\n<p>One approach to imputing missing values is to use an <strong>iterative imputation model<\/strong>.<\/p>\n<p>Iterative imputation refers to a process where each feature is modeled as a function of the other features, e.g. a regression problem where missing values are predicted. Each feature is imputed sequentially, one after the other, allowing prior imputed values to be used as part of a model in predicting subsequent features.<\/p>\n<p>It is iterative because this process is repeated multiple times, allowing ever improved estimates of missing values to be calculated as missing values across all features are estimated.<\/p>\n<p>This approach may be generally referred to as fully conditional specification (FCS) or multivariate imputation by chained equations (MICE).<\/p>\n<blockquote>\n<p>This methodology is attractive if the multivariate distribution is a reasonable description of the data. FCS specifies the multivariate imputation model on a variable-by-variable basis by a set of conditional densities, one for each incomplete variable. Starting from an initial imputation, FCS draws imputations by iterating over the conditional densities. A low number of iterations (say 10&ndash;20) is often sufficient.<\/p>\n<\/blockquote>\n<p>&mdash; <a href=\"https:\/\/www.jstatsoft.org\/article\/view\/v045i03\">mice: Multivariate Imputation by Chained Equations in R<\/a>, 2009.<\/p>\n<p>Different regression algorithms can be used to estimate the missing values for each feature, although linear methods are often used for simplicity. The number of iterations of the procedure is often kept small, such as 10. Finally, the order that features are processed sequentially can be considered, such as from the feature with the least missing values to the feature with the most missing values.<\/p>\n<p>Now that we are familiar with iterative methods for missing value imputation, let&rsquo;s take a look at a dataset with missing values.<\/p>\n<h2>Horse Colic Dataset<\/h2>\n<p>The horse colic dataset describes medical characteristics of horses with colic and whether they lived or died.<\/p>\n<p>There are 300 rows and 26 input variables with one output variable. It is a binary classification prediction task that involves predicting 1 if the horse lived and 2 if the horse died.<\/p>\n<p>A naive model can achieve a classification accuracy of about 67 percent, and a top performing model can achieve an accuracy of about 85.2 percent using three repeats of 10-fold cross-validation. This defines the range of expected modeling performance on the dataset.<\/p>\n<p>The dataset has many missing values for many of the columns where each missing value is marked with a question mark character (&ldquo;?&rdquo;).<\/p>\n<p>Below provides an example of rows from the dataset with marked missing values.<\/p>\n<pre class=\"crayon-plain-tag\">2,1,530101,38.50,66,28,3,3,?,2,5,4,4,?,?,?,3,5,45.00,8.40,?,?,2,2,11300,00000,00000,2\r\n1,1,534817,39.2,88,20,?,?,4,1,3,4,2,?,?,?,4,2,50,85,2,2,3,2,02208,00000,00000,2\r\n2,1,530334,38.30,40,24,1,1,3,1,3,3,1,?,?,?,1,1,33.00,6.70,?,?,1,2,00000,00000,00000,1\r\n1,9,5290409,39.10,164,84,4,1,6,2,2,4,4,1,2,5.00,3,?,48.00,7.20,3,5.30,2,1,02208,00000,00000,1\r\n...<\/pre>\n<p>You can learn more about the dataset here:<\/p>\n<ul>\n<li><a href=\"https:\/\/raw.githubusercontent.com\/jbrownlee\/Datasets\/master\/horse-colic.csv\">Horse Colic Dataset<\/a><\/li>\n<li><a href=\"https:\/\/raw.githubusercontent.com\/jbrownlee\/Datasets\/master\/horse-colic.names\">Horse Colic Dataset Description<\/a><\/li>\n<\/ul>\n<p>No need to download the dataset as we will download it automatically in the worked examples.<\/p>\n<p>Marking missing values with a NaN (not a number) value in a loaded dataset using Python is a best practice.<\/p>\n<p>We can load the dataset using the read_csv() Pandas function and specify the &ldquo;na_values&rdquo; to load values of &lsquo;?&rsquo; as missing, marked with a NaN value.<\/p>\n<pre class=\"crayon-plain-tag\">...\r\n# load dataset\r\nurl = 'https:\/\/raw.githubusercontent.com\/jbrownlee\/Datasets\/master\/horse-colic.csv'\r\ndataframe = read_csv(url, header=None, na_values='?')<\/pre>\n<p>Once loaded, we can review the loaded data to confirm that &ldquo;?&rdquo; values are marked as NaN.<\/p>\n<pre class=\"crayon-plain-tag\">...\r\n# summarize the first few rows\r\nprint(dataframe.head())<\/pre>\n<p>We can then enumerate each column and report the number of rows with missing values for the column.<\/p>\n<pre class=\"crayon-plain-tag\">...\r\n# summarize the number of rows with missing values for each column\r\nfor i in range(dataframe.shape[1]):\r\n\t# count number of rows with missing values\r\n\tn_miss = dataframe[[i]].isnull().sum()\r\n\tperc = n_miss \/ dataframe.shape[0] * 100\r\n\tprint('&gt; %d, Missing: %d (%.1f%%)' % (i, n_miss, perc))<\/pre>\n<p>Tying this together, the complete example of loading and summarizing the dataset is listed below.<\/p>\n<pre class=\"crayon-plain-tag\"># summarize the horse colic dataset\r\nfrom pandas import read_csv\r\n# load dataset\r\nurl = 'https:\/\/raw.githubusercontent.com\/jbrownlee\/Datasets\/master\/horse-colic.csv'\r\ndataframe = read_csv(url, header=None, na_values='?')\r\n# summarize the first few rows\r\nprint(dataframe.head())\r\n# summarize the number of rows with missing values for each column\r\nfor i in range(dataframe.shape[1]):\r\n\t# count number of rows with missing values\r\n\tn_miss = dataframe[[i]].isnull().sum()\r\n\tperc = n_miss \/ dataframe.shape[0] * 100\r\n\tprint('&gt; %d, Missing: %d (%.1f%%)' % (i, n_miss, perc))<\/pre>\n<p>Running the example first loads the dataset and summarizes the first five rows.<\/p>\n<p>We can see that the missing values that were marked with a &ldquo;?&rdquo; character have been replaced with NaN values.<\/p>\n<pre class=\"crayon-plain-tag\">0   1        2     3      4     5    6   ...   21   22  23     24  25  26  27\r\n0  2.0   1   530101  38.5   66.0  28.0  3.0  ...  NaN  2.0   2  11300   0   0   2\r\n1  1.0   1   534817  39.2   88.0  20.0  NaN  ...  2.0  3.0   2   2208   0   0   2\r\n2  2.0   1   530334  38.3   40.0  24.0  1.0  ...  NaN  1.0   2      0   0   0   1\r\n3  1.0   9  5290409  39.1  164.0  84.0  4.0  ...  5.3  2.0   1   2208   0   0   1\r\n4  2.0   1   530255  37.3  104.0  35.0  NaN  ...  NaN  2.0   2   4300   0   0   2\r\n\r\n[5 rows x 28 columns]<\/pre>\n<p>Next, we can see the list of all columns in the dataset and the number and percentage of missing values.<\/p>\n<p>We can see that some columns (e.g. column indexes 1 and 2) have no missing values and other columns (e.g. column indexes 15 and 21) have many or even a majority of missing values.<\/p>\n<pre class=\"crayon-plain-tag\">&gt; 0, Missing: 1 (0.3%)\r\n&gt; 1, Missing: 0 (0.0%)\r\n&gt; 2, Missing: 0 (0.0%)\r\n&gt; 3, Missing: 60 (20.0%)\r\n&gt; 4, Missing: 24 (8.0%)\r\n&gt; 5, Missing: 58 (19.3%)\r\n&gt; 6, Missing: 56 (18.7%)\r\n&gt; 7, Missing: 69 (23.0%)\r\n&gt; 8, Missing: 47 (15.7%)\r\n&gt; 9, Missing: 32 (10.7%)\r\n&gt; 10, Missing: 55 (18.3%)\r\n&gt; 11, Missing: 44 (14.7%)\r\n&gt; 12, Missing: 56 (18.7%)\r\n&gt; 13, Missing: 104 (34.7%)\r\n&gt; 14, Missing: 106 (35.3%)\r\n&gt; 15, Missing: 247 (82.3%)\r\n&gt; 16, Missing: 102 (34.0%)\r\n&gt; 17, Missing: 118 (39.3%)\r\n&gt; 18, Missing: 29 (9.7%)\r\n&gt; 19, Missing: 33 (11.0%)\r\n&gt; 20, Missing: 165 (55.0%)\r\n&gt; 21, Missing: 198 (66.0%)\r\n&gt; 22, Missing: 1 (0.3%)\r\n&gt; 23, Missing: 0 (0.0%)\r\n&gt; 24, Missing: 0 (0.0%)\r\n&gt; 25, Missing: 0 (0.0%)\r\n&gt; 26, Missing: 0 (0.0%)\r\n&gt; 27, Missing: 0 (0.0%)<\/pre>\n<p>Now that we are familiar with the horse colic dataset that has missing values, let&rsquo;s look at how we can use iterative imputation.<\/p>\n<h2>Iterative Imputation With IterativeImputer<\/h2>\n<p>The scikit-learn machine learning library provides the <a href=\"https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.impute.IterativeImputer.html\">IterativeImputer class<\/a> that supports iterative imputation.<\/p>\n<p>In this section, we will explore how to effectively use the <em>IterativeImputer<\/em> class.<\/p>\n<h3>IterativeImputer Data Transform<\/h3>\n<p>It is a data transform that is first configured based on the method used to estimate the missing values. By default, a <a href=\"https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.linear_model.BayesianRidge.html\">BayesianRidge<\/a> model is employed that uses a function of all other input features. Features are filled in ascending order, from those with the fewest missing values to those with the most.<\/p>\n<pre class=\"crayon-plain-tag\">...\r\n# define imputer\r\nimputer = IterativeImputer(estimator=BayesianRidge(), n_nearest_features=None, imputation_order='ascending')<\/pre>\n<p>Then the imputer is fit on a dataset.<\/p>\n<pre class=\"crayon-plain-tag\">...\r\n# fit on the dataset\r\nimputer.fit(X)<\/pre>\n<p>The fit imputer is then applied to a dataset to create a copy of the dataset with all missing values for each column replaced with an estimated value.<\/p>\n<pre class=\"crayon-plain-tag\">...\r\n# transform the dataset\r\nXtrans = imputer.transform(X)<\/pre>\n<p>The <em>IterativeImputer<\/em> class cannot be used directly because it is experimental.<\/p>\n<p>If you try to use it directly, you will get an error as follows:<\/p>\n<pre class=\"crayon-plain-tag\">ImportError: cannot import name 'IterativeImputer'<\/pre>\n<p>Instead, you must add an additional import statement to add support for the IterativeImputer class, as follows:<\/p>\n<pre class=\"crayon-plain-tag\">...\r\nfrom sklearn.experimental import enable_iterative_imputer<\/pre>\n<p>We can demonstrate its usage on the horse colic dataset and confirm it works by summarizing the total number of missing values in the dataset before and after the transform.<\/p>\n<p>The complete example is listed below.<\/p>\n<pre class=\"crayon-plain-tag\"># iterative imputation transform for the horse colic dataset\r\nfrom numpy import isnan\r\nfrom pandas import read_csv\r\nfrom sklearn.experimental import enable_iterative_imputer\r\nfrom sklearn.impute import IterativeImputer\r\n# load dataset\r\nurl = 'https:\/\/raw.githubusercontent.com\/jbrownlee\/Datasets\/master\/horse-colic.csv'\r\ndataframe = read_csv(url, header=None, na_values='?')\r\n# split into input and output elements\r\ndata = dataframe.values\r\nX, y = data[:, :-1], data[:, -1]\r\n# print total missing\r\nprint('Missing: %d' % sum(isnan(X).flatten()))\r\n# define imputer\r\nimputer = IterativeImputer()\r\n# fit on the dataset\r\nimputer.fit(X)\r\n# transform the dataset\r\nXtrans = imputer.transform(X)\r\n# print total missing\r\nprint('Missing: %d' % sum(isnan(Xtrans).flatten()))<\/pre>\n<p>Running the example first loads the dataset and reports the total number of missing values in the dataset as 1,605.<\/p>\n<p>The transform is configured, fit, and performed and the resulting new dataset has no missing values, confirming it was performed as we expected.<\/p>\n<p>Each missing value was replaced with a value estimated by the model.<\/p>\n<pre class=\"crayon-plain-tag\">Missing: 1605\r\nMissing: 0<\/pre>\n<\/p>\n<h3>IterativeImputer and Model Evaluation<\/h3>\n<p>It is a good practice to evaluate machine learning models on a dataset using <a href=\"https:\/\/machinelearningmastery.com\/k-fold-cross-validation\/\">k-fold cross-validation<\/a>.<\/p>\n<p>To correctly apply iterative missing data imputation and avoid data leakage, it is required that the models for each column are calculated on the training dataset only, then applied to the train and test sets for each fold in the dataset.<\/p>\n<p>This can be achieved by creating a modeling pipeline where the first step is the iterative imputation, then the second step is the model. This can be achieved using the <a href=\"https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.pipeline.Pipeline.html\">Pipeline class<\/a>.<\/p>\n<p>For example, the <em>Pipeline<\/em> below uses an <em>IterativeImputer<\/em> with the default strategy, followed by a random forest model.<\/p>\n<pre class=\"crayon-plain-tag\">...\r\n# define modeling pipeline\r\nmodel = RandomForestClassifier()\r\nimputer = IterativeImputer()\r\npipeline = Pipeline(steps=[('i', imputer), ('m', model)])<\/pre>\n<p>We can evaluate the imputed dataset and random forest modeling pipeline for the horse colic dataset with repeated 10-fold cross-validation.<\/p>\n<p>The complete example is listed below.<\/p>\n<pre class=\"crayon-plain-tag\"># evaluate iterative imputation and random forest for the horse colic dataset\r\nfrom numpy import mean\r\nfrom numpy import std\r\nfrom pandas import read_csv\r\nfrom sklearn.ensemble import RandomForestClassifier\r\nfrom sklearn.experimental import enable_iterative_imputer\r\nfrom sklearn.impute import IterativeImputer\r\nfrom sklearn.model_selection import cross_val_score\r\nfrom sklearn.model_selection import RepeatedStratifiedKFold\r\nfrom sklearn.pipeline import Pipeline\r\n# load dataset\r\nurl = 'https:\/\/raw.githubusercontent.com\/jbrownlee\/Datasets\/master\/horse-colic.csv'\r\ndataframe = read_csv(url, header=None, na_values='?')\r\n# split into input and output elements\r\ndata = dataframe.values\r\nX, y = data[:, :-1], data[:, -1]\r\n# define modeling pipeline\r\nmodel = RandomForestClassifier()\r\nimputer = IterativeImputer()\r\npipeline = Pipeline(steps=[('i', imputer), ('m', model)])\r\n# define model evaluation\r\ncv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)\r\n# evaluate model\r\nscores = cross_val_score(pipeline, X, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise')\r\nprint('Mean Accuracy: %.3f (%.3f)' % (mean(scores), std(scores)))<\/pre>\n<p>Running the example correctly applies data imputation to each fold of the cross-validation procedure.<\/p>\n<p>The pipeline is evaluated using three repeats of 10-fold cross-validation and reports the mean classification accuracy on the dataset as about 81.4 percent which is a good score.<\/p>\n<pre class=\"crayon-plain-tag\">Mean Accuracy: 0.814 (0.063)<\/pre>\n<p>How do we know that using a default iterative strategy is good or best for this dataset?<\/p>\n<p>The answer is that we don&rsquo;t.<\/p>\n<h3>IterativeImputer and Different Imputation Order<\/h3>\n<p>By default, imputation is performed in ascending order from the feature with the least missing values to the feature with the most.<\/p>\n<p>This makes sense as we want to have more complete data when it comes time to estimating missing values for columns where the majority of values are missing.<\/p>\n<p>Nevertheless, we can experiment with different imputation order strategies, such as descending, right-to-left (Arabic), left-to-right (Roman), and random.<\/p>\n<p>The example below evaluates and compares each available imputation order configuration.<\/p>\n<pre class=\"crayon-plain-tag\"># compare iterative imputation strategies for the horse colic dataset\r\nfrom numpy import mean\r\nfrom numpy import std\r\nfrom pandas import read_csv\r\nfrom sklearn.ensemble import RandomForestClassifier\r\nfrom sklearn.experimental import enable_iterative_imputer\r\nfrom sklearn.impute import IterativeImputer\r\nfrom sklearn.model_selection import cross_val_score\r\nfrom sklearn.model_selection import RepeatedStratifiedKFold\r\nfrom sklearn.pipeline import Pipeline\r\nfrom matplotlib import pyplot\r\n# load dataset\r\nurl = 'https:\/\/raw.githubusercontent.com\/jbrownlee\/Datasets\/master\/horse-colic.csv'\r\ndataframe = read_csv(url, header=None, na_values='?')\r\n# split into input and output elements\r\ndata = dataframe.values\r\nX, y = data[:, :-1], data[:, -1]\r\n# evaluate each strategy on the dataset\r\nresults = list()\r\nstrategies = ['ascending', 'descending', 'roman', 'arabic', 'random']\r\nfor s in strategies:\r\n\t# create the modeling pipeline\r\n\tpipeline = Pipeline(steps=[('i', IterativeImputer(imputation_order=s)), ('m', RandomForestClassifier())])\r\n\t# evaluate the model\r\n\tcv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)\r\n\tscores = cross_val_score(pipeline, X, y, scoring='accuracy', cv=cv, n_jobs=-1)\r\n\t# store results\r\n\tresults.append(scores)\r\n\tprint('&gt;%s %.3f (%.3f)' % (s, mean(scores), std(scores)))\r\n# plot model performance for comparison\r\npyplot.boxplot(results, labels=strategies, showmeans=True)\r\npyplot.xticks(rotation=45)\r\npyplot.show()<\/pre>\n<p>Running the example evaluates each imputation order on the horse colic dataset using repeated cross-validation.<\/p>\n<p>Your specific results may vary given the stochastic nature of the learning algorithm; consider running the example a few times.<\/p>\n<p>The mean accuracy of each strategy is reported along the way. The results suggest little difference between most of the methods, with descending (opposite of the default) performing the best. The results suggest that right-to-left (Arabic) order might be better for this dataset with an accuracy of about 80.4 percent.<\/p>\n<pre class=\"crayon-plain-tag\">&gt;ascending 0.801 (0.071)\r\n&gt;descending 0.797 (0.059)\r\n&gt;roman 0.802 (0.060)\r\n&gt;arabic 0.804 (0.068)\r\n&gt;random 0.802 (0.061)<\/pre>\n<p>At the end of the run, a box and whisker plot is created for each set of results, allowing the distribution of results to be compared.<\/p>\n<div id=\"attachment_10535\" style=\"width: 1290px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-10535\" class=\"size-full wp-image-10535\" src=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2020\/03\/Box-and-Whisker-Plot-of-Imputation-Order-Strategies-Applied-to-the-Horse-Colic-Dataset.png\" alt=\"Box and Whisker Plot of Imputation Order Strategies Applied to the Horse Colic Dataset\" width=\"1280\" height=\"960\" srcset=\"http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2020\/03\/Box-and-Whisker-Plot-of-Imputation-Order-Strategies-Applied-to-the-Horse-Colic-Dataset.png 1280w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2020\/03\/Box-and-Whisker-Plot-of-Imputation-Order-Strategies-Applied-to-the-Horse-Colic-Dataset-300x225.png 300w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2020\/03\/Box-and-Whisker-Plot-of-Imputation-Order-Strategies-Applied-to-the-Horse-Colic-Dataset-1024x768.png 1024w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2020\/03\/Box-and-Whisker-Plot-of-Imputation-Order-Strategies-Applied-to-the-Horse-Colic-Dataset-768x576.png 768w\" sizes=\"(max-width: 1280px) 100vw, 1280px\"><\/p>\n<p id=\"caption-attachment-10535\" class=\"wp-caption-text\">Box and Whisker Plot of Imputation Order Strategies Applied to the Horse Colic Dataset<\/p>\n<\/div>\n<h3>IterativeImputer and Different Number of Iterations<\/h3>\n<p>By default, the IterativeImputer will repeat the number of iterations 10 times.<\/p>\n<p>It is possible that a large number of iterations may begin to bias or skew the estimate and that few iterations may be preferred. The number of iterations of the procedure can be specified via the &ldquo;<em>max_iter<\/em>&rdquo; argument.<\/p>\n<p>It may be interesting to evaluate different numbers of iterations. The example below compares different values for &ldquo;<em>max_iter<\/em>&rdquo; from 1 to 20.<\/p>\n<pre class=\"crayon-plain-tag\"># compare iterative imputation number of iterations for the horse colic dataset\r\nfrom numpy import mean\r\nfrom numpy import std\r\nfrom pandas import read_csv\r\nfrom sklearn.ensemble import RandomForestClassifier\r\nfrom sklearn.experimental import enable_iterative_imputer\r\nfrom sklearn.impute import IterativeImputer\r\nfrom sklearn.model_selection import cross_val_score\r\nfrom sklearn.model_selection import RepeatedStratifiedKFold\r\nfrom sklearn.pipeline import Pipeline\r\nfrom matplotlib import pyplot\r\n# load dataset\r\nurl = 'https:\/\/raw.githubusercontent.com\/jbrownlee\/Datasets\/master\/horse-colic.csv'\r\ndataframe = read_csv(url, header=None, na_values='?')\r\n# split into input and output elements\r\ndata = dataframe.values\r\nX, y = data[:, :-1], data[:, -1]\r\n# evaluate each strategy on the dataset\r\nresults = list()\r\nstrategies = [str(i) for i in range(1, 21)]\r\nfor s in strategies:\r\n\t# create the modeling pipeline\r\n\tpipeline = Pipeline(steps=[('i', IterativeImputer(max_iter=int(s))), ('m', RandomForestClassifier())])\r\n\t# evaluate the model\r\n\tcv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)\r\n\tscores = cross_val_score(pipeline, X, y, scoring='accuracy', cv=cv, n_jobs=-1)\r\n\t# store results\r\n\tresults.append(scores)\r\n\tprint('&gt;%s %.3f (%.3f)' % (s, mean(scores), std(scores)))\r\n# plot model performance for comparison\r\npyplot.boxplot(results, labels=strategies, showmeans=True)\r\npyplot.xticks(rotation=45)\r\npyplot.show()<\/pre>\n<p>Running the example evaluates each number of iterations on the horse colic dataset using repeated cross-validation.<\/p>\n<p>Your specific results may vary given the stochastic nature of the learning algorithm; consider running the example a few times.<\/p>\n<p>The results suggest that very few iterations, such as 1 or 2, might be as or more effective than 9-12 iterations on this dataset.<\/p>\n<pre class=\"crayon-plain-tag\">&gt;1 0.820 (0.072)\r\n&gt;2 0.813 (0.078)\r\n&gt;3 0.801 (0.066)\r\n&gt;4 0.817 (0.067)\r\n&gt;5 0.808 (0.071)\r\n&gt;6 0.799 (0.059)\r\n&gt;7 0.804 (0.058)\r\n&gt;8 0.809 (0.070)\r\n&gt;9 0.812 (0.068)\r\n&gt;10 0.800 (0.058)\r\n&gt;11 0.818 (0.064)\r\n&gt;12 0.810 (0.073)\r\n&gt;13 0.808 (0.073)\r\n&gt;14 0.799 (0.067)\r\n&gt;15 0.812 (0.075)\r\n&gt;16 0.814 (0.057)\r\n&gt;17 0.812 (0.060)\r\n&gt;18 0.810 (0.069)\r\n&gt;19 0.810 (0.057)\r\n&gt;20 0.802 (0.067)<\/pre>\n<p>At the end of the run, a box and whisker plot is created for each set of results, allowing the distribution of results to be compared.<\/p>\n<div id=\"attachment_10536\" style=\"width: 1290px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-10536\" class=\"size-full wp-image-10536\" src=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2020\/03\/Box-and-Whisker-Plot-of-Number-of-Imputation-Iterations-on-the-Horse-Colic-Dataset.png\" alt=\"Box and Whisker Plot of Number of Imputation Iterations on the Horse Colic Dataset\" width=\"1280\" height=\"960\" srcset=\"http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2020\/03\/Box-and-Whisker-Plot-of-Number-of-Imputation-Iterations-on-the-Horse-Colic-Dataset.png 1280w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2020\/03\/Box-and-Whisker-Plot-of-Number-of-Imputation-Iterations-on-the-Horse-Colic-Dataset-300x225.png 300w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2020\/03\/Box-and-Whisker-Plot-of-Number-of-Imputation-Iterations-on-the-Horse-Colic-Dataset-1024x768.png 1024w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2020\/03\/Box-and-Whisker-Plot-of-Number-of-Imputation-Iterations-on-the-Horse-Colic-Dataset-768x576.png 768w\" sizes=\"(max-width: 1280px) 100vw, 1280px\"><\/p>\n<p id=\"caption-attachment-10536\" class=\"wp-caption-text\">Box and Whisker Plot of Number of Imputation Iterations on the Horse Colic Dataset<\/p>\n<\/div>\n<h3>IterativeImputer Transform When Making a Prediction<\/h3>\n<p>We may wish to create a final modeling pipeline with the iterative imputation and random forest algorithm, then make a prediction for new data.<\/p>\n<p>This can be achieved by defining the pipeline and fitting it on all available data, then calling the <em>predict()<\/em> function, passing new data in as an argument.<\/p>\n<p>Importantly, the row of new data must mark any missing values using the NaN value.<\/p>\n<pre class=\"crayon-plain-tag\">...\r\n# define new data\r\nrow = [2,1,530101,38.50,66,28,3,3,nan,2,5,4,4,nan,nan,nan,3,5,45.00,8.40,nan,nan,2,2,11300,00000,00000]<\/pre>\n<p>The complete example is listed below.<\/p>\n<pre class=\"crayon-plain-tag\"># iterative imputation strategy and prediction for the hose colic dataset\r\nfrom numpy import nan\r\nfrom pandas import read_csv\r\nfrom sklearn.ensemble import RandomForestClassifier\r\nfrom sklearn.experimental import enable_iterative_imputer\r\nfrom sklearn.impute import IterativeImputer\r\nfrom sklearn.pipeline import Pipeline\r\n# load dataset\r\nurl = 'https:\/\/raw.githubusercontent.com\/jbrownlee\/Datasets\/master\/horse-colic.csv'\r\ndataframe = read_csv(url, header=None, na_values='?')\r\n# split into input and output elements\r\ndata = dataframe.values\r\nX, y = data[:, :-1], data[:, -1]\r\n# create the modeling pipeline\r\npipeline = Pipeline(steps=[('i', IterativeImputer()), ('m', RandomForestClassifier())])\r\n# fit the model\r\npipeline.fit(X, y)\r\n# define new data\r\nrow = [2,1,530101,38.50,66,28,3,3,nan,2,5,4,4,nan,nan,nan,3,5,45.00,8.40,nan,nan,2,2,11300,00000,00000]\r\n# make a prediction\r\nyhat = pipeline.predict([row])\r\n# summarize prediction\r\nprint('Predicted Class: %d' % yhat[0])<\/pre>\n<p>Running the example fits the modeling pipeline on all available data.<\/p>\n<p>A new row of data is defined with missing values marked with NaNs and a classification prediction is made.<\/p>\n<pre class=\"crayon-plain-tag\">Predicted Class: 2<\/pre>\n<\/p>\n<h2>Further Reading<\/h2>\n<p>This section provides more resources on the topic if you are looking to go deeper.<\/p>\n<h3>Related Tutorials<\/h3>\n<ul>\n<li><a href=\"https:\/\/machinelearningmastery.com\/results-for-standard-classification-and-regression-machine-learning-datasets\/\">Results for Standard Classification and Regression Machine Learning Datasets<\/a><\/li>\n<li><a href=\"https:\/\/machinelearningmastery.com\/handle-missing-data-python\/\">How to Handle Missing Data with Python<\/a><\/li>\n<\/ul>\n<h3>Papers<\/h3>\n<ul>\n<li><a href=\"https:\/\/www.jstatsoft.org\/article\/view\/v045i03\">mice: Multivariate Imputation by Chained Equations in R<\/a>, 2009.<\/li>\n<li><a href=\"https:\/\/www.jstor.org\/stable\/2984099?seq=1\">A Method of Estimation of Missing Values in Multivariate Data Suitable for use with an Electronic Computer<\/a>, 1960.<\/li>\n<\/ul>\n<h3>APIs<\/h3>\n<ul>\n<li><a href=\"https:\/\/scikit-learn.org\/stable\/modules\/impute.html\">Imputation of missing values, scikit-learn Documentation<\/a>.<\/li>\n<li><a href=\"https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.impute.IterativeImputer.html\">sklearn.impute.IterativeImputer API<\/a>.<\/li>\n<\/ul>\n<h3>Dataset<\/h3>\n<ul>\n<li><a href=\"https:\/\/raw.githubusercontent.com\/jbrownlee\/Datasets\/master\/horse-colic.csv\">Horse Colic Dataset<\/a><\/li>\n<li><a href=\"https:\/\/raw.githubusercontent.com\/jbrownlee\/Datasets\/master\/horse-colic.names\">Horse Colic Dataset Description<\/a><\/li>\n<\/ul>\n<h2>Summary<\/h2>\n<p>In this tutorial, you discovered how to use iterative imputation strategies for missing data in machine learning.<\/p>\n<p>Specifically, you learned:<\/p>\n<ul>\n<li>Missing values must be marked with NaN values and can be replaced with iteratively estimated values.<\/li>\n<li>How to load a CSV value with missing values and mark the missing values with NaN values and report the number and percentage of missing values for each column.<\/li>\n<li>How to impute missing values with iterative models as a data preparation method when evaluating models and when fitting a final model to make predictions on new data.<\/li>\n<\/ul>\n<p><strong>Do you have any questions?<\/strong><br \/>\nAsk your questions in the comments below and I will do my best to answer.<\/p>\n<p>The post <a rel=\"nofollow\" href=\"https:\/\/machinelearningmastery.com\/iterative-imputation-for-missing-values-in-machine-learning\/\">Iterative Imputation for Missing Values in Machine Learning<\/a> appeared first on <a rel=\"nofollow\" href=\"https:\/\/machinelearningmastery.com\/\">Machine Learning Mastery<\/a>.<\/p>\n<\/div>\n<p><a href=\"https:\/\/machinelearningmastery.com\/iterative-imputation-for-missing-values-in-machine-learning\/\">Go to Source<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Author: Jason Brownlee Datasets may have missing values, and this can cause problems for many machine learning algorithms. As such, it is good practice to [&hellip;] <span class=\"read-more-link\"><a class=\"read-more\" href=\"https:\/\/www.aiproblog.com\/index.php\/2020\/06\/02\/iterative-imputation-for-missing-values-in-machine-learning\/\">Read More<\/a><\/span><\/p>\n","protected":false},"author":1,"featured_media":3523,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_bbp_topic_count":0,"_bbp_reply_count":0,"_bbp_total_topic_count":0,"_bbp_total_reply_count":0,"_bbp_voice_count":0,"_bbp_anonymous_reply_count":0,"_bbp_topic_count_hidden":0,"_bbp_reply_count_hidden":0,"_bbp_forum_subforum_count":0,"footnotes":""},"categories":[24],"tags":[],"_links":{"self":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts\/3522"}],"collection":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/comments?post=3522"}],"version-history":[{"count":0,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts\/3522\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/media\/3523"}],"wp:attachment":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/media?parent=3522"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/categories?post=3522"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/tags?post=3522"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}