{"id":2928,"date":"2019-12-15T18:00:15","date_gmt":"2019-12-15T18:00:15","guid":{"rendered":"https:\/\/www.aiproblog.com\/index.php\/2019\/12\/15\/how-to-transform-target-variables-for-regression-with-scikit-learn\/"},"modified":"2019-12-15T18:00:15","modified_gmt":"2019-12-15T18:00:15","slug":"how-to-transform-target-variables-for-regression-with-scikit-learn","status":"publish","type":"post","link":"https:\/\/www.aiproblog.com\/index.php\/2019\/12\/15\/how-to-transform-target-variables-for-regression-with-scikit-learn\/","title":{"rendered":"How to Transform Target Variables for Regression With Scikit-Learn"},"content":{"rendered":"<p>Author: Jason Brownlee<\/p>\n<div>\n<p>Data preparation is a big part of applied machine learning.<\/p>\n<p>Correctly preparing your training data can mean the difference between mediocre and extraordinary results, even with very simple linear algorithms.<\/p>\n<p>Performing data preparation operations, such as scaling, is relatively straightforward for input variables and has been made routine in Python via the Pipeline scikit-learn class.<\/p>\n<p>On regression predictive modeling problems where a numerical value must be predicted, it can also be critical to scale and perform other data transformations on the target variable. This can be achieved in Python using the <strong>TransformedTargetRegressor<\/strong> class.<\/p>\n<p>In this tutorial, you will discover how to use the TransformedTargetRegressor to scale and transform target variables for regression using the scikit-learn Python machine learning library.<\/p>\n<p>After completing this tutorial, you will know:<\/p>\n<ul>\n<li>The importance of scaling input and target data for machine learning.<\/li>\n<li>The two approaches to applying data transforms to target variables.<\/li>\n<li>How to use the TransformedTargetRegressor on a real regression dataset.<\/li>\n<\/ul>\n<p>Let\u2019s get started.<\/p>\n<div id=\"attachment_9217\" style=\"width: 809px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-9217\" class=\"size-full wp-image-9217\" src=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2019\/12\/How-to-Transform-Target-Variables-for-Regression-With-Scikit-Learn.jpg\" alt=\"How to Transform Target Variables for Regression With Scikit-Learn\" width=\"799\" height=\"533\" srcset=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2019\/12\/How-to-Transform-Target-Variables-for-Regression-With-Scikit-Learn.jpg 799w, https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2019\/12\/How-to-Transform-Target-Variables-for-Regression-With-Scikit-Learn-300x200.jpg 300w, https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2019\/12\/How-to-Transform-Target-Variables-for-Regression-With-Scikit-Learn-768x512.jpg 768w\" sizes=\"(max-width: 799px) 100vw, 799px\"><\/p>\n<p id=\"caption-attachment-9217\" class=\"wp-caption-text\">How to Transform Target Variables for Regression With Scikit-Learn<br \/>Photo by <a href=\"https:\/\/flickr.com\/photos\/kiskadee_3\/37926034661\/\">Don Henise<\/a>, some rights reserved.<\/p>\n<\/div>\n<h2>Tutorial Overview<\/h2>\n<p>This tutorial is divided into three parts; they are:<\/p>\n<ol>\n<li>Importance of Data Scaling<\/li>\n<li>How to Scale Target Variables<\/li>\n<li>Example of Using the TransformedTargetRegressor<\/li>\n<\/ol>\n<h2>Importance of Data Scaling<\/h2>\n<p>It is common to have data where the scale of values differs from variable to variable.<\/p>\n<p>For example, one variable may be in feet, another in meters, and so on.<\/p>\n<p>Some machine learning algorithms perform much better if all of the variables are scaled to the same range, such as scaling all variables to values between 0 and 1, called normalization.<\/p>\n<p>This effects algorithms that use a weighted sum of the input, like linear models and neural networks, as well as models that use distance measures such as support vector machines and k-nearest neighbors.<\/p>\n<p>As such, it is a good practice to scale input data, and perhaps even try other data transforms such as making the data more normal (better fit a <a href=\"https:\/\/machinelearningmastery.com\/continuous-probability-distributions-for-machine-learning\/\">Gaussian probability distribution<\/a>) using a power transform.<\/p>\n<p>This also applies to output variables, called target variables, such as numerical values that are predicted when modeling regression predictive modeling problems.<\/p>\n<p>For regression problems, it is often desirable to scale or transform both the input and the target variables.<\/p>\n<p>Scaling input variables is straightforward. In scikit-learn, you can use the scale objects manually, or the more convenient Pipeline that allows you to chain a series of data transform objects together before using your model.<\/p>\n<p>The <a href=\"https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.pipeline.Pipeline.html\">Pipeline<\/a> will fit the scale objects on the training data for you and apply the transform to new data, such as when using a model to make a prediction.<\/p>\n<p>For example:<\/p>\n<pre class=\"crayon-plain-tag\">...\r\n# prepare the model with input scaling\r\npipeline = Pipeline(steps=[('normalize', MinMaxScaler()), ('model', LinearRegression())])\r\n# fit pipeline\r\npipeline.fit(train_x, train_y)\r\n# make predictions\r\nyhat = pipeline.predict(test_x)<\/pre>\n<p>The challenge is, what is the equivalent mechanism to scale target variables in scikit-learn?<\/p>\n<h2>How to Scale Target Variables<\/h2>\n<p>There are two ways that you can scale target variables.<\/p>\n<p>The first is to manually manage the transform, and the second is to use a new automatic way for managing the transform.<\/p>\n<ol>\n<li>Manually transform the target variable.<\/li>\n<li>Automatically transform the target variable.<\/li>\n<\/ol>\n<h3>1. Manual Transform of the Target Variable<\/h3>\n<p>Manually managing the scaling of the target variable involves creating and applying the scaling object to the data manually.<\/p>\n<p>It involves the following steps:<\/p>\n<ol>\n<li>Create the transform object, e.g. a MinMaxScaler.<\/li>\n<li>Fit the transform on the training dataset.<\/li>\n<li>Apply the transform to the train and test datasets.<\/li>\n<li>Invert the transform on any predictions made.<\/li>\n<\/ol>\n<p>For example, if we wanted to normalize a target variable, we would first define and train a <a href=\"https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.preprocessing.MinMaxScaler.html\">MinMaxScaler object<\/a>:<\/p>\n<pre class=\"crayon-plain-tag\"># create target scaler object\r\n...\r\ntarget_scaler = MinMaxScaler()\r\ntarget_scaler.fit(train_y)<\/pre>\n<p>We would then transform the train and test target variable data.<\/p>\n<pre class=\"crayon-plain-tag\"># transform target variables\r\n...\r\ntrain_y = target_scaler.transform(train_y)\r\ntest_y = target_scaler.transform(test_y)<\/pre>\n<p>Then we would fit our model and use the model to make predictions.<\/p>\n<p>Before the predictions can be used or evaluated with an error metric, we would have to invert the transform.<\/p>\n<pre class=\"crayon-plain-tag\"># invert transform on predictions\r\n...\r\nyhat = model.predict(test_X)\r\nyhat = target_scaler.inverse_transform(yhat)<\/pre>\n<p>This is a pain, as it means you cannot use convenience functions in scikit-learn, such as <a href=\"https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.model_selection.cross_val_score.html\">cross_val_score()<\/a>, to quickly evaluate a model.<\/p>\n<h3>2. Automatic Transform of the Target Variable<\/h3>\n<p>An alternate approach is to automatically manage the transform and inverse transform.<\/p>\n<p>This can be achieved by using the <a href=\"https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.compose.TransformedTargetRegressor.html\">TransformedTargetRegressor<\/a> object that wraps a given model and a scaling object.<\/p>\n<p>It will prepare the transform of the target variable using the same training data used to fit the model, then apply that inverse transform on any new data provided when calling fit(), returning predictions in the correct scale.<\/p>\n<p>To use the TransformedTargetRegressor, it is defined by specifying the model and the transform object to use on the target; for example:<\/p>\n<pre class=\"crayon-plain-tag\"># define the target transform wrapper\r\nwrapped_model = TransformedTargetRegressor(regressor=model, transformer=MinMaxScaler())<\/pre>\n<p>Later, the TransformedTargetRegressor instance can be fit like any other model by calling the fit() function and used to make predictions by calling the predict() function.<\/p>\n<pre class=\"crayon-plain-tag\"># use the target transform wrapper\r\n...\r\nwrapped_model.fit(train_X, train_y)\r\nyhat = wrapped_model.predict(test_X)<\/pre>\n<p>This is much easier and allows you to use helpful functions like <em>cross_val_score()<\/em> to evaluate a model<\/p>\n<p>Now that we are familiar with the TransformedTargetRegressor, let\u2019s look at an example of using it on a real dataset.<\/p>\n<h2>Example of Using the TransformedTargetRegressor<\/h2>\n<p>In this section, we will demonstrate how to use the TransformedTargetRegressor on a real dataset.<\/p>\n<p>We will use the Boston housing regression problem that has 13 inputs and one numerical target and requires learning the relationship between suburb characteristics and house prices.<\/p>\n<p>The dataset can be downloaded from here:<\/p>\n<ul>\n<li><a href=\"https:\/\/raw.githubusercontent.com\/jbrownlee\/Datasets\/master\/housing.csv\">Boston Housing Dataset (housing.csv)<\/a><\/li>\n<\/ul>\n<p>Download the dataset and save it in your current working directory with the name \u201c<em>housing.csv<\/em>\u201c.<\/p>\n<p>Looking in the dataset, you should see that all variables are numeric.<\/p>\n<pre class=\"crayon-plain-tag\">0.00632,18.00,2.310,0,0.5380,6.5750,65.20,4.0900,1,296.0,15.30,396.90,4.98,24.00\r\n0.02731,0.00,7.070,0,0.4690,6.4210,78.90,4.9671,2,242.0,17.80,396.90,9.14,21.60\r\n0.02729,0.00,7.070,0,0.4690,7.1850,61.10,4.9671,2,242.0,17.80,392.83,4.03,34.70\r\n0.03237,0.00,2.180,0,0.4580,6.9980,45.80,6.0622,3,222.0,18.70,394.63,2.94,33.40\r\n0.06905,0.00,2.180,0,0.4580,7.1470,54.20,6.0622,3,222.0,18.70,396.90,5.33,36.20\r\n...<\/pre>\n<p>You can learn more about this dataset and the meanings of the columns here:<\/p>\n<ul>\n<li><a href=\"https:\/\/raw.githubusercontent.com\/jbrownlee\/Datasets\/master\/housing.names\">Boston Housing Data Details (housing.names)<\/a><\/li>\n<\/ul>\n<p>We can confirm that the dataset can be loaded correctly as a NumPy array and split it into input and output variables.<\/p>\n<p>The complete example is listed below.<\/p>\n<pre class=\"crayon-plain-tag\"># load and summarize the dataset\r\nfrom numpy import loadtxt\r\n# load data\r\ndataset = loadtxt('housing.csv', delimiter=\",\")\r\n# split into inputs and outputs\r\nX, y = dataset[:, :-1], dataset[:, -1]\r\n# summarize dataset\r\nprint(X.shape, y.shape)<\/pre>\n<p>Running the example prints the shape of the input and output parts of the dataset, showing 13 input variables, one output variable, and 506 rows of data.<\/p>\n<pre class=\"crayon-plain-tag\">(506, 13) (506,)<\/pre>\n<p>We can now prepare an example of using the TransformedTargetRegressor.<\/p>\n<p>A <a href=\"https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.dummy.DummyRegressor.html\">naive regression<\/a> model that predicts the mean value of the target on this problem can achieve a <a href=\"https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.metrics.mean_absolute_error.html\">mean absolute error<\/a> (MAE) of about 6.659. We will aim to do better.<\/p>\n<p>In this example, we will fit a <a href=\"https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.linear_model.HuberRegressor.html\">HuberRegressor<\/a> object and normalize the input variables using a Pipeline.<\/p>\n<pre class=\"crayon-plain-tag\">...\r\n# prepare the model with input scaling\r\npipeline = Pipeline(steps=[('normalize', MinMaxScaler()), ('model', HuberRegressor())])<\/pre>\n<p>Next, we will define a TransformedTargetRegressor instance and set the regressor to the pipeline and the transformer to an instance of a MinMaxScaler object.<\/p>\n<pre class=\"crayon-plain-tag\">...\r\n# prepare the model with target scaling\r\nmodel = TransformedTargetRegressor(regressor=pipeline, transformer=MinMaxScaler())<\/pre>\n<p>We can then evaluate the model with normalization of the input and output variables using 10-fold cross-validation.<\/p>\n<pre class=\"crayon-plain-tag\">...\r\n# evaluate model\r\ncv = KFold(n_splits=10, shuffle=True, random_state=1)\r\nscores = cross_val_score(model, X, y, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1)<\/pre>\n<p>Tying this all together, the complete example is listed below.<\/p>\n<pre class=\"crayon-plain-tag\"># example of normalizing input and output variables for regression.\r\nfrom numpy import mean\r\nfrom numpy import absolute\r\nfrom numpy import loadtxt\r\nfrom sklearn.model_selection import cross_val_score\r\nfrom sklearn.model_selection import KFold\r\nfrom sklearn.pipeline import Pipeline\r\nfrom sklearn.linear_model import HuberRegressor\r\nfrom sklearn.preprocessing import MinMaxScaler\r\nfrom sklearn.compose import TransformedTargetRegressor\r\n# load data\r\ndataset = loadtxt('housing.csv', delimiter=\",\")\r\n# split into inputs and outputs\r\nX, y = dataset[:, :-1], dataset[:, -1]\r\n# prepare the model with input scaling\r\npipeline = Pipeline(steps=[('normalize', MinMaxScaler()), ('model', HuberRegressor())])\r\n# prepare the model with target scaling\r\nmodel = TransformedTargetRegressor(regressor=pipeline, transformer=MinMaxScaler())\r\n# evaluate model\r\ncv = KFold(n_splits=10, shuffle=True, random_state=1)\r\nscores = cross_val_score(model, X, y, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1)\r\n# convert scores to positive\r\nscores = absolute(scores)\r\n# summarize the result\r\ns_mean = mean(scores)\r\nprint('Mean MAE: %.3f' % (s_mean))<\/pre>\n<p>Running the example evaluates the model with normalization of the input and output variables.<\/p>\n<p>Your specific results may vary given the stochastic learning algorithm and differences in library versions.<\/p>\n<p>In this case, we achieve a MAE of about 3.1, much better than a naive model that achieved about 6.6.<\/p>\n<pre class=\"crayon-plain-tag\">Mean MAE: 3.191<\/pre>\n<p>We are not restricted to using scaling objects; for example, we can also explore using other data transforms on the target variable, such as the <a href=\"https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.preprocessing.PowerTransformer.html\">PowerTransformer<\/a>, that can make each variable more-Gaussian-like (using the <a href=\"https:\/\/en.wikipedia.org\/wiki\/Power_transform\">Yeo-Johnson transform<\/a>) and improve the performance of linear models.<\/p>\n<p>By default, the PowerTransformer also performs a standardization of each variable after performing the transform.<\/p>\n<p>The complete example of using a PowerTransformer on the input and target variables of the housing dataset is listed below.<\/p>\n<pre class=\"crayon-plain-tag\"># example of power transform input and output variables for regression.\r\nfrom numpy import mean\r\nfrom numpy import absolute\r\nfrom numpy import loadtxt\r\nfrom sklearn.model_selection import cross_val_score\r\nfrom sklearn.model_selection import KFold\r\nfrom sklearn.pipeline import Pipeline\r\nfrom sklearn.linear_model import HuberRegressor\r\nfrom sklearn.preprocessing import PowerTransformer\r\nfrom sklearn.compose import TransformedTargetRegressor\r\n# load data\r\ndataset = loadtxt('housing.csv', delimiter=\",\")\r\n# split into inputs and outputs\r\nX, y = dataset[:, :-1], dataset[:, -1]\r\n# prepare the model with input scaling\r\npipeline = Pipeline(steps=[('power', PowerTransformer()), ('model', HuberRegressor())])\r\n# prepare the model with target scaling\r\nmodel = TransformedTargetRegressor(regressor=pipeline, transformer=PowerTransformer())\r\n# evaluate model\r\ncv = KFold(n_splits=10, shuffle=True, random_state=1)\r\nscores = cross_val_score(model, X, y, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1)\r\n# convert scores to positive\r\nscores = absolute(scores)\r\n# summarize the result\r\ns_mean = mean(scores)\r\nprint('Mean MAE: %.3f' % (s_mean))<\/pre>\n<p>Running the example evaluates the model with a power transform of the input and output variables.<\/p>\n<p>Your specific results may vary given the stochastic learning algorithm and differences in library versions.<\/p>\n<p>In this case, we see further improvement to a MAE of about 2.9.<\/p>\n<pre class=\"crayon-plain-tag\">Mean MAE: 2.926<\/pre>\n<\/p>\n<h2>Further Reading<\/h2>\n<p>This section provides more resources on the topic if you are looking to go deeper.<\/p>\n<h3>API<\/h3>\n<ul>\n<li><a href=\"https:\/\/scikit-learn.org\/stable\/modules\/compose.html#transforming-target-in-regression\">Transforming target in regression scikit-learn API.<\/a><\/li>\n<li><a href=\"https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.compose.TransformedTargetRegressor.html\">sklearn.compose.TransformedTargetRegressor API.<\/a><\/li>\n<li><a href=\"https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.preprocessing.MinMaxScaler.html\">sklearn.preprocessing.MinMaxScaler API.<\/a><\/li>\n<li><a href=\"https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.preprocessing.PowerTransformer.html\">sklearn.preprocessing.PowerTransformer API.<\/a><\/li>\n<\/ul>\n<h3>Articles<\/h3>\n<ul>\n<li><a href=\"https:\/\/raw.githubusercontent.com\/jbrownlee\/Datasets\/master\/housing.csv\">Boston Housing Dataset (housing.csv)<\/a><\/li>\n<li><a href=\"https:\/\/raw.githubusercontent.com\/jbrownlee\/Datasets\/master\/housing.names\">Boston Housing Data Details (housing.names)<\/a><\/li>\n<\/ul>\n<h2>Summary<\/h2>\n<p>In this tutorial, you discovered how to use the TransformedTargetRegressor to scale and transform target variables for regression in scikit-learn.<\/p>\n<p>Specifically, you learned:<\/p>\n<ul>\n<li>The importance of scaling input and target data for machine learning.<\/li>\n<li>The two approaches to applying data transforms to target variables.<\/li>\n<li>How to use the TransformedTargetRegressor on a real regression dataset.<\/li>\n<\/ul>\n<p>Do you have any questions?<br \/>\nAsk your questions in the comments below and I will do my best to answer.<\/p>\n<p>The post <a rel=\"nofollow\" href=\"https:\/\/machinelearningmastery.com\/how-to-transform-target-variables-for-regression-with-scikit-learn\/\">How to Transform Target Variables for Regression With Scikit-Learn<\/a> appeared first on <a rel=\"nofollow\" href=\"https:\/\/machinelearningmastery.com\/\">Machine Learning Mastery<\/a>.<\/p>\n<\/div>\n<p><a href=\"https:\/\/machinelearningmastery.com\/how-to-transform-target-variables-for-regression-with-scikit-learn\/\">Go to Source<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Author: Jason Brownlee Data preparation is a big part of applied machine learning. Correctly preparing your training data can mean the difference between mediocre and [&hellip;] <span class=\"read-more-link\"><a class=\"read-more\" href=\"https:\/\/www.aiproblog.com\/index.php\/2019\/12\/15\/how-to-transform-target-variables-for-regression-with-scikit-learn\/\">Read More<\/a><\/span><\/p>\n","protected":false},"author":1,"featured_media":2929,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_bbp_topic_count":0,"_bbp_reply_count":0,"_bbp_total_topic_count":0,"_bbp_total_reply_count":0,"_bbp_voice_count":0,"_bbp_anonymous_reply_count":0,"_bbp_topic_count_hidden":0,"_bbp_reply_count_hidden":0,"_bbp_forum_subforum_count":0,"footnotes":""},"categories":[24],"tags":[],"_links":{"self":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts\/2928"}],"collection":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/comments?post=2928"}],"version-history":[{"count":0,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts\/2928\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/media\/2929"}],"wp:attachment":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/media?parent=2928"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/categories?post=2928"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/tags?post=2928"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}