{"id":2949,"date":"2019-12-19T18:00:38","date_gmt":"2019-12-19T18:00:38","guid":{"rendered":"https:\/\/www.aiproblog.com\/index.php\/2019\/12\/19\/use-the-columntransformer-for-numerical-and-categorical-data-in-python\/"},"modified":"2019-12-19T18:00:38","modified_gmt":"2019-12-19T18:00:38","slug":"use-the-columntransformer-for-numerical-and-categorical-data-in-python","status":"publish","type":"post","link":"https:\/\/www.aiproblog.com\/index.php\/2019\/12\/19\/use-the-columntransformer-for-numerical-and-categorical-data-in-python\/","title":{"rendered":"Use the ColumnTransformer for Numerical and Categorical Data in Python"},"content":{"rendered":"<p>Author: Jason Brownlee<\/p>\n<div>\n<p>You must prepare your raw data using data transforms prior to fitting a machine learning model.<\/p>\n<p>This is required to ensure that you best expose the structure of your predictive modeling problem to the learning algorithms.<\/p>\n<p>Applying data transforms like scaling or encoding categorical variables is straightforward when all input variables are the same type. It can be challenging when you have a dataset with mixed types and you want to selectively apply data transforms to some, but not all, input features.<\/p>\n<p>Thankfully, the scikit-learn Python machine learning library provides the <strong>ColumnTransformer<\/strong> that allows you to selectively apply data transforms to different columns in your dataset.<\/p>\n<p>In this tutorial, you will discover how to use the ColumnTransformer to selectively apply data transforms to columns in a dataset with mixed data types.<\/p>\n<p>After completing this tutorial, you will know:<\/p>\n<ul>\n<li>The challenge of using data transformations with datasets that have mixed data types.<\/li>\n<li>How to define, fit, and use the ColumnTransformer to selectively apply data transforms to columns.<\/li>\n<li>How to work through a real dataset with mixed data types and use the ColumnTransformer to apply different transforms to categorical and numerical data columns.<\/li>\n<\/ul>\n<p>Let\u2019s get started.<\/p>\n<div id=\"attachment_9260\" style=\"width: 809px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-9260\" class=\"size-full wp-image-9260\" src=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2019\/12\/Use-the-ColumnTransformer-for-Numerical-and-Categorical-Data-in-Python.jpg\" alt=\"Use the ColumnTransformer for Numerical and Categorical Data in Python\" width=\"799\" height=\"533\" srcset=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2019\/12\/Use-the-ColumnTransformer-for-Numerical-and-Categorical-Data-in-Python.jpg 799w, https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2019\/12\/Use-the-ColumnTransformer-for-Numerical-and-Categorical-Data-in-Python-300x200.jpg 300w, https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2019\/12\/Use-the-ColumnTransformer-for-Numerical-and-Categorical-Data-in-Python-768x512.jpg 768w\" sizes=\"(max-width: 799px) 100vw, 799px\"><\/p>\n<p id=\"caption-attachment-9260\" class=\"wp-caption-text\">Use the ColumnTransformer for Numerical and Categorical Data in Python<br \/>Photo by <a href=\"https:\/\/flickr.com\/photos\/designsbykari\/6205452745\/\">Kari<\/a>, some rights reserved.<\/p>\n<\/div>\n<h2>Tutorial Overview<\/h2>\n<p>This tutorial is divided into three parts; they are:<\/p>\n<ol>\n<li>Challenge of Transforming Different Data Types<\/li>\n<li>How to use the ColumnTransformer<\/li>\n<li>Data Preparation for the Abalone Regression Dataset<\/li>\n<\/ol>\n<h2>Challenge of Transforming Different Data Types<\/h2>\n<p>It is important to prepare data prior to modeling.<\/p>\n<p>This may involve replacing missing values, scaling numerical values, and one hot encoding categorical data.<\/p>\n<p>Data transforms can be performed using the scikit-learn library; for example, the <a href=\"https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.impute.SimpleImputer.html\">SimpleImputer<\/a> class can be used to replace missing values, the <a href=\"https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.preprocessing.MinMaxScaler.html\">MinMaxScaler<\/a> class can be used to scale numerical values, and the <a href=\"https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.preprocessing.OneHotEncoder.html\">OneHotEncoder<\/a> can be used to encode categorical variables.<\/p>\n<p>For example:<\/p>\n<pre class=\"crayon-plain-tag\">...\r\n# prepare transform\r\nscaler = MinMaxScaler()\r\n# fit transform on training data\r\nscaler.fit(train_X)\r\n# transform training data\r\ntrain_X = scaler.transform(train_X)<\/pre>\n<p>Sequences of different transforms can also be chained together using the <a href=\"https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.pipeline.Pipeline.html\">Pipeline<\/a>, such as imputing missing values, then scaling numerical values.<\/p>\n<p>For example:<\/p>\n<pre class=\"crayon-plain-tag\">...\r\n# define pipeline\r\npipeline = Pipeline(steps=[('i', SimpleImputer(strategy='median')), ('s', MinMaxScaler())])\r\n# transform training data\r\ntrain_X = scaler.fit_transform(train_X)<\/pre>\n<p>It is very common to want to perform different data preparation techniques on different columns in your input data.<\/p>\n<p>For example, you may want to impute missing numerical values with a median value, then scale the values and impute missing categorical values using the most frequent value and one hot encode the categories.<\/p>\n<p>Traditionally, this would require you to separate the numerical and categorical data and then manually apply the transforms on those groups of features before combining the columns back together in order to fit and evaluate a model.<\/p>\n<p>Now, you can use the ColumnTransformer to perform this operation for you.<\/p>\n<h2>How to use the ColumnTransformer<\/h2>\n<p>The <a href=\"https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.compose.ColumnTransformer.html\">ColumnTransformer<\/a> is a class in the scikit-learn Python machine learning library that allows you to selectively apply data preparation transforms.<\/p>\n<p>For example, it allows you to apply a specific transform or sequence of transforms to just the numerical columns, and a separate sequence of transforms to just the categorical columns.<\/p>\n<p>To use the ColumnTransformer, you must specify a list of transformers.<\/p>\n<p>Each transformer is a three-element tuple that defines the name of the transformer, the transform to apply, and the column indices to apply it to. For example:<\/p>\n<ul>\n<li>(Name, Object, Columns)<\/li>\n<\/ul>\n<p>For example, the ColumnTransformer below applies a OneHotEncoder to columns 0 and 1.<\/p>\n<pre class=\"crayon-plain-tag\">transformer = ColumnTransformer(transformers=[('cat', OneHotEncoder(), [0, 1])])<\/pre>\n<p>The example below applies a SimpleImputer with median imputing for numerical columns 0 and 1, and SimpleImputer with most frequent imputing to categorical columns 2 and 3.<\/p>\n<pre class=\"crayon-plain-tag\">t = [('num', SimpleImputer(strategy='median'), [0, 1]), ('cat', SimpleImputer(strategy='most_frequent'), [2, 3])]\r\ntransformer = ColumnTransformer(transformers=t)<\/pre>\n<p>Any columns not specified in the list of \u201c<em>transformers<\/em>\u201d are dropped from the dataset by default; this can be changed by setting the \u201c<em>remainder<\/em>\u201d argument.<\/p>\n<p>Setting <em>remainder=\u2019passthrough\u2019<\/em> will mean that all columns not specified in the list of \u201c<em>transformers<\/em>\u201d will be passed through without transformation, instead of being dropped.<\/p>\n<p>For example, if columns 0 and 1 were numerical and columns 2 and 3 were categorical and we wanted to just transform the categorical data and pass through the numerical columns unchanged, we could define the ColumnTransformer as follows:<\/p>\n<pre class=\"crayon-plain-tag\">transformer = ColumnTransformer(transformers=[('cat', OneHotEncoder(), [2, 3])], remainder='passthrough')<\/pre>\n<p>Once the transformer is defined, it can be used to transform a dataset.<\/p>\n<p>For example:<\/p>\n<pre class=\"crayon-plain-tag\">...\r\ntransformer = ColumnTransformer(transformers=[('cat', OneHotEncoder(), [0, 1])])\r\n# transform training data\r\ntrain_X = transformer.fit_transform(train_X)<\/pre>\n<p>A ColumnTransformer can also be used in a Pipeline to selectively prepare the columns of your dataset before fitting a model on the transformed data.<\/p>\n<p>This is the most likely use case as it ensures that the transforms are performed automatically on the raw data when fitting the model and when making predictions, such as when evaluating the model on a test dataset via cross-validation or making predictions on new data in the future.<\/p>\n<p>For example:<\/p>\n<pre class=\"crayon-plain-tag\">...\r\n# define model\r\nmodel = LogisticRegression()\r\n# define transform\r\ntransformer = ColumnTransformer(transformers=[('cat', OneHotEncoder(), [0, 1])])\r\n# define pipeline\r\npipeline = Pipeline(steps=[('t', transformer), ('m',model)])\r\n# fit the model on the transformed data\r\nmodel.fit(train_X, train_y)\r\n# make predictions\r\nyhat = model.predict(test_X)<\/pre>\n<p>Now that we are familiar with how to configure and use the ColumnTransformer in general, let\u2019s look at a worked example.<\/p>\n<h2>Data Preparation for the Abalone Regression Dataset<\/h2>\n<p>The abalone dataset is a standard machine learning problem that involves predicting the age of an abalone given measurements of an abalone.<\/p>\n<p>You can download the dataset and learn more about it here:<\/p>\n<ul>\n<li><a href=\"https:\/\/raw.githubusercontent.com\/jbrownlee\/Datasets\/master\/abalone.csv\">Download Abalone Dataset (abalone.csv)<\/a><\/li>\n<li><a href=\"https:\/\/raw.githubusercontent.com\/jbrownlee\/Datasets\/master\/abalone.names\">Learn More About the Abalone Dataset (abalone.names)<\/a><\/li>\n<\/ul>\n<p>The dataset has 4,177 examples, 8 input variables, and the target variable is an integer.<\/p>\n<p>A naive model can achieve a mean absolute error (MAE) of about 2.363 (std 0.092) by predicting the mean value, evaluated via 10-fold cross-validation.<\/p>\n<p>We can model this as a regression predictive modeling problem with a support vector machine model (<a href=\"https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.svm.SVR.html\">SVR<\/a>).<\/p>\n<p>Reviewing the data, you can see the first few rows as follows:<\/p>\n<pre class=\"crayon-plain-tag\">M,0.455,0.365,0.095,0.514,0.2245,0.101,0.15,15\r\nM,0.35,0.265,0.09,0.2255,0.0995,0.0485,0.07,7\r\nF,0.53,0.42,0.135,0.677,0.2565,0.1415,0.21,9\r\nM,0.44,0.365,0.125,0.516,0.2155,0.114,0.155,10\r\nI,0.33,0.255,0.08,0.205,0.0895,0.0395,0.055,7\r\n...<\/pre>\n<p>We can see that the first column is categorical and the remainder of the columns are numerical.<\/p>\n<p>We may want to one hot encode the first column and normalize the remaining numerical columns, and this can be achieved using the ColumnTransformer.<\/p>\n<p>First, we need to load the dataset. We can load the dataset directly from the URL using the <a href=\"https:\/\/pandas.pydata.org\/pandas-docs\/stable\/generated\/pandas.read_csv.html\">read_csv()<\/a> Pandas function, then split the data into two data frames: one for input and one for the output.<\/p>\n<p>The complete example of loading the dataset is listed below.<\/p>\n<pre class=\"crayon-plain-tag\"># load the dataset\r\nfrom pandas import read_csv\r\n# load dataset\r\nurl = 'https:\/\/raw.githubusercontent.com\/jbrownlee\/Datasets\/master\/abalone.csv'\r\ndataframe = read_csv(url, header=None)\r\n# split into inputs and outputs\r\nlast_ix = len(dataframe.columns) - 1\r\nX, y = dataframe.drop(last_ix, axis=1), dataframe[last_ix]\r\nprint(X.shape, y.shape)<\/pre>\n<p><strong>Note<\/strong>: if you have trouble loading the dataset from a URL, you can download the CSV file with the name \u2018<em>abalone.csv<\/em>\u2018 and place it in the same directory as your Python file and change the call to <em>read_csv()<\/em> as follows:<\/p>\n<pre class=\"crayon-plain-tag\">...\r\ndataframe = read_csv('abalone.csv', header=None)<\/pre>\n<p>Running the example, we can see that the dataset is loaded correctly and split into eight input columns and one target column.<\/p>\n<pre class=\"crayon-plain-tag\">(4177, 8) (4177,)<\/pre>\n<p>Next, we can use the <em>select_dtypes()<\/em> function to select the column indexes that match different data types.<\/p>\n<p>We are interested in a list of columns that are numerical columns marked as \u2018<em>float64<\/em>\u2018 or \u2018<em>int64<\/em>\u2018 in Pandas, and a list of categorical columns, marked as \u2018<em>object<\/em>\u2018 or \u2018<em>bool<\/em>\u2018 type in Pandas.<\/p>\n<pre class=\"crayon-plain-tag\">...\r\n# determine categorical and numerical features\r\nnumerical_ix = X.select_dtypes(include=['int64', 'float64']).columns\r\ncategorical_ix = X.select_dtypes(include=['object', 'bool']).columns<\/pre>\n<p>We can then use these lists in the ColumnTransformer to one hot encode the categorical variables, which should just be the first column.<\/p>\n<p>We can also use the list of numerical columns to normalize the remaining data.<\/p>\n<pre class=\"crayon-plain-tag\">...\r\n# define the data preparation for the columns\r\nt = [('cat', OneHotEncoder(), categorical_ix), ('num', MinMaxScaler(), numerical_ix)]\r\ncol_transform = ColumnTransformer(transformers=t)<\/pre>\n<p>Next, we can define our SVR model and define a Pipeline that first uses the ColumnTransformer, then fits the model on the prepared dataset.<\/p>\n<pre class=\"crayon-plain-tag\">...\r\n# define the model\r\nmodel = SVR(kernel='rbf',gamma='scale',C=100)\r\n# define the data preparation and modeling pipeline\r\npipeline = Pipeline(steps=[('prep',col_transform), ('m', model)])<\/pre>\n<p>Finally, we can evaluate the model using 10-fold cross-validation and calculate the mean absolute error, averaged across all 10 evaluations of the pipeline.<\/p>\n<pre class=\"crayon-plain-tag\">...\r\n# define the model cross-validation configuration\r\ncv = KFold(n_splits=10, shuffle=True, random_state=1)\r\n# evaluate the pipeline using cross validation and calculate MAE\r\nscores = cross_val_score(pipeline, X, y, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1)\r\n# convert MAE scores to positive values\r\nscores = absolute(scores)\r\n# summarize the model performance\r\nprint('MAE: %.3f (%.3f)' % (mean(scores), std(scores)))<\/pre>\n<p>Tying this all together, the complete example is listed below.<\/p>\n<pre class=\"crayon-plain-tag\"># example of using the ColumnTransformer for the Abalone dataset\r\nfrom numpy import mean\r\nfrom numpy import std\r\nfrom numpy import absolute\r\nfrom pandas import read_csv\r\nfrom sklearn.model_selection import cross_val_score\r\nfrom sklearn.model_selection import KFold\r\nfrom sklearn.compose import ColumnTransformer\r\nfrom sklearn.pipeline import Pipeline\r\nfrom sklearn.preprocessing import OneHotEncoder\r\nfrom sklearn.preprocessing import MinMaxScaler\r\nfrom sklearn.svm import SVR\r\n\r\n# load dataset\r\nurl = 'https:\/\/raw.githubusercontent.com\/jbrownlee\/Datasets\/master\/abalone.csv'\r\ndataframe = read_csv(url, header=None)\r\n# split into inputs and outputs\r\nlast_ix = len(dataframe.columns) - 1\r\nX, y = dataframe.drop(last_ix, axis=1), dataframe[last_ix]\r\nprint(X.shape, y.shape)\r\n# determine categorical and numerical features\r\nnumerical_ix = X.select_dtypes(include=['int64', 'float64']).columns\r\ncategorical_ix = X.select_dtypes(include=['object', 'bool']).columns\r\n# define the data preparation for the columns\r\nt = [('cat', OneHotEncoder(), categorical_ix), ('num', MinMaxScaler(), numerical_ix)]\r\ncol_transform = ColumnTransformer(transformers=t)\r\n# define the model\r\nmodel = SVR(kernel='rbf',gamma='scale',C=100)\r\n# define the data preparation and modeling pipeline\r\npipeline = Pipeline(steps=[('prep',col_transform), ('m', model)])\r\n# define the model cross-validation configuration\r\ncv = KFold(n_splits=10, shuffle=True, random_state=1)\r\n# evaluate the pipeline using cross validation and calculate MAE\r\nscores = cross_val_score(pipeline, X, y, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1)\r\n# convert MAE scores to positive values\r\nscores = absolute(scores)\r\n# summarize the model performance\r\nprint('MAE: %.3f (%.3f)' % (mean(scores), std(scores)))<\/pre>\n<p>Running the example evaluates the data preparation pipeline using 10-fold cross-validation.<\/p>\n<p>Your specific results may vary given the stochastic learning algorithm and differences in library versions.<\/p>\n<p>In this case, we achieve an average MAE of about 1.4, which is better than the baseline score of 2.3.<\/p>\n<pre class=\"crayon-plain-tag\">(4177, 8) (4177,)\r\nMAE: 1.465 (0.047)<\/pre>\n<p>You now have a template for using the ColumnTransformer on a dataset with mixed data types that you can use and adapt for your own projects in the future.<\/p>\n<h2>Further Reading<\/h2>\n<p>This section provides more resources on the topic if you are looking to go deeper.<\/p>\n<h3>API<\/h3>\n<ul>\n<li><a href=\"https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.compose.ColumnTransformer.html\">sklearn.compose.ColumnTransformer API<\/a>.<\/li>\n<li><a href=\"https:\/\/pandas.pydata.org\/pandas-docs\/stable\/reference\/api\/pandas.read_csv.html\">pandas.read_csv API<\/a>.<\/li>\n<li><a href=\"https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.impute.SimpleImputer.html\">sklearn.impute.SimpleImputer API<\/a>.<\/li>\n<li><a href=\"https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.preprocessing.OneHotEncoder.html\">sklearn.preprocessing.OneHotEncoder API<\/a>.<\/li>\n<li><a href=\"https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.preprocessing.MinMaxScaler.html\">sklearn.preprocessing.MinMaxScaler API<\/a><\/li>\n<li><a href=\"https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.pipeline.Pipeline.html\">sklearn.pipeline.Pipeline API<\/a>.<\/li>\n<\/ul>\n<h2>Summary<\/h2>\n<p>In this tutorial, you discovered how to use the ColumnTransformer to selectively apply data transforms to columns in datasets with mixed data types.<\/p>\n<p>Specifically, you learned:<\/p>\n<ul>\n<li>The challenge of using data transformations with datasets that have mixed data types.<\/li>\n<li>How to define, fit, and use the ColumnTransformer to selectively apply data transforms to columns.<\/li>\n<li>How to work through a real dataset with mixed data types and use the ColumnTransformer to apply different transforms to categorical and numerical data columns.<\/li>\n<\/ul>\n<p>Do you have any questions?<br \/>\nAsk your questions in the comments below and I will do my best to answer.<\/p>\n<p>The post <a rel=\"nofollow\" href=\"https:\/\/machinelearningmastery.com\/columntransformer-for-numerical-and-categorical-data\/\">Use the ColumnTransformer for Numerical and Categorical Data in Python<\/a> appeared first on <a rel=\"nofollow\" href=\"https:\/\/machinelearningmastery.com\/\">Machine Learning Mastery<\/a>.<\/p>\n<\/div>\n<p><a href=\"https:\/\/machinelearningmastery.com\/columntransformer-for-numerical-and-categorical-data\/\">Go to Source<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Author: Jason Brownlee You must prepare your raw data using data transforms prior to fitting a machine learning model. This is required to ensure that [&hellip;] <span class=\"read-more-link\"><a class=\"read-more\" href=\"https:\/\/www.aiproblog.com\/index.php\/2019\/12\/19\/use-the-columntransformer-for-numerical-and-categorical-data-in-python\/\">Read More<\/a><\/span><\/p>\n","protected":false},"author":1,"featured_media":2950,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_bbp_topic_count":0,"_bbp_reply_count":0,"_bbp_total_topic_count":0,"_bbp_total_reply_count":0,"_bbp_voice_count":0,"_bbp_anonymous_reply_count":0,"_bbp_topic_count_hidden":0,"_bbp_reply_count_hidden":0,"_bbp_forum_subforum_count":0,"footnotes":""},"categories":[24],"tags":[],"_links":{"self":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts\/2949"}],"collection":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/comments?post=2949"}],"version-history":[{"count":0,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts\/2949\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/media\/2950"}],"wp:attachment":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/media?parent=2949"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/categories?post=2949"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/tags?post=2949"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}