{"id":2852,"date":"2019-11-24T18:00:57","date_gmt":"2019-11-24T18:00:57","guid":{"rendered":"https:\/\/www.aiproblog.com\/index.php\/2019\/11\/24\/how-to-perform-feature-selection-with-categorical-data\/"},"modified":"2019-11-24T18:00:57","modified_gmt":"2019-11-24T18:00:57","slug":"how-to-perform-feature-selection-with-categorical-data","status":"publish","type":"post","link":"https:\/\/www.aiproblog.com\/index.php\/2019\/11\/24\/how-to-perform-feature-selection-with-categorical-data\/","title":{"rendered":"How to Perform Feature Selection with Categorical Data"},"content":{"rendered":"<p>Author: Jason Brownlee<\/p>\n<div>\n<p><a href=\"https:\/\/machinelearningmastery.com\/an-introduction-to-feature-selection\/\">Feature selection<\/a> is the process of identifying and selecting a subset of input features that are most relevant to the target variable.<\/p>\n<p>Feature selection is often straightforward when working with real-valued data, such as using the Pearson\u2019s correlation coefficient, but can be challenging when working with categorical data.<\/p>\n<p>The two most commonly used feature selection methods for categorical input data when the target variable is also categorical (e.g. classification predictive modeling) are the <a href=\"https:\/\/machinelearningmastery.com\/chi-squared-test-for-machine-learning\/\">chi-squared statistic<\/a> and the <a href=\"https:\/\/machinelearningmastery.com\/information-gain-and-mutual-information\">mutual information statistic<\/a>.<\/p>\n<p>In this tutorial, you will discover how to perform feature selection with categorical input data.<\/p>\n<p>After completing this tutorial, you will know:<\/p>\n<ul>\n<li>The breast cancer predictive modeling problem with categorical inputs and binary classification target variable.<\/li>\n<li>How to evaluate the importance of categorical features using the chi-squared and mutual information statistics.<\/li>\n<li>How to perform feature selection for categorical data when fitting and evaluating a classification model.<\/li>\n<\/ul>\n<p>Let\u2019s get started.<\/p>\n<div id=\"attachment_9076\" style=\"width: 649px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-9076\" class=\"size-full wp-image-9076\" src=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2019\/11\/How-to-Perform-Feature-Selection-with-Categorical-Data.jpg\" alt=\"How to Perform Feature Selection with Categorical Data\" width=\"639\" height=\"254\" srcset=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2019\/11\/How-to-Perform-Feature-Selection-with-Categorical-Data.jpg 639w, https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2019\/11\/How-to-Perform-Feature-Selection-with-Categorical-Data-300x119.jpg 300w\" sizes=\"(max-width: 639px) 100vw, 639px\"><\/p>\n<p id=\"caption-attachment-9076\" class=\"wp-caption-text\">How to Perform Feature Selection with Categorical Data<br \/>Photo by <a href=\"https:\/\/flickr.com\/photos\/126654539@N08\/16021168888\/\">Phil Dolby<\/a>, some rights reserved.<\/p>\n<\/div>\n<h2>Tutorial Overview<\/h2>\n<p>This tutorial is divided into three parts; they are:<\/p>\n<ol>\n<li>Breast Cancer Categorical Dataset<\/li>\n<li>Categorical Feature Selection<\/li>\n<li>Modeling With Selected Features<\/li>\n<\/ol>\n<h2>Breast Cancer Categorical Dataset<\/h2>\n<p>As the basis of this tutorial, we will use the so-called \u201c<a href=\"https:\/\/archive.ics.uci.edu\/ml\/datasets\/Breast+Cancer\">Breast cancer<\/a>\u201d dataset that has been widely studied as a machine learning dataset since the 1980s.<\/p>\n<p>The dataset classifies breast cancer patient data as either a recurrence or no recurrence of cancer. There are 286 examples and nine input variables. It is a binary classification problem.<\/p>\n<p>A naive model can achieve an accuracy of 70% on this dataset. A good score is about 76% +\/- 3%. We will aim for this region, but note that the models in this tutorial are not optimized; they are designed to demonstrate encoding schemes.<\/p>\n<p>You can download the dataset and save the file as \u201c<em>breast-cancer.csv<\/em>\u201d in your current working directory.<\/p>\n<ul>\n<li><a href=\"https:\/\/raw.githubusercontent.com\/jbrownlee\/Datasets\/master\/breast-cancer.csv\">Breast Cancer Dataset (breast-cancer.csv)<\/a><\/li>\n<\/ul>\n<p>Looking at the data, we can see that all nine input variables are categorical.<\/p>\n<p>Specifically, all variables are quoted strings; some are ordinal and some are not.<\/p>\n<pre class=\"crayon-plain-tag\">'40-49','premeno','15-19','0-2','yes','3','right','left_up','no','recurrence-events'\r\n'50-59','ge40','15-19','0-2','no','1','right','central','no','no-recurrence-events'\r\n'50-59','ge40','35-39','0-2','no','2','left','left_low','no','recurrence-events'\r\n'40-49','premeno','35-39','0-2','yes','3','right','left_low','yes','no-recurrence-events'\r\n'40-49','premeno','30-34','3-5','yes','2','left','right_up','no','recurrence-events'\r\n...<\/pre>\n<p>We can load this dataset into memory using the Pandas library.<\/p>\n<pre class=\"crayon-plain-tag\">...\r\n# load the dataset as a pandas DataFrame\r\ndata = read_csv(filename, header=None)\r\n# retrieve numpy array\r\ndataset = data.values<\/pre>\n<p>Once loaded, we can split the columns into input (<em>X<\/em>) and output for modeling.<\/p>\n<pre class=\"crayon-plain-tag\">...\r\n# split into input (X) and output (y) variables\r\nX = dataset[:, :-1]\r\ny = dataset[:,-1]<\/pre>\n<p>Finally, we can force all fields in the input data to be string, just in case Pandas tried to map some automatically to numbers (it does try).<\/p>\n<pre class=\"crayon-plain-tag\">...\r\n# format all fields as string\r\nX = X.astype(str)<\/pre>\n<p>We can tie all of this together into a helpful function that we can reuse later.<\/p>\n<pre class=\"crayon-plain-tag\"># load the dataset\r\ndef load_dataset(filename):\r\n\t# load the dataset as a pandas DataFrame\r\n\tdata = read_csv(filename, header=None)\r\n\t# retrieve numpy array\r\n\tdataset = data.values\r\n\t# split into input (X) and output (y) variables\r\n\tX = dataset[:, :-1]\r\n\ty = dataset[:,-1]\r\n\t# format all fields as string\r\n\tX = X.astype(str)\r\n\treturn X, y<\/pre>\n<p>Once loaded, we can split the data into training and test sets so that we can fit and evaluate a learning model.<\/p>\n<p>We will use the <a href=\"https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.model_selection.train_test_split.html\">train_test_split() function<\/a> form scikit-learn and use 67% of the data for training and 33% for testing.<\/p>\n<pre class=\"crayon-plain-tag\">...\r\n# load the dataset\r\nX, y = load_dataset('breast-cancer.csv')\r\n# split into train and test sets\r\nX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)<\/pre>\n<p>Tying all of these elements together, the complete example of loading, splitting, and summarizing the raw categorical dataset is listed below.<\/p>\n<pre class=\"crayon-plain-tag\"># load and summarize the dataset\r\nfrom pandas import read_csv\r\nfrom sklearn.model_selection import train_test_split\r\n\r\n# load the dataset\r\ndef load_dataset(filename):\r\n\t# load the dataset as a pandas DataFrame\r\n\tdata = read_csv(filename, header=None)\r\n\t# retrieve numpy array\r\n\tdataset = data.values\r\n\t# split into input (X) and output (y) variables\r\n\tX = dataset[:, :-1]\r\n\ty = dataset[:,-1]\r\n\t# format all fields as string\r\n\tX = X.astype(str)\r\n\treturn X, y\r\n\r\n# load the dataset\r\nX, y = load_dataset('breast-cancer.csv')\r\n# split into train and test sets\r\nX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)\r\n# summarize\r\nprint('Train', X_train.shape, y_train.shape)\r\nprint('Test', X_test.shape, y_test.shape)<\/pre>\n<p>Running the example reports the size of the input and output elements of the train and test sets.<\/p>\n<p>We can see that we have 191 examples for training and 95 for testing.<\/p>\n<pre class=\"crayon-plain-tag\">Train (191, 9) (191, 1)\r\nTest (95, 9) (95, 1)<\/pre>\n<p>Now that we are familiar with the dataset, let\u2019s look at how we can encode it for modeling.<\/p>\n<p>We can use the <a href=\"https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.preprocessing.OrdinalEncoder.html\">OrdinalEncoder() from scikit-learn<\/a> to encode each variable to integers. This is a flexible class and does allow the order of the categories to be specified as arguments if any such order is known.<\/p>\n<p><strong>Note<\/strong>: I will leave it as an exercise to you to update the example below to try specifying the order for those variables that have a natural ordering and see if it has an impact on model performance.<\/p>\n<p>The best practice when encoding variables is to fit the encoding on the training dataset, then apply it to the train and test datasets.<\/p>\n<p>The function below named <em>prepare_inputs()<\/em> takes the input data for the train and test sets and encodes it using an ordinal encoding.<\/p>\n<pre class=\"crayon-plain-tag\"># prepare input data\r\ndef prepare_inputs(X_train, X_test):\r\n\toe = OrdinalEncoder()\r\n\toe.fit(X_train)\r\n\tX_train_enc = oe.transform(X_train)\r\n\tX_test_enc = oe.transform(X_test)\r\n\treturn X_train_enc, X_test_enc<\/pre>\n<p>We also need to prepare the target variable.<\/p>\n<p>It is a binary classification problem, so we need to map the two class labels to 0 and 1. This is a type of ordinal encoding, and scikit-learn provides the <a href=\"https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.preprocessing.LabelEncoder.html\">LabelEncoder<\/a> class specifically designed for this purpose. We could just as easily use the <em>OrdinalEncoder<\/em> and achieve the same result, although the <em>LabelEncoder<\/em> is designed for encoding a single variable.<\/p>\n<p>The <em>prepare_targets()<\/em> function integer encodes the output data for the train and test sets.<\/p>\n<pre class=\"crayon-plain-tag\"># prepare target\r\ndef prepare_targets(y_train, y_test):\r\n\tle = LabelEncoder()\r\n\tle.fit(y_train)\r\n\ty_train_enc = le.transform(y_train)\r\n\ty_test_enc = le.transform(y_test)\r\n\treturn y_train_enc, y_test_enc<\/pre>\n<p>We can call these functions to prepare our data.<\/p>\n<pre class=\"crayon-plain-tag\">...\r\n# prepare input data\r\nX_train_enc, X_test_enc = prepare_inputs(X_train, X_test)\r\n# prepare output data\r\ny_train_enc, y_test_enc = prepare_targets(y_train, y_test)<\/pre>\n<p>Tying this all together, the complete example of loading and encoding the input and output variables for the breast cancer categorical dataset is listed below.<\/p>\n<pre class=\"crayon-plain-tag\"># example of loading and preparing the breast cancer dataset\r\nfrom pandas import read_csv\r\nfrom sklearn.model_selection import train_test_split\r\nfrom sklearn.preprocessing import LabelEncoder\r\nfrom sklearn.preprocessing import OrdinalEncoder\r\n\r\n# load the dataset\r\ndef load_dataset(filename):\r\n\t# load the dataset as a pandas DataFrame\r\n\tdata = read_csv(filename, header=None)\r\n\t# retrieve numpy array\r\n\tdataset = data.values\r\n\t# split into input (X) and output (y) variables\r\n\tX = dataset[:, :-1]\r\n\ty = dataset[:,-1]\r\n\t# format all fields as string\r\n\tX = X.astype(str)\r\n\treturn X, y\r\n\r\n# prepare input data\r\ndef prepare_inputs(X_train, X_test):\r\n\toe = OrdinalEncoder()\r\n\toe.fit(X_train)\r\n\tX_train_enc = oe.transform(X_train)\r\n\tX_test_enc = oe.transform(X_test)\r\n\treturn X_train_enc, X_test_enc\r\n\r\n# prepare target\r\ndef prepare_targets(y_train, y_test):\r\n\tle = LabelEncoder()\r\n\tle.fit(y_train)\r\n\ty_train_enc = le.transform(y_train)\r\n\ty_test_enc = le.transform(y_test)\r\n\treturn y_train_enc, y_test_enc\r\n\r\n# load the dataset\r\nX, y = load_dataset('breast-cancer.csv')\r\n# split into train and test sets\r\nX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)\r\n# prepare input data\r\nX_train_enc, X_test_enc = prepare_inputs(X_train, X_test)\r\n# prepare output data\r\ny_train_enc, y_test_enc = prepare_targets(y_train, y_test)<\/pre>\n<p>Now that we have loaded and prepared the breast cancer dataset, we can explore feature selection.<\/p>\n<h2>Categorical Feature Selection<\/h2>\n<p>There are two popular feature selection techniques that can be used for categorical input data and a categorical (class) target variable.<\/p>\n<p>They are:<\/p>\n<ul>\n<li>Chi-Squared Statistic.<\/li>\n<li>Mutual Information Statistic.<\/li>\n<\/ul>\n<p>Let\u2019s take a closer look at each in turn.<\/p>\n<h3>Chi-Squared Feature Selection<\/h3>\n<p>Pearson\u2019s chi-squared statistical hypothesis test is an example of a test for independence between categorical variables.<\/p>\n<p>You can learn more about this statistical test in the tutorial:<\/p>\n<ul>\n<li><a href=\"https:\/\/machinelearningmastery.com\/chi-squared-test-for-machine-learning\/\">A Gentle Introduction to the Chi-Squared Test for Machine Learning<\/a><\/li>\n<\/ul>\n<p>The results of this test can be used for feature selection, where those features that are independent of the target variable can be removed from the dataset.<\/p>\n<p>The scikit-learn machine library provides an implementation of the chi-squared test in the <a href=\"https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.feature_selection.chi2.html\">chi2() function<\/a>. This function can be used in a feature selection strategy, such as selecting the top <em>k<\/em> most relevant features (largest values) via the <a href=\"https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.feature_selection.SelectKBest.html\">SelectKBest class<\/a>.<\/p>\n<p>For example, we can define the <em>SelectKBest<\/em> class to use the <em>chi2()<\/em> function and select all features, then transform the train and test sets.<\/p>\n<pre class=\"crayon-plain-tag\">...\r\nfs = SelectKBest(score_func=chi2, k='all')\r\nfs.fit(X_train, y_train)\r\nX_train_fs = fs.transform(X_train)\r\nX_test_fs = fs.transform(X_test)<\/pre>\n<p>We can then print the scores for each variable (largest is better), and plot the scores for each variable as a bar graph to get an idea of how many features we should select.<\/p>\n<pre class=\"crayon-plain-tag\">...\r\n# what are scores for the features\r\nfor i in range(len(fs.scores_)):\r\n\tprint('Feature %d: %f' % (i, fs.scores_[i]))\r\n# plot the scores\r\npyplot.bar([i for i in range(len(fs.scores_))], fs.scores_)\r\npyplot.show()<\/pre>\n<p>Tying this together with the data preparation for the breast cancer dataset in the previous section, the complete example is listed below.<\/p>\n<pre class=\"crayon-plain-tag\"># example of chi squared feature selection for categorical data\r\nfrom pandas import read_csv\r\nfrom sklearn.model_selection import train_test_split\r\nfrom sklearn.preprocessing import LabelEncoder\r\nfrom sklearn.preprocessing import OrdinalEncoder\r\nfrom sklearn.feature_selection import SelectKBest\r\nfrom sklearn.feature_selection import chi2\r\nfrom matplotlib import pyplot\r\n\r\n# load the dataset\r\ndef load_dataset(filename):\r\n\t# load the dataset as a pandas DataFrame\r\n\tdata = read_csv(filename, header=None)\r\n\t# retrieve numpy array\r\n\tdataset = data.values\r\n\t# split into input (X) and output (y) variables\r\n\tX = dataset[:, :-1]\r\n\ty = dataset[:,-1]\r\n\t# format all fields as string\r\n\tX = X.astype(str)\r\n\treturn X, y\r\n\r\n# prepare input data\r\ndef prepare_inputs(X_train, X_test):\r\n\toe = OrdinalEncoder()\r\n\toe.fit(X_train)\r\n\tX_train_enc = oe.transform(X_train)\r\n\tX_test_enc = oe.transform(X_test)\r\n\treturn X_train_enc, X_test_enc\r\n\r\n# prepare target\r\ndef prepare_targets(y_train, y_test):\r\n\tle = LabelEncoder()\r\n\tle.fit(y_train)\r\n\ty_train_enc = le.transform(y_train)\r\n\ty_test_enc = le.transform(y_test)\r\n\treturn y_train_enc, y_test_enc\r\n\r\n# feature selection\r\ndef select_features(X_train, y_train, X_test):\r\n\tfs = SelectKBest(score_func=chi2, k='all')\r\n\tfs.fit(X_train, y_train)\r\n\tX_train_fs = fs.transform(X_train)\r\n\tX_test_fs = fs.transform(X_test)\r\n\treturn X_train_fs, X_test_fs, fs\r\n\r\n# load the dataset\r\nX, y = load_dataset('breast-cancer.csv')\r\n# split into train and test sets\r\nX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)\r\n# prepare input data\r\nX_train_enc, X_test_enc = prepare_inputs(X_train, X_test)\r\n# prepare output data\r\ny_train_enc, y_test_enc = prepare_targets(y_train, y_test)\r\n# feature selection\r\nX_train_fs, X_test_fs, fs = select_features(X_train_enc, y_train_enc, X_test_enc)\r\n# what are scores for the features\r\nfor i in range(len(fs.scores_)):\r\n\tprint('Feature %d: %f' % (i, fs.scores_[i]))\r\n# plot the scores\r\npyplot.bar([i for i in range(len(fs.scores_))], fs.scores_)\r\npyplot.show()<\/pre>\n<p>Running the example first prints the scores calculated for each input feature and the target variable.<\/p>\n<p><strong>Note<\/strong>: your specific results may differ. Try running the example a few times.<\/p>\n<p>In this case, we can see the scores are small and it is hard to get an idea from the number alone as to which features are more relevant.<\/p>\n<p>Perhaps features 3, 4, 5, and 8 are most relevant.<\/p>\n<pre class=\"crayon-plain-tag\">Feature 0: 0.472553\r\nFeature 1: 0.029193\r\nFeature 2: 2.137658\r\nFeature 3: 29.381059\r\nFeature 4: 8.222601\r\nFeature 5: 8.100183\r\nFeature 6: 1.273822\r\nFeature 7: 0.950682\r\nFeature 8: 3.699989<\/pre>\n<p>A bar chart of the feature importance scores for each input feature is created.<\/p>\n<p>This clearly shows that feature 3 might be the most relevant (according to chi-squared) and that perhaps four of the nine input features are the most relevant.<\/p>\n<p>We could set k=4 When configuring the <em>SelectKBest<\/em> to select these top four features.<\/p>\n<div id=\"attachment_9078\" style=\"width: 1034px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-9078\" class=\"size-large wp-image-9078\" src=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2019\/11\/Bar-Chart-of-the-Input-Features-x-vs-The-Ch-Squared-Feature-Importance-y-1024x768.png\" alt=\"Bar Chart of the Input Features (x) vs The Ch-Squared Feature Importance (y)\" width=\"1024\" height=\"768\" srcset=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2019\/11\/Bar-Chart-of-the-Input-Features-x-vs-The-Ch-Squared-Feature-Importance-y-1024x768.png 1024w, https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2019\/11\/Bar-Chart-of-the-Input-Features-x-vs-The-Ch-Squared-Feature-Importance-y-300x225.png 300w, https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2019\/11\/Bar-Chart-of-the-Input-Features-x-vs-The-Ch-Squared-Feature-Importance-y-768x576.png 768w, https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2019\/11\/Bar-Chart-of-the-Input-Features-x-vs-The-Ch-Squared-Feature-Importance-y.png 1280w\" sizes=\"(max-width: 1024px) 100vw, 1024px\"><\/p>\n<p id=\"caption-attachment-9078\" class=\"wp-caption-text\">Bar Chart of the Input Features (x) vs The Chi-Squared Feature Importance (y)<\/p>\n<\/div>\n<h3>Mutual Information Feature Selection<\/h3>\n<p>Mutual information from the field of information theory is the application of information gain (typically used in the construction of decision trees) to feature selection.<\/p>\n<p>Mutual information is calculated between two variables and measures the reduction in uncertainty for one variable given a known value of the other variable.<\/p>\n<p>You can learn more about mutual information in the following tutorial.<\/p>\n<ul>\n<li><a href=\"https:\/\/machinelearningmastery.com\/information-gain-and-mutual-information\">What Is Information Gain and Mutual Information for Machine Learning<\/a><\/li>\n<\/ul>\n<p>The scikit-learn machine learning library provides an implementation of mutual information for feature selection via the <a href=\"https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.feature_selection.mutual_info_classif.html\">mutual_info_classif() function<\/a>.<\/p>\n<p>Like <em>chi2()<\/em>, it can be used in the <em>SelectKBest<\/em> feature selection strategy (and other strategies).<\/p>\n<pre class=\"crayon-plain-tag\"># feature selection\r\ndef select_features(X_train, y_train, X_test):\r\n\tfs = SelectKBest(score_func=mutual_info_classif, k='all')\r\n\tfs.fit(X_train, y_train)\r\n\tX_train_fs = fs.transform(X_train)\r\n\tX_test_fs = fs.transform(X_test)\r\n\treturn X_train_fs, X_test_fs, fs<\/pre>\n<p>We can perform feature selection using mutual information on the breast cancer set and print and plot the scores (larger is better) as we did in the previous section.<\/p>\n<p>The complete example of using mutual information for categorical feature selection is listed below.<\/p>\n<pre class=\"crayon-plain-tag\"># example of mutual information feature selection for categorical data\r\nfrom pandas import read_csv\r\nfrom sklearn.model_selection import train_test_split\r\nfrom sklearn.preprocessing import LabelEncoder\r\nfrom sklearn.preprocessing import OrdinalEncoder\r\nfrom sklearn.feature_selection import SelectKBest\r\nfrom sklearn.feature_selection import mutual_info_classif\r\nfrom matplotlib import pyplot\r\n\r\n# load the dataset\r\ndef load_dataset(filename):\r\n\t# load the dataset as a pandas DataFrame\r\n\tdata = read_csv(filename, header=None)\r\n\t# retrieve numpy array\r\n\tdataset = data.values\r\n\t# split into input (X) and output (y) variables\r\n\tX = dataset[:, :-1]\r\n\ty = dataset[:,-1]\r\n\t# format all fields as string\r\n\tX = X.astype(str)\r\n\treturn X, y\r\n\r\n# prepare input data\r\ndef prepare_inputs(X_train, X_test):\r\n\toe = OrdinalEncoder()\r\n\toe.fit(X_train)\r\n\tX_train_enc = oe.transform(X_train)\r\n\tX_test_enc = oe.transform(X_test)\r\n\treturn X_train_enc, X_test_enc\r\n\r\n# prepare target\r\ndef prepare_targets(y_train, y_test):\r\n\tle = LabelEncoder()\r\n\tle.fit(y_train)\r\n\ty_train_enc = le.transform(y_train)\r\n\ty_test_enc = le.transform(y_test)\r\n\treturn y_train_enc, y_test_enc\r\n\r\n# feature selection\r\ndef select_features(X_train, y_train, X_test):\r\n\tfs = SelectKBest(score_func=mutual_info_classif, k='all')\r\n\tfs.fit(X_train, y_train)\r\n\tX_train_fs = fs.transform(X_train)\r\n\tX_test_fs = fs.transform(X_test)\r\n\treturn X_train_fs, X_test_fs, fs\r\n\r\n# load the dataset\r\nX, y = load_dataset('breast-cancer.csv')\r\n# split into train and test sets\r\nX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)\r\n# prepare input data\r\nX_train_enc, X_test_enc = prepare_inputs(X_train, X_test)\r\n# prepare output data\r\ny_train_enc, y_test_enc = prepare_targets(y_train, y_test)\r\n# feature selection\r\nX_train_fs, X_test_fs, fs = select_features(X_train_enc, y_train_enc, X_test_enc)\r\n# what are scores for the features\r\nfor i in range(len(fs.scores_)):\r\n\tprint('Feature %d: %f' % (i, fs.scores_[i]))\r\n# plot the scores\r\npyplot.bar([i for i in range(len(fs.scores_))], fs.scores_)\r\npyplot.show()<\/pre>\n<p>Running the example first prints the scores calculated for each input feature and the target variable.<\/p>\n<p><strong>Note<\/strong>: your specific results may differ. Try running the example a few times.<\/p>\n<p>In this case, we can see that some of the features have a very low score, suggesting that perhaps they can be removed.<\/p>\n<p>Perhaps features 3, 6, 2, and 5 are most relevant.<\/p>\n<pre class=\"crayon-plain-tag\">Feature 0: 0.003588\r\nFeature 1: 0.000000\r\nFeature 2: 0.025934\r\nFeature 3: 0.071461\r\nFeature 4: 0.000000\r\nFeature 5: 0.038973\r\nFeature 6: 0.064759\r\nFeature 7: 0.003068\r\nFeature 8: 0.000000<\/pre>\n<p>A bar chart of the feature importance scores for each input feature is created.<\/p>\n<p>Importantly, a different mixture of features is promoted.<\/p>\n<div id=\"attachment_9079\" style=\"width: 1034px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-9079\" class=\"size-large wp-image-9079\" src=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2019\/11\/Bar-Chart-of-the-Input-Features-x-vs-The-Mutual-Information-Feature-Importance-y-1024x768.png\" alt=\"Bar Chart of the Input Features (x) vs The Mutual Information Feature Importance (y)\" width=\"1024\" height=\"768\" srcset=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2019\/11\/Bar-Chart-of-the-Input-Features-x-vs-The-Mutual-Information-Feature-Importance-y-1024x768.png 1024w, https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2019\/11\/Bar-Chart-of-the-Input-Features-x-vs-The-Mutual-Information-Feature-Importance-y-300x225.png 300w, https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2019\/11\/Bar-Chart-of-the-Input-Features-x-vs-The-Mutual-Information-Feature-Importance-y-768x576.png 768w, https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2019\/11\/Bar-Chart-of-the-Input-Features-x-vs-The-Mutual-Information-Feature-Importance-y.png 1280w\" sizes=\"(max-width: 1024px) 100vw, 1024px\"><\/p>\n<p id=\"caption-attachment-9079\" class=\"wp-caption-text\">Bar Chart of the Input Features (x) vs The Mutual Information Feature Importance (y)<\/p>\n<\/div>\n<p>Now that we know how to perform feature selection on categorical data for a classification predictive modeling problem, we can try developing a model using the selected features and compare the results.<\/p>\n<h2>Modeling With Selected Features<\/h2>\n<p>There are many different techniques for scoring features and selecting features based on scores; how do you know which one to use?<\/p>\n<p>A robust approach is to evaluate models using different feature selection methods (and numbers of features) and select the method that results in a model with the best performance.<\/p>\n<p>In this section, we will evaluate a Logistic Regression model with all features compared to a model built from features selected by chi-squared and those features selected via mutual information.<\/p>\n<p>Logistic regression is a good model for testing feature selection methods as it can perform better if irrelevant features are removed from the model.<\/p>\n<h3>Model Built Using All Features<\/h3>\n<p>As a first step, we will evaluate a <a href=\"https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.linear_model.LogisticRegression.html\">LogisticRegression<\/a> model using all the available features.<\/p>\n<p>The model is fit on the training dataset and evaluated on the test dataset.<\/p>\n<p>The complete example is listed below.<\/p>\n<pre class=\"crayon-plain-tag\"># evaluation of a model using all input features\r\nfrom pandas import read_csv\r\nfrom sklearn.preprocessing import LabelEncoder\r\nfrom sklearn.preprocessing import OrdinalEncoder\r\nfrom sklearn.model_selection import train_test_split\r\nfrom sklearn.linear_model import LogisticRegression\r\nfrom sklearn.metrics import accuracy_score\r\n\r\n# load the dataset\r\ndef load_dataset(filename):\r\n\t# load the dataset as a pandas DataFrame\r\n\tdata = read_csv(filename, header=None)\r\n\t# retrieve numpy array\r\n\tdataset = data.values\r\n\t# split into input (X) and output (y) variables\r\n\tX = dataset[:, :-1]\r\n\ty = dataset[:,-1]\r\n\t# format all fields as string\r\n\tX = X.astype(str)\r\n\treturn X, y\r\n\r\n# prepare input data\r\ndef prepare_inputs(X_train, X_test):\r\n\toe = OrdinalEncoder()\r\n\toe.fit(X_train)\r\n\tX_train_enc = oe.transform(X_train)\r\n\tX_test_enc = oe.transform(X_test)\r\n\treturn X_train_enc, X_test_enc\r\n\r\n# prepare target\r\ndef prepare_targets(y_train, y_test):\r\n\tle = LabelEncoder()\r\n\tle.fit(y_train)\r\n\ty_train_enc = le.transform(y_train)\r\n\ty_test_enc = le.transform(y_test)\r\n\treturn y_train_enc, y_test_enc\r\n\r\n# load the dataset\r\nX, y = load_dataset('breast-cancer.csv')\r\n# split into train and test sets\r\nX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)\r\n# prepare input data\r\nX_train_enc, X_test_enc = prepare_inputs(X_train, X_test)\r\n# prepare output data\r\ny_train_enc, y_test_enc = prepare_targets(y_train, y_test)\r\n# fit the model\r\nmodel = LogisticRegression(solver='lbfgs')\r\nmodel.fit(X_train_enc, y_train_enc)\r\n# evaluate the model\r\nyhat = model.predict(X_test_enc)\r\n# evaluate predictions\r\naccuracy = accuracy_score(y_test_enc, yhat)\r\nprint('Accuracy: %.2f' % (accuracy*100))<\/pre>\n<p>Running the example prints the accuracy of the model on the training dataset.<\/p>\n<p><strong>Note<\/strong>: your specific results may vary given the stochastic nature of the learning algorithm. Try running the example a few times.<\/p>\n<p>In this case, we can see that the model achieves a classification accuracy of about 75%.<\/p>\n<p>We would prefer to use a subset of features that achieves a classification accuracy that is as good or better than this.<\/p>\n<pre class=\"crayon-plain-tag\">Accuracy: 75.79<\/pre>\n<\/p>\n<h3>Model Built Using Chi-Squared Features<\/h3>\n<p>We can use the chi-squared test to score the features and select the four most relevant features.<\/p>\n<p>The <em>select_features()<\/em> function below is updated to achieve this.<\/p>\n<pre class=\"crayon-plain-tag\"># feature selection\r\ndef select_features(X_train, y_train, X_test):\r\n\tfs = SelectKBest(score_func=chi2, k=4)\r\n\tfs.fit(X_train, y_train)\r\n\tX_train_fs = fs.transform(X_train)\r\n\tX_test_fs = fs.transform(X_test)\r\n\treturn X_train_fs, X_test_fs<\/pre>\n<p>The complete example of evaluating a logistic regression model fit and evaluated on data using this feature selection method is listed below.<\/p>\n<pre class=\"crayon-plain-tag\"># evaluation of a model fit using chi squared input features\r\nfrom pandas import read_csv\r\nfrom sklearn.preprocessing import LabelEncoder\r\nfrom sklearn.preprocessing import OrdinalEncoder\r\nfrom sklearn.feature_selection import SelectKBest\r\nfrom sklearn.feature_selection import chi2\r\nfrom sklearn.model_selection import train_test_split\r\nfrom sklearn.linear_model import LogisticRegression\r\nfrom sklearn.metrics import accuracy_score\r\n\r\n# load the dataset\r\ndef load_dataset(filename):\r\n\t# load the dataset as a pandas DataFrame\r\n\tdata = read_csv(filename, header=None)\r\n\t# retrieve numpy array\r\n\tdataset = data.values\r\n\t# split into input (X) and output (y) variables\r\n\tX = dataset[:, :-1]\r\n\ty = dataset[:,-1]\r\n\t# format all fields as string\r\n\tX = X.astype(str)\r\n\treturn X, y\r\n\r\n# prepare input data\r\ndef prepare_inputs(X_train, X_test):\r\n\toe = OrdinalEncoder()\r\n\toe.fit(X_train)\r\n\tX_train_enc = oe.transform(X_train)\r\n\tX_test_enc = oe.transform(X_test)\r\n\treturn X_train_enc, X_test_enc\r\n\r\n# prepare target\r\ndef prepare_targets(y_train, y_test):\r\n\tle = LabelEncoder()\r\n\tle.fit(y_train)\r\n\ty_train_enc = le.transform(y_train)\r\n\ty_test_enc = le.transform(y_test)\r\n\treturn y_train_enc, y_test_enc\r\n\r\n# feature selection\r\ndef select_features(X_train, y_train, X_test):\r\n\tfs = SelectKBest(score_func=chi2, k=4)\r\n\tfs.fit(X_train, y_train)\r\n\tX_train_fs = fs.transform(X_train)\r\n\tX_test_fs = fs.transform(X_test)\r\n\treturn X_train_fs, X_test_fs\r\n\r\n# load the dataset\r\nX, y = load_dataset('breast-cancer.csv')\r\n# split into train and test sets\r\nX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)\r\n# prepare input data\r\nX_train_enc, X_test_enc = prepare_inputs(X_train, X_test)\r\n# prepare output data\r\ny_train_enc, y_test_enc = prepare_targets(y_train, y_test)\r\n# feature selection\r\nX_train_fs, X_test_fs = select_features(X_train_enc, y_train_enc, X_test_enc)\r\n# fit the model\r\nmodel = LogisticRegression(solver='lbfgs')\r\nmodel.fit(X_train_fs, y_train_enc)\r\n# evaluate the model\r\nyhat = model.predict(X_test_fs)\r\n# evaluate predictions\r\naccuracy = accuracy_score(y_test_enc, yhat)\r\nprint('Accuracy: %.2f' % (accuracy*100))<\/pre>\n<p>Running the example reports the performance of the model on just four of the nine input features selected using the chi-squared statistic.<\/p>\n<p><strong>Note<\/strong>: your specific results may vary given the stochastic nature of the learning algorithm. Try running the example a few times.<\/p>\n<p>In this case, we see that the model achieved an accuracy of about 74%, a slight drop in performance.<\/p>\n<p>It is possible that some of the features removed are, in fact, adding value directly or in concert with the selected features.<\/p>\n<p>At this stage, we would probably prefer to use all of the input features.<\/p>\n<pre class=\"crayon-plain-tag\">Accuracy: 74.74<\/pre>\n<\/p>\n<h3>Model Built Using Mutual Information Features<\/h3>\n<p>We can repeat the experiment and select the top four features using a mutual information statistic.<\/p>\n<p>The updated version of the <em>select_features()<\/em> function to achieve this is listed below.<\/p>\n<pre class=\"crayon-plain-tag\"># feature selection\r\ndef select_features(X_train, y_train, X_test):\r\n\tfs = SelectKBest(score_func=mutual_info_classif, k=4)\r\n\tfs.fit(X_train, y_train)\r\n\tX_train_fs = fs.transform(X_train)\r\n\tX_test_fs = fs.transform(X_test)\r\n\treturn X_train_fs, X_test_fs<\/pre>\n<p>The complete example of using mutual information for feature selection to fit a logistic regression model is listed below.<\/p>\n<pre class=\"crayon-plain-tag\"># evaluation of a model fit using mutual information input features\r\nfrom pandas import read_csv\r\nfrom sklearn.preprocessing import LabelEncoder\r\nfrom sklearn.preprocessing import OrdinalEncoder\r\nfrom sklearn.feature_selection import SelectKBest\r\nfrom sklearn.feature_selection import mutual_info_classif\r\nfrom sklearn.model_selection import train_test_split\r\nfrom sklearn.linear_model import LogisticRegression\r\nfrom sklearn.metrics import accuracy_score\r\n\r\n# load the dataset\r\ndef load_dataset(filename):\r\n\t# load the dataset as a pandas DataFrame\r\n\tdata = read_csv(filename, header=None)\r\n\t# retrieve numpy array\r\n\tdataset = data.values\r\n\t# split into input (X) and output (y) variables\r\n\tX = dataset[:, :-1]\r\n\ty = dataset[:,-1]\r\n\t# format all fields as string\r\n\tX = X.astype(str)\r\n\treturn X, y\r\n\r\n# prepare input data\r\ndef prepare_inputs(X_train, X_test):\r\n\toe = OrdinalEncoder()\r\n\toe.fit(X_train)\r\n\tX_train_enc = oe.transform(X_train)\r\n\tX_test_enc = oe.transform(X_test)\r\n\treturn X_train_enc, X_test_enc\r\n\r\n# prepare target\r\ndef prepare_targets(y_train, y_test):\r\n\tle = LabelEncoder()\r\n\tle.fit(y_train)\r\n\ty_train_enc = le.transform(y_train)\r\n\ty_test_enc = le.transform(y_test)\r\n\treturn y_train_enc, y_test_enc\r\n\r\n# feature selection\r\ndef select_features(X_train, y_train, X_test):\r\n\tfs = SelectKBest(score_func=mutual_info_classif, k=4)\r\n\tfs.fit(X_train, y_train)\r\n\tX_train_fs = fs.transform(X_train)\r\n\tX_test_fs = fs.transform(X_test)\r\n\treturn X_train_fs, X_test_fs\r\n\r\n# load the dataset\r\nX, y = load_dataset('breast-cancer.csv')\r\n# split into train and test sets\r\nX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)\r\n# prepare input data\r\nX_train_enc, X_test_enc = prepare_inputs(X_train, X_test)\r\n# prepare output data\r\ny_train_enc, y_test_enc = prepare_targets(y_train, y_test)\r\n# feature selection\r\nX_train_fs, X_test_fs = select_features(X_train_enc, y_train_enc, X_test_enc)\r\n# fit the model\r\nmodel = LogisticRegression(solver='lbfgs')\r\nmodel.fit(X_train_fs, y_train_enc)\r\n# evaluate the model\r\nyhat = model.predict(X_test_fs)\r\n# evaluate predictions\r\naccuracy = accuracy_score(y_test_enc, yhat)\r\nprint('Accuracy: %.2f' % (accuracy*100))<\/pre>\n<p>Running the example fits the model on the four top selected features chosen using mutual information.<\/p>\n<p><strong>Note<\/strong>: your specific results may vary given the stochastic nature of the learning algorithm. Try running the example a few times.<\/p>\n<p>In this case, we can see a small lift in classification accuracy to 76%.<\/p>\n<p>To be sure that the effect is real, it would be a good idea to repeat each experiment multiple times and compare the mean performance. It may also be a good idea to explore using k-fold cross-validation instead of a simple train\/test split.<\/p>\n<pre class=\"crayon-plain-tag\">Accuracy: 76.84<\/pre>\n<\/p>\n<h2>Further Reading<\/h2>\n<p>This section provides more resources on the topic if you are looking to go deeper.<\/p>\n<h3>Posts<\/h3>\n<ul>\n<li><a href=\"https:\/\/machinelearningmastery.com\/chi-squared-test-for-machine-learning\/\">A Gentle Introduction to the Chi-Squared Test for Machine Learning<\/a><\/li>\n<li><a href=\"https:\/\/machinelearningmastery.com\/an-introduction-to-feature-selection\/\">An Introduction to Feature Selection<\/a><\/li>\n<li><a href=\"https:\/\/machinelearningmastery.com\/feature-selection-machine-learning-python\/\">Feature Selection For Machine Learning in Python<\/a><\/li>\n<li><a href=\"https:\/\/machinelearningmastery.com\/information-gain-and-mutual-information\">What is Information Gain and Mutual Information for Machine Learning<\/a><\/li>\n<\/ul>\n<h3>API<\/h3>\n<ul>\n<li><a href=\"https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.model_selection.train_test_split.html\">sklearn.model_selection.train_test_split API<\/a>.<\/li>\n<li><a href=\"https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.preprocessing.OrdinalEncoder.html\">sklearn.preprocessing.OrdinalEncoder API<\/a>.<\/li>\n<li><a href=\"https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.preprocessing.LabelEncoder.html\">sklearn.preprocessing.LabelEncoder API<\/a>.<\/li>\n<li><a href=\"https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.feature_selection.chi2.html\">sklearn.feature_selection.chi2 API<\/a><\/li>\n<li><a href=\"https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.feature_selection.SelectKBest.html\">sklearn.feature_selection.SelectKBest API<\/a><\/li>\n<li><a href=\"https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.feature_selection.mutual_info_classif.html\">sklearn.feature_selection.mutual_info_classif API<\/a>.<\/li>\n<li><a href=\"https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.linear_model.LogisticRegression.html\">sklearn.linear_model.LogisticRegression API<\/a>.<\/li>\n<\/ul>\n<h3>Articles<\/h3>\n<ul>\n<li><a href=\"https:\/\/archive.ics.uci.edu\/ml\/datasets\/Breast+Cancer\">Breast Cancer Data Set, UCI Machine Learning Repository<\/a>.<\/li>\n<li><a href=\"https:\/\/raw.githubusercontent.com\/jbrownlee\/Datasets\/master\/breast-cancer.csv\">Breast Cancer Raw Dataset<\/a><\/li>\n<li><a href=\"https:\/\/github.com\/jbrownlee\/Datasets\/blob\/master\/breast-cancer.names\">Breast Cancer Description<\/a><\/li>\n<\/ul>\n<h2>Summary<\/h2>\n<p>In this tutorial, you discovered how to perform feature selection with categorical input data.<\/p>\n<p>Specifically, you learned:<\/p>\n<ul>\n<li>The breast cancer predictive modeling problem with categorical inputs and binary classification target variable.<\/li>\n<li>How to evaluate the importance of categorical features using the chi-squared and mutual information statistics.<\/li>\n<li>How to perform feature selection for categorical data when fitting and evaluating a classification model.<\/li>\n<\/ul>\n<p>Do you have any questions?<br \/>\nAsk your questions in the comments below and I will do my best to answer.<\/p>\n<p>The post <a rel=\"nofollow\" href=\"https:\/\/machinelearningmastery.com\/feature-selection-with-categorical-data\/\">How to Perform Feature Selection with Categorical Data<\/a> appeared first on <a rel=\"nofollow\" href=\"https:\/\/machinelearningmastery.com\/\">Machine Learning Mastery<\/a>.<\/p>\n<\/div>\n<p><a href=\"https:\/\/machinelearningmastery.com\/feature-selection-with-categorical-data\/\">Go to Source<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Author: Jason Brownlee Feature selection is the process of identifying and selecting a subset of input features that are most relevant to the target variable. [&hellip;] <span class=\"read-more-link\"><a class=\"read-more\" href=\"https:\/\/www.aiproblog.com\/index.php\/2019\/11\/24\/how-to-perform-feature-selection-with-categorical-data\/\">Read More<\/a><\/span><\/p>\n","protected":false},"author":1,"featured_media":2853,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_bbp_topic_count":0,"_bbp_reply_count":0,"_bbp_total_topic_count":0,"_bbp_total_reply_count":0,"_bbp_voice_count":0,"_bbp_anonymous_reply_count":0,"_bbp_topic_count_hidden":0,"_bbp_reply_count_hidden":0,"_bbp_forum_subforum_count":0,"footnotes":""},"categories":[24],"tags":[],"_links":{"self":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts\/2852"}],"collection":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/comments?post=2852"}],"version-history":[{"count":0,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts\/2852\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/media\/2853"}],"wp:attachment":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/media?parent=2852"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/categories?post=2852"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/tags?post=2852"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}