{"id":2843,"date":"2019-11-21T18:00:02","date_gmt":"2019-11-21T18:00:02","guid":{"rendered":"https:\/\/www.aiproblog.com\/index.php\/2019\/11\/21\/3-ways-to-encode-categorical-variables-for-deep-learning\/"},"modified":"2019-11-21T18:00:02","modified_gmt":"2019-11-21T18:00:02","slug":"3-ways-to-encode-categorical-variables-for-deep-learning","status":"publish","type":"post","link":"https:\/\/www.aiproblog.com\/index.php\/2019\/11\/21\/3-ways-to-encode-categorical-variables-for-deep-learning\/","title":{"rendered":"3 Ways to Encode Categorical Variables for Deep Learning"},"content":{"rendered":"<p>Author: Jason Brownlee<\/p>\n<div>\n<p>Machine learning and deep learning models, like those in Keras, require all input and output variables to be numeric.<\/p>\n<p>This means that if your data contains categorical data, you must encode it to numbers before you can fit and evaluate a model.<\/p>\n<p>The two most popular techniques are an <strong>integer encoding<\/strong> and a <strong>one hot encoding<\/strong>, although a newer technique called <strong>learned embedding<\/strong> may provide a useful middle ground between these two methods.<\/p>\n<p>In this tutorial, you will discover how to encode categorical data when developing neural network models in Keras.<\/p>\n<p>After completing this tutorial, you will know:<\/p>\n<ul>\n<li>The challenge of working with categorical data when using machine learning and deep learning models.<\/li>\n<li>How to integer encode and one hot encode categorical variables for modeling.<\/li>\n<li>How to learn an embedding distributed representation as part of a neural network for categorical variables.<\/li>\n<\/ul>\n<p>Let\u2019s get started.<\/p>\n<div id=\"attachment_9068\" style=\"width: 650px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-9068\" class=\"size-full wp-image-9068\" src=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2019\/11\/How-to-Encode-Categorical-Data-for-Deep-Learning-in-Keras.jpg\" alt=\"How to Encode Categorical Data for Deep Learning in Keras\" width=\"640\" height=\"427\" srcset=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2019\/11\/How-to-Encode-Categorical-Data-for-Deep-Learning-in-Keras.jpg 640w, https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2019\/11\/How-to-Encode-Categorical-Data-for-Deep-Learning-in-Keras-300x200.jpg 300w\" sizes=\"(max-width: 640px) 100vw, 640px\"><\/p>\n<p id=\"caption-attachment-9068\" class=\"wp-caption-text\">How to Encode Categorical Data for Deep Learning in Keras<br \/>Photo by <a href=\"https:\/\/www.flickr.com\/photos\/kendixon\/43872634992\/\">Ken Dixon<\/a>, some rights reserved.<\/p>\n<\/div>\n<h2>Tutorial Overview<\/h2>\n<p>This tutorial is divided into five parts; they are:<\/p>\n<ol>\n<li>The Challenge With Categorical Data<\/li>\n<li>Breast Cancer Categorical Dataset<\/li>\n<li>How to Ordinal Encode Categorical Data<\/li>\n<li>How to One Hot Encode Categorical Data<\/li>\n<li>How to Use a Learned Embedding for Categorical Data<\/li>\n<\/ol>\n<h2>The Challenge With Categorical Data<\/h2>\n<p>A categorical variable is a variable whose values take on the value of labels.<\/p>\n<p>For example, the variable may be \u201c<em>color<\/em>\u201d and may take on the values \u201c<em>red<\/em>,\u201d \u201c<em>green<\/em>,\u201d and \u201c<em>blue<\/em>.\u201d<\/p>\n<p>Sometimes, the categorical data may have an ordered relationship between the categories, such as \u201c<em>first<\/em>,\u201d \u201c<em>second<\/em>,\u201d and \u201c<em>third<\/em>.\u201d This type of categorical data is referred to as ordinal and the additional ordering information can be useful.<\/p>\n<p>Machine learning algorithms and deep learning neural networks require that input and output variables are numbers.<\/p>\n<p>This means that categorical data must be encoded to numbers before we can use it to fit and evaluate a model.<\/p>\n<p>There are many ways to encode categorical variables for modeling, although the three most common are as follows:<\/p>\n<ol>\n<li><strong>Integer Encoding<\/strong>: Where each unique label is mapped to an integer.<\/li>\n<li><strong>One Hot Encoding<\/strong>: Where each label is mapped to a binary vector.<\/li>\n<li><strong>Learned Embedding<\/strong>: Where a distributed representation of the categories is learned.<\/li>\n<\/ol>\n<p>We will take a closer look at how to encode categorical data for training a deep learning neural network in Keras using each one of these methods.<\/p>\n<h2>Breast Cancer Categorical Dataset<\/h2>\n<p>As the basis of this tutorial, we will use the so-called \u201c<a href=\"https:\/\/archive.ics.uci.edu\/ml\/datasets\/Breast+Cancer\">Breast cancer<\/a>\u201d dataset that has been widely studied in machine learning since the 1980s.<\/p>\n<p>The dataset classifies breast cancer patient data as either a recurrence or no recurrence of cancer. There are 286 examples and nine input variables. It is a binary classification problem.<\/p>\n<p>A reasonable classification accuracy score on this dataset is between 68% and 73%. We will aim for this region, but note that the models in this tutorial are not optimized: <em>they are designed to demonstrate encoding schemes<\/em>.<\/p>\n<p>You can download the dataset and save the file as \u201c<em>breast-cancer.csv<\/em>\u201d in your current working directory.<\/p>\n<ul>\n<li><a href=\"https:\/\/raw.githubusercontent.com\/jbrownlee\/Datasets\/master\/breast-cancer.csv\">Breast Cancer Dataset (breast-cancer.csv)<\/a><\/li>\n<\/ul>\n<p>Looking at the data, we can see that all nine input variables are categorical.<\/p>\n<p>Specifically, all variables are quoted strings; some are ordinal and some are not.<\/p>\n<pre class=\"crayon-plain-tag\">'40-49','premeno','15-19','0-2','yes','3','right','left_up','no','recurrence-events'\r\n'50-59','ge40','15-19','0-2','no','1','right','central','no','no-recurrence-events'\r\n'50-59','ge40','35-39','0-2','no','2','left','left_low','no','recurrence-events'\r\n'40-49','premeno','35-39','0-2','yes','3','right','left_low','yes','no-recurrence-events'\r\n'40-49','premeno','30-34','3-5','yes','2','left','right_up','no','recurrence-events'\r\n...<\/pre>\n<p>We can load this dataset into memory using the Pandas library.<\/p>\n<pre class=\"crayon-plain-tag\">...\r\n# load the dataset as a pandas DataFrame\r\ndata = read_csv(filename, header=None)\r\n# retrieve numpy array\r\ndataset = data.values<\/pre>\n<p>Once loaded, we can split the columns into input (<em>X<\/em>) and output (<em>y<\/em>) for modeling.<\/p>\n<pre class=\"crayon-plain-tag\">...\r\n# split into input (X) and output (y) variables\r\nX = dataset[:, :-1]\r\ny = dataset[:,-1]<\/pre>\n<p>Finally, we can force all fields in the input data to be string, just in case Pandas tried to map some automatically to numbers (it does try).<\/p>\n<p>We can also reshape the output variable to be one column (e.g. a 2D shape).<\/p>\n<pre class=\"crayon-plain-tag\">...\r\n# format all fields as string\r\nX = X.astype(str)\r\n# reshape target to be a 2d array\r\ny = y.reshape((len(y), 1))<\/pre>\n<p>We can tie all of this together into a helpful function that we can reuse later.<\/p>\n<pre class=\"crayon-plain-tag\"># load the dataset\r\ndef load_dataset(filename):\r\n\t# load the dataset as a pandas DataFrame\r\n\tdata = read_csv(filename, header=None)\r\n\t# retrieve numpy array\r\n\tdataset = data.values\r\n\t# split into input (X) and output (y) variables\r\n\tX = dataset[:, :-1]\r\n\ty = dataset[:,-1]\r\n\t# format all fields as string\r\n\tX = X.astype(str)\r\n\t# reshape target to be a 2d array\r\n\ty = y.reshape((len(y), 1))\r\n\treturn X, y<\/pre>\n<p>Once loaded, we can split the data into training and test sets so that we can fit and evaluate a deep learning model.<\/p>\n<p>We will use the <a href=\"https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.model_selection.train_test_split.html\">train_test_split() function<\/a> from scikit-learn and use 67% of the data for training and 33% for testing.<\/p>\n<pre class=\"crayon-plain-tag\">...\r\n# load the dataset\r\nX, y = load_dataset('breast-cancer.csv')\r\n# split into train and test sets\r\nX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)<\/pre>\n<p>Tying all of these elements together, the complete example of loading, splitting, and summarizing the raw categorical dataset is listed below.<\/p>\n<pre class=\"crayon-plain-tag\"># load and summarize the dataset\r\nfrom pandas import read_csv\r\nfrom sklearn.model_selection import train_test_split\r\n\r\n# load the dataset\r\ndef load_dataset(filename):\r\n\t# load the dataset as a pandas DataFrame\r\n\tdata = read_csv(filename, header=None)\r\n\t# retrieve numpy array\r\n\tdataset = data.values\r\n\t# split into input (X) and output (y) variables\r\n\tX = dataset[:, :-1]\r\n\ty = dataset[:,-1]\r\n\t# format all fields as string\r\n\tX = X.astype(str)\r\n\t# reshape target to be a 2d array\r\n\ty = y.reshape((len(y), 1))\r\n\treturn X, y\r\n\r\n# load the dataset\r\nX, y = load_dataset('breast-cancer.csv')\r\n# split into train and test sets\r\nX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)\r\n# summarize\r\nprint('Train', X_train.shape, y_train.shape)\r\nprint('Test', X_test.shape, y_test.shape)<\/pre>\n<p>Running the example reports the size of the input and output elements of the train and test sets.<\/p>\n<p>We can see that we have 191 examples for training and 95 for testing.<\/p>\n<pre class=\"crayon-plain-tag\">Train (191, 9) (191, 1)\r\nTest (95, 9) (95, 1)<\/pre>\n<p>Now that we are familiar with the dataset, let\u2019s look at how we can encode it for modeling.<\/p>\n<h2>How to Ordinal Encode Categorical Data<\/h2>\n<p>An ordinal encoding involves mapping each unique label to an integer value.<\/p>\n<p>As such, it is sometimes referred to simply as an integer encoding.<\/p>\n<p>This type of encoding is really only appropriate if there is a known relationship between the categories.<\/p>\n<p>This relationship does exist for some of the variables in the dataset, and ideally, this should be harnessed when preparing the data.<\/p>\n<p>In this case, we will ignore any possible existing ordinal relationship and assume all variables are categorical. It can still be helpful to use an ordinal encoding, at least as a point of reference with other encoding schemes.<\/p>\n<p>We can use the <a href=\"https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.preprocessing.OrdinalEncoder.html\">OrdinalEncoder() from scikit-learn<\/a> to encode each variable to integers. This is a flexible class and does allow the order of the categories to be specified as arguments if any such order is known.<\/p>\n<p><strong>Note<\/strong>: I will leave it as an exercise for you to update the example below to try specifying the order for those variables that have a natural ordering and see if it has an impact on model performance.<\/p>\n<p>The best practice when encoding variables is to fit the encoding on the training dataset, then apply it to the train and test datasets.<\/p>\n<p>The function below, named <em>prepare_inputs()<\/em>, takes the input data for the train and test sets and encodes it using an ordinal encoding.<\/p>\n<pre class=\"crayon-plain-tag\"># prepare input data\r\ndef prepare_inputs(X_train, X_test):\r\n\toe = OrdinalEncoder()\r\n\toe.fit(X_train)\r\n\tX_train_enc = oe.transform(X_train)\r\n\tX_test_enc = oe.transform(X_test)\r\n\treturn X_train_enc, X_test_enc<\/pre>\n<p>We also need to prepare the target variable.<\/p>\n<p>It is a binary classification problem, so we need to map the two class labels to 0 and 1.<\/p>\n<p>This is a type of ordinal encoding, and scikit-learn provides the <a href=\"https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.preprocessing.LabelEncoder.html\">LabelEncoder<\/a> class specifically designed for this purpose. We could just as easily use the <a href=\"https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.preprocessing.OrdinalEncoder.html\">OrdinalEncoder<\/a> and achieve the same result, although the <em>LabelEncoder<\/em> is designed for encoding a single variable.<\/p>\n<p>The <em>prepare_targets()<\/em> integer encodes the output data for the train and test sets.<\/p>\n<pre class=\"crayon-plain-tag\"># prepare target\r\ndef prepare_targets(y_train, y_test):\r\n\tle = LabelEncoder()\r\n\tle.fit(y_train)\r\n\ty_train_enc = le.transform(y_train)\r\n\ty_test_enc = le.transform(y_test)\r\n\treturn y_train_enc, y_test_enc<\/pre>\n<p>We can call these functions to prepare our data.<\/p>\n<pre class=\"crayon-plain-tag\">...\r\n# prepare input data\r\nX_train_enc, X_test_enc = prepare_inputs(X_train, X_test)\r\n# prepare output data\r\ny_train_enc, y_test_enc = prepare_targets(y_train, y_test)<\/pre>\n<p>We can now define a neural network model.<\/p>\n<p>We will use the same general model in all of these examples. Specifically, a MultiLayer Perceptron (MLP) neural network with one hidden layer with 10 nodes, and one node in the output layer for making binary classifications.<\/p>\n<p>Without going into too much detail, the code below defines the model, fits it on the training dataset, and then evaluates it on the test dataset.<\/p>\n<pre class=\"crayon-plain-tag\">...\r\n# define the model\r\nmodel = Sequential()\r\nmodel.add(Dense(10, input_dim=X_train_enc.shape[1], activation='relu', kernel_initializer='he_normal'))\r\nmodel.add(Dense(1, activation='sigmoid'))\r\n# compile the keras model\r\nmodel.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])\r\n# fit the keras model on the dataset\r\nmodel.fit(X_train_enc, y_train_enc, epochs=100, batch_size=16, verbose=2)\r\n# evaluate the keras model\r\n_, accuracy = model.evaluate(X_test_enc, y_test_enc, verbose=0)\r\nprint('Accuracy: %.2f' % (accuracy*100))<\/pre>\n<p>If you are new to developing neural networks in Keras, I recommend this tutorial:<\/p>\n<ul>\n<li><a href=\"https:\/\/machinelearningmastery.com\/tutorial-first-neural-network-python-keras\/\">Develop Your First Neural Network in Python Step-By-Step<\/a><\/li>\n<\/ul>\n<p>Tying all of this together, the complete example of preparing the data with an ordinal encoding and fitting and evaluating a neural network on the data is listed below.<\/p>\n<pre class=\"crayon-plain-tag\"># example of ordinal encoding for a neural network\r\nfrom pandas import read_csv\r\nfrom sklearn.model_selection import train_test_split\r\nfrom sklearn.preprocessing import LabelEncoder\r\nfrom sklearn.preprocessing import OrdinalEncoder\r\nfrom keras.models import Sequential\r\nfrom keras.layers import Dense\r\n\r\n# load the dataset\r\ndef load_dataset(filename):\r\n\t# load the dataset as a pandas DataFrame\r\n\tdata = read_csv(filename, header=None)\r\n\t# retrieve numpy array\r\n\tdataset = data.values\r\n\t# split into input (X) and output (y) variables\r\n\tX = dataset[:, :-1]\r\n\ty = dataset[:,-1]\r\n\t# format all fields as string\r\n\tX = X.astype(str)\r\n\t# reshape target to be a 2d array\r\n\ty = y.reshape((len(y), 1))\r\n\treturn X, y\r\n\r\n# prepare input data\r\ndef prepare_inputs(X_train, X_test):\r\n\toe = OrdinalEncoder()\r\n\toe.fit(X_train)\r\n\tX_train_enc = oe.transform(X_train)\r\n\tX_test_enc = oe.transform(X_test)\r\n\treturn X_train_enc, X_test_enc\r\n\r\n# prepare target\r\ndef prepare_targets(y_train, y_test):\r\n\tle = LabelEncoder()\r\n\tle.fit(y_train)\r\n\ty_train_enc = le.transform(y_train)\r\n\ty_test_enc = le.transform(y_test)\r\n\treturn y_train_enc, y_test_enc\r\n\r\n# load the dataset\r\nX, y = load_dataset('breast-cancer.csv')\r\n# split into train and test sets\r\nX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)\r\n# prepare input data\r\nX_train_enc, X_test_enc = prepare_inputs(X_train, X_test)\r\n# prepare output data\r\ny_train_enc, y_test_enc = prepare_targets(y_train, y_test)\r\n# define the  model\r\nmodel = Sequential()\r\nmodel.add(Dense(10, input_dim=X_train_enc.shape[1], activation='relu', kernel_initializer='he_normal'))\r\nmodel.add(Dense(1, activation='sigmoid'))\r\n# compile the keras model\r\nmodel.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])\r\n# fit the keras model on the dataset\r\nmodel.fit(X_train_enc, y_train_enc, epochs=100, batch_size=16, verbose=2)\r\n# evaluate the keras model\r\n_, accuracy = model.evaluate(X_test_enc, y_test_enc, verbose=0)\r\nprint('Accuracy: %.2f' % (accuracy*100))<\/pre>\n<p>Running the example will fit the model in just a few seconds on any modern hardware (no GPU required).<\/p>\n<p>The loss and the accuracy of the model are reported at the end of each training epoch, and finally, the accuracy of the model on the test dataset is reported.<\/p>\n<p>Your specific results will vary given the stochastic nature of the learning algorithm. Try running the example a few times.<\/p>\n<p>In this case, we can see that the model achieved an accuracy of about 70% on the test dataset.<\/p>\n<p>Not bad, given that an ordinal relationship only exists for some of the input variables, and for those where it does, it was not honored in the encoding.<\/p>\n<pre class=\"crayon-plain-tag\">...\r\nEpoch 95\/100\r\n - 0s - loss: 0.5349 - acc: 0.7696\r\nEpoch 96\/100\r\n - 0s - loss: 0.5330 - acc: 0.7539\r\nEpoch 97\/100\r\n - 0s - loss: 0.5316 - acc: 0.7592\r\nEpoch 98\/100\r\n - 0s - loss: 0.5302 - acc: 0.7696\r\nEpoch 99\/100\r\n - 0s - loss: 0.5291 - acc: 0.7644\r\nEpoch 100\/100\r\n - 0s - loss: 0.5277 - acc: 0.7644\r\n\r\nAccuracy: 70.53<\/pre>\n<p>This provides a good starting point when working with categorical data.<\/p>\n<p>A better and more general approach is to use a one hot encoding.<\/p>\n<h2>How to One Hot Encode Categorical Data<\/h2>\n<p>A one hot encoding is appropriate for categorical data where no relationship exists between categories.<\/p>\n<p>It involves representing each categorical variable with a binary vector that has one element for each unique label and marking the class label with a 1 and all other elements 0.<\/p>\n<p>For example, if our variable was \u201c<em>color<\/em>\u201d and the labels were \u201c<em>red<\/em>,\u201d \u201c<em>green<\/em>,\u201d and \u201c<em>blue<\/em>,\u201d we would encode each of these labels as a three-element binary vector as follows:<\/p>\n<ul>\n<li>Red: [1, 0, 0]<\/li>\n<li>Green: [0, 1, 0]<\/li>\n<li>Blue: [0, 0, 1]<\/li>\n<\/ul>\n<p>Then each label in the dataset would be replaced with a vector (one column becomes three). This is done for all categorical variables so that our nine input variables or columns become 43 in the case of the breast cancer dataset.<\/p>\n<p>The scikit-learn library provides the <a href=\"https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.preprocessing.OneHotEncoder.html\">OneHotEncoder<\/a> to automatically one hot encode one or more variables.<\/p>\n<p>The <em>prepare_inputs()<\/em> function below provides a drop-in replacement function for the example in the previous section. Instead of using an <em>OrdinalEncoder<\/em>, it uses a <em>OneHotEncoder<\/em>.<\/p>\n<pre class=\"crayon-plain-tag\"># prepare input data\r\ndef prepare_inputs(X_train, X_test):\r\n\tohe = OneHotEncoder()\r\n\tohe.fit(X_train)\r\n\tX_train_enc = ohe.transform(X_train)\r\n\tX_test_enc = ohe.transform(X_test)\r\n\treturn X_train_enc, X_test_enc<\/pre>\n<p>Tying this together, the complete example of one hot encoding the breast cancer categorical dataset and modeling it with a neural network is listed below.<\/p>\n<pre class=\"crayon-plain-tag\"># example of one hot encoding for a neural network\r\nfrom pandas import read_csv\r\nfrom sklearn.model_selection import train_test_split\r\nfrom sklearn.preprocessing import LabelEncoder\r\nfrom sklearn.preprocessing import OneHotEncoder\r\nfrom keras.models import Sequential\r\nfrom keras.layers import Dense\r\n\r\n# load the dataset\r\ndef load_dataset(filename):\r\n\t# load the dataset as a pandas DataFrame\r\n\tdata = read_csv(filename, header=None)\r\n\t# retrieve numpy array\r\n\tdataset = data.values\r\n\t# split into input (X) and output (y) variables\r\n\tX = dataset[:, :-1]\r\n\ty = dataset[:,-1]\r\n\t# format all fields as string\r\n\tX = X.astype(str)\r\n\t# reshape target to be a 2d array\r\n\ty = y.reshape((len(y), 1))\r\n\treturn X, y\r\n\r\n# prepare input data\r\ndef prepare_inputs(X_train, X_test):\r\n\tohe = OneHotEncoder()\r\n\tohe.fit(X_train)\r\n\tX_train_enc = ohe.transform(X_train)\r\n\tX_test_enc = ohe.transform(X_test)\r\n\treturn X_train_enc, X_test_enc\r\n\r\n# prepare target\r\ndef prepare_targets(y_train, y_test):\r\n\tle = LabelEncoder()\r\n\tle.fit(y_train)\r\n\ty_train_enc = le.transform(y_train)\r\n\ty_test_enc = le.transform(y_test)\r\n\treturn y_train_enc, y_test_enc\r\n\r\n# load the dataset\r\nX, y = load_dataset('breast-cancer.csv')\r\n# split into train and test sets\r\nX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)\r\n# prepare input data\r\nX_train_enc, X_test_enc = prepare_inputs(X_train, X_test)\r\n# prepare output data\r\ny_train_enc, y_test_enc = prepare_targets(y_train, y_test)\r\n# define the  model\r\nmodel = Sequential()\r\nmodel.add(Dense(10, input_dim=X_train_enc.shape[1], activation='relu', kernel_initializer='he_normal'))\r\nmodel.add(Dense(1, activation='sigmoid'))\r\n# compile the keras model\r\nmodel.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])\r\n# fit the keras model on the dataset\r\nmodel.fit(X_train_enc, y_train_enc, epochs=100, batch_size=16, verbose=2)\r\n# evaluate the keras model\r\n_, accuracy = model.evaluate(X_test_enc, y_test_enc, verbose=0)\r\nprint('Accuracy: %.2f' % (accuracy*100))<\/pre>\n<p>The example one hot encodes the input categorical data, and also label encodes the target variable as we did in the previous section. The same neural network model is then fit on the prepared dataset.<\/p>\n<p>Your specific results will vary given the stochastic nature of the learning algorithm. Try running the example a few times.<\/p>\n<p>In this case, the model performs reasonably well, achieving an accuracy of about 72%, close to what was seen in the previous section.<\/p>\n<p>A more fair comparison would be to run each configuration 10 or 30 times and compare performance using the mean accuracy. Recall, that we are more focused on how to encode categorical data in this tutorial rather than getting the best score on this specific dataset.<\/p>\n<pre class=\"crayon-plain-tag\">...\r\nEpoch 95\/100\r\n - 0s - loss: 0.3837 - acc: 0.8272\r\nEpoch 96\/100\r\n - 0s - loss: 0.3823 - acc: 0.8325\r\nEpoch 97\/100\r\n - 0s - loss: 0.3814 - acc: 0.8325\r\nEpoch 98\/100\r\n - 0s - loss: 0.3795 - acc: 0.8325\r\nEpoch 99\/100\r\n - 0s - loss: 0.3788 - acc: 0.8325\r\nEpoch 100\/100\r\n - 0s - loss: 0.3773 - acc: 0.8325\r\n\r\nAccuracy: 72.63<\/pre>\n<p>Ordinal and one hot encoding are perhaps the two most popular methods.<\/p>\n<p>A newer technique is similar to one hot encoding and was designed for use with neural networks, called a learned embedding.<\/p>\n<h2>How to Use a Learned Embedding for Categorical Data<\/h2>\n<p>A learned embedding, or simply an \u201c<em>embedding<\/em>,\u201d is a distributed representation for categorical data.<\/p>\n<p>Each category is mapped to a distinct vector, and the properties of the vector are adapted or learned while training a neural network. The vector space provides a projection of the categories, allowing those categories that are close or related to cluster together naturally.<\/p>\n<p>This provides both the benefits of an ordinal relationship by allowing any such relationships to be learned from data, and a one hot encoding in providing a vector representation for each category. Unlike one hot encoding, the input vectors are not sparse (do not have lots of zeros). The downside is that it requires learning as part of the model and the creation of many more input variables (columns).<\/p>\n<p>The technique was originally developed to provide a distributed representation for words, e.g. allowing similar words to have similar vector representations. As such, the technique is often referred to as a word embedding, and in the case of text data, algorithms have been developed to learn a representation independent of a neural network. For more on this topic, see the post:<\/p>\n<ul>\n<li><a href=\"https:\/\/machinelearningmastery.com\/what-are-word-embeddings\/\">What Are Word Embeddings for Text?<\/a><\/li>\n<\/ul>\n<p>An additional benefit of using an embedding is that the learned vectors that each category is mapped to can be fit in a model that has modest skill, but the vectors can be extracted and used generally as input for the category on a range of different models and applications. That is, they can be learned and reused.<\/p>\n<p>Embeddings can be used in Keras via the <em>Embedding<\/em> layer.<\/p>\n<p>For an example of learning word embeddings for text data in Keras, see the post:<\/p>\n<ul>\n<li><a href=\"https:\/\/machinelearningmastery.com\/use-word-embedding-layers-deep-learning-keras\/\">How to Use Word Embedding Layers for Deep Learning with Keras<\/a><\/li>\n<\/ul>\n<p>One embedding layer is required for each categorical variable, and the embedding expects the categories to be ordinal encoded, although no relationship between the categories is assumed.<\/p>\n<p>Each embedding also requires the number of dimensions to use for the distributed representation (vector space). It is common in natural language applications to use 50, 100, or 300 dimensions. For our small example, we will fix the number of dimensions at 10, but this is arbitrary; you should experimenter with other values.<\/p>\n<p>First, we can prepare the input data using an ordinal encoding.<\/p>\n<p>The model we will develop will have one separate embedding for each input variable. Therefore, the model will take nine different input datasets. As such, we will split the input variables and ordinal encode (integer encoding) each separately using the <em>LabelEncoder<\/em> and return a list of separate prepared train and test input datasets.<\/p>\n<p>The <em>prepare_inputs()<\/em> function below implements this, enumerating over each input variable, integer encoding each correctly using best practices, and returning lists of encoded train and test variables (or one-variable datasets) that can be used as input for our model later.<\/p>\n<pre class=\"crayon-plain-tag\"># prepare input data\r\ndef prepare_inputs(X_train, X_test):\r\n\tX_train_enc, X_test_enc = list(), list()\r\n\t# label encode each column\r\n\tfor i in range(X_train.shape[1]):\r\n\t\tle = LabelEncoder()\r\n\t\tle.fit(X_train[:, i])\r\n\t\t# encode\r\n\t\ttrain_enc = le.transform(X_train[:, i])\r\n\t\ttest_enc = le.transform(X_test[:, i])\r\n\t\t# store\r\n\t\tX_train_enc.append(train_enc)\r\n\t\tX_test_enc.append(test_enc)\r\n\treturn X_train_enc, X_test_enc<\/pre>\n<p>Now we can construct the model.<\/p>\n<p>We must construct the model differently in this case because we will have nine input layers, with nine embeddings the outputs of which (the nine different 10-element vectors) need to be concatenated into one long vector before being passed as input to the dense layers.<\/p>\n<p>We can achieve this using the functional Keras API. If you are new to the Keras functional API, see the post:<\/p>\n<ul>\n<li><a href=\"https:\/\/machinelearningmastery.com\/keras-functional-api-deep-learning\/\">How to Use the Keras Functional API for Deep Learning<\/a><\/li>\n<\/ul>\n<p>First, we can enumerate each variable and construct an input layer and connect it to an embedding layer, and store both layers in lists. We need a reference to all of the input layers when defining the model, and we need a reference to each embedding layer to concentrate them with a merge layer.<\/p>\n<pre class=\"crayon-plain-tag\">...\r\n# prepare each input head\r\nin_layers = list()\r\nem_layers = list()\r\nfor i in range(len(X_train_enc)):\r\n\t# calculate the number of unique inputs\r\n\tn_labels = len(unique(X_train_enc[i]))\r\n\t# define input layer\r\n\tin_layer = Input(shape=(1,))\r\n\t# define embedding layer\r\n\tem_layer = Embedding(n_labels, 10)(in_layer)\r\n\t# store layers\r\n\tin_layers.append(in_layer)\r\n\tem_layers.append(em_layer)<\/pre>\n<p>We can then merge all of the embedding layers, define the hidden layer and output layer, then define the model.<\/p>\n<pre class=\"crayon-plain-tag\">...\r\n# concat all embeddings\r\nmerge = concatenate(em_layers)\r\ndense = Dense(10, activation='relu', kernel_initializer='he_normal')(merge)\r\noutput = Dense(1, activation='sigmoid')(dense)\r\nmodel = Model(inputs=in_layers, outputs=output)<\/pre>\n<p>When using a model with multiple inputs, we will need to specify a list that has one dataset for each input, e.g. a list of nine arrays each with one column in the case of our dataset. Thankfully, this is the format we returned from our <em>prepare_inputs()<\/em> function.<\/p>\n<p>Therefore, fitting and evaluating the model looks like it does in the previous section.<\/p>\n<p>Additionally, we will plot the model by calling the <em>plot_model()<\/em> function and save it to file. This requires that pygraphviz and pydot are installed, which can be a pain on some systems. <strong>If you have trouble<\/strong>, just comment out the import statement and call to <em>plot_model()<\/em>.<\/p>\n<pre class=\"crayon-plain-tag\">...\r\n# compile the keras model\r\nmodel.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])\r\n# plot graph\r\nplot_model(model, show_shapes=True, to_file='embeddings.png')\r\n# fit the keras model on the dataset\r\nmodel.fit(X_train_enc, y_train_enc, epochs=20, batch_size=16, verbose=2)\r\n# evaluate the keras model\r\n_, accuracy = model.evaluate(X_test_enc, y_test_enc, verbose=0)\r\nprint('Accuracy: %.2f' % (accuracy*100))<\/pre>\n<p>Tying this all together, the complete example of using a separate embedding for each categorical input variable in a multi-input layer model is listed below.<\/p>\n<pre class=\"crayon-plain-tag\"># example of learned embedding encoding for a neural network\r\nfrom numpy import unique\r\nfrom pandas import read_csv\r\nfrom sklearn.model_selection import train_test_split\r\nfrom sklearn.preprocessing import LabelEncoder\r\nfrom keras.models import Model\r\nfrom keras.layers import Input\r\nfrom keras.layers import Dense\r\nfrom keras.layers import Embedding\r\nfrom keras.layers.merge import concatenate\r\nfrom keras.utils import plot_model\r\n\r\n# load the dataset\r\ndef load_dataset(filename):\r\n\t# load the dataset as a pandas DataFrame\r\n\tdata = read_csv(filename, header=None)\r\n\t# retrieve numpy array\r\n\tdataset = data.values\r\n\t# split into input (X) and output (y) variables\r\n\tX = dataset[:, :-1]\r\n\ty = dataset[:,-1]\r\n\t# format all fields as string\r\n\tX = X.astype(str)\r\n\t# reshape target to be a 2d array\r\n\ty = y.reshape((len(y), 1))\r\n\treturn X, y\r\n\r\n# prepare input data\r\ndef prepare_inputs(X_train, X_test):\r\n\tX_train_enc, X_test_enc = list(), list()\r\n\t# label encode each column\r\n\tfor i in range(X_train.shape[1]):\r\n\t\tle = LabelEncoder()\r\n\t\tle.fit(X_train[:, i])\r\n\t\t# encode\r\n\t\ttrain_enc = le.transform(X_train[:, i])\r\n\t\ttest_enc = le.transform(X_test[:, i])\r\n\t\t# store\r\n\t\tX_train_enc.append(train_enc)\r\n\t\tX_test_enc.append(test_enc)\r\n\treturn X_train_enc, X_test_enc\r\n\r\n# prepare target\r\ndef prepare_targets(y_train, y_test):\r\n\tle = LabelEncoder()\r\n\tle.fit(y_train)\r\n\ty_train_enc = le.transform(y_train)\r\n\ty_test_enc = le.transform(y_test)\r\n\treturn y_train_enc, y_test_enc\r\n\r\n# load the dataset\r\nX, y = load_dataset('breast-cancer.csv')\r\n# split into train and test sets\r\nX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)\r\n# prepare input data\r\nX_train_enc, X_test_enc = prepare_inputs(X_train, X_test)\r\n# prepare output data\r\ny_train_enc, y_test_enc = prepare_targets(y_train, y_test)\r\n# make output 3d\r\ny_train_enc = y_train_enc.reshape((len(y_train_enc), 1, 1))\r\ny_test_enc = y_test_enc.reshape((len(y_test_enc), 1, 1))\r\n# prepare each input head\r\nin_layers = list()\r\nem_layers = list()\r\nfor i in range(len(X_train_enc)):\r\n\t# calculate the number of unique inputs\r\n\tn_labels = len(unique(X_train_enc[i]))\r\n\t# define input layer\r\n\tin_layer = Input(shape=(1,))\r\n\t# define embedding layer\r\n\tem_layer = Embedding(n_labels, 10)(in_layer)\r\n\t# store layers\r\n\tin_layers.append(in_layer)\r\n\tem_layers.append(em_layer)\r\n# concat all embeddings\r\nmerge = concatenate(em_layers)\r\ndense = Dense(10, activation='relu', kernel_initializer='he_normal')(merge)\r\noutput = Dense(1, activation='sigmoid')(dense)\r\nmodel = Model(inputs=in_layers, outputs=output)\r\n# compile the keras model\r\nmodel.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])\r\n# plot graph\r\nplot_model(model, show_shapes=True, to_file='embeddings.png')\r\n# fit the keras model on the dataset\r\nmodel.fit(X_train_enc, y_train_enc, epochs=20, batch_size=16, verbose=2)\r\n# evaluate the keras model\r\n_, accuracy = model.evaluate(X_test_enc, y_test_enc, verbose=0)\r\nprint('Accuracy: %.2f' % (accuracy*100))<\/pre>\n<p>Running the example prepares the data as described above, fits the model, and reports the performance.<\/p>\n<p>Your specific results will vary given the stochastic nature of the learning algorithm. Try running the example a few times.<\/p>\n<p>In this case, the model performs reasonably well, matching what we saw for the one hot encoding in the previous section.<\/p>\n<p>As the learned vectors were trained in a skilled model, it is possible to save them and use them as a general representation for these variables in other models that operate on the same data. A useful and compelling reason to explore this encoding.<\/p>\n<pre class=\"crayon-plain-tag\">...\r\nEpoch 15\/20\r\n - 0s - loss: 0.4891 - acc: 0.7696\r\nEpoch 16\/20\r\n - 0s - loss: 0.4845 - acc: 0.7749\r\nEpoch 17\/20\r\n - 0s - loss: 0.4783 - acc: 0.7749\r\nEpoch 18\/20\r\n - 0s - loss: 0.4763 - acc: 0.7906\r\nEpoch 19\/20\r\n - 0s - loss: 0.4696 - acc: 0.7906\r\nEpoch 20\/20\r\n - 0s - loss: 0.4660 - acc: 0.7958\r\n\r\nAccuracy: 72.63<\/pre>\n<p>To confirm our understanding of the model, a plot is created and saved to the file embeddings.png in the current working directory.<\/p>\n<p>The plot shows the nine inputs each mapped to a 10 element vector, meaning that the actual input to the model is a 90 element vector.<\/p>\n<p><strong>Note<\/strong>: Click to the image to see the large version.<\/p>\n<div id=\"attachment_9067\" style=\"width: 1034px\" class=\"wp-caption aligncenter\"><a href=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2019\/09\/Plot-of-the-Model-Architecture-with-Separate-Inputs-and-Embeddings-for-each-Categorical-Variable.png\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-9067\" class=\"wp-image-9067 size-large\" src=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2019\/09\/Plot-of-the-Model-Architecture-with-Separate-Inputs-and-Embeddings-for-each-Categorical-Variable-1024x134.png\" alt=\"Plot of the Model Architecture With Separate Inputs and Embeddings for each Categorical Variable\" width=\"1024\" height=\"134\" srcset=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2019\/09\/Plot-of-the-Model-Architecture-with-Separate-Inputs-and-Embeddings-for-each-Categorical-Variable-1024x134.png 1024w, https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2019\/09\/Plot-of-the-Model-Architecture-with-Separate-Inputs-and-Embeddings-for-each-Categorical-Variable-300x39.png 300w, https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2019\/09\/Plot-of-the-Model-Architecture-with-Separate-Inputs-and-Embeddings-for-each-Categorical-Variable-768x100.png 768w\" sizes=\"(max-width: 1024px) 100vw, 1024px\"><\/a><\/p>\n<p id=\"caption-attachment-9067\" class=\"wp-caption-text\">Plot of the Model Architecture With Separate Inputs and Embeddings for each Categorical Variable<br \/>Click to Enlarge.<\/p>\n<\/div>\n<h2>Common Questions<\/h2>\n<p>This section lists some common questions and answers when encoding categorical data.<\/p>\n<h4><strong>Q. What if I have a mixture of numeric and categorical data?<\/strong><\/h4>\n<p>Or, what if I have a mixture of categorical and ordinal data?<\/p>\n<p>You will need to prepare or encode each variable (column) in your dataset separately, then concatenate all of the prepared variables back together into a single array for fitting or evaluating the model.<\/p>\n<h4><strong>Q. What if I have hundreds of categories?<\/strong><\/h4>\n<p>Or, what if I concatenate many one hot encoded vectors to create a many thousand element input vector?<\/p>\n<p>You can use a one hot encoding up to thousands and tens of thousands of categories. Also, having large vectors as input sounds intimidating, but the models can generally handle it.<\/p>\n<p>Try an embedding; it offers the benefit of a smaller vector space (a projection) and the representation can have more meaning.<\/p>\n<h4>Q. What encoding technique is the best?<\/h4>\n<p>This is unknowable.<\/p>\n<p>Test each technique (and more) on your dataset with your chosen model and discover what works best for your case.<\/p>\n<h2>Further Reading<\/h2>\n<p>This section provides more resources on the topic if you are looking to go deeper.<\/p>\n<h3>Posts<\/h3>\n<ul>\n<li><a href=\"https:\/\/machinelearningmastery.com\/tutorial-first-neural-network-python-keras\/\">Develop Your First Neural Network in Python Step-By-Step<\/a><\/li>\n<li><a href=\"https:\/\/machinelearningmastery.com\/why-one-hot-encode-data-in-machine-learning\/\">Why One-Hot Encode Data in Machine Learning?<\/a><\/li>\n<li><a href=\"https:\/\/machinelearningmastery.com\/data-preparation-gradient-boosting-xgboost-python\/\">Data Preparation for Gradient Boosting with XGBoost in Python<\/a><\/li>\n<li><a href=\"https:\/\/machinelearningmastery.com\/what-are-word-embeddings\/\">What Are Word Embeddings for Text?<\/a><\/li>\n<li><a href=\"https:\/\/machinelearningmastery.com\/use-word-embedding-layers-deep-learning-keras\/\">How to Use Word Embedding Layers for Deep Learning with Keras<\/a><\/li>\n<li><a href=\"https:\/\/machinelearningmastery.com\/keras-functional-api-deep-learning\/\">How to Use the Keras Functional API for Deep Learning<\/a><\/li>\n<\/ul>\n<h3>API<\/h3>\n<ul>\n<li><a href=\"https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.model_selection.train_test_split.html\">sklearn.model_selection.train_test_split API<\/a>.<\/li>\n<li><a href=\"https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.preprocessing.OrdinalEncoder.html\">sklearn.preprocessing.OrdinalEncoder API<\/a>.<\/li>\n<li><a href=\"https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.preprocessing.LabelEncoder.html\">sklearn.preprocessing.LabelEncoder API<\/a>.<\/li>\n<li><a href=\"https:\/\/keras.io\/layers\/embeddings\/\">Embedding Keras API<\/a>.<\/li>\n<li><a href=\"https:\/\/keras.io\/visualization\/\">Visualization Keras API<\/a>.<\/li>\n<\/ul>\n<h3>Dataset<\/h3>\n<ul>\n<li><a href=\"https:\/\/archive.ics.uci.edu\/ml\/datasets\/Breast+Cancer\">Breast Cancer Data Set, UCI Machine Learning Repository<\/a>.<\/li>\n<li><a href=\"https:\/\/raw.githubusercontent.com\/jbrownlee\/Datasets\/master\/breast-cancer.csv\">Breast Cancer Raw Dataset<\/a><\/li>\n<li><a href=\"https:\/\/github.com\/jbrownlee\/Datasets\/blob\/master\/breast-cancer.names\">Breast Cancer Description<\/a><\/li>\n<\/ul>\n<h2>Summary<\/h2>\n<p>In this tutorial, you discovered how to encode categorical data when developing neural network models in Keras.<\/p>\n<p>Specifically, you learned:<\/p>\n<ul>\n<li>The challenge of working with categorical data when using machine learning and deep learning models.<\/li>\n<li>How to integer encode and one hot encode categorical variables for modeling.<\/li>\n<li>How to learn an embedding distributed representation as part of a neural network for categorical variables.<\/li>\n<\/ul>\n<p>Do you have any questions?<br \/>\nAsk your questions in the comments below and I will do my best to answer.<\/p>\n<p>The post <a rel=\"nofollow\" href=\"https:\/\/machinelearningmastery.com\/how-to-prepare-categorical-data-for-deep-learning-in-python\/\">3 Ways to Encode Categorical Variables for Deep Learning<\/a> appeared first on <a rel=\"nofollow\" href=\"https:\/\/machinelearningmastery.com\/\">Machine Learning Mastery<\/a>.<\/p>\n<\/div>\n<p><a href=\"https:\/\/machinelearningmastery.com\/how-to-prepare-categorical-data-for-deep-learning-in-python\/\">Go to Source<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Author: Jason Brownlee Machine learning and deep learning models, like those in Keras, require all input and output variables to be numeric. This means that [&hellip;] <span class=\"read-more-link\"><a class=\"read-more\" href=\"https:\/\/www.aiproblog.com\/index.php\/2019\/11\/21\/3-ways-to-encode-categorical-variables-for-deep-learning\/\">Read More<\/a><\/span><\/p>\n","protected":false},"author":1,"featured_media":2844,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_bbp_topic_count":0,"_bbp_reply_count":0,"_bbp_total_topic_count":0,"_bbp_total_reply_count":0,"_bbp_voice_count":0,"_bbp_anonymous_reply_count":0,"_bbp_topic_count_hidden":0,"_bbp_reply_count_hidden":0,"_bbp_forum_subforum_count":0,"footnotes":""},"categories":[24],"tags":[],"_links":{"self":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts\/2843"}],"collection":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/comments?post=2843"}],"version-history":[{"count":0,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts\/2843\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/media\/2844"}],"wp:attachment":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/media?parent=2843"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/categories?post=2843"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/tags?post=2843"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}