{"id":3208,"date":"2020-03-05T18:00:45","date_gmt":"2020-03-05T18:00:45","guid":{"rendered":"https:\/\/www.aiproblog.com\/index.php\/2020\/03\/05\/imbalanced-classification-with-the-adult-income-dataset\/"},"modified":"2020-03-05T18:00:45","modified_gmt":"2020-03-05T18:00:45","slug":"imbalanced-classification-with-the-adult-income-dataset","status":"publish","type":"post","link":"https:\/\/www.aiproblog.com\/index.php\/2020\/03\/05\/imbalanced-classification-with-the-adult-income-dataset\/","title":{"rendered":"Imbalanced Classification with the Adult Income Dataset"},"content":{"rendered":"<p>Author: Jason Brownlee<\/p>\n<div>\n<p>Many binary classification tasks do not have an equal number of examples from each class, e.g. the class distribution is skewed or imbalanced.<\/p>\n<p>A popular example is the adult income dataset that involves predicting personal income levels as above or below $50,000 per year based on personal details such as relationship and education level. There are many more cases of incomes less than $50K than above $50K, although the skew is not severe.<\/p>\n<p>This means that techniques for imbalanced classification can be used whilst model performance can still be reported using classification accuracy, as is used with balanced classification problems.<\/p>\n<p>In this tutorial, you will discover how to develop and evaluate a model for the imbalanced adult income classification dataset.<\/p>\n<p>After completing this tutorial, you will know:<\/p>\n<ul>\n<li>How to load and explore the dataset and generate ideas for data preparation and model selection.<\/li>\n<li>How to systematically evaluate a suite of machine learning models with a robust test harness.<\/li>\n<li>How to fit a final model and use it to predict class labels for specific cases.<\/li>\n<\/ul>\n<p>Discover SMOTE, one-class classification, cost-sensitive learning, threshold moving, and much more <a href=\"https:\/\/machinelearningmastery.com\/imbalanced-classification-with-python\/\">in my new book<\/a>, with 30 step-by-step tutorials and full Python source code.<\/p>\n<p>Let&rsquo;s get started.<\/p>\n<div id=\"attachment_9714\" style=\"width: 809px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-9714\" class=\"size-full wp-image-9714\" src=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2020\/03\/Develop-an-Imbalanced-Classification-Model-to-Predict-Income.jpg\" alt=\"Develop an Imbalanced Classification Model to Predict Income\" width=\"799\" height=\"533\" srcset=\"http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2020\/03\/Develop-an-Imbalanced-Classification-Model-to-Predict-Income.jpg 799w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2020\/03\/Develop-an-Imbalanced-Classification-Model-to-Predict-Income-300x200.jpg 300w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2020\/03\/Develop-an-Imbalanced-Classification-Model-to-Predict-Income-768x512.jpg 768w\" sizes=\"(max-width: 799px) 100vw, 799px\"><\/p>\n<p id=\"caption-attachment-9714\" class=\"wp-caption-text\">Develop an Imbalanced Classification Model to Predict Income<br \/>Photo by <a href=\"https:\/\/flickr.com\/photos\/kirt_edblom\/32562259528\/\">Kirt Edblom<\/a>, some rights reserved.<\/p>\n<\/div>\n<h2>Tutorial Overview<\/h2>\n<p>This tutorial is divided into five parts; they are:<\/p>\n<ol>\n<li>Adult Income Dataset<\/li>\n<li>Explore the Dataset<\/li>\n<li>Model Test and Baseline Result<\/li>\n<li>Evaluate Models<\/li>\n<li>Make Prediction on New Data<\/li>\n<\/ol>\n<h2>Adult Income Dataset<\/h2>\n<p>In this project, we will use a standard imbalanced machine learning dataset referred to as the &ldquo;<em>Adult Income<\/em>&rdquo; or simply the &ldquo;<a href=\"https:\/\/archive.ics.uci.edu\/ml\/datasets\/Adult\"><em>adult<\/em><\/a>&rdquo; dataset.<\/p>\n<p>The dataset is credited to Ronny Kohavi and Barry Becker and was drawn from the 1994 <a href=\"https:\/\/www.census.gov\/\">United States Census Bureau<\/a> data and involves using personal details such as education level to predict whether an individual will earn more or less than $50,000 per year.<\/p>\n<blockquote>\n<p>The Adult dataset is from the Census Bureau and the task is to predict whether a given adult makes more than $50,000 a year based attributes such as education, hours of work per week, etc..<\/p>\n<\/blockquote>\n<p>&mdash; <a href=\"https:\/\/dl.acm.org\/citation.cfm?id=3001502\">Scaling Up The Accuracy Of Naive-bayes Classifiers: A Decision-tree Hybrid<\/a>, 1996.<\/p>\n<p>The dataset provides 14 input variables that are a mixture of categorical, ordinal, and numerical data types. The complete list of variables is as follows:<\/p>\n<ul>\n<li>Age.<\/li>\n<li>Workclass.<\/li>\n<li>Final Weight.<\/li>\n<li>Education.<\/li>\n<li>Education Number of Years.<\/li>\n<li>Marital-status.<\/li>\n<li>Occupation.<\/li>\n<li>Relationship.<\/li>\n<li>Race.<\/li>\n<li>Sex.<\/li>\n<li>Capital-gain.<\/li>\n<li>Capital-loss.<\/li>\n<li>Hours-per-week.<\/li>\n<li>Native-country.<\/li>\n<\/ul>\n<p>The dataset contains missing values that are marked with a question mark character (?).<\/p>\n<p>There are a total of 48,842 rows of data, and 3,620 with missing values, leaving 45,222 complete rows.<\/p>\n<p>There are two class values &lsquo;<em>&gt;50K<\/em>&lsquo; and &lsquo;<em>&lt;=50K<\/em>&lsquo;, meaning it is a binary classification task. The classes are imbalanced, with a skew toward the &lsquo;<em>&lt;=50K<\/em>&lsquo; class label.<\/p>\n<ul>\n<li><strong>&lsquo;&gt;50K&rsquo;<\/strong>: majority class, approximately 25%.<\/li>\n<li><strong>&lsquo;&lt;=50K&rsquo;<\/strong>: minority class, approximately 75%.<\/li>\n<\/ul>\n<p>Given that the class imbalance is not severe and that both class labels are equally important, it is common to use classification accuracy or classification error to report model performance on this dataset.<\/p>\n<p>Using predefined train and test sets, reported good classification error is approximately 14 percent or a classification accuracy of about 86 percent. This might provide a target to aim for when working on this dataset.<\/p>\n<p>Next, let&rsquo;s take a closer look at the data.<\/p>\n<\/p>\n<div class=\"woo-sc-hr\"><\/div>\n<p><center><\/p>\n<h3>Want to Get Started With Imbalance Classification?<\/h3>\n<p>Take my free 7-day email crash course now (with sample code).<\/p>\n<p>Click to sign-up and also get a free PDF Ebook version of the course.<\/p>\n<p><a href=\"https:\/\/machinelearningmastery.lpages.co\/leadbox\/14de34d42172a2%3A164f8be4f346dc\/4529268551712768\/\" target=\"_blank\" style=\"background: rgb(255, 206, 10); color: rgb(255, 255, 255); text-decoration: none; font-family: Helvetica, Arial, sans-serif; font-weight: bold; font-size: 16px; line-height: 20px; padding: 10px; display: inline-block; max-width: 300px; border-radius: 5px; text-shadow: rgba(0, 0, 0, 0.25) 0px -1px 1px; box-shadow: rgba(255, 255, 255, 0.5) 0px 1px 3px inset, rgba(0, 0, 0, 0.5) 0px 1px 3px;\" rel=\"noopener noreferrer\">Download Your FREE Mini-Course<\/a><script data-leadbox=\"14de34d42172a2:164f8be4f346dc\" data-url=\"https:\/\/machinelearningmastery.lpages.co\/leadbox\/14de34d42172a2%3A164f8be4f346dc\/4529268551712768\/\" data-config=\"%7B%7D\" type=\"text\/javascript\" src=\"https:\/\/machinelearningmastery.lpages.co\/leadbox-1576257931.js\"><\/script><\/p>\n<p><\/center><\/p>\n<div class=\"woo-sc-hr\"><\/div>\n<h2>Explore the Dataset<\/h2>\n<p>The Adult dataset is a widely used standard machine learning dataset, used to explore and demonstrate many machine learning algorithms, both generally and those designed specifically for imbalanced classification.<\/p>\n<p>First, download the dataset and save it in your current working directory with the name &ldquo;<em>adult-all.csv<\/em>&rdquo;<\/p>\n<ul>\n<li><a href=\"https:\/\/raw.githubusercontent.com\/jbrownlee\/Datasets\/master\/adult-all.csv\">Download Adult Dataset (adult-all.csv)<\/a><\/li>\n<\/ul>\n<p>Review the contents of the file.<\/p>\n<p>The first few lines of the file should look as follows:<\/p>\n<pre class=\"crayon-plain-tag\">39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,&lt;=50K\r\n50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,&lt;=50K\r\n38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,&lt;=50K\r\n53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,&lt;=50K\r\n28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,&lt;=50K\r\n...<\/pre>\n<p>We can see that the input variables are a mixture of numerical and categorical or ordinal data types, where the non-numerical columns are represented using strings. At a minimum, the categorical variables will need to be ordinal or one-hot encoded.<\/p>\n<p>We can also see that the target variable is represented using strings. This column will need to be label encoded with 0 for the majority class and 1 for the minority class, as is the custom for binary imbalanced classification tasks.<\/p>\n<p>Missing values are marked with a &lsquo;<em>?<\/em>&lsquo; character. These values will need to be imputed, or given the small number of examples, these rows could be deleted from the dataset.<\/p>\n<p>The dataset can be loaded as a <em>DataFrame<\/em> using the <a href=\"https:\/\/pandas.pydata.org\/pandas-docs\/stable\/reference\/api\/pandas.read_csv.html\">read_csv() Pandas function<\/a>, specifying the filename, that there is no header line, and that strings like &lsquo; <em>?<\/em>&lsquo; should be parsed as <em>NaN<\/em> (missing) values.<\/p>\n<pre class=\"crayon-plain-tag\">...\r\n# define the dataset location\r\nfilename = 'adult-all.csv'\r\n# load the csv file as a data frame\r\ndataframe = read_csv(filename, header=None, na_values='?')<\/pre>\n<p>Once loaded, we can remove the rows that contain one or more missing values.<\/p>\n<pre class=\"crayon-plain-tag\">...\r\n# drop rows with missing\r\ndataframe = dataframe.dropna()<\/pre>\n<p>We can summarize the number of rows and columns by printing the shape of the DataFrame.<\/p>\n<pre class=\"crayon-plain-tag\">...\r\n# summarize the shape of the dataset\r\nprint(dataframe.shape)<\/pre>\n<p>We can also summarize the number of examples in each class using the <a href=\"https:\/\/docs.python.org\/3\/library\/collections.html#collections.Counter\">Counter object<\/a>.<\/p>\n<pre class=\"crayon-plain-tag\">...\r\n# summarize the class distribution\r\ntarget = dataframe.values[:,-1]\r\ncounter = Counter(target)\r\nfor k,v in counter.items():\r\n\tper = v \/ len(target) * 100\r\n\tprint('Class=%s, Count=%d, Percentage=%.3f%%' % (k, v, per))<\/pre>\n<p>Tying this together, the complete example of loading and summarizing the dataset is listed below.<\/p>\n<pre class=\"crayon-plain-tag\"># load and summarize the dataset\r\nfrom pandas import read_csv\r\nfrom collections import Counter\r\n# define the dataset location\r\nfilename = 'adult-all.csv'\r\n# load the csv file as a data frame\r\ndataframe = read_csv(filename, header=None, na_values='?')\r\n# drop rows with missing\r\ndataframe = dataframe.dropna()\r\n# summarize the shape of the dataset\r\nprint(dataframe.shape)\r\n# summarize the class distribution\r\ntarget = dataframe.values[:,-1]\r\ncounter = Counter(target)\r\nfor k,v in counter.items():\r\n\tper = v \/ len(target) * 100\r\n\tprint('Class=%s, Count=%d, Percentage=%.3f%%' % (k, v, per))<\/pre>\n<p>Running the example first loads the dataset and confirms the number of rows and columns, that is 45,222 rows without missing values and 14 input variables and one target variable.<\/p>\n<p>The class distribution is then summarized, confirming a modest class imbalance with approximately 75 percent for the majority class (&lt;=50K) and approximately 25 percent for the minority class (&gt;50K).<\/p>\n<pre class=\"crayon-plain-tag\">(45222, 15)\r\nClass= &lt;=50K, Count=34014, Percentage=75.216%\r\nClass= &gt;50K, Count=11208, Percentage=24.784%<\/pre>\n<p>We can also take a look at the distribution of the numerical input variables by creating a histogram for each.<\/p>\n<p>First, we can select the columns with numeric variables by calling the <a href=\"https:\/\/pandas.pydata.org\/pandas-docs\/stable\/reference\/api\/pandas.DataFrame.select_dtypes.html\">select_dtypes() function<\/a> on the DataFrame. We can then select just those columns from the DataFrame.<\/p>\n<pre class=\"crayon-plain-tag\">...\r\n# select columns with numerical data types\r\nnum_ix = df.select_dtypes(include=['int64', 'float64']).columns\r\n# select a subset of the dataframe with the chosen columns\r\nsubset = df[num_ix]<\/pre>\n<p>We can then create histograms of each numeric input variable. The complete example is listed below.<\/p>\n<pre class=\"crayon-plain-tag\"># create histograms of numeric input variables\r\nfrom pandas import read_csv\r\nfrom matplotlib import pyplot\r\n# define the dataset location\r\nfilename = 'adult-all.csv'\r\n# load the csv file as a data frame\r\ndf = read_csv(filename, header=None, na_values='?')\r\n# drop rows with missing\r\ndf = df.dropna()\r\n# select columns with numerical data types\r\nnum_ix = df.select_dtypes(include=['int64', 'float64']).columns\r\n# select a subset of the dataframe with the chosen columns\r\nsubset = df[num_ix]\r\n# create a histogram plot of each numeric variable\r\nsubset.hist()\r\npyplot.show()<\/pre>\n<p>Running the example creates the figure with one histogram subplot for each of the six input variables in the dataset. The title of each subplot indicates the column number in the DataFrame (e.g. zero-offset).<\/p>\n<p>We can see many different distributions, some with Gaussian-like distributions, others with seemingly exponential or discrete distributions. We can also see that they all appear to have a very different scale.<\/p>\n<p>Depending on the choice of modeling algorithms, we would expect scaling the distributions to the same range to be useful, and perhaps the use of some power transforms.<\/p>\n<div id=\"attachment_9712\" style=\"width: 1290px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-9712\" class=\"size-full wp-image-9712\" src=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2019\/12\/Histogram-of-Numeric-Variables-in-the-Adult-Imbalanced-Classification-Dataset.png\" alt=\"Histogram of Numeric Variables in the Adult Imbalanced Classification Dataset\" width=\"1280\" height=\"960\" srcset=\"http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2019\/12\/Histogram-of-Numeric-Variables-in-the-Adult-Imbalanced-Classification-Dataset.png 1280w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2019\/12\/Histogram-of-Numeric-Variables-in-the-Adult-Imbalanced-Classification-Dataset-300x225.png 300w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2019\/12\/Histogram-of-Numeric-Variables-in-the-Adult-Imbalanced-Classification-Dataset-1024x768.png 1024w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2019\/12\/Histogram-of-Numeric-Variables-in-the-Adult-Imbalanced-Classification-Dataset-768x576.png 768w\" sizes=\"(max-width: 1280px) 100vw, 1280px\"><\/p>\n<p id=\"caption-attachment-9712\" class=\"wp-caption-text\">Histogram of Numeric Variables in the Adult Imbalanced Classification Dataset<\/p>\n<\/div>\n<p>Now that we have reviewed the dataset, let&rsquo;s look at developing a test harness for evaluating candidate models.<\/p>\n<h2>Model Test and Baseline Result<\/h2>\n<p>We will evaluate candidate models using repeated stratified k-fold cross-validation.<\/p>\n<p>The <a href=\"https:\/\/machinelearningmastery.com\/k-fold-cross-validation\/\">k-fold cross-validation procedure<\/a> provides a good general estimate of model performance that is not too optimistically biased, at least compared to a single train-test split. We will use k=10, meaning each fold will contain about 45,222\/10, or about 4,522 examples.<\/p>\n<p>Stratified means that each fold will contain the same mixture of examples by class, that is about 75 percent to 25 percent for the majority and minority classes respectively. Repeated means that the evaluation process will be performed multiple times to help avoid fluke results and better capture the variance of the chosen model. We will use three repeats.<\/p>\n<p>This means a single model will be fit and evaluated 10 * 3 or 30 times and the mean and standard deviation of these runs will be reported.<\/p>\n<p>This can be achieved using the <a href=\"https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.model_selection.RepeatedStratifiedKFold.html\">RepeatedStratifiedKFold<\/a> scikit-learn class.<\/p>\n<p>We will predict a class label for each example and measure model performance using classification accuracy.<\/p>\n<p>The <em>evaluate_model()<\/em> function below will take the loaded dataset and a defined model and will evaluate it using repeated stratified k-fold cross-validation, then return a list of accuracy scores that can later be summarized.<\/p>\n<pre class=\"crayon-plain-tag\"># evaluate a model\r\ndef evaluate_model(X, y, model):\r\n\t# define evaluation procedure\r\n\tcv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)\r\n\t# evaluate model\r\n\tscores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)\r\n\treturn scores<\/pre>\n<p>We can define a function to load the dataset and label encode the target column.<\/p>\n<p>We will also return a list of categorical and numeric columns in case we decide to transform them later when fitting models.<\/p>\n<pre class=\"crayon-plain-tag\"># load the dataset\r\ndef load_dataset(full_path):\r\n\t# load the dataset as a numpy array\r\n\tdataframe = read_csv(full_path, header=None, na_values='?')\r\n\t# drop rows with missing\r\n\tdataframe = dataframe.dropna()\r\n\t# split into inputs and outputs\r\n\tlast_ix = len(dataframe.columns) - 1\r\n\tX, y = dataframe.drop(last_ix, axis=1), dataframe[last_ix]\r\n\t# select categorical and numerical features\r\n\tcat_ix = X.select_dtypes(include=['object', 'bool']).columns\r\n\tnum_ix = X.select_dtypes(include=['int64', 'float64']).columns\r\n\t# label encode the target variable to have the classes 0 and 1\r\n\ty = LabelEncoder().fit_transform(y)\r\n\treturn X.values, y, cat_ix, num_ix<\/pre>\n<p>Finally, we can evaluate a baseline model on the dataset using this test harness.<\/p>\n<p>When using classification accuracy, a naive model will predict the majority class for all cases. This provides a baseline in model performance on this problem by which all other models can be compared.<\/p>\n<p>This can be achieved using the <a href=\"https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.dummy.DummyClassifier.html\">DummyClassifier<\/a> class from the scikit-learn library and setting the &ldquo;<em>strategy<\/em>&rdquo; argument to &lsquo;<em>most_frequent<\/em>&lsquo;.<\/p>\n<pre class=\"crayon-plain-tag\">...\r\n# define the reference model\r\nmodel = DummyClassifier(strategy='most_frequent')<\/pre>\n<p>Once the model is evaluated, we can report the mean and standard deviation of the accuracy scores directly.<\/p>\n<pre class=\"crayon-plain-tag\">...\r\n# evaluate the model\r\nscores = evaluate_model(X, y, model)\r\n# summarize performance\r\nprint('Mean Accuracy: %.3f (%.3f)' % (mean(scores), std(scores)))<\/pre>\n<p>Tying this together, the complete example of loading the Adult dataset, evaluating a baseline model, and reporting the performance is listed below.<\/p>\n<pre class=\"crayon-plain-tag\"># test harness and baseline model evaluation for the adult dataset\r\nfrom collections import Counter\r\nfrom numpy import mean\r\nfrom numpy import std\r\nfrom numpy import hstack\r\nfrom pandas import read_csv\r\nfrom sklearn.preprocessing import LabelEncoder\r\nfrom sklearn.model_selection import cross_val_score\r\nfrom sklearn.model_selection import RepeatedStratifiedKFold\r\nfrom sklearn.dummy import DummyClassifier\r\n\r\n# load the dataset\r\ndef load_dataset(full_path):\r\n\t# load the dataset as a numpy array\r\n\tdataframe = read_csv(full_path, header=None, na_values='?')\r\n\t# drop rows with missing\r\n\tdataframe = dataframe.dropna()\r\n\t# split into inputs and outputs\r\n\tlast_ix = len(dataframe.columns) - 1\r\n\tX, y = dataframe.drop(last_ix, axis=1), dataframe[last_ix]\r\n\t# select categorical and numerical features\r\n\tcat_ix = X.select_dtypes(include=['object', 'bool']).columns\r\n\tnum_ix = X.select_dtypes(include=['int64', 'float64']).columns\r\n\t# label encode the target variable to have the classes 0 and 1\r\n\ty = LabelEncoder().fit_transform(y)\r\n\treturn X.values, y, cat_ix, num_ix\r\n\r\n# evaluate a model\r\ndef evaluate_model(X, y, model):\r\n\t# define evaluation procedure\r\n\tcv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)\r\n\t# evaluate model\r\n\tscores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)\r\n\treturn scores\r\n\r\n# define the location of the dataset\r\nfull_path = 'adult-all.csv'\r\n# load the dataset\r\nX, y, cat_ix, num_ix = load_dataset(full_path)\r\n# summarize the loaded dataset\r\nprint(X.shape, y.shape, Counter(y))\r\n# define the reference model\r\nmodel = DummyClassifier(strategy='most_frequent')\r\n# evaluate the model\r\nscores = evaluate_model(X, y, model)\r\n# summarize performance\r\nprint('Mean Accuracy: %.3f (%.3f)' % (mean(scores), std(scores)))<\/pre>\n<p>Running the example first loads and summarizes the dataset.<\/p>\n<p>We can see that we have the correct number of rows loaded. Importantly, we can see that the class labels have the correct mapping to integers, with 0 for the majority class and 1 for the minority class, customary for imbalanced binary classification dataset.<\/p>\n<p>Next, the average classification accuracy score is reported.<\/p>\n<p>In this case, we can see that the baseline algorithm achieves an accuracy of about 75.2%. This score provides a lower limit on model skill; any model that achieves an average accuracy above about 75.2% has skill, whereas models that achieve a score below this value do not have skill on this dataset.<\/p>\n<pre class=\"crayon-plain-tag\">(45222, 14) (45222,) Counter({0: 34014, 1: 11208})\r\nMean Accuracy: 0.752 (0.000)<\/pre>\n<p>Now that we have a test harness and a baseline in performance, we can begin to evaluate some models on this dataset.<\/p>\n<h2>Evaluate Models<\/h2>\n<p>In this section, we will evaluate a suite of different techniques on the dataset using the test harness developed in the previous section.<\/p>\n<p>The goal is to both demonstrate how to work through the problem systematically and to demonstrate the capability of some techniques designed for imbalanced classification problems.<\/p>\n<p>The reported performance is good, but not highly optimized (e.g. hyperparameters are not tuned).<\/p>\n<p><strong>Can you do better?<\/strong> If you can achieve better classification accuracy performance using the same test harness, I&rsquo;d love to hear about it. Let me know in the comments below.<\/p>\n<h3>Evaluate Machine Learning Algorithms<\/h3>\n<p>Let&rsquo;s start by evaluating a mixture of machine learning models on the dataset.<\/p>\n<p>It can be a good idea to spot check a suite of different nonlinear algorithms on a dataset to quickly flush out what works well and deserves further attention, and what doesn&rsquo;t.<\/p>\n<p>We will evaluate the following machine learning models on the adult dataset:<\/p>\n<ul>\n<li>Decision Tree (CART)<\/li>\n<li>Support Vector Machine (SVM)<\/li>\n<li>Bagged Decision Trees (BAG)<\/li>\n<li>Random Forest (RF)<\/li>\n<li>Gradient Boosting Machine (GBM)<\/li>\n<\/ul>\n<p>We will use mostly default model hyperparameters, with the exception of the number of trees in the ensemble algorithms, which we will set to a reasonable default of 100.<\/p>\n<p>We will define each model in turn and add them to a list so that we can evaluate them sequentially. The <em>get_models()<\/em> function below defines the list of models for evaluation, as well as a list of model short names for plotting the results later.<\/p>\n<pre class=\"crayon-plain-tag\"># define models to test\r\ndef get_models():\r\n\tmodels, names = list(), list()\r\n\t# CART\r\n\tmodels.append(DecisionTreeClassifier())\r\n\tnames.append('CART')\r\n\t# SVM\r\n\tmodels.append(SVC(gamma='scale'))\r\n\tnames.append('SVM')\r\n\t# Bagging\r\n\tmodels.append(BaggingClassifier(n_estimators=100))\r\n\tnames.append('BAG')\r\n\t# RF\r\n\tmodels.append(RandomForestClassifier(n_estimators=100))\r\n\tnames.append('RF')\r\n\t# GBM\r\n\tmodels.append(GradientBoostingClassifier(n_estimators=100))\r\n\tnames.append('GBM')\r\n\treturn models, names<\/pre>\n<p>We can then enumerate the list of models in turn and evaluate each, storing the scores for later evaluation.<\/p>\n<p>We will one-hot encode the categorical input variables using a <a href=\"https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.preprocessing.OneHotEncoder.html\">OneHotEncoder<\/a>, and we will normalize the numerical input variables using the <a href=\"https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.preprocessing.MinMaxScaler.html\">MinMaxScaler<\/a>. These operations must be performed within each train\/test split during the cross-validation process, where the encoding and scaling operations are fit on the training set and applied to the train and test sets.<\/p>\n<p>An easy way to implement this is to use a <a href=\"https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.pipeline.Pipeline.html\">Pipeline<\/a> where the first step is a <a href=\"https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.compose.ColumnTransformer.html\">ColumnTransformer<\/a> that applies a <em>OneHotEncoder<\/em> to just the categorical variables, and a <em>MinMaxScaler<\/em> to just the numerical input variables. To achieve this, we need a list of the column indices for categorical and numerical input variables.<\/p>\n<p>The <em>load_dataset()<\/em> function we defined in the previous section loads and returns both the dataset and lists of columns that have categorical and numerical data types. This can be used to prepare a <em>Pipeline<\/em> to wrap each model prior to evaluating it. First, the <em>ColumnTransformer<\/em> is defined, which specifies what transform to apply to each type of column, then this is used as the first step in a <em>Pipeline<\/em> that ends with the specific model that will be fit and evaluated.<\/p>\n<pre class=\"crayon-plain-tag\">...\r\n# define steps\r\nsteps = [('c',OneHotEncoder(handle_unknown='ignore'),cat_ix), ('n',MinMaxScaler(),num_ix)]\r\n# one hot encode categorical, normalize numerical\r\nct = ColumnTransformer(steps)\r\n# wrap the model i a pipeline\r\npipeline = Pipeline(steps=[('t',ct),('m',models[i])])\r\n# evaluate the model and store results\r\nscores = evaluate_model(X, y, pipeline)<\/pre>\n<p>We can summarize the mean accuracy for each algorithm, this will help to directly compare algorithms.<\/p>\n<pre class=\"crayon-plain-tag\">...\r\n# summarize performance\r\nprint('&gt;%s %.3f (%.3f)' % (names[i], mean(scores), std(scores)))<\/pre>\n<p>At the end of the run, we will create a separate box and whisker plot for each algorithm&rsquo;s sample of results. These plots will use the same y-axis scale so we can compare the distribution of results directly.<\/p>\n<pre class=\"crayon-plain-tag\">...\r\n# plot the results\r\npyplot.boxplot(results, labels=names, showmeans=True)\r\npyplot.show()<\/pre>\n<p>Tying this all together, the complete example of evaluation a suite of machine learning algorithms on the adult imbalanced dataset is listed below.<\/p>\n<pre class=\"crayon-plain-tag\"># spot check machine learning algorithms on the adult imbalanced dataset\r\nfrom numpy import mean\r\nfrom numpy import std\r\nfrom pandas import read_csv\r\nfrom matplotlib import pyplot\r\nfrom sklearn.preprocessing import LabelEncoder\r\nfrom sklearn.preprocessing import OneHotEncoder\r\nfrom sklearn.preprocessing import MinMaxScaler\r\nfrom sklearn.pipeline import Pipeline\r\nfrom sklearn.compose import ColumnTransformer\r\nfrom sklearn.model_selection import cross_val_score\r\nfrom sklearn.model_selection import RepeatedStratifiedKFold\r\nfrom sklearn.tree import DecisionTreeClassifier\r\nfrom sklearn.svm import SVC\r\nfrom sklearn.ensemble import RandomForestClassifier\r\nfrom sklearn.ensemble import GradientBoostingClassifier\r\nfrom sklearn.ensemble import BaggingClassifier\r\n\r\n# load the dataset\r\ndef load_dataset(full_path):\r\n\t# load the dataset as a numpy array\r\n\tdataframe = read_csv(full_path, header=None, na_values='?')\r\n\t# drop rows with missing\r\n\tdataframe = dataframe.dropna()\r\n\t# split into inputs and outputs\r\n\tlast_ix = len(dataframe.columns) - 1\r\n\tX, y = dataframe.drop(last_ix, axis=1), dataframe[last_ix]\r\n\t# select categorical and numerical features\r\n\tcat_ix = X.select_dtypes(include=['object', 'bool']).columns\r\n\tnum_ix = X.select_dtypes(include=['int64', 'float64']).columns\r\n\t# label encode the target variable to have the classes 0 and 1\r\n\ty = LabelEncoder().fit_transform(y)\r\n\treturn X.values, y, cat_ix, num_ix\r\n\r\n# evaluate a model\r\ndef evaluate_model(X, y, model):\r\n\t# define evaluation procedure\r\n\tcv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)\r\n\t# evaluate model\r\n\tscores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)\r\n\treturn scores\r\n\r\n# define models to test\r\ndef get_models():\r\n\tmodels, names = list(), list()\r\n\t# CART\r\n\tmodels.append(DecisionTreeClassifier())\r\n\tnames.append('CART')\r\n\t# SVM\r\n\tmodels.append(SVC(gamma='scale'))\r\n\tnames.append('SVM')\r\n\t# Bagging\r\n\tmodels.append(BaggingClassifier(n_estimators=100))\r\n\tnames.append('BAG')\r\n\t# RF\r\n\tmodels.append(RandomForestClassifier(n_estimators=100))\r\n\tnames.append('RF')\r\n\t# GBM\r\n\tmodels.append(GradientBoostingClassifier(n_estimators=100))\r\n\tnames.append('GBM')\r\n\treturn models, names\r\n\r\n# define the location of the dataset\r\nfull_path = 'adult-all.csv'\r\n# load the dataset\r\nX, y, cat_ix, num_ix = load_dataset(full_path)\r\n# define models\r\nmodels, names = get_models()\r\nresults = list()\r\n# evaluate each model\r\nfor i in range(len(models)):\r\n\t# define steps\r\n\tsteps = [('c',OneHotEncoder(handle_unknown='ignore'),cat_ix), ('n',MinMaxScaler(),num_ix)]\r\n\t# one hot encode categorical, normalize numerical\r\n\tct = ColumnTransformer(steps)\r\n\t# wrap the model i a pipeline\r\n\tpipeline = Pipeline(steps=[('t',ct),('m',models[i])])\r\n\t# evaluate the model and store results\r\n\tscores = evaluate_model(X, y, pipeline)\r\n\tresults.append(scores)\r\n\t# summarize performance\r\n\tprint('&gt;%s %.3f (%.3f)' % (names[i], mean(scores), std(scores)))\r\n# plot the results\r\npyplot.boxplot(results, labels=names, showmeans=True)\r\npyplot.show()<\/pre>\n<p>Running the example evaluates each algorithm in turn and reports the mean and standard deviation classification accuracy.<\/p>\n<p>Your specific results will vary given the stochastic nature of the learning algorithms; consider running the example a few times.<\/p>\n<p><strong>What scores did you get?<\/strong><br \/>\nPost your results in the comments below.<\/p>\n<p>In this case, we can see that all of the chosen algorithms are skillful, achieving a classification accuracy above 75.2%. We can see that the ensemble decision tree algorithms perform the best with perhaps stochastic gradient boosting performing the best with a classification accuracy of about 86.3%.<\/p>\n<p>This is slightly better than the result reported in the original paper, albeit with a different model evaluation procedure.<\/p>\n<pre class=\"crayon-plain-tag\">&gt;CART 0.812 (0.005)\r\n&gt;SVM 0.837 (0.005)\r\n&gt;BAG 0.852 (0.004)\r\n&gt;RF 0.849 (0.004)\r\n&gt;GBM 0.863 (0.004)<\/pre>\n<p>A figure is created showing one box and whisker plot for each algorithm&rsquo;s sample of results. The box shows the middle 50 percent of the data, the orange line in the middle of each box shows the median of the sample, and the green triangle in each box shows the mean of the sample.<\/p>\n<p>We can see that the distribution of scores for each algorithm appears to be above the baseline of about 75%, perhaps with a few outliers (circles on the plot). The distribution for each algorithm appears compact, with the median and mean aligning, suggesting the models are quite stable on this dataset and scores do not form a skewed distribution.<\/p>\n<p>This highlights that it is not just the central tendency of the model performance that it is important, but also the spread and even worst-case result that should be considered. Especially with a limited number of examples of the minority class.<\/p>\n<div id=\"attachment_9713\" style=\"width: 1290px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-9713\" class=\"size-full wp-image-9713\" src=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2019\/12\/Box-and-Whisker-Plot-of-Machine-Learning-Models-on-the-Imbalanced-Adult-Dataset.png\" alt=\"Box and Whisker Plot of Machine Learning Models on the Imbalanced Adult Dataset\" width=\"1280\" height=\"960\" srcset=\"http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2019\/12\/Box-and-Whisker-Plot-of-Machine-Learning-Models-on-the-Imbalanced-Adult-Dataset.png 1280w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2019\/12\/Box-and-Whisker-Plot-of-Machine-Learning-Models-on-the-Imbalanced-Adult-Dataset-300x225.png 300w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2019\/12\/Box-and-Whisker-Plot-of-Machine-Learning-Models-on-the-Imbalanced-Adult-Dataset-1024x768.png 1024w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2019\/12\/Box-and-Whisker-Plot-of-Machine-Learning-Models-on-the-Imbalanced-Adult-Dataset-768x576.png 768w\" sizes=\"(max-width: 1280px) 100vw, 1280px\"><\/p>\n<p id=\"caption-attachment-9713\" class=\"wp-caption-text\">Box and Whisker Plot of Machine Learning Models on the Imbalanced Adult Dataset<\/p>\n<\/div>\n<h2>Make Prediction on New Data<\/h2>\n<p>In this section, we can fit a final model and use it to make predictions on single rows of data.<\/p>\n<p>We will use the <a href=\"https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.ensemble.GradientBoostingClassifier.html\">GradientBoostingClassifier<\/a> model as our final model that achieved a classification accuracy of about 86.3%. Fitting the final model involves defining the <a href=\"https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.compose.ColumnTransformer.html\">ColumnTransformer<\/a> to encode the categorical variables and scale the numerical variables, then construct a <a href=\"https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.pipeline.Pipeline.html\">Pipeline<\/a> to perform these transforms on the training set prior to fitting the model.<\/p>\n<p>The <em>Pipeline<\/em> can then be used to make predictions on new data directly, and will automatically encode and scale new data using the same operations as were performed on the training dataset.<\/p>\n<p>First, we can define the model as a pipeline.<\/p>\n<pre class=\"crayon-plain-tag\">...\r\n# define model to evaluate\r\nmodel = GradientBoostingClassifier(n_estimators=100)\r\n# one hot encode categorical, normalize numerical\r\nct = ColumnTransformer([('c',OneHotEncoder(),cat_ix), ('n',MinMaxScaler(),num_ix)])\r\n# scale, then oversample, then fit model\r\npipeline = Pipeline(steps=[('t',ct), ('m',model)])<\/pre>\n<p>Once defined, we can fit it on the entire training dataset.<\/p>\n<pre class=\"crayon-plain-tag\">...\r\n# fit the model\r\npipeline.fit(X, y)<\/pre>\n<p>Once fit, we can use it to make predictions for new data by calling the <em>predict()<\/em> function. This will return the class label of 0 for &ldquo;&lt;=50K&rdquo;, or 1 for &ldquo;&gt;50K&rdquo;.<\/p>\n<p>Importantly, we must use the <em>ColumnTransformer<\/em> within the <em>Pipeline<\/em> to correctly prepare new data using the same transforms.<\/p>\n<p>For example:<\/p>\n<pre class=\"crayon-plain-tag\">...\r\n# define a row of data\r\nrow = [...]\r\n# make prediction\r\nyhat = pipeline.predict([row])<\/pre>\n<p>To demonstrate this, we can use the fit model to make some predictions of labels for a few cases where we know the outcome.<\/p>\n<p>The complete example is listed below.<\/p>\n<pre class=\"crayon-plain-tag\"># fit a model and make predictions for the on the adult dataset\r\nfrom pandas import read_csv\r\nfrom sklearn.preprocessing import LabelEncoder\r\nfrom sklearn.preprocessing import OneHotEncoder\r\nfrom sklearn.preprocessing import MinMaxScaler\r\nfrom sklearn.compose import ColumnTransformer\r\nfrom sklearn.ensemble import GradientBoostingClassifier\r\nfrom imblearn.pipeline import Pipeline\r\n\r\n# load the dataset\r\ndef load_dataset(full_path):\r\n\t# load the dataset as a numpy array\r\n\tdataframe = read_csv(full_path, header=None, na_values='?')\r\n\t# drop rows with missing\r\n\tdataframe = dataframe.dropna()\r\n\t# split into inputs and outputs\r\n\tlast_ix = len(dataframe.columns) - 1\r\n\tX, y = dataframe.drop(last_ix, axis=1), dataframe[last_ix]\r\n\t# select categorical and numerical features\r\n\tcat_ix = X.select_dtypes(include=['object', 'bool']).columns\r\n\tnum_ix = X.select_dtypes(include=['int64', 'float64']).columns\r\n\t# label encode the target variable to have the classes 0 and 1\r\n\ty = LabelEncoder().fit_transform(y)\r\n\treturn X.values, y, cat_ix, num_ix\r\n\r\n# define the location of the dataset\r\nfull_path = 'adult-all.csv'\r\n# load the dataset\r\nX, y, cat_ix, num_ix = load_dataset(full_path)\r\n# define model to evaluate\r\nmodel = GradientBoostingClassifier(n_estimators=100)\r\n# one hot encode categorical, normalize numerical\r\nct = ColumnTransformer([('c',OneHotEncoder(),cat_ix), ('n',MinMaxScaler(),num_ix)])\r\n# scale, then oversample, then fit model\r\npipeline = Pipeline(steps=[('t',ct), ('m',model)])\r\n# fit the model\r\npipeline.fit(X, y)\r\n# evaluate on some &lt;=50K cases (known class 0)\r\nprint('&lt;=50K cases:')\r\ndata = [[24, 'Private', 161198, 'Bachelors', 13, 'Never-married', 'Prof-specialty', 'Not-in-family', 'White', 'Male', 0, 0, 25, 'United-States'],\r\n\t[23, 'Private', 214542, 'Some-college', 10, 'Never-married', 'Farming-fishing', 'Own-child', 'White', 'Male', 0, 0, 40, 'United-States'],\r\n\t[38, 'Private', 309122, '10th', 6, 'Divorced', 'Machine-op-inspct', 'Not-in-family', 'White', 'Female', 0, 0, 40, 'United-States']]\r\nfor row in data:\r\n\t# make prediction\r\n\tyhat = pipeline.predict([row])\r\n\t# get the label\r\n\tlabel = yhat[0]\r\n\t# summarize\r\n\tprint('&gt;Predicted=%d (expected 0)' % (label))\r\n# evaluate on some &gt;50K cases (known class 1)\r\nprint('&gt;50K cases:')\r\ndata = [[55, 'Local-gov', 107308, 'Masters', 14, 'Married-civ-spouse', 'Prof-specialty', 'Husband', 'White', 'Male', 0, 0, 40, 'United-States'],\r\n\t[53, 'Self-emp-not-inc', 145419, '1st-4th', 2, 'Married-civ-spouse', 'Exec-managerial', 'Husband', 'White', 'Male', 7688, 0, 67, 'Italy'],\r\n\t[44, 'Local-gov', 193425, 'Masters', 14, 'Married-civ-spouse', 'Prof-specialty', 'Wife', 'White', 'Female', 4386, 0, 40, 'United-States']]\r\nfor row in data:\r\n\t# make prediction\r\n\tyhat = pipeline.predict([row])\r\n\t# get the label\r\n\tlabel = yhat[0]\r\n\t# summarize\r\n\tprint('&gt;Predicted=%d (expected 1)' % (label))<\/pre>\n<p>Running the example first fits the model on the entire training dataset.<\/p>\n<p>Then the fit model used to predict the label of &lt;=50K cases is chosen from the dataset file. We can see that all cases are correctly predicted. Then some &gt;50K cases are used as input to the model and the label is predicted. As we might have hoped, the correct labels are predicted.<\/p>\n<pre class=\"crayon-plain-tag\">&lt;=50K cases:\r\n&gt;Predicted=0 (expected 0)\r\n&gt;Predicted=0 (expected 0)\r\n&gt;Predicted=0 (expected 0)\r\n&gt;50K cases:\r\n&gt;Predicted=1 (expected 1)\r\n&gt;Predicted=1 (expected 1)\r\n&gt;Predicted=1 (expected 1)<\/pre>\n<\/p>\n<h2>Further Reading<\/h2>\n<p>This section provides more resources on the topic if you are looking to go deeper.<\/p>\n<h3>Papers<\/h3>\n<ul>\n<li><a href=\"https:\/\/dl.acm.org\/citation.cfm?id=3001502\">Scaling Up The Accuracy Of Naive-bayes Classifiers: A Decision-tree Hybrid<\/a>, 1996.<\/li>\n<\/ul>\n<h3>APIs<\/h3>\n<ul>\n<li><a href=\"https:\/\/pandas.pydata.org\/pandas-docs\/stable\/reference\/api\/pandas.DataFrame.select_dtypes.html\">pandas.DataFrame.select_dtypes API<\/a>.<\/li>\n<li><a href=\"https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.model_selection.RepeatedStratifiedKFold.html\">sklearn.model_selection.RepeatedStratifiedKFold API<\/a>.<\/li>\n<li><a href=\"https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.dummy.DummyClassifier.html\">sklearn.dummy.DummyClassifier API<\/a>.<\/li>\n<\/ul>\n<h3>Dataset<\/h3>\n<ul>\n<li><a href=\"https:\/\/raw.githubusercontent.com\/jbrownlee\/Datasets\/master\/adult-all.csv\">Adult Dataset CSV<\/a>.<\/li>\n<li><a href=\"https:\/\/raw.githubusercontent.com\/jbrownlee\/Datasets\/master\/adult.names\">Adult Dataset Description<\/a>.<\/li>\n<li><a href=\"https:\/\/archive.ics.uci.edu\/ml\/datasets\/adult\">Adult Dataset, UCI Machine Learning Repository<\/a>.<\/li>\n<\/ul>\n<h2>Summary<\/h2>\n<p>In this tutorial, you discovered how to develop and evaluate a model for the imbalanced adult income classification dataset.<\/p>\n<p>Specifically, you learned:<\/p>\n<ul>\n<li>How to load and explore the dataset and generate ideas for data preparation and model selection.<\/li>\n<li>How to systematically evaluate a suite of machine learning models with a robust test harness.<\/li>\n<li>How to fit a final model and use it to predict class labels for specific cases.<\/li>\n<\/ul>\n<p>Do you have any questions?<br \/>\nAsk your questions in the comments below and I will do my best to answer.<\/p>\n<p>The post <a rel=\"nofollow\" href=\"https:\/\/machinelearningmastery.com\/imbalanced-classification-with-the-adult-income-dataset\/\">Imbalanced Classification with the Adult Income Dataset<\/a> appeared first on <a rel=\"nofollow\" href=\"https:\/\/machinelearningmastery.com\/\">Machine Learning Mastery<\/a>.<\/p>\n<\/div>\n<p><a href=\"https:\/\/machinelearningmastery.com\/imbalanced-classification-with-the-adult-income-dataset\/\">Go to Source<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Author: Jason Brownlee Many binary classification tasks do not have an equal number of examples from each class, e.g. the class distribution is skewed or [&hellip;] <span class=\"read-more-link\"><a class=\"read-more\" href=\"https:\/\/www.aiproblog.com\/index.php\/2020\/03\/05\/imbalanced-classification-with-the-adult-income-dataset\/\">Read More<\/a><\/span><\/p>\n","protected":false},"author":1,"featured_media":3209,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_bbp_topic_count":0,"_bbp_reply_count":0,"_bbp_total_topic_count":0,"_bbp_total_reply_count":0,"_bbp_voice_count":0,"_bbp_anonymous_reply_count":0,"_bbp_topic_count_hidden":0,"_bbp_reply_count_hidden":0,"_bbp_forum_subforum_count":0,"footnotes":""},"categories":[24],"tags":[],"_links":{"self":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts\/3208"}],"collection":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/comments?post=3208"}],"version-history":[{"count":0,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts\/3208\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/media\/3209"}],"wp:attachment":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/media?parent=3208"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/categories?post=3208"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/tags?post=3208"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}