{"id":3181,"date":"2020-02-27T18:00:02","date_gmt":"2020-02-27T18:00:02","guid":{"rendered":"https:\/\/www.aiproblog.com\/index.php\/2020\/02\/27\/develop-a-model-for-the-imbalanced-classification-of-good-and-bad-credit\/"},"modified":"2020-02-27T18:00:02","modified_gmt":"2020-02-27T18:00:02","slug":"develop-a-model-for-the-imbalanced-classification-of-good-and-bad-credit","status":"publish","type":"post","link":"https:\/\/www.aiproblog.com\/index.php\/2020\/02\/27\/develop-a-model-for-the-imbalanced-classification-of-good-and-bad-credit\/","title":{"rendered":"Develop a Model for the Imbalanced Classification of Good and Bad Credit"},"content":{"rendered":"<p>Author: Jason Brownlee<\/p>\n<div>\n<p>Misclassification errors on the minority class are more important than other types of prediction errors for some imbalanced classification tasks.<\/p>\n<p>One example is the problem of classifying bank customers as to whether they should receive a loan or not. Giving a loan to a bad customer marked as a good customer results in a greater cost to the bank than denying a loan to a good customer marked as a bad customer.<\/p>\n<p>This requires careful selection of a performance metric that both promotes minimizing misclassification errors in general, and favors minimizing one type of misclassification error over another.<\/p>\n<p>The <strong>German credit dataset<\/strong> is a standard imbalanced classification dataset that has this property of differing costs to misclassification errors. Models evaluated on this dataset can be evaluated using the <a href=\"https:\/\/machinelearningmastery.com\/fbeta-measure-for-machine-learning\/\">Fbeta-Measure<\/a> that provides a way of both quantifying model performance generally, and captures the requirement that one type of misclassification error is more costly than another.<\/p>\n<p>In this tutorial, you will discover how to develop and evaluate a model for the imbalanced German credit classification dataset.<\/p>\n<p>After completing this tutorial, you will know:<\/p>\n<ul>\n<li>How to load and explore the dataset and generate ideas for data preparation and model selection.<\/li>\n<li>How to evaluate a suite of machine learning models and improve their performance with data undersampling techniques.<\/li>\n<li>How to fit a final model and use it to predict class labels for specific cases.<\/li>\n<\/ul>\n<p>Discover SMOTE, one-class classification, cost-sensitive learning, threshold moving, and much more <a href=\"https:\/\/machinelearningmastery.com\/imbalanced-classification-with-python\/\">in my new book<\/a>, with 30 step-by-step tutorials and full Python source code.<\/p>\n<p>Let&rsquo;s get started.<\/p>\n<ul>\n<li><strong>Update Feb\/2020<\/strong>: Added section on further model improvements.<\/li>\n<\/ul>\n<div id=\"attachment_9681\" style=\"width: 810px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-9681\" class=\"size-full wp-image-9681\" src=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2020\/02\/Develop-an-Imbalanced-Classification-Model-to-Predict-Good-and-Bad-Credit.jpg\" alt=\"Develop an Imbalanced Classification Model to Predict Good and Bad Credit\" width=\"800\" height=\"450\" srcset=\"http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2020\/02\/Develop-an-Imbalanced-Classification-Model-to-Predict-Good-and-Bad-Credit.jpg 800w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2020\/02\/Develop-an-Imbalanced-Classification-Model-to-Predict-Good-and-Bad-Credit-300x169.jpg 300w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2020\/02\/Develop-an-Imbalanced-Classification-Model-to-Predict-Good-and-Bad-Credit-768x432.jpg 768w\" sizes=\"(max-width: 800px) 100vw, 800px\"><\/p>\n<p id=\"caption-attachment-9681\" class=\"wp-caption-text\">Develop an Imbalanced Classification Model to Predict Good and Bad Credit<br \/>Photo by <a href=\"https:\/\/flickr.com\/photos\/alnieves\/43544649282\/\">AL Nieves<\/a>, some rights reserved.<\/p>\n<\/div>\n<h2>Tutorial Overview<\/h2>\n<p>This tutorial is divided into five parts; they are:<\/p>\n<ol>\n<li>German Credit Dataset<\/li>\n<li>Explore the Dataset<\/li>\n<li>Model Test and Baseline Result<\/li>\n<li>Evaluate Models\n<ol>\n<li>Evaluate Machine Learning Algorithms<\/li>\n<li>Evaluate Undersampling<\/li>\n<li>Further Model Improvements<\/li>\n<\/ol>\n<\/li>\n<li>Make Prediction on New Data<\/li>\n<\/ol>\n<h2>German Credit Dataset<\/h2>\n<p>In this project, we will use a standard imbalanced machine learning dataset referred to as the &ldquo;<a href=\"https:\/\/archive.ics.uci.edu\/ml\/datasets\/Statlog+(German+Credit+Data)\">German Credit<\/a>&rdquo; dataset or simply &ldquo;<em>German<\/em>.&rdquo;<\/p>\n<p>The dataset was used as part of the Statlog project, a European-based initiative in the 1990s to evaluate and compare a large number (at the time) of machine learning algorithms on a range of different classification tasks. The dataset is credited to Hans Hofmann.<\/p>\n<blockquote>\n<p>The fragmentation amongst different disciplines has almost certainly hindered communication and progress. The StatLog project was designed to break down these divisions by selecting classification procedures regardless of historical pedigree, testing them on large-scale and commercially important problems, and hence to determine to what extent the various techniques met the needs of industry.<\/p>\n<\/blockquote>\n<p>&mdash; Page 4, <a href=\"https:\/\/amzn.to\/33oQT1q\">Machine Learning, Neural and Statistical Classification<\/a>, 1994.<\/p>\n<p>The german credit dataset describes financial and banking details for customers and the task is to determine whether the customer is good or bad. The assumption is that the task involves predicting whether a customer will pay back a loan or credit.<\/p>\n<p>The dataset includes 1,000 examples and 20 input variables, 7 of which are numerical (integer) and 13 are categorical.<\/p>\n<ul>\n<li>Status of existing checking account<\/li>\n<li>Duration in month<\/li>\n<li>Credit history<\/li>\n<li>Purpose<\/li>\n<li>Credit amount<\/li>\n<li>Savings account<\/li>\n<li>Present employment since<\/li>\n<li>Installment rate in percentage of disposable income<\/li>\n<li>Personal status and sex<\/li>\n<li>Other debtors<\/li>\n<li>Present residence since<\/li>\n<li>Property<\/li>\n<li>Age in years<\/li>\n<li>Other installment plans<\/li>\n<li>Housing<\/li>\n<li>Number of existing credits at this bank<\/li>\n<li>Job<\/li>\n<li>Number of dependents<\/li>\n<li>Telephone<\/li>\n<li>Foreign worker<\/li>\n<\/ul>\n<p>Some of the categorical variables have an ordinal relationship, such as &ldquo;<em>Savings account<\/em>,&rdquo; although most do not.<\/p>\n<p>There are two classes, 1 for good customers and 2 for bad customers. Good customers are the default or negative class, whereas bad customers are the exception or positive class. A total of 70 percent of the examples are good customers, whereas the remaining 30 percent of examples are bad customers.<\/p>\n<ul>\n<li><strong>Good Customers<\/strong>: Negative or majority class (70%).<\/li>\n<li><strong>Bad Customers<\/strong>: Positive or minority class (30%).<\/li>\n<\/ul>\n<p>A cost matrix is provided with the dataset that gives a different penalty to each misclassification error for the positive class. Specifically, a cost of five is applied to a false negative (marking a bad customer as good) and a cost of one is assigned for a false positive (marking a good customer as bad).<\/p>\n<ul>\n<li><strong>Cost for False Negative<\/strong>: 5<\/li>\n<li><strong>Cost for False Positive<\/strong>: 1<\/li>\n<\/ul>\n<p>This suggests that the positive class is the focus of the prediction task and that it is more costly to the bank or financial institution to give money to a bad customer than to not give money to a good customer. This must be taken into account when selecting a performance metric.<\/p>\n<p>Next, let&rsquo;s take a closer look at the data.<\/p>\n<\/p>\n<div class=\"woo-sc-hr\"><\/div>\n<p><center><\/p>\n<h3>Want to Get Started With Imbalance Classification?<\/h3>\n<p>Take my free 7-day email crash course now (with sample code).<\/p>\n<p>Click to sign-up and also get a free PDF Ebook version of the course.<\/p>\n<p><a href=\"https:\/\/machinelearningmastery.lpages.co\/leadbox\/14de34d42172a2%3A164f8be4f346dc\/4529268551712768\/\" target=\"_blank\" style=\"background: rgb(255, 206, 10); color: rgb(255, 255, 255); text-decoration: none; font-family: Helvetica, Arial, sans-serif; font-weight: bold; font-size: 16px; line-height: 20px; padding: 10px; display: inline-block; max-width: 300px; border-radius: 5px; text-shadow: rgba(0, 0, 0, 0.25) 0px -1px 1px; box-shadow: rgba(255, 255, 255, 0.5) 0px 1px 3px inset, rgba(0, 0, 0, 0.5) 0px 1px 3px;\" rel=\"noopener noreferrer\">Download Your FREE Mini-Course<\/a><script data-leadbox=\"14de34d42172a2:164f8be4f346dc\" data-url=\"https:\/\/machinelearningmastery.lpages.co\/leadbox\/14de34d42172a2%3A164f8be4f346dc\/4529268551712768\/\" data-config=\"%7B%7D\" type=\"text\/javascript\" src=\"https:\/\/machinelearningmastery.lpages.co\/leadbox-1576257931.js\"><\/script><\/p>\n<p><\/center><\/p>\n<div class=\"woo-sc-hr\"><\/div>\n<h2>Explore the Dataset<\/h2>\n<p>First, download the dataset and save it in your current working directory with the name &ldquo;<em>german.csv<\/em>&ldquo;.<\/p>\n<ul>\n<li><a href=\"https:\/\/raw.githubusercontent.com\/jbrownlee\/Datasets\/master\/german.csv\">Download German Credit Dataset (german.csv)<\/a><\/li>\n<\/ul>\n<p>Review the contents of the file.<\/p>\n<p>The first few lines of the file should look as follows:<\/p>\n<pre class=\"crayon-plain-tag\">A11,6,A34,A43,1169,A65,A75,4,A93,A101,4,A121,67,A143,A152,2,A173,1,A192,A201,1\r\nA12,48,A32,A43,5951,A61,A73,2,A92,A101,2,A121,22,A143,A152,1,A173,1,A191,A201,2\r\nA14,12,A34,A46,2096,A61,A74,2,A93,A101,3,A121,49,A143,A152,1,A172,2,A191,A201,1\r\nA11,42,A32,A42,7882,A61,A74,2,A93,A103,4,A122,45,A143,A153,1,A173,2,A191,A201,1\r\nA11,24,A33,A40,4870,A61,A73,3,A93,A101,4,A124,53,A143,A153,2,A173,2,A191,A201,2\r\n...<\/pre>\n<p>We can see that the categorical columns are encoded with an <em>Axxx<\/em> format, where &ldquo;<em>x<\/em>&rdquo; are integers for different labels. A one-hot encoding of the categorical variables will be required.<\/p>\n<p>We can also see that the numerical variables have different scales, e.g. 6, 48, and 12 in column 2, and 1169, 5951, etc. in column 5. This suggests that scaling of the integer columns will be needed for those algorithms that are sensitive to scale.<\/p>\n<p>The target variable or class is the last column and contains values of 1 and 2. These will need to be label encoded to 0 and 1, respectively, to meet the general expectation for imbalanced binary classification tasks where 0 represents the negative case and 1 represents the positive case.<\/p>\n<p>The dataset can be loaded as a DataFrame using the <a href=\"https:\/\/pandas.pydata.org\/pandas-docs\/stable\/reference\/api\/pandas.read_csv.html\">read_csv() Pandas function<\/a>, specifying the location and the fact that there is no header line.<\/p>\n<pre class=\"crayon-plain-tag\">...\r\n# define the dataset location\r\nfilename = 'german.csv'\r\n# load the csv file as a data frame\r\ndataframe = read_csv(filename, header=None)<\/pre>\n<p>Once loaded, we can summarize the number of rows and columns by printing the shape of the <a href=\"https:\/\/pandas.pydata.org\/pandas-docs\/stable\/reference\/api\/pandas.DataFrame.html\">DataFrame<\/a>.<\/p>\n<pre class=\"crayon-plain-tag\">...\r\n# summarize the shape of the dataset\r\nprint(dataframe.shape)<\/pre>\n<p>We can also summarize the number of examples in each class using the <a href=\"https:\/\/docs.python.org\/3\/library\/collections.html\">Counter<\/a> object.<\/p>\n<pre class=\"crayon-plain-tag\">...\r\n# summarize the class distribution\r\ntarget = dataframe.values[:,-1]\r\ncounter = Counter(target)\r\nfor k,v in counter.items():\r\n\tper = v \/ len(target) * 100\r\n\tprint('Class=%d, Count=%d, Percentage=%.3f%%' % (k, v, per))<\/pre>\n<p>Tying this together, the complete example of loading and summarizing the dataset is listed below.<\/p>\n<pre class=\"crayon-plain-tag\"># load and summarize the dataset\r\nfrom pandas import read_csv\r\nfrom collections import Counter\r\n# define the dataset location\r\nfilename = 'german.csv'\r\n# load the csv file as a data frame\r\ndataframe = read_csv(filename, header=None)\r\n# summarize the shape of the dataset\r\nprint(dataframe.shape)\r\n# summarize the class distribution\r\ntarget = dataframe.values[:,-1]\r\ncounter = Counter(target)\r\nfor k,v in counter.items():\r\n\tper = v \/ len(target) * 100\r\n\tprint('Class=%d, Count=%d, Percentage=%.3f%%' % (k, v, per))<\/pre>\n<p>Running the example first loads the dataset and confirms the number of rows and columns, that is 1,000 rows and 20 input variables and 1 target variable.<\/p>\n<p>The class distribution is then summarized, confirming the number of good and bad customers and the percentage of cases in the minority and majority classes.<\/p>\n<pre class=\"crayon-plain-tag\">(1000, 21)\r\nClass=1, Count=700, Percentage=70.000%\r\nClass=2, Count=300, Percentage=30.000%<\/pre>\n<p>We can also take a look at the distribution of the seven numerical input variables by creating a histogram for each.<\/p>\n<p>First, we can select the columns with numeric variables by calling the <a href=\"https:\/\/pandas.pydata.org\/pandas-docs\/stable\/reference\/api\/pandas.DataFrame.select_dtypes.html\">select_dtypes() function<\/a> on the DataFrame. We can then select just those columns from the DataFrame. We would expect there to be seven, plus the numerical class labels.<\/p>\n<pre class=\"crayon-plain-tag\">...\r\n# select columns with numerical data types\r\nnum_ix = df.select_dtypes(include=['int64', 'float64']).columns\r\n# select a subset of the dataframe with the chosen columns\r\nsubset = df[num_ix]<\/pre>\n<p>We can then create histograms of each numeric input variable. The complete example is listed below.<\/p>\n<pre class=\"crayon-plain-tag\"># create histograms of numeric input variables\r\nfrom pandas import read_csv\r\nfrom matplotlib import pyplot\r\n# define the dataset location\r\nfilename = 'german.csv'\r\n# load the csv file as a data frame\r\ndf = read_csv(filename, header=None)\r\n# select columns with numerical data types\r\nnum_ix = df.select_dtypes(include=['int64', 'float64']).columns\r\n# select a subset of the dataframe with the chosen columns\r\nsubset = df[num_ix]\r\n# create a histogram plot of each numeric variable\r\nax = subset.hist()\r\n# disable axis labels to avoid the clutter\r\nfor axis in ax.flatten():\r\n\taxis.set_xticklabels([])\r\n\taxis.set_yticklabels([])\r\n# show the plot\r\npyplot.show()<\/pre>\n<p>Running the example creates the figure with one histogram subplot for each of the seven input variables and one class label in the dataset. The title of each subplot indicates the column number in the DataFrame (e.g. zero-offset from 0 to 20).<\/p>\n<p>We can see many different distributions, some with <a href=\"https:\/\/machinelearningmastery.com\/continuous-probability-distributions-for-machine-learning\/\">Gaussian-like distributions<\/a>, others with seemingly exponential or discrete distributions.<\/p>\n<p>Depending on the choice of modeling algorithms, we would expect scaling the distributions to the same range to be useful, and perhaps the use of some power transforms.<\/p>\n<div id=\"attachment_9677\" style=\"width: 1290px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-9677\" class=\"size-full wp-image-9677\" src=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2019\/11\/Histogram-of-Numeric-Variables-in-the-German-Credit-Dataset.png\" alt=\"Histogram of Numeric Variables in the German Credit Dataset\" width=\"1280\" height=\"960\" srcset=\"http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2019\/11\/Histogram-of-Numeric-Variables-in-the-German-Credit-Dataset.png 1280w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2019\/11\/Histogram-of-Numeric-Variables-in-the-German-Credit-Dataset-300x225.png 300w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2019\/11\/Histogram-of-Numeric-Variables-in-the-German-Credit-Dataset-1024x768.png 1024w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2019\/11\/Histogram-of-Numeric-Variables-in-the-German-Credit-Dataset-768x576.png 768w\" sizes=\"(max-width: 1280px) 100vw, 1280px\"><\/p>\n<p id=\"caption-attachment-9677\" class=\"wp-caption-text\">Histogram of Numeric Variables in the German Credit Dataset<\/p>\n<\/div>\n<p>Now that we have reviewed the dataset, let&rsquo;s look at developing a test harness for evaluating candidate models.<\/p>\n<h2>Model Test and Baseline Result<\/h2>\n<p>We will evaluate candidate models using repeated stratified k-fold cross-validation.<\/p>\n<p>The <a href=\"https:\/\/machinelearningmastery.com\/k-fold-cross-validation\/\">k-fold cross-validation procedure<\/a> provides a good general estimate of model performance that is not too optimistically biased, at least compared to a single train-test split. We will use k=10, meaning each fold will contain about 1000\/10 or 100 examples.<\/p>\n<p>Stratified means that each fold will contain the same mixture of examples by class, that is about 70 percent to 30 percent good to bad customers. Repeated means that the evaluation process will be performed multiple times to help avoid fluke results and better capture the variance of the chosen model. We will use three repeats.<\/p>\n<p>This means a single model will be fit and evaluated 10 * 3 or 30 times and the mean and standard deviation of these runs will be reported.<\/p>\n<p>This can be achieved using the <a href=\"https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.model_selection.RepeatedStratifiedKFold.html\">RepeatedStratifiedKFold scikit-learn class<\/a>.<\/p>\n<p>We will predict class labels of whether a customer is good or not. Therefore, we need a measure that is appropriate for evaluating the predicted class labels.<\/p>\n<p>The focus of the task is on the positive class (bad customers). Precision and recall are a good place to start. Maximizing precision will minimize the false positives and maximizing recall will minimize the false negatives in the predictions made by a model.<\/p>\n<ul>\n<li>Precision = TruePositives \/ (TruePositives + FalsePositives)<\/li>\n<li>Recall = TruePositives \/ (TruePositives + FalseNegatives)<\/li>\n<\/ul>\n<p>Using the F-Measure will calculate the harmonic mean between precision and recall. This is a good single number that can be used to compare and select a model on this problem. The issue is that false negatives are more damaging than false positives.<\/p>\n<ul>\n<li>F-Measure = (2 * Precision * Recall) \/ (Precision + Recall)<\/li>\n<\/ul>\n<p>Remember that false negatives on this dataset are cases of a bad customer being marked as a good customer and being given a loan. False positives are cases of a good customer being marked as a bad customer and not being given a loan.<\/p>\n<ul>\n<li><strong>False Negative<\/strong>: Bad Customer (class 1) predicted as a Good Customer (class 0).<\/li>\n<li><strong>False Positive<\/strong>: Good Customer (class 0) predicted as a Bad Customer (class 1).<\/li>\n<\/ul>\n<p>False negatives are more costly to the bank than false positives.<\/p>\n<ul>\n<li>Cost(False Negatives) &gt; Cost(False Positives)<\/li>\n<\/ul>\n<p>Put another way, we are interested in the F-measure that will summarize a model&rsquo;s ability to minimize misclassification errors for the positive class, but we want to favor models that are better are minimizing false negatives over false positives.<\/p>\n<p>This can be achieved by using a version of the F-measure that calculates a weighted <a href=\"https:\/\/machinelearningmastery.com\/arithmetic-geometric-and-harmonic-means-for-machine-learning\/\">harmonic mean<\/a> of precision and recall but favors higher recall scores over precision scores. This is called the <a href=\"https:\/\/machinelearningmastery.com\/fbeta-measure-for-machine-learning\/\">Fbeta-measure<\/a>, a generalization of F-measure, where &ldquo;<em>beta<\/em>&rdquo; is a parameter that defines the weighting of the two scores.<\/p>\n<ul>\n<li>Fbeta-Measure = ((1 + beta^2) * Precision * Recall) \/ (beta^2 * Precision + Recall)<\/li>\n<\/ul>\n<p>A beta value of 2 will weight more attention on recall than precision and is referred to as the F2-measure.<\/p>\n<ul>\n<li>F2-Measure = ((1 + 2^2) * Precision * Recall) \/ (2^2 * Precision + Recall)<\/li>\n<\/ul>\n<p>We will use this measure to evaluate models on the German credit dataset. This can be achieved using the <a href=\"https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.metrics.fbeta_score.html\">fbeta_score() scikit-learn function<\/a>.<\/p>\n<p>We can define a function to load the dataset and split the columns into input and output variables. We will one-hot encode the categorical variables and label encode the target variable. You might recall that a one-hot encoding replaces the categorical variable with one new column for each value of the variable and marks values with a 1 in the column for that value.<\/p>\n<p>First, we must split the DataFrame into input and output variables.<\/p>\n<pre class=\"crayon-plain-tag\">...\r\n# split into inputs and outputs\r\nlast_ix = len(dataframe.columns) - 1\r\nX, y = dataframe.drop(last_ix, axis=1), dataframe[last_ix]<\/pre>\n<p>Next, we need to select all input variables that are categorical, then apply a one-hot encoding and leave the numerical variables untouched.<\/p>\n<p>This can be achieved using a <a href=\"https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.compose.ColumnTransformer.html\">ColumnTransformer<\/a> and defining the transform as a <a href=\"https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.preprocessing.OneHotEncoder.html\">OneHotEncoder<\/a> applied only to the column indices for categorical variables.<\/p>\n<pre class=\"crayon-plain-tag\">...\r\n# select categorical features\r\ncat_ix = X.select_dtypes(include=['object', 'bool']).columns\r\n# one hot encode cat features only\r\nct = ColumnTransformer([('o',OneHotEncoder(),cat_ix)], remainder='passthrough')\r\nX = ct.fit_transform(X)<\/pre>\n<p>We can then label encode the target variable.<\/p>\n<pre class=\"crayon-plain-tag\">...\r\n# label encode the target variable to have the classes 0 and 1\r\ny = LabelEncoder().fit_transform(y)<\/pre>\n<p>The <em>load_dataset()<\/em> function below ties all of this together and loads and prepares the dataset for modeling.<\/p>\n<pre class=\"crayon-plain-tag\"># load the dataset\r\ndef load_dataset(full_path):\r\n\t# load the dataset as a numpy array\r\n\tdataframe = read_csv(full_path, header=None)\r\n\t# split into inputs and outputs\r\n\tlast_ix = len(dataframe.columns) - 1\r\n\tX, y = dataframe.drop(last_ix, axis=1), dataframe[last_ix]\r\n\t# select categorical features\r\n\tcat_ix = X.select_dtypes(include=['object', 'bool']).columns\r\n\t# one hot encode cat features only\r\n\tct = ColumnTransformer([('o',OneHotEncoder(),cat_ix)], remainder='passthrough')\r\n\tX = ct.fit_transform(X)\r\n\t# label encode the target variable to have the classes 0 and 1\r\n\ty = LabelEncoder().fit_transform(y)\r\n\treturn X, y<\/pre>\n<p>Next, we need a function that will evaluate a set of predictions using the <em>fbeta_score()<\/em> function with <em>beta<\/em> set to 2.<\/p>\n<pre class=\"crayon-plain-tag\"># calculate f2 score\r\ndef f2(y_true, y_pred):\r\n\treturn fbeta_score(y_true, y_pred, beta=2)<\/pre>\n<p>We can then define a function that will evaluate a given model on the dataset and return a list of F2-Measure scores for each fold and repeat.<\/p>\n<p>The <em>evaluate_model()<\/em> function below implements this, taking the dataset and model as arguments and returning the list of scores.<\/p>\n<pre class=\"crayon-plain-tag\"># evaluate a model\r\ndef evaluate_model(X, y, model):\r\n\t# define evaluation procedure\r\n\tcv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)\r\n\t# define the model evaluation the metric\r\n\tmetric = make_scorer(f2)\r\n\t# evaluate model\r\n\tscores = cross_val_score(model, X, y, scoring=metric, cv=cv, n_jobs=-1)\r\n\treturn scores<\/pre>\n<p>Finally, we can evaluate a baseline model on the dataset using this test harness.<\/p>\n<p>A model that predicts the minority class for examples will achieve a maximum recall score and a baseline precision score. This provides a baseline in model performance on this problem by which all other models can be compared.<\/p>\n<p>This can be achieved using the <a href=\"https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.dummy.DummyClassifier.html\">DummyClassifier<\/a> class from the scikit-learn library and setting the &ldquo;<em>strategy<\/em>&rdquo; argument to &ldquo;<em>constant<\/em>&rdquo; and the &ldquo;<em>constant<\/em>&rdquo; argument to &ldquo;<em>1<\/em>&rdquo; for the minority class.<\/p>\n<pre class=\"crayon-plain-tag\">...\r\n# define the reference model\r\nmodel = DummyClassifier(strategy='constant', constant=1)<\/pre>\n<p>Once the model is evaluated, we can report the mean and standard deviation of the F2-Measure scores directly.<\/p>\n<pre class=\"crayon-plain-tag\">...\r\n# evaluate the model\r\nscores = evaluate_model(X, y, model)\r\n# summarize performance\r\nprint('Mean F2: %.3f (%.3f)' % (mean(scores), std(scores)))<\/pre>\n<p>Tying this together, the complete example of loading the German Credit dataset, evaluating a baseline model, and reporting the performance is listed below.<\/p>\n<pre class=\"crayon-plain-tag\"># test harness and baseline model evaluation for the german credit dataset\r\nfrom collections import Counter\r\nfrom numpy import mean\r\nfrom numpy import std\r\nfrom pandas import read_csv\r\nfrom sklearn.preprocessing import LabelEncoder\r\nfrom sklearn.preprocessing import OneHotEncoder\r\nfrom sklearn.compose import ColumnTransformer\r\nfrom sklearn.model_selection import cross_val_score\r\nfrom sklearn.model_selection import RepeatedStratifiedKFold\r\nfrom sklearn.metrics import fbeta_score\r\nfrom sklearn.metrics import make_scorer\r\nfrom sklearn.dummy import DummyClassifier\r\n\r\n# load the dataset\r\ndef load_dataset(full_path):\r\n\t# load the dataset as a numpy array\r\n\tdataframe = read_csv(full_path, header=None)\r\n\t# split into inputs and outputs\r\n\tlast_ix = len(dataframe.columns) - 1\r\n\tX, y = dataframe.drop(last_ix, axis=1), dataframe[last_ix]\r\n\t# select categorical features\r\n\tcat_ix = X.select_dtypes(include=['object', 'bool']).columns\r\n\t# one hot encode cat features only\r\n\tct = ColumnTransformer([('o',OneHotEncoder(),cat_ix)], remainder='passthrough')\r\n\tX = ct.fit_transform(X)\r\n\t# label encode the target variable to have the classes 0 and 1\r\n\ty = LabelEncoder().fit_transform(y)\r\n\treturn X, y\r\n\r\n# calculate f2 score\r\ndef f2(y_true, y_pred):\r\n\treturn fbeta_score(y_true, y_pred, beta=2)\r\n\r\n# evaluate a model\r\ndef evaluate_model(X, y, model):\r\n\t# define evaluation procedure\r\n\tcv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)\r\n\t# define the model evaluation metric\r\n\tmetric = make_scorer(f2)\r\n\t# evaluate model\r\n\tscores = cross_val_score(model, X, y, scoring=metric, cv=cv, n_jobs=-1)\r\n\treturn scores\r\n\r\n# define the location of the dataset\r\nfull_path = 'german.csv'\r\n# load the dataset\r\nX, y = load_dataset(full_path)\r\n# summarize the loaded dataset\r\nprint(X.shape, y.shape, Counter(y))\r\n# define the reference model\r\nmodel = DummyClassifier(strategy='constant', constant=1)\r\n# evaluate the model\r\nscores = evaluate_model(X, y, model)\r\n# summarize performance\r\nprint('Mean F2: %.3f (%.3f)' % (mean(scores), std(scores)))<\/pre>\n<p>Running the example first loads and summarizes the dataset.<\/p>\n<p>We can see that we have the correct number of rows loaded, and through the one-hot encoding of the categorical input variables, we have increased the number of input variables from 20 to 61. That suggests that the 13 categorical variables were encoded into a total of 54 columns.<\/p>\n<p>Importantly, we can see that the class labels have the correct mapping to integers with 0 for the majority class and 1 for the minority class, customary for imbalanced binary classification dataset.<\/p>\n<p>Next, the average of the F2-Measure scores is reported.<\/p>\n<p>In this case, we can see that the baseline algorithm achieves an F2-Measure of about 0.682. This score provides a lower limit on model skill; any model that achieves an average F2-Measure above about 0.682 has skill, whereas models that achieve a score below this value do not have skill on this dataset.<\/p>\n<pre class=\"crayon-plain-tag\">(1000, 61) (1000,) Counter({0: 700, 1: 300})\r\nMean F2: 0.682 (0.000)<\/pre>\n<p>Now that we have a test harness and a baseline in performance, we can begin to evaluate some models on this dataset.<\/p>\n<h2>Evaluate Models<\/h2>\n<p>In this section, we will evaluate a suite of different techniques on the dataset using the test harness developed in the previous section.<\/p>\n<p>The goal is to both demonstrate how to work through the problem systematically and to demonstrate the capability of some techniques designed for imbalanced classification problems.<\/p>\n<p>The reported performance is good, but not highly optimized (e.g. hyperparameters are not tuned).<\/p>\n<p><strong>Can you do better? <\/strong>If you can achieve better F2-Measure performance using the same test harness, I&rsquo;d love to hear about it. Let me know in the comments below.<\/p>\n<h3>Evaluate Machine Learning Algorithms<\/h3>\n<p>Let&rsquo;s start by evaluating a mixture of probabilistic machine learning models on the dataset.<\/p>\n<p>It can be a good idea to spot check a suite of different linear and nonlinear algorithms on a dataset to quickly flush out what works well and deserves further attention, and what doesn&rsquo;t.<\/p>\n<p>We will evaluate the following machine learning models on the German credit dataset:<\/p>\n<ul>\n<li>Logistic Regression (LR)<\/li>\n<li>Linear Discriminant Analysis (LDA)<\/li>\n<li>Naive Bayes (NB)<\/li>\n<li>Gaussian Process Classifier (GPC)<\/li>\n<li>Support Vector Machine (SVM)<\/li>\n<\/ul>\n<p>We will use mostly default model hyperparameters.<\/p>\n<p>We will define each model in turn and add them to a list so that we can evaluate them sequentially. The <em>get_models()<\/em> function below defines the list of models for evaluation, as well as a list of model short names for plotting the results later.<\/p>\n<pre class=\"crayon-plain-tag\"># define models to test\r\ndef get_models():\r\n\tmodels, names = list(), list()\r\n\t# LR\r\n\tmodels.append(LogisticRegression(solver='liblinear'))\r\n\tnames.append('LR')\r\n\t# LDA\r\n\tmodels.append(LinearDiscriminantAnalysis())\r\n\tnames.append('LDA')\r\n\t# NB\r\n\tmodels.append(GaussianNB())\r\n\tnames.append('NB')\r\n\t# GPC\r\n\tmodels.append(GaussianProcessClassifier())\r\n\tnames.append('GPC')\r\n\t# SVM\r\n\tmodels.append(SVC(gamma='scale'))\r\n\tnames.append('SVM')\r\n\treturn models, names<\/pre>\n<p>We can then enumerate the list of models in turn and evaluate each, storing the scores for later evaluation.<\/p>\n<p>We will one-hot encode the categorical input variables as we did in the previous section, and in this case, we will normalize the numerical input variables. This is best performed using the <a href=\"https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.preprocessing.MinMaxScaler.html\">MinMaxScaler<\/a> within each fold of the cross-validation evaluation process.<\/p>\n<p>An easy way to implement this is to use a <a href=\"https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.pipeline.Pipeline.html\">Pipeline<\/a> where the first step is a <a href=\"https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.compose.ColumnTransformer.html\">ColumnTransformer<\/a> that applies a <a href=\"https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.preprocessing.OneHotEncoder.html\">OneHotEncoder<\/a> to just the categorical variables, and a <em>MinMaxScaler<\/em> to just the numerical input variables. To achieve this, we need a list of the column indices for categorical and numerical input variables.<\/p>\n<p>We can update the <em>load_dataset()<\/em> to return the column indexes as well as the input and output elements of the dataset. The updated version of this function is listed below.<\/p>\n<pre class=\"crayon-plain-tag\"># load the dataset\r\ndef load_dataset(full_path):\r\n\t# load the dataset as a numpy array\r\n\tdataframe = read_csv(full_path, header=None)\r\n\t# split into inputs and outputs\r\n\tlast_ix = len(dataframe.columns) - 1\r\n\tX, y = dataframe.drop(last_ix, axis=1), dataframe[last_ix]\r\n\t# select categorical and numerical features\r\n\tcat_ix = X.select_dtypes(include=['object', 'bool']).columns\r\n\tnum_ix = X.select_dtypes(include=['int64', 'float64']).columns\r\n\t# label encode the target variable to have the classes 0 and 1\r\n\ty = LabelEncoder().fit_transform(y)\r\n\treturn X.values, y, cat_ix, num_ix<\/pre>\n<p>We can then call this function to get the data and the list of categorical and numerical variables.<\/p>\n<pre class=\"crayon-plain-tag\">...\r\n# define the location of the dataset\r\nfull_path = 'german.csv'\r\n# load the dataset\r\nX, y, cat_ix, num_ix = load_dataset(full_path)<\/pre>\n<p>This can be used to prepare a <em>Pipeline<\/em> to wrap each model prior to evaluating it.<\/p>\n<p>First, the <em>ColumnTransformer<\/em> is defined, which specifies what transform to apply to each type of column, then this is used as the first step in a Pipeline that ends with the specific model that will be fit and evaluated.<\/p>\n<pre class=\"crayon-plain-tag\">...\r\n# evaluate each model\r\nfor i in range(len(models)):\r\n\t# one hot encode categorical, normalize numerical\r\n\tct = ColumnTransformer([('c',OneHotEncoder(),cat_ix), ('n',MinMaxScaler(),num_ix)])\r\n\t# wrap the model i a pipeline\r\n\tpipeline = Pipeline(steps=[('t',ct),('m',models[i])])\r\n\t# evaluate the model and store results\r\n\tscores = evaluate_model(X, y, pipeline)<\/pre>\n<p>We can summarize the mean F2-Measure for each algorithm; this will help to directly compare algorithms.<\/p>\n<pre class=\"crayon-plain-tag\">...\r\n# summarize and store\r\nprint('&gt;%s %.3f (%.3f)' % (names[i], mean(scores), std(scores)))<\/pre>\n<p>At the end of the run, we will create a separate box and whisker plot for each algorithm&rsquo;s sample of results.<\/p>\n<p>These plots will use the same y-axis scale so we can compare the distribution of results directly.<\/p>\n<pre class=\"crayon-plain-tag\">...\r\n# plot the results\r\npyplot.boxplot(results, labels=names, showmeans=True)\r\npyplot.show()<\/pre>\n<p>Tying this all together, the complete example of evaluating a suite of machine learning algorithms on the German credit dataset is listed below.<\/p>\n<pre class=\"crayon-plain-tag\"># spot check machine learning algorithms on the german credit dataset\r\nfrom numpy import mean\r\nfrom numpy import std\r\nfrom pandas import read_csv\r\nfrom matplotlib import pyplot\r\nfrom sklearn.preprocessing import LabelEncoder\r\nfrom sklearn.preprocessing import OneHotEncoder\r\nfrom sklearn.preprocessing import MinMaxScaler\r\nfrom sklearn.pipeline import Pipeline\r\nfrom sklearn.compose import ColumnTransformer\r\nfrom sklearn.model_selection import cross_val_score\r\nfrom sklearn.model_selection import RepeatedStratifiedKFold\r\nfrom sklearn.metrics import fbeta_score\r\nfrom sklearn.metrics import make_scorer\r\nfrom sklearn.linear_model import LogisticRegression\r\nfrom sklearn.discriminant_analysis import LinearDiscriminantAnalysis\r\nfrom sklearn.naive_bayes import GaussianNB\r\nfrom sklearn.gaussian_process import GaussianProcessClassifier\r\nfrom sklearn.svm import SVC\r\n\r\n# load the dataset\r\ndef load_dataset(full_path):\r\n\t# load the dataset as a numpy array\r\n\tdataframe = read_csv(full_path, header=None)\r\n\t# split into inputs and outputs\r\n\tlast_ix = len(dataframe.columns) - 1\r\n\tX, y = dataframe.drop(last_ix, axis=1), dataframe[last_ix]\r\n\t# select categorical and numerical features\r\n\tcat_ix = X.select_dtypes(include=['object', 'bool']).columns\r\n\tnum_ix = X.select_dtypes(include=['int64', 'float64']).columns\r\n\t# label encode the target variable to have the classes 0 and 1\r\n\ty = LabelEncoder().fit_transform(y)\r\n\treturn X.values, y, cat_ix, num_ix\r\n\r\n# calculate f2-measure\r\ndef f2_measure(y_true, y_pred):\r\n\treturn fbeta_score(y_true, y_pred, beta=2)\r\n\r\n# evaluate a model\r\ndef evaluate_model(X, y, model):\r\n\t# define evaluation procedure\r\n\tcv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)\r\n\t# define the model evaluation metric\r\n\tmetric = make_scorer(f2_measure)\r\n\t# evaluate model\r\n\tscores = cross_val_score(model, X, y, scoring=metric, cv=cv, n_jobs=-1)\r\n\treturn scores\r\n\r\n# define models to test\r\ndef get_models():\r\n\tmodels, names = list(), list()\r\n\t# LR\r\n\tmodels.append(LogisticRegression(solver='liblinear'))\r\n\tnames.append('LR')\r\n\t# LDA\r\n\tmodels.append(LinearDiscriminantAnalysis())\r\n\tnames.append('LDA')\r\n\t# NB\r\n\tmodels.append(GaussianNB())\r\n\tnames.append('NB')\r\n\t# GPC\r\n\tmodels.append(GaussianProcessClassifier())\r\n\tnames.append('GPC')\r\n\t# SVM\r\n\tmodels.append(SVC(gamma='scale'))\r\n\tnames.append('SVM')\r\n\treturn models, names\r\n\r\n# define the location of the dataset\r\nfull_path = 'german.csv'\r\n# load the dataset\r\nX, y, cat_ix, num_ix = load_dataset(full_path)\r\n# define models\r\nmodels, names = get_models()\r\nresults = list()\r\n# evaluate each model\r\nfor i in range(len(models)):\r\n\t# one hot encode categorical, normalize numerical\r\n\tct = ColumnTransformer([('c',OneHotEncoder(),cat_ix), ('n',MinMaxScaler(),num_ix)])\r\n\t# wrap the model i a pipeline\r\n\tpipeline = Pipeline(steps=[('t',ct),('m',models[i])])\r\n\t# evaluate the model and store results\r\n\tscores = evaluate_model(X, y, pipeline)\r\n\tresults.append(scores)\r\n\t# summarize and store\r\n\tprint('&gt;%s %.3f (%.3f)' % (names[i], mean(scores), std(scores)))\r\n# plot the results\r\npyplot.boxplot(results, labels=names, showmeans=True)\r\npyplot.show()<\/pre>\n<p>Running the example evaluates each algorithm in turn and reports the mean and standard deviation F2-Measure.<\/p>\n<p>Your specific results will vary given the stochastic nature of the learning algorithms; consider running the example a few times.<\/p>\n<p>In this case, we can see that none of the tested models have an F2-measure above the default of predicting the majority class in all cases (0.682). None of the models are skillful. This is surprising, although suggests that perhaps the decision boundary between the two classes is noisy.<\/p>\n<pre class=\"crayon-plain-tag\">&gt;LR 0.497 (0.072)\r\n&gt;LDA 0.519 (0.072)\r\n&gt;NB 0.639 (0.049)\r\n&gt;GPC 0.219 (0.061)\r\n&gt;SVM 0.436 (0.077)<\/pre>\n<p>A figure is created showing one box and whisker plot for each algorithm&rsquo;s sample of results. The box shows the middle 50 percent of the data, the orange line in the middle of each box shows the median of the sample, and the green triangle in each box shows the mean of the sample.<\/p>\n<div id=\"attachment_9969\" style=\"width: 1290px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-9969\" class=\"size-full wp-image-9969\" src=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2020\/02\/Box-and-Whisker-Plot-of-Machine-Learning-Models-on-the-Imbalanced-German-Credit-Dataset3.png\" alt=\"Box and Whisker Plot of Machine Learning Models on the Imbalanced German Credit Dataset\" width=\"1280\" height=\"960\" srcset=\"http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2020\/02\/Box-and-Whisker-Plot-of-Machine-Learning-Models-on-the-Imbalanced-German-Credit-Dataset3.png 1280w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2020\/02\/Box-and-Whisker-Plot-of-Machine-Learning-Models-on-the-Imbalanced-German-Credit-Dataset3-300x225.png 300w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2020\/02\/Box-and-Whisker-Plot-of-Machine-Learning-Models-on-the-Imbalanced-German-Credit-Dataset3-1024x768.png 1024w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2020\/02\/Box-and-Whisker-Plot-of-Machine-Learning-Models-on-the-Imbalanced-German-Credit-Dataset3-768x576.png 768w\" sizes=\"(max-width: 1280px) 100vw, 1280px\"><\/p>\n<p id=\"caption-attachment-9969\" class=\"wp-caption-text\">Box and Whisker Plot of Machine Learning Models on the Imbalanced German Credit Dataset<\/p>\n<\/div>\n<p>Now that we have some results, let&rsquo;s see if we can improve them with some undersampling.<\/p>\n<h3>Evaluate Undersampling<\/h3>\n<p>Undersampling is perhaps the least widely used technique when addressing an imbalanced classification task as most of the focus is put on oversampling the majority class with SMOTE.<\/p>\n<p>Undersampling can help to remove examples from the majority class along the decision boundary that make the problem challenging for classification algorithms.<\/p>\n<p>In this experiment we will test the following undersampling algorithms:<\/p>\n<ul>\n<li>Tomek Links (TL)<\/li>\n<li>Edited Nearest Neighbors (ENN)<\/li>\n<li>Repeated Edited Nearest Neighbors (RENN)<\/li>\n<li>One Sided Selection (OSS)<\/li>\n<li>Neighborhood Cleaning Rule (NCR)<\/li>\n<\/ul>\n<p>The Tomek Links and ENN methods select examples from the majority class to delete, whereas OSS and NCR both select examples to keep and examples to delete. We will use the balanced version of the logistic regression algorithm to test each undersampling method, to keep things simple.<\/p>\n<p>The <em>get_models()<\/em> function from the previous section can be updated to return a list of undersampling techniques to test with the logistic regression algorithm. We use the implementations of these algorithms from the imbalanced-learn library.<\/p>\n<p>The updated version of the <em>get_models()<\/em> function defining the undersampling methods is listed below.<\/p>\n<pre class=\"crayon-plain-tag\"># define undersampling models to test\r\ndef get_models():\r\n\tmodels, names = list(), list()\r\n\t# TL\r\n\tmodels.append(TomekLinks())\r\n\tnames.append('TL')\r\n\t# ENN\r\n\tmodels.append(EditedNearestNeighbours())\r\n\tnames.append('ENN')\r\n\t# RENN\r\n\tmodels.append(RepeatedEditedNearestNeighbours())\r\n\tnames.append('RENN')\r\n\t# OSS\r\n\tmodels.append(OneSidedSelection())\r\n\tnames.append('OSS')\r\n\t# NCR\r\n\tmodels.append(NeighbourhoodCleaningRule())\r\n\tnames.append('NCR')\r\n\treturn models, names<\/pre>\n<p>The <a href=\"https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.pipeline.Pipeline.html\">Pipeline<\/a> provided by scikit-learn does not know about undersampling algorithms. Therefore, we must use the <a href=\"https:\/\/imbalanced-learn.readthedocs.io\/en\/stable\/generated\/imblearn.pipeline.Pipeline.html\">Pipeline<\/a> implementation provided by the <a href=\"https:\/\/imbalanced-learn.readthedocs.io\/en\/stable\/\">imbalanced-learn library<\/a>.<\/p>\n<p>As in the previous section, the first step of the pipeline will be one hot encoding of categorical variables and normalization of numerical variables, and the final step will be fitting the model. Here, the middle step will be the undersampling technique, correctly applied within the cross-validation evaluation on the training dataset only.<\/p>\n<pre class=\"crayon-plain-tag\">...\r\n# define model to evaluate\r\nmodel = LogisticRegression(solver='liblinear', class_weight='balanced')\r\n# one hot encode categorical, normalize numerical\r\nct = ColumnTransformer([('c',OneHotEncoder(),cat_ix), ('n',MinMaxScaler(),num_ix)])\r\n# scale, then undersample, then fit model\r\npipeline = Pipeline(steps=[('t',ct), ('s', models[i]), ('m',model)])\r\n# evaluate the model and store results\r\nscores = evaluate_model(X, y, pipeline)<\/pre>\n<p>Tying this together, the complete example of evaluating logistic regression with different undersampling methods on the German credit dataset is listed below.<\/p>\n<p>We would expect the undersampling to to result in a lift on skill in logistic regression, ideally above the baseline performance of predicting the minority class in all cases.<\/p>\n<p>The complete example is listed below.<\/p>\n<pre class=\"crayon-plain-tag\"># evaluate undersampling with logistic regression on the imbalanced german credit dataset\r\nfrom numpy import mean\r\nfrom numpy import std\r\nfrom pandas import read_csv\r\nfrom sklearn.preprocessing import LabelEncoder\r\nfrom sklearn.preprocessing import OneHotEncoder\r\nfrom sklearn.preprocessing import MinMaxScaler\r\nfrom sklearn.compose import ColumnTransformer\r\nfrom sklearn.model_selection import cross_val_score\r\nfrom sklearn.model_selection import RepeatedStratifiedKFold\r\nfrom sklearn.metrics import fbeta_score\r\nfrom sklearn.metrics import make_scorer\r\nfrom matplotlib import pyplot\r\nfrom sklearn.linear_model import LogisticRegression\r\nfrom imblearn.pipeline import Pipeline\r\nfrom imblearn.under_sampling import TomekLinks\r\nfrom imblearn.under_sampling import EditedNearestNeighbours\r\nfrom imblearn.under_sampling import RepeatedEditedNearestNeighbours\r\nfrom imblearn.under_sampling import NeighbourhoodCleaningRule\r\nfrom imblearn.under_sampling import OneSidedSelection\r\n\r\n# load the dataset\r\ndef load_dataset(full_path):\r\n\t# load the dataset as a numpy array\r\n\tdataframe = read_csv(full_path, header=None)\r\n\t# split into inputs and outputs\r\n\tlast_ix = len(dataframe.columns) - 1\r\n\tX, y = dataframe.drop(last_ix, axis=1), dataframe[last_ix]\r\n\t# select categorical and numerical features\r\n\tcat_ix = X.select_dtypes(include=['object', 'bool']).columns\r\n\tnum_ix = X.select_dtypes(include=['int64', 'float64']).columns\r\n\t# label encode the target variable to have the classes 0 and 1\r\n\ty = LabelEncoder().fit_transform(y)\r\n\treturn X.values, y, cat_ix, num_ix\r\n\r\n# calculate f2-measure\r\ndef f2_measure(y_true, y_pred):\r\n\treturn fbeta_score(y_true, y_pred, beta=2)\r\n\r\n# evaluate a model\r\ndef evaluate_model(X, y, model):\r\n\t# define evaluation procedure\r\n\tcv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)\r\n\t# define the model evaluation metric\r\n\tmetric = make_scorer(f2_measure)\r\n\t# evaluate model\r\n\tscores = cross_val_score(model, X, y, scoring=metric, cv=cv, n_jobs=-1)\r\n\treturn scores\r\n\r\n# define undersampling models to test\r\ndef get_models():\r\n\tmodels, names = list(), list()\r\n\t# TL\r\n\tmodels.append(TomekLinks())\r\n\tnames.append('TL')\r\n\t# ENN\r\n\tmodels.append(EditedNearestNeighbours())\r\n\tnames.append('ENN')\r\n\t# RENN\r\n\tmodels.append(RepeatedEditedNearestNeighbours())\r\n\tnames.append('RENN')\r\n\t# OSS\r\n\tmodels.append(OneSidedSelection())\r\n\tnames.append('OSS')\r\n\t# NCR\r\n\tmodels.append(NeighbourhoodCleaningRule())\r\n\tnames.append('NCR')\r\n\treturn models, names\r\n\r\n# define the location of the dataset\r\nfull_path = 'german.csv'\r\n# load the dataset\r\nX, y, cat_ix, num_ix = load_dataset(full_path)\r\n# define models\r\nmodels, names = get_models()\r\nresults = list()\r\n# evaluate each model\r\nfor i in range(len(models)):\r\n\t# define model to evaluate\r\n\tmodel = LogisticRegression(solver='liblinear', class_weight='balanced')\r\n\t# one hot encode categorical, normalize numerical\r\n\tct = ColumnTransformer([('c',OneHotEncoder(),cat_ix), ('n',MinMaxScaler(),num_ix)])\r\n\t# scale, then undersample, then fit model\r\n\tpipeline = Pipeline(steps=[('t',ct), ('s', models[i]), ('m',model)])\r\n\t# evaluate the model and store results\r\n\tscores = evaluate_model(X, y, pipeline)\r\n\tresults.append(scores)\r\n\t# summarize and store\r\n\tprint('&gt;%s %.3f (%.3f)' % (names[i], mean(scores), std(scores)))\r\n# plot the results\r\npyplot.boxplot(results, labels=names, showmeans=True)\r\npyplot.show()<\/pre>\n<p>Running the example evaluates the logistic regression algorithm with five different undersampling techniques.<\/p>\n<p>Your specific results will vary given the stochastic nature of the learning algorithms; consider running the example a few times.<\/p>\n<p>In this case, we can see that three of the five undersampling techniques resulted in an F2-measure that provides an improvement over the baseline of 0.682. Specifically, ENN, RENN and NCR, with repeated edited nearest neighbors resulting in the best performance with an F2-measure of about 0.716.<\/p>\n<p>The results suggest <em>SMOTE<\/em> achieved the best score with an F2-Measure of 0.604.<\/p>\n<pre class=\"crayon-plain-tag\">&gt;TL 0.669 (0.057)\r\n&gt;ENN 0.706 (0.048)\r\n&gt;RENN 0.714 (0.041)\r\n&gt;OSS 0.670 (0.054)\r\n&gt;NCR 0.693 (0.052)<\/pre>\n<p>Box and whisker plots are created for each evaluated undersampling technique, showing that they generally have the same spread.<\/p>\n<p>It is encouraging to see that for the well performing methods, the boxes spread up around 0.8, and the mean and median for all three methods are are around 0.7. This highlights that the distributions are skewing high and are let down on occasion by a few bad evaluations.<\/p>\n<div id=\"attachment_9970\" style=\"width: 1290px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-9970\" class=\"size-full wp-image-9970\" src=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2020\/02\/Box-and-Whisker-Plot-of-Logistic-Regression-With-Undersampling-on-the-Imbalanced-German-Credit-Dataset.png\" alt=\"Box and Whisker Plot of Logistic Regression With Undersampling on the Imbalanced German Credit Dataset\" width=\"1280\" height=\"960\" srcset=\"http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2020\/02\/Box-and-Whisker-Plot-of-Logistic-Regression-With-Undersampling-on-the-Imbalanced-German-Credit-Dataset.png 1280w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2020\/02\/Box-and-Whisker-Plot-of-Logistic-Regression-With-Undersampling-on-the-Imbalanced-German-Credit-Dataset-300x225.png 300w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2020\/02\/Box-and-Whisker-Plot-of-Logistic-Regression-With-Undersampling-on-the-Imbalanced-German-Credit-Dataset-1024x768.png 1024w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2020\/02\/Box-and-Whisker-Plot-of-Logistic-Regression-With-Undersampling-on-the-Imbalanced-German-Credit-Dataset-768x576.png 768w\" sizes=\"(max-width: 1280px) 100vw, 1280px\"><\/p>\n<p id=\"caption-attachment-9970\" class=\"wp-caption-text\">Box and Whisker Plot of Logistic Regression With Undersampling on the Imbalanced German Credit Dataset<\/p>\n<\/div>\n<p>Next, let&rsquo;s see how we might use a final model to make predictions on new data.<\/p>\n<h3>Further Model Improvements<\/h3>\n<p>This is a new section that provides a minor departure to the above section. Here, we will test specific models that result in a further lift in F2-measure performance and I will update this section as new models are reported\/discovered.<\/p>\n<h4>Improvement #1: InstanceHardnessThreshold<\/h4>\n<p>An F2-measure of about <strong>0.727<\/strong>&nbsp;can be achieved using balanced Logistic Regression with <a href=\"https:\/\/imbalanced-learn.readthedocs.io\/en\/stable\/generated\/imblearn.under_sampling.InstanceHardnessThreshold.html\">InstanceHardnessThreshold<\/a> undersampling.<\/p>\n<p>The complete example is listed below.<\/p>\n<pre class=\"crayon-plain-tag\"># improve performance on the imbalanced german credit dataset\r\nfrom numpy import mean\r\nfrom numpy import std\r\nfrom pandas import read_csv\r\nfrom sklearn.preprocessing import LabelEncoder\r\nfrom sklearn.preprocessing import OneHotEncoder\r\nfrom sklearn.preprocessing import MinMaxScaler\r\nfrom sklearn.compose import ColumnTransformer\r\nfrom sklearn.model_selection import cross_val_score\r\nfrom sklearn.model_selection import RepeatedStratifiedKFold\r\nfrom sklearn.metrics import fbeta_score\r\nfrom sklearn.metrics import make_scorer\r\nfrom sklearn.linear_model import LogisticRegression\r\nfrom imblearn.pipeline import Pipeline\r\nfrom imblearn.under_sampling import InstanceHardnessThreshold\r\n\r\n# load the dataset\r\ndef load_dataset(full_path):\r\n\t# load the dataset as a numpy array\r\n\tdataframe = read_csv(full_path, header=None)\r\n\t# split into inputs and outputs\r\n\tlast_ix = len(dataframe.columns) - 1\r\n\tX, y = dataframe.drop(last_ix, axis=1), dataframe[last_ix]\r\n\t# select categorical and numerical features\r\n\tcat_ix = X.select_dtypes(include=['object', 'bool']).columns\r\n\tnum_ix = X.select_dtypes(include=['int64', 'float64']).columns\r\n\t# label encode the target variable to have the classes 0 and 1\r\n\ty = LabelEncoder().fit_transform(y)\r\n\treturn X.values, y, cat_ix, num_ix\r\n\r\n# calculate f2-measure\r\ndef f2_measure(y_true, y_pred):\r\n\treturn fbeta_score(y_true, y_pred, beta=2)\r\n\r\n# evaluate a model\r\ndef evaluate_model(X, y, model):\r\n\t# define evaluation procedure\r\n\tcv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)\r\n\t# define the model evaluation metric\r\n\tmetric = make_scorer(f2_measure)\r\n\t# evaluate model\r\n\tscores = cross_val_score(model, X, y, scoring=metric, cv=cv, n_jobs=-1)\r\n\treturn scores\r\n\r\n# define the location of the dataset\r\nfull_path = 'german.csv'\r\n# load the dataset\r\nX, y, cat_ix, num_ix = load_dataset(full_path)\r\n# define model to evaluate\r\nmodel = LogisticRegression(solver='liblinear', class_weight='balanced')\r\n# define the data sampling\r\nsampling = InstanceHardnessThreshold()\r\n# one hot encode categorical, normalize numerical\r\nct = ColumnTransformer([('c',OneHotEncoder(),cat_ix), ('n',MinMaxScaler(),num_ix)])\r\n# scale, then sample, then fit model\r\npipeline = Pipeline(steps=[('t',ct), ('s', sampling), ('m',model)])\r\n# evaluate the model and store results\r\nscores = evaluate_model(X, y, pipeline)\r\nprint('%.3f (%.3f)' % (mean(scores), std(scores)))<\/pre>\n<p>Running the example gives the follow results, your results may vary given the stochastic nature of the learning algorithm.<\/p>\n<pre class=\"crayon-plain-tag\">0.727 (0.033)<\/pre>\n<\/p>\n<h4>Improvement #2: SMOTEENN<\/h4>\n<p>An F2-measure of about <strong>0.730<\/strong>&nbsp;can be achieved using LDA with <a href=\"https:\/\/imbalanced-learn.readthedocs.io\/en\/stable\/generated\/imblearn.combine.SMOTEENN.html\">SMOTEENN<\/a>, where the ENN parameter is set to an ENN instance with sampling_strategy set to majority.<\/p>\n<p>The complete example is listed below.<\/p>\n<pre class=\"crayon-plain-tag\"># improve performance on the imbalanced german credit dataset\r\nfrom numpy import mean\r\nfrom numpy import std\r\nfrom pandas import read_csv\r\nfrom sklearn.preprocessing import LabelEncoder\r\nfrom sklearn.preprocessing import OneHotEncoder\r\nfrom sklearn.preprocessing import MinMaxScaler\r\nfrom sklearn.compose import ColumnTransformer\r\nfrom sklearn.model_selection import cross_val_score\r\nfrom sklearn.model_selection import RepeatedStratifiedKFold\r\nfrom sklearn.metrics import fbeta_score\r\nfrom sklearn.metrics import make_scorer\r\nfrom sklearn.discriminant_analysis import LinearDiscriminantAnalysis\r\nfrom imblearn.pipeline import Pipeline\r\nfrom imblearn.combine import SMOTEENN\r\nfrom imblearn.under_sampling import EditedNearestNeighbours\r\n\r\n# load the dataset\r\ndef load_dataset(full_path):\r\n\t# load the dataset as a numpy array\r\n\tdataframe = read_csv(full_path, header=None)\r\n\t# split into inputs and outputs\r\n\tlast_ix = len(dataframe.columns) - 1\r\n\tX, y = dataframe.drop(last_ix, axis=1), dataframe[last_ix]\r\n\t# select categorical and numerical features\r\n\tcat_ix = X.select_dtypes(include=['object', 'bool']).columns\r\n\tnum_ix = X.select_dtypes(include=['int64', 'float64']).columns\r\n\t# label encode the target variable to have the classes 0 and 1\r\n\ty = LabelEncoder().fit_transform(y)\r\n\treturn X.values, y, cat_ix, num_ix\r\n\r\n# calculate f2-measure\r\ndef f2_measure(y_true, y_pred):\r\n\treturn fbeta_score(y_true, y_pred, beta=2)\r\n\r\n# evaluate a model\r\ndef evaluate_model(X, y, model):\r\n\t# define evaluation procedure\r\n\tcv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)\r\n\t# define the model evaluation metric\r\n\tmetric = make_scorer(f2_measure)\r\n\t# evaluate model\r\n\tscores = cross_val_score(model, X, y, scoring=metric, cv=cv, n_jobs=-1)\r\n\treturn scores\r\n\r\n# define the location of the dataset\r\nfull_path = 'german.csv'\r\n# load the dataset\r\nX, y, cat_ix, num_ix = load_dataset(full_path)\r\n# define model to evaluate\r\nmodel = LinearDiscriminantAnalysis()\r\n# define the data sampling\r\nsampling = SMOTEENN(enn=EditedNearestNeighbours(sampling_strategy='majority'))\r\n# one hot encode categorical, normalize numerical\r\nct = ColumnTransformer([('c',OneHotEncoder(),cat_ix), ('n',MinMaxScaler(),num_ix)])\r\n# scale, then sample, then fit model\r\npipeline = Pipeline(steps=[('t',ct), ('s', sampling), ('m',model)])\r\n# evaluate the model and store results\r\nscores = evaluate_model(X, y, pipeline)\r\nprint('%.3f (%.3f)' % (mean(scores), std(scores)))<\/pre>\n<p>Running the example gives the follow results, your results may vary given the stochastic nature of the learning algorithm.<\/p>\n<pre class=\"crayon-plain-tag\">0.730 (0.046)<\/pre>\n<\/p>\n<h4>Improvement #3: SMOTEENN with StandardScaler and RidgeClassifier<\/h4>\n<p>An F2-measure of about <strong>0.741<\/strong> can be achieved with further improvements to the SMOTEENN using a RidgeClassifier instead of LDA and using a StandardScaler for the numeric inputs instead of a MinMaxScaler.<\/p>\n<p>The complete example is listed below.<\/p>\n<pre class=\"crayon-plain-tag\"># improve performance on the imbalanced german credit dataset\r\nfrom numpy import mean\r\nfrom numpy import std\r\nfrom pandas import read_csv\r\nfrom sklearn.preprocessing import LabelEncoder\r\nfrom sklearn.preprocessing import OneHotEncoder\r\nfrom sklearn.preprocessing import StandardScaler\r\nfrom sklearn.compose import ColumnTransformer\r\nfrom sklearn.model_selection import cross_val_score\r\nfrom sklearn.model_selection import RepeatedStratifiedKFold\r\nfrom sklearn.metrics import fbeta_score\r\nfrom sklearn.metrics import make_scorer\r\nfrom sklearn.linear_model import RidgeClassifier\r\nfrom imblearn.pipeline import Pipeline\r\nfrom imblearn.combine import SMOTEENN\r\nfrom imblearn.under_sampling import EditedNearestNeighbours\r\n\r\n# load the dataset\r\ndef load_dataset(full_path):\r\n\t# load the dataset as a numpy array\r\n\tdataframe = read_csv(full_path, header=None)\r\n\t# split into inputs and outputs\r\n\tlast_ix = len(dataframe.columns) - 1\r\n\tX, y = dataframe.drop(last_ix, axis=1), dataframe[last_ix]\r\n\t# select categorical and numerical features\r\n\tcat_ix = X.select_dtypes(include=['object', 'bool']).columns\r\n\tnum_ix = X.select_dtypes(include=['int64', 'float64']).columns\r\n\t# label encode the target variable to have the classes 0 and 1\r\n\ty = LabelEncoder().fit_transform(y)\r\n\treturn X.values, y, cat_ix, num_ix\r\n\r\n# calculate f2-measure\r\ndef f2_measure(y_true, y_pred):\r\n\treturn fbeta_score(y_true, y_pred, beta=2)\r\n\r\n# evaluate a model\r\ndef evaluate_model(X, y, model):\r\n\t# define evaluation procedure\r\n\tcv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)\r\n\t# define the model evaluation metric\r\n\tmetric = make_scorer(f2_measure)\r\n\t# evaluate model\r\n\tscores = cross_val_score(model, X, y, scoring=metric, cv=cv, n_jobs=-1)\r\n\treturn scores\r\n\r\n# define the location of the dataset\r\nfull_path = 'german.csv'\r\n# load the dataset\r\nX, y, cat_ix, num_ix = load_dataset(full_path)\r\n# define model to evaluate\r\nmodel = RidgeClassifier()\r\n# define the data sampling\r\nsampling = SMOTEENN(enn=EditedNearestNeighbours(sampling_strategy='majority'))\r\n# one hot encode categorical, normalize numerical\r\nct = ColumnTransformer([('c',OneHotEncoder(),cat_ix), ('n',StandardScaler(),num_ix)])\r\n# scale, then sample, then fit model\r\npipeline = Pipeline(steps=[('t',ct), ('s', sampling), ('m',model)])\r\n# evaluate the model and store results\r\nscores = evaluate_model(X, y, pipeline)\r\nprint('%.3f (%.3f)' % (mean(scores), std(scores)))<\/pre>\n<p>Running the example gives the follow results, your results may vary given the stochastic nature of the learning algorithm.<\/p>\n<pre class=\"crayon-plain-tag\">0.741 (0.034)<\/pre>\n<p><strong>Can you do even better?<\/strong><br \/>\nLet me know in the comments below.<\/p>\n<h2>Make Prediction on New Data<\/h2>\n<p>Given the variance in results, a selection of any of the undersampling methods is probably sufficient. In this case, we will select logistic regression with Repeated ENN.<\/p>\n<p>This model had an F2-measure of about about 0.716 on our test harness.<\/p>\n<p>We will use this as our final model and use it to make predictions on new data.<\/p>\n<p>First, we can define the model as a pipeline.<\/p>\n<pre class=\"crayon-plain-tag\">...\r\n# define model to evaluate\r\nmodel = LogisticRegression(solver='liblinear', class_weight='balanced')\r\n# one hot encode categorical, normalize numerical\r\nct = ColumnTransformer([('c',OneHotEncoder(),cat_ix), ('n',MinMaxScaler(),num_ix)])\r\n# scale, then undersample, then fit model\r\npipeline = Pipeline(steps=[('t',ct), ('s', RepeatedEditedNearestNeighbours()), ('m',model)])<\/pre>\n<p>Once defined, we can fit it on the entire training dataset.<\/p>\n<pre class=\"crayon-plain-tag\">...\r\n# fit the model\r\npipeline.fit(X, y)<\/pre>\n<p>Once fit, we can use it to make predictions for new data by calling the <em>predict()<\/em> function. This will return the class label of 0 for &ldquo;<em>good customer<\/em>&rdquo;, or 1 for &ldquo;<em>bad customer<\/em>&rdquo;.<\/p>\n<p>Importantly, we must use the <em>ColumnTransformer<\/em> that was fit on the training dataset in the <em>Pipeline<\/em> to correctly prepare new data using the same transforms.<\/p>\n<p>For example:<\/p>\n<pre class=\"crayon-plain-tag\">...\r\n# define a row of data\r\nrow = [...]\r\n# make prediction\r\nyhat = pipeline.predict([row])<\/pre>\n<p>To demonstrate this, we can use the fit model to make some predictions of labels for a few cases where we know if the case is a good customer or bad.<\/p>\n<p>The complete example is listed below.<\/p>\n<pre class=\"crayon-plain-tag\"># fit a model and make predictions for the german credit dataset\r\nfrom pandas import read_csv\r\nfrom sklearn.preprocessing import LabelEncoder\r\nfrom sklearn.preprocessing import OneHotEncoder\r\nfrom sklearn.preprocessing import MinMaxScaler\r\nfrom sklearn.compose import ColumnTransformer\r\nfrom sklearn.linear_model import LogisticRegression\r\nfrom imblearn.pipeline import Pipeline\r\nfrom imblearn.under_sampling import RepeatedEditedNearestNeighbours\r\n\r\n# load the dataset\r\ndef load_dataset(full_path):\r\n\t# load the dataset as a numpy array\r\n\tdataframe = read_csv(full_path, header=None)\r\n\t# split into inputs and outputs\r\n\tlast_ix = len(dataframe.columns) - 1\r\n\tX, y = dataframe.drop(last_ix, axis=1), dataframe[last_ix]\r\n\t# select categorical and numerical features\r\n\tcat_ix = X.select_dtypes(include=['object', 'bool']).columns\r\n\tnum_ix = X.select_dtypes(include=['int64', 'float64']).columns\r\n\t# label encode the target variable to have the classes 0 and 1\r\n\ty = LabelEncoder().fit_transform(y)\r\n\treturn X.values, y, cat_ix, num_ix\r\n\r\n# define the location of the dataset\r\nfull_path = 'german.csv'\r\n# load the dataset\r\nX, y, cat_ix, num_ix = load_dataset(full_path)\r\n# define model to evaluate\r\nmodel = LogisticRegression(solver='liblinear', class_weight='balanced')\r\n# one hot encode categorical, normalize numerical\r\nct = ColumnTransformer([('c',OneHotEncoder(),cat_ix), ('n',MinMaxScaler(),num_ix)])\r\n# scale, then undersample, then fit model\r\npipeline = Pipeline(steps=[('t',ct), ('s', RepeatedEditedNearestNeighbours()), ('m',model)])\r\n# fit the model\r\npipeline.fit(X, y)\r\n# evaluate on some good customers cases (known class 0)\r\nprint('Good Customers:')\r\ndata = [['A11', 6, 'A34', 'A43', 1169, 'A65', 'A75', 4, 'A93', 'A101', 4, 'A121', 67, 'A143', 'A152', 2, 'A173', 1, 'A192', 'A201'],\r\n\t['A14', 12, 'A34', 'A46', 2096, 'A61', 'A74', 2, 'A93', 'A101', 3, 'A121', 49, 'A143', 'A152', 1, 'A172', 2, 'A191', 'A201'],\r\n\t['A11', 42, 'A32', 'A42', 7882, 'A61', 'A74', 2, 'A93', 'A103', 4, 'A122', 45, 'A143', 'A153', 1, 'A173', 2, 'A191', 'A201']]\r\nfor row in data:\r\n\t# make prediction\r\n\tyhat = pipeline.predict([row])\r\n\t# get the label\r\n\tlabel = yhat[0]\r\n\t# summarize\r\n\tprint('&gt;Predicted=%d (expected 0)' % (label))\r\n# evaluate on some bad customers (known class 1)\r\nprint('Bad Customers:')\r\ndata = [['A13', 18, 'A32', 'A43', 2100, 'A61', 'A73', 4, 'A93', 'A102', 2, 'A121', 37, 'A142', 'A152', 1, 'A173', 1, 'A191', 'A201'],\r\n\t['A11', 24, 'A33', 'A40', 4870, 'A61', 'A73', 3, 'A93', 'A101', 4, 'A124', 53, 'A143', 'A153', 2, 'A173', 2, 'A191', 'A201'],\r\n\t['A11', 24, 'A32', 'A43', 1282, 'A62', 'A73', 4, 'A92', 'A101', 2, 'A123', 32, 'A143', 'A152', 1, 'A172', 1, 'A191', 'A201']]\r\nfor row in data:\r\n\t# make prediction\r\n\tyhat = pipeline.predict([row])\r\n\t# get the label\r\n\tlabel = yhat[0]\r\n\t# summarize\r\n\tprint('&gt;Predicted=%d (expected 1)' % (label))<\/pre>\n<p>Running the example first fits the model on the entire training dataset.<\/p>\n<p>Then the fit model used to predict the label of a good customer for cases chosen from the dataset file. We can see that most cases are correctly predicted. This highlights that although we chose a good model, it is not perfect.<\/p>\n<p>Then some cases of actual bad customers are used as input to the model and the label is predicted. As we might have hoped, the correct labels are predicted for all cases.<\/p>\n<pre class=\"crayon-plain-tag\">Good Customers:\r\n&gt;Predicted=0 (expected 0)\r\n&gt;Predicted=0 (expected 0)\r\n&gt;Predicted=0 (expected 0)\r\nBad Customers:\r\n&gt;Predicted=0 (expected 1)\r\n&gt;Predicted=1 (expected 1)\r\n&gt;Predicted=1 (expected 1)<\/pre>\n<\/p>\n<h2>Further Reading<\/h2>\n<p>This section provides more resources on the topic if you are looking to go deeper.<\/p>\n<h3>Books<\/h3>\n<ul>\n<li><a href=\"https:\/\/amzn.to\/33oQT1q\">Machine Learning, Neural and Statistical Classification<\/a>, 1994.<\/li>\n<\/ul>\n<h3>APIs<\/h3>\n<ul>\n<li><a href=\"https:\/\/pandas.pydata.org\/pandas-docs\/stable\/reference\/api\/pandas.DataFrame.select_dtypes.html\">pandas.DataFrame.select_dtypes API<\/a>.<\/li>\n<li><a href=\"https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.metrics.fbeta_score.html\">sklearn.metrics.fbeta_score API<\/a>.<\/li>\n<li><a href=\"https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.compose.ColumnTransformer.html\">sklearn.compose.ColumnTransformer API<\/a>.<\/li>\n<li><a href=\"https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.preprocessing.OneHotEncoder.html\">sklearn.preprocessing.OneHotEncoder API<\/a>.<\/li>\n<li><a href=\"https:\/\/imbalanced-learn.readthedocs.io\/en\/stable\/generated\/imblearn.pipeline.Pipeline.html\">imblearn.pipeline.Pipeline API<\/a>.<\/li>\n<\/ul>\n<h3>Dataset<\/h3>\n<ul>\n<li><a href=\"https:\/\/archive.ics.uci.edu\/ml\/datasets\/Statlog+(German+Credit+Data)\">Statlog (German Credit Data) Dataset, UCI Machine Learning Repository<\/a>.<\/li>\n<li><a href=\"https:\/\/raw.githubusercontent.com\/jbrownlee\/Datasets\/master\/german.csv\">German Credit Dataset<\/a>.<\/li>\n<li><a href=\"https:\/\/raw.githubusercontent.com\/jbrownlee\/Datasets\/master\/german.names\">German Credit Dataset Description<\/a><\/li>\n<\/ul>\n<h2>Summary<\/h2>\n<p>In this tutorial, you discovered how to develop and evaluate a model for the imbalanced German credit classification dataset.<\/p>\n<p>Specifically, you learned:<\/p>\n<ul>\n<li>How to load and explore the dataset and generate ideas for data preparation and model selection.<\/li>\n<li>How to evaluate a suite of machine learning models and improve their performance with data undersampling techniques.<\/li>\n<li>How to fit a final model and use it to predict class labels for specific cases.<\/li>\n<\/ul>\n<p>Do you have any questions?<br \/>\nAsk your questions in the comments below and I will do my best to answer.<\/p>\n<p>The post <a rel=\"nofollow\" href=\"https:\/\/machinelearningmastery.com\/imbalanced-classification-of-good-and-bad-credit\/\">Develop a Model for the Imbalanced Classification of Good and Bad Credit<\/a> appeared first on <a rel=\"nofollow\" href=\"https:\/\/machinelearningmastery.com\/\">Machine Learning Mastery<\/a>.<\/p>\n<\/div>\n<p><a href=\"https:\/\/machinelearningmastery.com\/imbalanced-classification-of-good-and-bad-credit\/\">Go to Source<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Author: Jason Brownlee Misclassification errors on the minority class are more important than other types of prediction errors for some imbalanced classification tasks. One example [&hellip;] <span class=\"read-more-link\"><a class=\"read-more\" href=\"https:\/\/www.aiproblog.com\/index.php\/2020\/02\/27\/develop-a-model-for-the-imbalanced-classification-of-good-and-bad-credit\/\">Read More<\/a><\/span><\/p>\n","protected":false},"author":1,"featured_media":3182,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_bbp_topic_count":0,"_bbp_reply_count":0,"_bbp_total_topic_count":0,"_bbp_total_reply_count":0,"_bbp_voice_count":0,"_bbp_anonymous_reply_count":0,"_bbp_topic_count_hidden":0,"_bbp_reply_count_hidden":0,"_bbp_forum_subforum_count":0,"footnotes":""},"categories":[24],"tags":[],"_links":{"self":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts\/3181"}],"collection":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/comments?post=3181"}],"version-history":[{"count":0,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts\/3181\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/media\/3182"}],"wp:attachment":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/media?parent=3181"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/categories?post=3181"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/tags?post=3181"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}