{"id":3050,"date":"2020-01-21T18:00:30","date_gmt":"2020-01-21T18:00:30","guid":{"rendered":"https:\/\/www.aiproblog.com\/index.php\/2020\/01\/21\/combine-oversampling-and-undersampling-for-imbalanced-classification\/"},"modified":"2020-01-21T18:00:30","modified_gmt":"2020-01-21T18:00:30","slug":"combine-oversampling-and-undersampling-for-imbalanced-classification","status":"publish","type":"post","link":"https:\/\/www.aiproblog.com\/index.php\/2020\/01\/21\/combine-oversampling-and-undersampling-for-imbalanced-classification\/","title":{"rendered":"Combine Oversampling and Undersampling for Imbalanced Classification"},"content":{"rendered":"<p>Author: Jason Brownlee<\/p>\n<div>\n<p>Resampling methods are designed to add or remove examples from the training dataset in order to change the class distribution.<\/p>\n<p>Once the class distributions are more balanced, the suite of standard machine learning classification algorithms can be fit successfully on the transformed datasets.<\/p>\n<p>Oversampling methods duplicate or create new synthetic examples in the minority class, whereas undersampling methods delete or merge examples in the majority class. Both types of resampling can be effective when used in isolation, although can be more effective when both types of methods are used together.<\/p>\n<p>In this tutorial, you will discover how to combine oversampling and undersampling techniques for imbalanced classification.<\/p>\n<p>After completing this tutorial, you will know:<\/p>\n<ul>\n<li>How to define a sequence of oversampling and undersampling methods to be applied to a training dataset or when evaluating a classifier model.<\/li>\n<li>How to manually combine oversampling and undersampling methods for imbalanced classification.<\/li>\n<li>How to use pre-defined and well-performing combinations of resampling methods for imbalanced classification.<\/li>\n<\/ul>\n<p>Discover SMOTE, one-class classification, cost-sensitive learning, threshold moving, and much more <a href=\"https:\/\/machinelearningmastery.com\/imbalanced-classification-with-python\/\">in my new book<\/a>, with 30 step-by-step tutorials and full Python source code.<\/p>\n<p>Let&rsquo;s get started.<\/p>\n<div id=\"attachment_9461\" style=\"width: 809px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-9461\" class=\"size-full wp-image-9461\" src=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2020\/01\/Combine-Oversampling-and-Undersampling-for-Imbalanced-Classification.jpg\" alt=\"Combine Oversampling and Undersampling for Imbalanced Classification\" width=\"799\" height=\"533\" srcset=\"http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2020\/01\/Combine-Oversampling-and-Undersampling-for-Imbalanced-Classification.jpg 799w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2020\/01\/Combine-Oversampling-and-Undersampling-for-Imbalanced-Classification-300x200.jpg 300w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2020\/01\/Combine-Oversampling-and-Undersampling-for-Imbalanced-Classification-768x512.jpg 768w\" sizes=\"(max-width: 799px) 100vw, 799px\"><\/p>\n<p id=\"caption-attachment-9461\" class=\"wp-caption-text\">Combine Oversampling and Undersampling for Imbalanced Classification<br \/>Photo by <a href=\"https:\/\/flickr.com\/photos\/137294100@N08\/43934817620\/\">Radek Kucharski<\/a>, some rights reserved.<\/p>\n<\/div>\n<h2>Tutorial Overview<\/h2>\n<p>This tutorial is divided into four parts; they are:<\/p>\n<ol>\n<li>Binary Test Problem and Decision Tree Model<\/li>\n<li>Imbalanced-Learn Library<\/li>\n<li>Manually Combine Over- and Undersampling Methods\n<ol>\n<li>Manually Combine Random Oversampling and Undersampling<\/li>\n<li>Manually Combine SMOTE and Random Undersampling<\/li>\n<\/ol>\n<\/li>\n<li>Use Predefined Combinations of Resampling Methods\n<ol>\n<li>Combination of SMOTE and Tomek Links Undersampling<\/li>\n<li>Combination of SMOTE and Edited Nearest Neighbors Undersampling<\/li>\n<\/ol>\n<\/li>\n<\/ol>\n<h2>Binary Test Problem and Decision Tree Model<\/h2>\n<p>Before we dive into combinations of oversampling and undersampling methods, let&rsquo;s define a synthetic dataset and model.<\/p>\n<p>We can define a synthetic binary classification dataset using the <a href=\"https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.datasets.make_classification.html\">make_classification() function<\/a> from the scikit-learn library.<\/p>\n<p>For example, we can create 10,000 examples with two input variables and a 1:100 class distribution as follows:<\/p>\n<pre class=\"crayon-plain-tag\">...\r\n# define dataset\r\nX, y = make_classification(n_samples=10000, n_features=2, n_redundant=0,\r\n\tn_clusters_per_class=1, weights=[0.99], flip_y=0, random_state=1)<\/pre>\n<p>We can then create a scatter plot of the dataset via the <a href=\"https:\/\/matplotlib.org\/3.1.1\/api\/_as_gen\/matplotlib.pyplot.scatter.html\">scatter() Matplotlib function<\/a> to understand the spatial relationship of the examples in each class and their imbalance.<\/p>\n<pre class=\"crayon-plain-tag\">...\r\n# scatter plot of examples by class label\r\nfor label, _ in counter.items():\r\n\trow_ix = where(y == label)[0]\r\n\tpyplot.scatter(X[row_ix, 0], X[row_ix, 1], label=str(label))\r\npyplot.legend()\r\npyplot.show()<\/pre>\n<p>Tying this together, the complete example of creating an imbalanced classification dataset and plotting the examples is listed below.<\/p>\n<pre class=\"crayon-plain-tag\"># Generate and plot a synthetic imbalanced classification dataset\r\nfrom collections import Counter\r\nfrom sklearn.datasets import make_classification\r\nfrom matplotlib import pyplot\r\nfrom numpy import where\r\n# define dataset\r\nX, y = make_classification(n_samples=10000, n_features=2, n_redundant=0,\r\n\tn_clusters_per_class=1, weights=[0.99], flip_y=0, random_state=1)\r\n# summarize class distribution\r\ncounter = Counter(y)\r\nprint(counter)\r\n# scatter plot of examples by class label\r\nfor label, _ in counter.items():\r\n\trow_ix = where(y == label)[0]\r\n\tpyplot.scatter(X[row_ix, 0], X[row_ix, 1], label=str(label))\r\npyplot.legend()\r\npyplot.show()<\/pre>\n<p>Running the example first summarizes the class distribution, showing an approximate 1:100 class distribution with about 10,000 examples with class 0 and 100 with class 1.<\/p>\n<pre class=\"crayon-plain-tag\">Counter({0: 9900, 1: 100})<\/pre>\n<p>Next, a scatter plot is created showing all of the examples in the dataset. We can see a large mass of examples for class 0 (blue) and a small number of examples for class 1 (orange).<\/p>\n<p>We can also see that the classes overlap with some examples from class 1 clearly within the part of the feature space that belongs to class 0.<\/p>\n<div id=\"attachment_9459\" style=\"width: 1290px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-9459\" class=\"size-full wp-image-9459\" src=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2019\/10\/Scatter-Plot-of-Imbalanced-Classification-Dataset-1.png\" alt=\"Scatter Plot of Imbalanced Classification Dataset\" width=\"1280\" height=\"960\" srcset=\"http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2019\/10\/Scatter-Plot-of-Imbalanced-Classification-Dataset-1.png 1280w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2019\/10\/Scatter-Plot-of-Imbalanced-Classification-Dataset-1-300x225.png 300w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2019\/10\/Scatter-Plot-of-Imbalanced-Classification-Dataset-1-768x576.png 768w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2019\/10\/Scatter-Plot-of-Imbalanced-Classification-Dataset-1-1024x768.png 1024w\" sizes=\"(max-width: 1280px) 100vw, 1280px\"><\/p>\n<p id=\"caption-attachment-9459\" class=\"wp-caption-text\">Scatter Plot of Imbalanced Classification Dataset<\/p>\n<\/div>\n<p>We can fit a <a href=\"https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.tree.DecisionTreeClassifier.html\">DecisionTreeClassifier model<\/a> on this dataset. It is a good model to test because it is sensitive to the class distribution in the training dataset.<\/p>\n<pre class=\"crayon-plain-tag\">...\r\n# define model\r\nmodel = DecisionTreeClassifier()<\/pre>\n<p>We can evaluate the model using <a href=\"https:\/\/machinelearningmastery.com\/k-fold-cross-validation\/\">repeated stratified k-fold cross-validation<\/a> with three repeats and 10 folds.<\/p>\n<p>The <a href=\"https:\/\/machinelearningmastery.com\/roc-curves-and-precision-recall-curves-for-classification-in-python\/\">ROC area under curve (AUC) measure<\/a> can be used to estimate the performance of the model. It can be optimistic for severely imbalanced datasets, although it does correctly show relative improvements in model performance.<\/p>\n<pre class=\"crayon-plain-tag\">...\r\n# define evaluation procedure\r\ncv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)\r\n# evaluate model\r\nscores = cross_val_score(model, X, y, scoring='roc_auc', cv=cv, n_jobs=-1)\r\n# summarize performance\r\nprint('Mean ROC AUC: %.3f' % mean(scores))<\/pre>\n<p>Tying this together, the example below evaluates a decision tree model on the imbalanced classification dataset.<\/p>\n<pre class=\"crayon-plain-tag\"># evaluates a decision tree model on the imbalanced dataset\r\nfrom numpy import mean\r\nfrom sklearn.datasets import make_classification\r\nfrom sklearn.tree import DecisionTreeClassifier\r\nfrom sklearn.model_selection import cross_val_score\r\nfrom sklearn.model_selection import RepeatedStratifiedKFold\r\n# generate 2 class dataset\r\nX, y = make_classification(n_samples=10000, n_features=2, n_redundant=0,\r\n\tn_clusters_per_class=1, weights=[0.99], flip_y=0, random_state=1)\r\n# define model\r\nmodel = DecisionTreeClassifier()\r\n# define evaluation procedure\r\ncv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)\r\n# evaluate model\r\nscores = cross_val_score(model, X, y, scoring='roc_auc', cv=cv, n_jobs=-1)\r\n# summarize performance\r\nprint('Mean ROC AUC: %.3f' % mean(scores))<\/pre>\n<p>Running the example reports the average ROC AUC for the decision tree on the dataset over three repeats of 10-fold cross-validation (e.g. average over 30 different model evaluations).<\/p>\n<p>Your specific results will vary given the stochastic nature of the learning algorithm and the evaluation procedure. Try running the example a few times.<\/p>\n<p>In this example, you can see that the model achieved a ROC AUC of about 0.76. This provides a baseline on this dataset, which we can use to compare different combinations of over and under sampling methods on the training dataset.<\/p>\n<pre class=\"crayon-plain-tag\">Mean ROC AUC: 0.762<\/pre>\n<p>Now that we have a test problem, model, and test harness, let&rsquo;s look at manual combinations of oversampling and undersampling methods.<\/p>\n<\/p>\n<div class=\"woo-sc-hr\"><\/div>\n<p><center><\/p>\n<h3>Want to Get Started With Imbalance Classification?<\/h3>\n<p>Take my free 7-day email crash course now (with sample code).<\/p>\n<p>Click to sign-up and also get a free PDF Ebook version of the course.<\/p>\n<p><a href=\"https:\/\/machinelearningmastery.lpages.co\/leadbox\/14de34d42172a2%3A164f8be4f346dc\/4529268551712768\/\" target=\"_blank\" style=\"background: rgb(255, 206, 10); color: rgb(255, 255, 255); text-decoration: none; font-family: Helvetica, Arial, sans-serif; font-weight: bold; font-size: 16px; line-height: 20px; padding: 10px; display: inline-block; max-width: 300px; border-radius: 5px; text-shadow: rgba(0, 0, 0, 0.25) 0px -1px 1px; box-shadow: rgba(255, 255, 255, 0.5) 0px 1px 3px inset, rgba(0, 0, 0, 0.5) 0px 1px 3px;\" rel=\"noopener noreferrer\">Download Your FREE Mini-Course<\/a><script data-leadbox=\"14de34d42172a2:164f8be4f346dc\" data-url=\"https:\/\/machinelearningmastery.lpages.co\/leadbox\/14de34d42172a2%3A164f8be4f346dc\/4529268551712768\/\" data-config=\"%7B%7D\" type=\"text\/javascript\" src=\"https:\/\/machinelearningmastery.lpages.co\/leadbox-1576257931.js\"><\/script><\/p>\n<p><\/center><\/p>\n<div class=\"woo-sc-hr\"><\/div>\n<h2>Imbalanced-Learn Library<\/h2>\n<p>In these examples, we will use the implementations provided by the <a href=\"https:\/\/github.com\/scikit-learn-contrib\/imbalanced-learn\">imbalanced-learn Python library<\/a>, which can be installed via pip as follows:<\/p>\n<pre class=\"crayon-plain-tag\">sudo pip install imbalanced-learn<\/pre>\n<p>You can confirm that the installation was successful by printing the version of the installed library:<\/p>\n<pre class=\"crayon-plain-tag\"># check version number\r\nimport imblearn\r\nprint(imblearn.__version__)<\/pre>\n<p>Running the example will print the version number of the installed library; for example:<\/p>\n<pre class=\"crayon-plain-tag\">0.5.0<\/pre>\n<\/p>\n<h2>Manually Combine Over- and Undersampling Methods<\/h2>\n<p>The imbalanced-learn Python library provides a range of resampling techniques, as well as a Pipeline class that can be used to create a combined sequence of resampling methods to apply to a dataset.<\/p>\n<p>We can use the <a href=\"https:\/\/imbalanced-learn.readthedocs.io\/en\/stable\/generated\/imblearn.pipeline.Pipeline.html\">Pipeline<\/a> to construct a sequence of oversampling and undersampling techniques to apply to a dataset. For example:<\/p>\n<pre class=\"crayon-plain-tag\"># define resampling\r\nover = ...\r\nunder = ...\r\n# define pipeline\r\npipeline = Pipeline(steps=[('o', over), ('u', under)])<\/pre>\n<p>This pipeline first applies an oversampling technique to a dataset, then applies undersampling to the output of the oversampling transform before returning the final outcome. It allows transforms to be stacked or applied in sequence on a dataset.<\/p>\n<p>The pipeline can then be used to transform a dataset; for example:<\/p>\n<pre class=\"crayon-plain-tag\"># fit and apply the pipeline\r\nX_resampled, y_resampled = pipeline.fit_resample(X, y)<\/pre>\n<p>Alternately, a model can be added as the last step in the pipeline.<\/p>\n<p>This allows the pipeline to be treated as a model. When it is fit on a training dataset, the transforms are first applied to the training dataset, then the transformed dataset is provided to the model so that it can develop a fit.<\/p>\n<pre class=\"crayon-plain-tag\">...\r\n# define model\r\nmodel = ...\r\n# define resampling\r\nover = ...\r\nunder = ...\r\n# define pipeline\r\npipeline = Pipeline(steps=[('o', over), ('u', under), ('m', model)])<\/pre>\n<p>Recall that the resampling is only applied to the training dataset, not the test dataset.<\/p>\n<p>When used in <a href=\"https:\/\/machinelearningmastery.com\/k-fold-cross-validation\/\">k-fold cross-validation<\/a>, the entire sequence of transforms and fit is applied on each training dataset comprised of cross-validation folds. This is important as both the transforms and fit are performed without knowledge of the holdout set, which avoids <a href=\"https:\/\/machinelearningmastery.com\/data-leakage-machine-learning\/\">data leakage<\/a>. For example:<\/p>\n<pre class=\"crayon-plain-tag\">...\r\n# define evaluation procedure\r\ncv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)\r\n# evaluate model\r\nscores = cross_val_score(pipeline, X, y, scoring='roc_auc', cv=cv, n_jobs=-1)<\/pre>\n<p>Now that we know how to manually combine resampling methods, let&rsquo;s look at two examples.<\/p>\n<h3>Manually Combine Random Oversampling and Undersampling<\/h3>\n<p>A good starting point for combining resampling techniques is to start with random or naive methods.<\/p>\n<p>Although they are simple, and often ineffective when applied in isolation, they can be effective when combined.<\/p>\n<p>Random oversampling involves randomly duplicating examples in the minority class, whereas random undersampling involves randomly deleting examples from the majority class.<\/p>\n<p>As these two transforms are performed on separate classes, the order in which they are applied to the training dataset does not matter.<\/p>\n<p>The example below defines a pipeline that first oversamples the minority class to 10 percent of the majority class, under samples the majority class to 50 percent more than the minority class, and then fits a decision tree model.<\/p>\n<pre class=\"crayon-plain-tag\">...\r\n# define model\r\nmodel = DecisionTreeClassifier()\r\n# define resampling\r\nover = RandomOverSampler(sampling_strategy=0.1)\r\nunder = RandomUnderSampler(sampling_strategy=0.5)\r\n# define pipeline\r\npipeline = Pipeline(steps=[('o', over), ('u', under), ('m', model)])<\/pre>\n<p>The complete example of evaluating this combination on the binary classification problem is listed below.<\/p>\n<pre class=\"crayon-plain-tag\"># combination of random oversampling and undersampling for imbalanced classification\r\nfrom numpy import mean\r\nfrom sklearn.datasets import make_classification\r\nfrom sklearn.model_selection import cross_val_score\r\nfrom sklearn.model_selection import RepeatedStratifiedKFold\r\nfrom sklearn.tree import DecisionTreeClassifier\r\nfrom imblearn.pipeline import Pipeline\r\nfrom imblearn.over_sampling import RandomOverSampler\r\nfrom imblearn.under_sampling import RandomUnderSampler\r\n# generate dataset\r\nX, y = make_classification(n_samples=10000, n_features=2, n_redundant=0,\r\n\tn_clusters_per_class=1, weights=[0.99], flip_y=0, random_state=1)\r\n# define model\r\nmodel = DecisionTreeClassifier()\r\n# define resampling\r\nover = RandomOverSampler(sampling_strategy=0.1)\r\nunder = RandomUnderSampler(sampling_strategy=0.5)\r\n# define pipeline\r\npipeline = Pipeline(steps=[('o', over), ('u', under), ('m', model)])\r\n# define evaluation procedure\r\ncv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)\r\n# evaluate model\r\nscores = cross_val_score(pipeline, X, y, scoring='roc_auc', cv=cv, n_jobs=-1)\r\n# summarize performance\r\nprint('Mean ROC AUC: %.3f' % mean(scores))<\/pre>\n<p>Running the example evaluates the system of transforms and the model and summarizes the performance as the mean ROC AUC.<\/p>\n<p>Your specific results will vary given the stochastic nature of the learning algorithm, resampling algorithms, and the evaluation procedure. Try running the example a few times.<\/p>\n<p>In this case, we can see a modest lift in ROC AUC performance from 0.76 with no transforms to about 0.81 with random over- and undersampling.<\/p>\n<pre class=\"crayon-plain-tag\">Mean ROC AUC: 0.814<\/pre>\n<\/p>\n<h3>Manually Combine SMOTE and Random Undersampling<\/h3>\n<p>We are not limited to using random resampling methods.<\/p>\n<p>Perhaps the most popular oversampling method is the Synthetic Minority Oversampling Technique, or SMOTE for short.<\/p>\n<p>SMOTE works by selecting examples that are close in the feature space, drawing a line between the examples in the feature space and drawing a new sample as a point along that line.<\/p>\n<p>The authors of the technique recommend using SMOTE on the minority class, followed by an undersampling technique on the majority class.<\/p>\n<blockquote>\n<p>The combination of SMOTE and under-sampling performs better than plain under-sampling.<\/p>\n<\/blockquote>\n<p>&mdash; <a href=\"https:\/\/arxiv.org\/abs\/1106.1813\">SMOTE: Synthetic Minority Over-sampling Technique<\/a>, 2011.<\/p>\n<p>We can combine SMOTE with <a href=\"https:\/\/imbalanced-learn.readthedocs.io\/en\/stable\/generated\/imblearn.under_sampling.RandomUnderSampler.html\">RandomUnderSampler<\/a>. Again, the order in which these procedures are applied does not matter as they are performed on different subsets of the training dataset.<\/p>\n<p>The pipeline below implements this combination, first applying SMOTE to bring the minority class distribution to 10 percent of the majority class, then using <em>RandomUnderSampler<\/em> to bring the minority class down to 50 percent more than the minority class before fitting a <a href=\"https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.tree.DecisionTreeClassifier.html\">DecisionTreeClassifier<\/a>.<\/p>\n<pre class=\"crayon-plain-tag\">...\r\n# define model\r\nmodel = DecisionTreeClassifier()\r\n# define pipeline\r\nover = SMOTE(sampling_strategy=0.1)\r\nunder = RandomUnderSampler(sampling_strategy=0.5)\r\nsteps = [('o', over), ('u', under), ('m', model)]<\/pre>\n<p>The example below evaluates this combination on our imbalanced binary classification problem.<\/p>\n<pre class=\"crayon-plain-tag\"># combination of SMOTE and random undersampling for imbalanced classification\r\nfrom numpy import mean\r\nfrom sklearn.datasets import make_classification\r\nfrom sklearn.model_selection import cross_val_score\r\nfrom sklearn.model_selection import RepeatedStratifiedKFold\r\nfrom sklearn.tree import DecisionTreeClassifier\r\nfrom imblearn.pipeline import Pipeline\r\nfrom imblearn.over_sampling import SMOTE\r\nfrom imblearn.under_sampling import RandomUnderSampler\r\n# generate dataset\r\nX, y = make_classification(n_samples=10000, n_features=2, n_redundant=0,\r\n\tn_clusters_per_class=1, weights=[0.99], flip_y=0, random_state=1)\r\n# define model\r\nmodel = DecisionTreeClassifier()\r\n# define pipeline\r\nover = SMOTE(sampling_strategy=0.1)\r\nunder = RandomUnderSampler(sampling_strategy=0.5)\r\nsteps = [('o', over), ('u', under), ('m', model)]\r\npipeline = Pipeline(steps=steps)\r\n# define evaluation procedure\r\ncv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)\r\n# evaluate model\r\nscores = cross_val_score(pipeline, X, y, scoring='roc_auc', cv=cv, n_jobs=-1)\r\n# summarize performance\r\nprint('Mean ROC AUC: %.3f' % mean(scores))<\/pre>\n<p>Running the example evaluates the system of transforms and the model and summarizes the performance as the mean ROC AUC.<\/p>\n<p>Your specific results will vary given the stochastic nature of the learning algorithm, resampling algorithms, and the evaluation procedure. Try running the example a few times.<\/p>\n<p>In this case, we can see another list in ROC AUC performance from about 0.81 to about 0.83.<\/p>\n<pre class=\"crayon-plain-tag\">Mean ROC AUC: 0.833<\/pre>\n<\/p>\n<h2>Use Predefined Combinations of Resampling Methods<\/h2>\n<p>There are combinations of oversampling and undersampling methods that have proven effective and together may be considered resampling techniques.<\/p>\n<p>Two examples are the combination of SMOTE with Tomek Links undersampling and SMOTE with Edited Nearest Neighbors undersampling.<\/p>\n<p>The imbalanced-learn Python library provides implementations for both of these combinations directly. Let&rsquo;s take a closer look at each in turn.<\/p>\n<h3>Combination of SMOTE and Tomek Links Undersampling<\/h3>\n<p>SMOTE is an oversampling method that synthesizes new plausible examples in the majority class.<\/p>\n<p>Tomek Links refers to a method for identifying pairs of nearest neighbors in a dataset that have different classes. Removing one or both of the examples in these pairs (such as the examples in the majority class) has the effect of making the decision boundary in the training dataset less noisy or ambiguous.<\/p>\n<p>Gustavo Batista, et al. tested combining these methods in their 2003 paper titled &ldquo;<a href=\"http:\/\/www.inf.ufrgs.br\/maslab\/pergamus\/pubs\/balancing-training-data-for.pdf\">Balancing Training Data for Automated Annotation of Keywords: a Case Study<\/a>.&rdquo;<\/p>\n<p>Specifically, first the SMOTE method is applied to oversample the minority class to a balanced distribution, then examples in Tomek Links from the majority classes are identified and removed.<\/p>\n<blockquote>\n<p>In this work, only majority class examples that participate of a Tomek link were removed, since minority class examples were considered too rare to be discarded. [&hellip;] In our work, as minority class examples were artificially created and the data sets are currently balanced, then both majority and minority class examples that form a Tomek link, are removed.<\/p>\n<\/blockquote>\n<p>&mdash; <a href=\"http:\/\/www.inf.ufrgs.br\/maslab\/pergamus\/pubs\/balancing-training-data-for.pdf\">Balancing Training Data for Automated Annotation of Keywords: a Case Study<\/a>, 2003.<\/p>\n<p>The combination was shown to provide a reduction in false negatives at the cost of an increase in false positives for a binary classification task.<\/p>\n<p>We can implement this combination using the <a href=\"https:\/\/imbalanced-learn.readthedocs.io\/en\/stable\/generated\/imblearn.combine.SMOTETomek.html\">SMOTETomek class<\/a>.<\/p>\n<pre class=\"crayon-plain-tag\">...\r\n# define resampling\r\nresample = SMOTETomek()<\/pre>\n<p>The SMOTE configuration can be set via the &ldquo;<em>smote<\/em>&rdquo; argument and takes a configured <a href=\"https:\/\/imbalanced-learn.readthedocs.io\/en\/stable\/generated\/imblearn.over_sampling.SMOTE.html\">SMOTE<\/a> instance. The Tomek Links configuration can be set via the &ldquo;tomek&rdquo; argument and takes a configured <a href=\"https:\/\/imbalanced-learn.readthedocs.io\/en\/stable\/generated\/imblearn.under_sampling.TomekLinks.html\">TomekLinks<\/a>&nbsp;object.<\/p>\n<p>The default is to balance the dataset with SMOTE then remove Tomek links from all classes. This is the approach used in another paper that explorea this combination titled &ldquo;<a href=\"https:\/\/dl.acm.org\/citation.cfm?id=1007735\">A Study of the Behavior of Several Methods for Balancing Machine Learning Training Data<\/a>.&rdquo;<\/p>\n<blockquote>\n<p>&hellip; we propose applying Tomek links to the over-sampled training set as a data cleaning method. Thus, instead of removing only the majority class examples that form Tomek links, examples from both classes are removed.<\/p>\n<\/blockquote>\n<p>&mdash; <a href=\"https:\/\/dl.acm.org\/citation.cfm?id=1007735\">A Study of the Behavior of Several Methods for Balancing Machine Learning Training Data<\/a>, 2004.<\/p>\n<p>Alternately, we can configure the combination to only remove links from the majority class as described in the 2003 paper by specifying the &ldquo;<em>tomek<\/em>&rdquo; argument with an instance of <em>TomekLinks<\/em> with the &ldquo;<em>sampling_strategy<\/em>&rdquo; argument set to only undersample the &lsquo;<em>majority<\/em>&lsquo; class; for example:<\/p>\n<pre class=\"crayon-plain-tag\">...\r\n# define resampling\r\nresample = SMOTETomek(tomek=TomekLinks(sampling_strategy='majority'))<\/pre>\n<p>We can evaluate this combined resampling strategy with a decision tree classifier on our binary classification problem.<\/p>\n<p>The complete example is listed below.<\/p>\n<pre class=\"crayon-plain-tag\"># combined SMOTE and Tomek Links resampling for imbalanced classification\r\nfrom numpy import mean\r\nfrom sklearn.datasets import make_classification\r\nfrom sklearn.model_selection import cross_val_score\r\nfrom sklearn.model_selection import RepeatedStratifiedKFold\r\nfrom imblearn.pipeline import Pipeline\r\nfrom sklearn.tree import DecisionTreeClassifier\r\nfrom imblearn.combine import SMOTETomek\r\nfrom imblearn.under_sampling import TomekLinks\r\n# generate dataset\r\nX, y = make_classification(n_samples=10000, n_features=2, n_redundant=0,\r\n\tn_clusters_per_class=1, weights=[0.99], flip_y=0, random_state=1)\r\n# define model\r\nmodel = DecisionTreeClassifier()\r\n# define resampling\r\nresample = SMOTETomek(tomek=TomekLinks(sampling_strategy='majority'))\r\n# define pipeline\r\npipeline = Pipeline(steps=[('r', resample), ('m', model)])\r\n# define evaluation procedure\r\ncv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)\r\n# evaluate model\r\nscores = cross_val_score(pipeline, X, y, scoring='roc_auc', cv=cv, n_jobs=-1)\r\n# summarize performance\r\nprint('Mean ROC AUC: %.3f' % mean(scores))<\/pre>\n<p>Running the example evaluates the system of transforms and the model and summarizes the performance as the mean ROC AUC.<\/p>\n<p>Your specific results will vary given the stochastic nature of the learning algorithm, resampling algorithms, and the evaluation procedure. Try running the example a few times.<\/p>\n<p>In this case, it seems that this combined resampling strategy does not offer a benefit for this model on this dataset.<\/p>\n<pre class=\"crayon-plain-tag\">Mean ROC AUC: 0.815<\/pre>\n<\/p>\n<h3>Combination of SMOTE and Edited Nearest Neighbors Undersampling<\/h3>\n<p>SMOTE may be the most popular oversampling technique and can be combined with many different undersampling techniques.<\/p>\n<p>Another very popular undersampling method is the Edited Nearest Neighbors, or ENN, rule. This rule involves using <em>k=3<\/em> nearest neighbors to locate those examples in a dataset that are misclassified and that are then removed. It can be applied to all classes or just those examples in the majority class.<\/p>\n<p>Gustavo Batista, et al. explore many combinations of oversampling and undersampling methods compared to the methods used in isolation in their 2004 paper titled &ldquo;<a href=\"https:\/\/dl.acm.org\/citation.cfm?id=1007735\">A Study of the Behavior of Several Methods for Balancing Machine Learning Training Data<\/a>.&rdquo;<\/p>\n<p>This includes the combinations:<\/p>\n<ul>\n<li>Condensed Nearest Neighbors + Tomek Links<\/li>\n<li>SMOTE + Tomek Links<\/li>\n<li>SMOTE + Edited NearestNeighbors<\/li>\n<\/ul>\n<p>Regarding this final combination, the authors comment that ENN is more aggressive at downsampling the majority class than Tomek Links, providing more in-depth cleaning. They apply the method, removing examples from both the majority and minority classes.<\/p>\n<blockquote>\n<p>&hellip; ENN is used to remove examples from both classes. Thus, any example that is misclassified by its three nearest neighbors is removed from the training set.<\/p>\n<\/blockquote>\n<p>&mdash; <a href=\"https:\/\/dl.acm.org\/citation.cfm?id=1007735\">A Study of the Behavior of Several Methods for Balancing Machine Learning Training Data<\/a>, 2004.<\/p>\n<p>This can be implemented via the <a href=\"https:\/\/imbalanced-learn.readthedocs.io\/en\/stable\/generated\/imblearn.combine.SMOTEENN.html\">SMOTEENN class<\/a> in the imbalanced-learn library.<\/p>\n<pre class=\"crayon-plain-tag\">...\r\n# define resampling\r\nresample = SMOTEENN()<\/pre>\n<p>The SMOTE configuration can be set as a SMOTE object via the &ldquo;<em>smote<\/em>&rdquo; argument, and the ENN configuration can be set via the <a href=\"https:\/\/imbalanced-learn.readthedocs.io\/en\/stable\/generated\/imblearn.under_sampling.EditedNearestNeighbours.html\">EditedNearestNeighbours<\/a> object via the &ldquo;<em>enn<\/em>&rdquo; argument. SMOTE defaults to balancing the distribution, followed by ENN that by default removes misclassified examples from all classes.<\/p>\n<p>We could change the ENN to only remove examples from the majority class by setting the &ldquo;<em>enn<\/em>&rdquo; argument to an <em>EditedNearestNeighbours<\/em> instance with <em>sampling_strategy<\/em> argument set to &lsquo;<em>majority<\/em>&lsquo;.<\/p>\n<pre class=\"crayon-plain-tag\">...\r\n# define resampling\r\nresample = SMOTEENN(enn=EditedNearestNeighbours(sampling_strategy='majority'))<\/pre>\n<p>We can evaluate the default strategy (editing examples in all classes) and evaluate it with a decision tree classifier on our imbalanced dataset.<\/p>\n<p>The complete example is listed below.<\/p>\n<pre class=\"crayon-plain-tag\"># combined SMOTE and Edited Nearest Neighbors resampling for imbalanced classification\r\nfrom numpy import mean\r\nfrom sklearn.datasets import make_classification\r\nfrom sklearn.model_selection import cross_val_score\r\nfrom sklearn.model_selection import RepeatedStratifiedKFold\r\nfrom imblearn.pipeline import Pipeline\r\nfrom sklearn.tree import DecisionTreeClassifier\r\nfrom imblearn.combine import SMOTEENN\r\n# generate dataset\r\nX, y = make_classification(n_samples=10000, n_features=2, n_redundant=0,\r\n\tn_clusters_per_class=1, weights=[0.99], flip_y=0, random_state=1)\r\n# define model\r\nmodel = DecisionTreeClassifier()\r\n# define resampling\r\nresample = SMOTEENN()\r\n# define pipeline\r\npipeline = Pipeline(steps=[('r', resample), ('m', model)])\r\n# define evaluation procedure\r\ncv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)\r\n# evaluate model\r\nscores = cross_val_score(pipeline, X, y, scoring='roc_auc', cv=cv, n_jobs=-1)\r\n# summarize performance\r\nprint('Mean ROC AUC: %.3f' % mean(scores))<\/pre>\n<p>Running the example evaluates the system of transforms and the model and summarizes the performance as the mean ROC AUC.<\/p>\n<p>Your specific results will vary given the stochastic nature of the learning algorithm, resampling algorithms, and the evaluation procedure. Try running the example a few times.<\/p>\n<p>In this case, we see a further lift in performance over SMOTE with the random undersampling method from about 0.81 to about 0.85.<\/p>\n<pre class=\"crayon-plain-tag\">Mean ROC AUC: 0.856<\/pre>\n<p>This result highlights that editing the oversampled minority class may also be an important consideration that could easily be overlooked.<\/p>\n<p>This was the same finding in the 2004 paper where the authors discover that SMOTE with Tomek Links and SMOTE with ENN perform well across a range of datasets.<\/p>\n<blockquote>\n<p>Our results show that the over-sampling methods in general, and Smote + Tomek and Smote + ENN (two of the methods proposed in this work) in particular for data sets with few positive (minority) examples, provided very good results in practice.<\/p>\n<\/blockquote>\n<p>&mdash; <a href=\"https:\/\/dl.acm.org\/citation.cfm?id=1007735\">A Study of the Behavior of Several Methods for Balancing Machine Learning Training Data<\/a>, 2004.<\/p>\n<h2>Further Reading<\/h2>\n<p>This section provides more resources on the topic if you are looking to go deeper.<\/p>\n<h3>Papers<\/h3>\n<ul>\n<li><a href=\"https:\/\/arxiv.org\/abs\/1106.1813\">SMOTE: Synthetic Minority Over-sampling Technique<\/a>, 2011.<\/li>\n<li><a href=\"http:\/\/www.inf.ufrgs.br\/maslab\/pergamus\/pubs\/balancing-training-data-for.pdf\">Balancing Training Data for Automated Annotation of Keywords: a Case Study<\/a>, 2003.<\/li>\n<li><a href=\"https:\/\/dl.acm.org\/citation.cfm?id=1007735\">A Study of the Behavior of Several Methods for Balancing Machine Learning Training Data<\/a>, 2004.<\/li>\n<\/ul>\n<h3>Books<\/h3>\n<ul>\n<li><a href=\"https:\/\/amzn.to\/307Xlva\">Learning from Imbalanced Data Sets<\/a>, 2018.<\/li>\n<li><a href=\"https:\/\/amzn.to\/32K9K6d\">Imbalanced Learning: Foundations, Algorithms, and Applications<\/a>, 2013.<\/li>\n<\/ul>\n<h3>API<\/h3>\n<ul>\n<li><a href=\"https:\/\/github.com\/scikit-learn-contrib\/imbalanced-learn\">imbalanced-learn, GitHub<\/a>.<\/li>\n<li><a href=\"https:\/\/imbalanced-learn.readthedocs.io\/en\/stable\/combine.html\">Combination of over- and under-sampling, Imbalanced Learn User Guide<\/a>.<\/li>\n<li><a href=\"https:\/\/imbalanced-learn.readthedocs.io\/en\/stable\/generated\/imblearn.over_sampling.RandomOverSampler.html\">imblearn.over_sampling.RandomOverSampler API<\/a>.<\/li>\n<li><a href=\"https:\/\/imbalanced-learn.readthedocs.io\/en\/stable\/generated\/imblearn.pipeline.Pipeline.html\">imblearn.pipeline.Pipeline API<\/a>.<\/li>\n<li><a href=\"https:\/\/imbalanced-learn.readthedocs.io\/en\/stable\/generated\/imblearn.under_sampling.RandomUnderSampler.html\">imblearn.under_sampling.RandomUnderSampler API<\/a>.<\/li>\n<li><a href=\"https:\/\/imbalanced-learn.readthedocs.io\/en\/stable\/generated\/imblearn.over_sampling.SMOTE.html\">imblearn.over_sampling.SMOTE API<\/a>.<\/li>\n<li><a href=\"https:\/\/imbalanced-learn.readthedocs.io\/en\/stable\/generated\/imblearn.combine.SMOTETomek.html\">imblearn.combine.SMOTETomek API<\/a>.<\/li>\n<li><a href=\"https:\/\/imbalanced-learn.readthedocs.io\/en\/stable\/generated\/imblearn.combine.SMOTEENN.html\">imblearn.combine.SMOTEENN API<\/a>.<\/li>\n<\/ul>\n<h3>Articles<\/h3>\n<ul>\n<li><a href=\"https:\/\/en.wikipedia.org\/wiki\/Oversampling_and_undersampling_in_data_analysis\">Oversampling and undersampling in data analysis, Wikipedia<\/a>.<\/li>\n<\/ul>\n<h2>Summary<\/h2>\n<p>In this tutorial, you discovered how to combine oversampling and undersampling techniques for imbalanced classification.<\/p>\n<p>Specifically, you learned:<\/p>\n<ul>\n<li>How to define a sequence of oversampling and undersampling methods to be applied to a training dataset or when evaluating a classifier model.<\/li>\n<li>How to manually combine oversampling and undersampling methods for imbalanced classification.<\/li>\n<li>How to use pre-defined and well-performing combinations of resampling methods for imbalanced classification.<\/li>\n<\/ul>\n<p>Do you have any questions?<br \/>\nAsk your questions in the comments below and I will do my best to answer.<\/p>\n<p>The post <a rel=\"nofollow\" href=\"https:\/\/machinelearningmastery.com\/combine-oversampling-and-undersampling-for-imbalanced-classification\/\">Combine Oversampling and Undersampling for Imbalanced Classification<\/a> appeared first on <a rel=\"nofollow\" href=\"https:\/\/machinelearningmastery.com\/\">Machine Learning Mastery<\/a>.<\/p>\n<\/div>\n<p><a href=\"https:\/\/machinelearningmastery.com\/combine-oversampling-and-undersampling-for-imbalanced-classification\/\">Go to Source<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Author: Jason Brownlee Resampling methods are designed to add or remove examples from the training dataset in order to change the class distribution. Once the [&hellip;] <span class=\"read-more-link\"><a class=\"read-more\" href=\"https:\/\/www.aiproblog.com\/index.php\/2020\/01\/21\/combine-oversampling-and-undersampling-for-imbalanced-classification\/\">Read More<\/a><\/span><\/p>\n","protected":false},"author":1,"featured_media":3051,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_bbp_topic_count":0,"_bbp_reply_count":0,"_bbp_total_topic_count":0,"_bbp_total_reply_count":0,"_bbp_voice_count":0,"_bbp_anonymous_reply_count":0,"_bbp_topic_count_hidden":0,"_bbp_reply_count_hidden":0,"_bbp_forum_subforum_count":0,"footnotes":""},"categories":[24],"tags":[],"_links":{"self":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts\/3050"}],"collection":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/comments?post=3050"}],"version-history":[{"count":0,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts\/3050\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/media\/3051"}],"wp:attachment":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/media?parent=3050"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/categories?post=3050"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/tags?post=3050"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}