{"id":3043,"date":"2020-01-19T18:00:47","date_gmt":"2020-01-19T18:00:47","guid":{"rendered":"https:\/\/www.aiproblog.com\/index.php\/2020\/01\/19\/undersampling-algorithms-for-imbalanced-classification\/"},"modified":"2020-01-19T18:00:47","modified_gmt":"2020-01-19T18:00:47","slug":"undersampling-algorithms-for-imbalanced-classification","status":"publish","type":"post","link":"https:\/\/www.aiproblog.com\/index.php\/2020\/01\/19\/undersampling-algorithms-for-imbalanced-classification\/","title":{"rendered":"Undersampling Algorithms for Imbalanced Classification"},"content":{"rendered":"<p>Author: Jason Brownlee<\/p>\n<div>\n<p>Resampling methods are designed to change the composition of a training dataset for an imbalanced classification task.<\/p>\n<p>Most of the attention of resampling methods for imbalanced classification is put on oversampling the minority class. Nevertheless, a suite of techniques has been developed for undersampling the majority class that can be used in conjunction with effective oversampling methods.<\/p>\n<p>There are many different types of undersampling techniques, although most can be grouped into those that select examples to keep in the transformed dataset, those that select examples to delete, and hybrids that combine both types of methods.<\/p>\n<p>In this tutorial, you will discover undersampling methods for imbalanced classification.<\/p>\n<p>After completing this tutorial, you will know:<\/p>\n<ul>\n<li>How to use the Near-Miss and Condensed Nearest Neighbor Rule methods that select examples to keep from the majority class.<\/li>\n<li>How to use Tomek Links and the Edited Nearest Neighbors Rule methods that select examples to delete from the majority class.<\/li>\n<li>How to use One-Sided Selection and the Neighborhood Cleaning Rule that combine methods for choosing examples to keep and delete from the majority class.<\/li>\n<\/ul>\n<p>Discover SMOTE, one-class classification, cost-sensitive learning, threshold moving, and much more <a href=\"https:\/\/machinelearningmastery.com\/imbalanced-classification-with-python\/\">in my new book<\/a>, with 30 step-by-step tutorials and full Python source code.<\/p>\n<p>Let&rsquo;s get started.<\/p>\n<div id=\"attachment_9452\" style=\"width: 810px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-9452\" class=\"size-full wp-image-9452\" src=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2020\/01\/How-to-Use-Undersampling-Algorithms-for-Imbalanced-Classification.jpg\" alt=\"How to Use Undersampling Algorithms for Imbalanced Classification\" width=\"800\" height=\"450\" srcset=\"http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2020\/01\/How-to-Use-Undersampling-Algorithms-for-Imbalanced-Classification.jpg 800w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2020\/01\/How-to-Use-Undersampling-Algorithms-for-Imbalanced-Classification-300x169.jpg 300w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2020\/01\/How-to-Use-Undersampling-Algorithms-for-Imbalanced-Classification-768x432.jpg 768w\" sizes=\"(max-width: 800px) 100vw, 800px\"><\/p>\n<p id=\"caption-attachment-9452\" class=\"wp-caption-text\">How to Use Undersampling Algorithms for Imbalanced Classification<br \/>Photo by <a href=\"https:\/\/flickr.com\/photos\/vgd1951\/40748796933\/\">nuogein<\/a>, some rights reserved.<\/p>\n<\/div>\n<h2>Tutorial Overview<\/h2>\n<p>This tutorial is divided into five parts; they are:<\/p>\n<ol>\n<li>Undersampling for Imbalanced Classification<\/li>\n<li>Imbalanced-Learn Library<\/li>\n<li>Methods that Select Examples to Keep\n<ol>\n<li>Near Miss Undersampling<\/li>\n<li>Condensed Nearest Neighbor Rule for Undersampling<\/li>\n<\/ol>\n<\/li>\n<li>Methods that Select Examples to Delete\n<ol>\n<li>Tomek Links for Undersampling<\/li>\n<li>Edited Nearest Neighbors Rule for Undersampling<\/li>\n<\/ol>\n<\/li>\n<li>Combinations of Keep and Delete Methods\n<ol>\n<li>One-Sided Selection for Undersampling<\/li>\n<li>Neighborhood Cleaning Rule for Undersampling<\/li>\n<\/ol>\n<\/li>\n<\/ol>\n<h2>Undersampling for Imbalanced Classification<\/h2>\n<p>Undersampling refers to a group of techniques designed to balance the class distribution for a classification dataset that has a skewed class distribution.<\/p>\n<p>An imbalanced class distribution will have one or more classes with few examples (the minority classes) and one or more classes with many examples (the majority classes). It is best understood in the context of a binary (two-class) classification problem where class 0 is the majority class and class 1 is the minority class.<\/p>\n<p>Undersampling techniques remove examples from the training dataset that belong to the majority class in order to better balance the class distribution, such as reducing the skew from a 1:100 to a 1:10, 1:2, or even a 1:1 class distribution. This is different from oversampling that involves adding examples to the minority class in an effort to reduce the skew in the class distribution.<\/p>\n<blockquote>\n<p>&hellip; undersampling, that consists of reducing the data by eliminating examples belonging to the majority class with the objective of equalizing the number of examples of each class &hellip;<\/p>\n<\/blockquote>\n<p>&mdash; Page 82, <a href=\"https:\/\/amzn.to\/307Xlva\">Learning from Imbalanced Data Sets<\/a>, 2018.<\/p>\n<p>Undersampling methods can be used directly on a training dataset that can then, in turn, be used to fit a machine learning model. Typically, undersampling methods are used in conjunction with an oversampling technique for the minority class, and this combination often results in better performance than using oversampling or undersampling alone on the training dataset.<\/p>\n<p>The simplest undersampling technique involves randomly selecting examples from the majority class and deleting them from the training dataset. This is referred to as random undersampling. Although simple and effective, a limitation of this technique is that examples are removed without any concern for how useful or important they might be in determining the decision boundary between the classes. This means it is possible, or even likely, that useful information will be deleted.<\/p>\n<blockquote>\n<p>The major drawback of random undersampling is that this method can discard potentially useful data that could be important for the induction process. The removal of data is a critical decision to be made, hence many the proposal of undersampling use heuristics in order to overcome the limitations of the non- heuristics decisions.<\/p>\n<\/blockquote>\n<p>&mdash; Page 83, <a href=\"https:\/\/amzn.to\/307Xlva\">Learning from Imbalanced Data Sets<\/a>, 2018.<\/p>\n<p>An extension of this approach is to be more discerning regarding the examples from the majority class that are deleted. This typically involves heuristics or learning models that attempt to identify redundant examples for deletion or useful examples for non-deletion.<\/p>\n<p>There are many undersampling techniques that use these types of heuristics. In the following sections, we will review some of the more common methods and develop an intuition for their operation on a synthetic imbalanced binary classification dataset.<\/p>\n<p>We can define a synthetic binary classification dataset using the <a href=\"https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.datasets.make_classification.html\">make_classification() function<\/a> from the scikit-learn library. For example, we can create 10,000 examples with two input variables and a 1:100 distribution as follows:<\/p>\n<pre class=\"crayon-plain-tag\">...\r\n# define dataset\r\nX, y = make_classification(n_samples=10000, n_features=2, n_redundant=0,\r\nn_clusters_per_class=1, weights=[0.99], flip_y=0, random_state=1)<\/pre>\n<p>We can then create a scatter plot of the dataset via the <a href=\"https:\/\/matplotlib.org\/3.1.1\/api\/_as_gen\/matplotlib.pyplot.scatter.html\">scatter() Matplotlib function<\/a> to understand the spatial relationship of the examples in each class and their imbalance.<\/p>\n<pre class=\"crayon-plain-tag\">...\r\n# scatter plot of examples by class label\r\nfor label, _ in counter.items():\r\n\trow_ix = where(y == label)[0]\r\n\tpyplot.scatter(X[row_ix, 0], X[row_ix, 1], label=str(label))\r\npyplot.legend()\r\npyplot.show()<\/pre>\n<p>Tying this together, the complete example of creating an imbalanced classification dataset and plotting the examples is listed below.<\/p>\n<pre class=\"crayon-plain-tag\"># Generate and plot a synthetic imbalanced classification dataset\r\nfrom collections import Counter\r\nfrom sklearn.datasets import make_classification\r\nfrom matplotlib import pyplot\r\nfrom numpy import where\r\n# define dataset\r\nX, y = make_classification(n_samples=10000, n_features=2, n_redundant=0,\r\n\tn_clusters_per_class=1, weights=[0.99], flip_y=0, random_state=1)\r\n# summarize class distribution\r\ncounter = Counter(y)\r\nprint(counter)\r\n# scatter plot of examples by class label\r\nfor label, _ in counter.items():\r\n\trow_ix = where(y == label)[0]\r\n\tpyplot.scatter(X[row_ix, 0], X[row_ix, 1], label=str(label))\r\npyplot.legend()\r\npyplot.show()<\/pre>\n<p>Running the example first summarizes the class distribution, showing an approximate 1:100 class distribution with about 10,000 examples with class 0 and 100 with class 1.<\/p>\n<pre class=\"crayon-plain-tag\">Counter({0: 9900, 1: 100})<\/pre>\n<p>Next, a scatter plot is created showing all of the examples in the dataset. We can see a large mass of examples for class 0 (blue) and a small number of examples for class 1 (orange). We can also see that the classes overlap with some examples from class 1 clearly within the part of the feature space that belongs to class 0.<\/p>\n<div id=\"attachment_9439\" style=\"width: 1290px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-9439\" class=\"size-full wp-image-9439\" src=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2019\/10\/Scatter-Plot-of-Imbalanced-Classification-Dataset.png\" alt=\"Scatter Plot of Imbalanced Classification Dataset\" width=\"1280\" height=\"960\" srcset=\"http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2019\/10\/Scatter-Plot-of-Imbalanced-Classification-Dataset.png 1280w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2019\/10\/Scatter-Plot-of-Imbalanced-Classification-Dataset-300x225.png 300w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2019\/10\/Scatter-Plot-of-Imbalanced-Classification-Dataset-768x576.png 768w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2019\/10\/Scatter-Plot-of-Imbalanced-Classification-Dataset-1024x768.png 1024w\" sizes=\"(max-width: 1280px) 100vw, 1280px\"><\/p>\n<p id=\"caption-attachment-9439\" class=\"wp-caption-text\">Scatter Plot of Imbalanced Classification Dataset<\/p>\n<\/div>\n<p>This plot provides the starting point for developing the intuition for the effect that different undersampling techniques have on the majority class.<\/p>\n<p>Next, we can begin to review popular undersampling methods made available via the <a href=\"https:\/\/github.com\/scikit-learn-contrib\/imbalanced-learn\">imbalanced-learn Python library<\/a>.<\/p>\n<p>There are many different methods to choose from. We will divide them into methods that select what examples from the majority class to keep, methods that select examples to delete, and combinations of both approaches.<\/p>\n<\/p>\n<div class=\"woo-sc-hr\"><\/div>\n<p><center><\/p>\n<h3>Want to Get Started With Imbalance Classification?<\/h3>\n<p>Take my free 7-day email crash course now (with sample code).<\/p>\n<p>Click to sign-up and also get a free PDF Ebook version of the course.<\/p>\n<p><a href=\"https:\/\/machinelearningmastery.lpages.co\/leadbox\/14de34d42172a2%3A164f8be4f346dc\/4529268551712768\/\" target=\"_blank\" style=\"background: rgb(255, 206, 10); color: rgb(255, 255, 255); text-decoration: none; font-family: Helvetica, Arial, sans-serif; font-weight: bold; font-size: 16px; line-height: 20px; padding: 10px; display: inline-block; max-width: 300px; border-radius: 5px; text-shadow: rgba(0, 0, 0, 0.25) 0px -1px 1px; box-shadow: rgba(255, 255, 255, 0.5) 0px 1px 3px inset, rgba(0, 0, 0, 0.5) 0px 1px 3px;\" rel=\"noopener noreferrer\">Download Your FREE Mini-Course<\/a><script data-leadbox=\"14de34d42172a2:164f8be4f346dc\" data-url=\"https:\/\/machinelearningmastery.lpages.co\/leadbox\/14de34d42172a2%3A164f8be4f346dc\/4529268551712768\/\" data-config=\"%7B%7D\" type=\"text\/javascript\" src=\"https:\/\/machinelearningmastery.lpages.co\/leadbox-1576257931.js\"><\/script><\/p>\n<p><\/center><\/p>\n<div class=\"woo-sc-hr\"><\/div>\n<h2>Imbalanced-Learn Library<\/h2>\n<p>In these examples, we will use the implementations provided by the <a href=\"https:\/\/github.com\/scikit-learn-contrib\/imbalanced-learn\">imbalanced-learn Python library<\/a>, which can be installed via pip as follows:<\/p>\n<pre class=\"crayon-plain-tag\">sudo pip install imbalanced-learn<\/pre>\n<p>You can confirm that the installation was successful by printing the version of the installed library:<\/p>\n<pre class=\"crayon-plain-tag\"># check version number\r\nimport imblearn\r\nprint(imblearn.__version__)<\/pre>\n<p>Running the example will print the version number of the installed library; for example:<\/p>\n<pre class=\"crayon-plain-tag\">0.5.0<\/pre>\n<\/p>\n<h2>Methods that Select Examples to Keep<\/h2>\n<p>In this section, we will take a closer look at two methods that choose which examples from the majority class to keep, the near-miss family of methods, and the popular condensed nearest neighbor rule.<\/p>\n<h3>Near Miss Undersampling<\/h3>\n<p>Near Miss refers to a collection of undersampling methods that select examples based on the distance of majority class examples to minority class examples.<\/p>\n<p>The approaches were proposed by Jianping Zhang and Inderjeet Mani in their 2003 paper titled &ldquo;<a href=\"https:\/\/www.site.uottawa.ca\/~nat\/Workshop2003\/jzhang.pdf\">KNN Approach to Unbalanced Data Distributions: A Case Study Involving Information Extraction<\/a>.&rdquo;<\/p>\n<p>There are three versions of the technique, named NearMiss-1, NearMiss-2, and NearMiss-3.<\/p>\n<p><strong>NearMiss-1<\/strong> selects examples from the majority class that have the smallest average distance to the three closest examples from the minority class. <strong>NearMiss-2<\/strong> selects examples from the majority class that have the smallest average distance to the three furthest examples from the minority class. <strong>NearMiss-3<\/strong> involves selecting a given number of majority class examples for each example in the minority class that are closest.<\/p>\n<p>Here, distance is determined in feature space using Euclidean distance or similar.<\/p>\n<ul>\n<li><strong>NearMiss-1<\/strong>: Majority class examples with minimum average distance to three closest minority class examples.<\/li>\n<li><strong>NearMiss-2<\/strong>: Majority class examples with minimum average distance to three furthest minority class examples.<\/li>\n<li><strong>NearMiss-3<\/strong>: Majority class examples with minimum distance to each minority class example.<\/li>\n<\/ul>\n<p>The NearMiss-3 seems desirable, given that it will only keep those majority class examples that are on the decision boundary.<\/p>\n<p>We can implement the Near Miss methods using the <a href=\"https:\/\/imbalanced-learn.readthedocs.io\/en\/stable\/generated\/imblearn.under_sampling.NearMiss.html\">NearMiss imbalanced-learn class<\/a>.<\/p>\n<p>The type of near-miss strategy used is defined by the &ldquo;<em>version<\/em>&rdquo; argument, which by default is set to 1 for NearMiss-1, but can be set to 2 or 3 for the other two methods.<\/p>\n<pre class=\"crayon-plain-tag\">...\r\n# define the undersampling method\r\nundersample = NearMiss(version=1)<\/pre>\n<p>By default, the technique will undersample the majority class to have the same number of examples as the minority class, although this can be changed by setting the <em>sampling_strategy<\/em> argument to a fraction of the minority class.<\/p>\n<p>First, we can demonstrate NearMiss-1 that selects only those majority class examples that have a minimum distance to three majority class instances, defined by the <em>n_neighbors<\/em> argument.<\/p>\n<p>We would expect clusters of majority class examples around the minority class examples that overlap.<\/p>\n<p>The complete example is listed below.<\/p>\n<pre class=\"crayon-plain-tag\"># Undersample imbalanced dataset with NearMiss-1\r\nfrom collections import Counter\r\nfrom sklearn.datasets import make_classification\r\nfrom imblearn.under_sampling import NearMiss\r\nfrom matplotlib import pyplot\r\nfrom numpy import where\r\n# define dataset\r\nX, y = make_classification(n_samples=10000, n_features=2, n_redundant=0,\r\n\tn_clusters_per_class=1, weights=[0.99], flip_y=0, random_state=1)\r\n# summarize class distribution\r\ncounter = Counter(y)\r\nprint(counter)\r\n# define the undersampling method\r\nundersample = NearMiss(version=1, n_neighbors=3)\r\n# transform the dataset\r\nX, y = undersample.fit_resample(X, y)\r\n# summarize the new class distribution\r\ncounter = Counter(y)\r\nprint(counter)\r\n# scatter plot of examples by class label\r\nfor label, _ in counter.items():\r\n\trow_ix = where(y == label)[0]\r\n\tpyplot.scatter(X[row_ix, 0], X[row_ix, 1], label=str(label))\r\npyplot.legend()\r\npyplot.show()<\/pre>\n<p>Running the example undersamples the majority class and creates a scatter plot of the transformed dataset.<\/p>\n<p>We can see that, as expected, only those examples in the majority class that are closest to the minority class examples in the overlapping area were retained.<\/p>\n<div id=\"attachment_9440\" style=\"width: 1290px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-9440\" class=\"size-full wp-image-9440\" src=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2019\/10\/Scatter-Plot-of-Imbalanced-Dataset-Undersampled-with-NearMiss-1.png\" alt=\"Scatter Plot of Imbalanced Dataset Undersampled with NearMiss-1\" width=\"1280\" height=\"960\" srcset=\"http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2019\/10\/Scatter-Plot-of-Imbalanced-Dataset-Undersampled-with-NearMiss-1.png 1280w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2019\/10\/Scatter-Plot-of-Imbalanced-Dataset-Undersampled-with-NearMiss-1-300x225.png 300w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2019\/10\/Scatter-Plot-of-Imbalanced-Dataset-Undersampled-with-NearMiss-1-768x576.png 768w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2019\/10\/Scatter-Plot-of-Imbalanced-Dataset-Undersampled-with-NearMiss-1-1024x768.png 1024w\" sizes=\"(max-width: 1280px) 100vw, 1280px\"><\/p>\n<p id=\"caption-attachment-9440\" class=\"wp-caption-text\">Scatter Plot of Imbalanced Dataset Undersampled with NearMiss-1<\/p>\n<\/div>\n<p>Next, we can demonstrate the NearMiss-2 strategy, which is an inverse to NearMiss-1. It selects examples that are closest to the most distant examples from the minority class, defined by the <em>n_neighbors<\/em> argument.<\/p>\n<p>This is not an intuitive strategy from the description alone.<\/p>\n<p>The complete example is listed below.<\/p>\n<pre class=\"crayon-plain-tag\"># Undersample imbalanced dataset with NearMiss-2\r\nfrom collections import Counter\r\nfrom sklearn.datasets import make_classification\r\nfrom imblearn.under_sampling import NearMiss\r\nfrom matplotlib import pyplot\r\nfrom numpy import where\r\n# define dataset\r\nX, y = make_classification(n_samples=10000, n_features=2, n_redundant=0,\r\n\tn_clusters_per_class=1, weights=[0.99], flip_y=0, random_state=1)\r\n# summarize class distribution\r\ncounter = Counter(y)\r\nprint(counter)\r\n# define the undersampling method\r\nundersample = NearMiss(version=2, n_neighbors=3)\r\n# transform the dataset\r\nX, y = undersample.fit_resample(X, y)\r\n# summarize the new class distribution\r\ncounter = Counter(y)\r\nprint(counter)\r\n# scatter plot of examples by class label\r\nfor label, _ in counter.items():\r\n\trow_ix = where(y == label)[0]\r\n\tpyplot.scatter(X[row_ix, 0], X[row_ix, 1], label=str(label))\r\npyplot.legend()\r\npyplot.show()<\/pre>\n<p>Running the example, we can see that the NearMiss-2 selects examples that appear to be in the center of mass for the overlap between the two classes.<\/p>\n<div id=\"attachment_9441\" style=\"width: 1290px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-9441\" class=\"size-full wp-image-9441\" src=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2019\/10\/Scatter-Plot-of-Imbalanced-Dataset-Undersampled-with-NearMiss-2.png\" alt=\"Scatter Plot of Imbalanced Dataset Undersampled With NearMiss-2\" width=\"1280\" height=\"960\" srcset=\"http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2019\/10\/Scatter-Plot-of-Imbalanced-Dataset-Undersampled-with-NearMiss-2.png 1280w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2019\/10\/Scatter-Plot-of-Imbalanced-Dataset-Undersampled-with-NearMiss-2-300x225.png 300w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2019\/10\/Scatter-Plot-of-Imbalanced-Dataset-Undersampled-with-NearMiss-2-768x576.png 768w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2019\/10\/Scatter-Plot-of-Imbalanced-Dataset-Undersampled-with-NearMiss-2-1024x768.png 1024w\" sizes=\"(max-width: 1280px) 100vw, 1280px\"><\/p>\n<p id=\"caption-attachment-9441\" class=\"wp-caption-text\">Scatter Plot of Imbalanced Dataset Undersampled With NearMiss-2<\/p>\n<\/div>\n<p>Finally, we can try NearMiss-3 that selects the closest examples from the majority class for each minority class.<\/p>\n<p>The <em>n_neighbors_ver3<\/em> argument determines the number of examples to select for each minority example, although the desired balancing ratio set via <em>sampling_strategy<\/em> will filter this so that the desired balance is achieved.<\/p>\n<p>The complete example is listed below.<\/p>\n<pre class=\"crayon-plain-tag\"># Undersample imbalanced dataset with NearMiss-3\r\nfrom collections import Counter\r\nfrom sklearn.datasets import make_classification\r\nfrom imblearn.under_sampling import NearMiss\r\nfrom matplotlib import pyplot\r\nfrom numpy import where\r\n# define dataset\r\nX, y = make_classification(n_samples=10000, n_features=2, n_redundant=0,\r\n\tn_clusters_per_class=1, weights=[0.99], flip_y=0, random_state=1)\r\n# summarize class distribution\r\ncounter = Counter(y)\r\nprint(counter)\r\n# define the undersampling method\r\nundersample = NearMiss(version=3, n_neighbors_ver3=3)\r\n# transform the dataset\r\nX, y = undersample.fit_resample(X, y)\r\n# summarize the new class distribution\r\ncounter = Counter(y)\r\nprint(counter)\r\n# scatter plot of examples by class label\r\nfor label, _ in counter.items():\r\n\trow_ix = where(y == label)[0]\r\n\tpyplot.scatter(X[row_ix, 0], X[row_ix, 1], label=str(label))\r\npyplot.legend()\r\npyplot.show()<\/pre>\n<p>As expected, we can see that each example in the minority class that was in the region of overlap with the majority class has up to three neighbors from the majority class.<\/p>\n<div id=\"attachment_9442\" style=\"width: 1290px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-9442\" class=\"size-full wp-image-9442\" src=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2019\/10\/Scatter-Plot-of-Imbalanced-Dataset-Undersampled-with-NearMiss-3.png\" alt=\"Scatter Plot of Imbalanced Dataset Undersampled With NearMiss-3\" width=\"1280\" height=\"960\" srcset=\"http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2019\/10\/Scatter-Plot-of-Imbalanced-Dataset-Undersampled-with-NearMiss-3.png 1280w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2019\/10\/Scatter-Plot-of-Imbalanced-Dataset-Undersampled-with-NearMiss-3-300x225.png 300w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2019\/10\/Scatter-Plot-of-Imbalanced-Dataset-Undersampled-with-NearMiss-3-768x576.png 768w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2019\/10\/Scatter-Plot-of-Imbalanced-Dataset-Undersampled-with-NearMiss-3-1024x768.png 1024w\" sizes=\"(max-width: 1280px) 100vw, 1280px\"><\/p>\n<p id=\"caption-attachment-9442\" class=\"wp-caption-text\">Scatter Plot of Imbalanced Dataset Undersampled With NearMiss-3<\/p>\n<\/div>\n<h3>Condensed Nearest Neighbor Rule Undersampling<\/h3>\n<p>Condensed Nearest Neighbors, or CNN for short, is an undersampling technique that seeks a subset of a collection of samples that results in no loss in model performance, referred to as a minimal consistent set.<\/p>\n<blockquote>\n<p>&hellip; the notion of a consistent subset of a sample set. This is a subset which, when used as a stored reference set for the NN rule, correctly classifies all of the remaining points in the sample set.<\/p>\n<\/blockquote>\n<p>&mdash; <a href=\"https:\/\/ieeexplore.ieee.org\/document\/1054155\">The Condensed Nearest Neighbor Rule (Corresp.)<\/a>, 1968.<\/p>\n<p>It is achieved by enumerating the examples in the dataset and adding them to the &ldquo;<em>store<\/em>&rdquo; only if they cannot be classified correctly by the current contents of the store. This approach was proposed to reduce the memory requirements for the k-Nearest Neighbors (KNN) algorithm by Peter Hart in the 1968 correspondence titled &ldquo;<a href=\"https:\/\/ieeexplore.ieee.org\/document\/1054155\">The Condensed Nearest Neighbor Rule<\/a>.&rdquo;<\/p>\n<p>When used for imbalanced classification, the store is comprised of all examples in the minority set and only examples from the majority set that cannot be classified correctly are added incrementally to the store.<\/p>\n<p>We can implement the Condensed Nearest Neighbor for undersampling using the <a href=\"https:\/\/imbalanced-learn.readthedocs.io\/en\/stable\/generated\/imblearn.under_sampling.CondensedNearestNeighbour.html\">CondensedNearestNeighbour class<\/a> from the imbalanced-learn library.<\/p>\n<p>During the procedure, the KNN algorithm is used to classify points to determine if they are to be added to the store or not. The k value is set via the <em>n_neighbors<\/em> argument and defaults to 1.<\/p>\n<pre class=\"crayon-plain-tag\">...\r\n# define the undersampling method\r\nundersample = CondensedNearestNeighbour(n_neighbors=1)<\/pre>\n<p>It&rsquo;s a relatively slow procedure, so small datasets and small <em>k<\/em> values are preferred.<\/p>\n<p>The complete example of demonstrating the Condensed Nearest Neighbor rule for undersampling is listed below.<\/p>\n<pre class=\"crayon-plain-tag\"># Undersample and plot imbalanced dataset with the Condensed Nearest Neighbor Rule\r\nfrom collections import Counter\r\nfrom sklearn.datasets import make_classification\r\nfrom imblearn.under_sampling import CondensedNearestNeighbour\r\nfrom matplotlib import pyplot\r\nfrom numpy import where\r\n# define dataset\r\nX, y = make_classification(n_samples=10000, n_features=2, n_redundant=0,\r\n\tn_clusters_per_class=1, weights=[0.99], flip_y=0, random_state=1)\r\n# summarize class distribution\r\ncounter = Counter(y)\r\nprint(counter)\r\n# define the undersampling method\r\nundersample = CondensedNearestNeighbour(n_neighbors=1)\r\n# transform the dataset\r\nX, y = undersample.fit_resample(X, y)\r\n# summarize the new class distribution\r\ncounter = Counter(y)\r\nprint(counter)\r\n# scatter plot of examples by class label\r\nfor label, _ in counter.items():\r\n\trow_ix = where(y == label)[0]\r\n\tpyplot.scatter(X[row_ix, 0], X[row_ix, 1], label=str(label))\r\npyplot.legend()\r\npyplot.show()<\/pre>\n<p>Running the example first reports the skewed distribution of the raw dataset, then the more balanced distribution for the transformed dataset.<\/p>\n<p>We can see that the resulting distribution is about 1:2 minority to majority examples. This highlights that although the <em>sampling_strategy<\/em> argument seeks to balance the class distribution, the algorithm will continue to add misclassified examples to the store (transformed dataset). This is a desirable property.<\/p>\n<pre class=\"crayon-plain-tag\">Counter({0: 9900, 1: 100})\r\nCounter({0: 188, 1: 100})<\/pre>\n<p>A scatter plot of the resulting dataset is created. We can see that the focus of the algorithm is those examples in the minority class along the decision boundary between the two classes, specifically, those majority examples around the minority class examples.<\/p>\n<div id=\"attachment_9443\" style=\"width: 1290px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-9443\" class=\"size-full wp-image-9443\" src=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2019\/10\/Scatter-Plot-of-Imbalanced-Dataset-Undersampled-with-the-Condensed-Nearest-Neighbor-Rule.png\" alt=\"Scatter Plot of Imbalanced Dataset Undersampled With the Condensed Nearest Neighbor Rule\" width=\"1280\" height=\"960\" srcset=\"http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2019\/10\/Scatter-Plot-of-Imbalanced-Dataset-Undersampled-with-the-Condensed-Nearest-Neighbor-Rule.png 1280w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2019\/10\/Scatter-Plot-of-Imbalanced-Dataset-Undersampled-with-the-Condensed-Nearest-Neighbor-Rule-300x225.png 300w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2019\/10\/Scatter-Plot-of-Imbalanced-Dataset-Undersampled-with-the-Condensed-Nearest-Neighbor-Rule-768x576.png 768w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2019\/10\/Scatter-Plot-of-Imbalanced-Dataset-Undersampled-with-the-Condensed-Nearest-Neighbor-Rule-1024x768.png 1024w\" sizes=\"(max-width: 1280px) 100vw, 1280px\"><\/p>\n<p id=\"caption-attachment-9443\" class=\"wp-caption-text\">Scatter Plot of Imbalanced Dataset Undersampled With the Condensed Nearest Neighbor Rule<\/p>\n<\/div>\n<h2>Methods that Select Examples to Delete<\/h2>\n<p>In this section, we will take a closer look at methods that select examples from the majority class to delete, including the popular Tomek Links method and the Edited Nearest Neighbors rule.<\/p>\n<h3>Tomek Links for Undersampling<\/h3>\n<p>A criticism of the Condensed Nearest Neighbor Rule is that examples are selected randomly, especially initially.<\/p>\n<p>This has the effect of allowing redundant examples into the store and in allowing examples that are internal to the mass of the distribution, rather than on the class boundary, into the store.<\/p>\n<blockquote>\n<p>The condensed nearest-neighbor (CNN) method chooses samples randomly. This results in a)retention of unnecessary samples and b) occasional retention of internal rather than boundary samples.<\/p>\n<\/blockquote>\n<p>&mdash; <a href=\"https:\/\/ieeexplore.ieee.org\/document\/4309452\">Two modifications of CNN<\/a>, 1976.<\/p>\n<p>Two modifications to the CNN procedure were proposed by Ivan Tomek in his 1976 paper titled &ldquo;<a href=\"https:\/\/ieeexplore.ieee.org\/document\/4309452\">Two modifications of CNN<\/a>.&rdquo; One of the modifications (Method2) is a rule that finds pairs of examples, one from each class; they together have the smallest Euclidean distance to each other in feature space.<\/p>\n<p>This means that in a binary classification problem with classes 0 and 1, a pair would have an example from each class and would be closest neighbors across the dataset.<\/p>\n<blockquote>\n<p>In words, instances a and b define a Tomek Link if: (i) instance a&rsquo;s nearest neighbor is b, (ii) instance b&rsquo;s nearest neighbor is a, and (iii) instances a and b belong to different classes.<\/p>\n<\/blockquote>\n<p>&mdash; Page 46, <a href=\"https:\/\/amzn.to\/32K9K6d\">Imbalanced Learning: Foundations, Algorithms, and Applications<\/a>, 2013.<\/p>\n<p>These cross-class pairs are now generally referred to as &ldquo;<em>Tomek Links<\/em>&rdquo; and are valuable as they define the class boundary.<\/p>\n<blockquote>\n<p>Method 2 has another potentially important property: It finds pairs of boundary points which participate in the formation of the (piecewise-linear) boundary. [&hellip;] Such methods could use these pairs to generate progressively simpler descriptions of acceptably accurate approximations of the original completely specified boundaries.<\/p>\n<\/blockquote>\n<p>&mdash; <a href=\"https:\/\/ieeexplore.ieee.org\/document\/4309452\">Two modifications of CNN<\/a>, 1976.<\/p>\n<p>The procedure for finding Tomek Links can be used to locate all cross-class nearest neighbors. If the examples in the minority class are held constant, the procedure can be used to find all of those examples in the majority class that are closest to the minority class, then removed. These would be the ambiguous examples.<\/p>\n<blockquote>\n<p>From this definition, we see that instances that are in Tomek Links are either boundary instances or noisy instances. This is due to the fact that only boundary instances and noisy instances will have nearest neighbors, which are from the opposite class.<\/p>\n<\/blockquote>\n<p>&mdash; Page 46, <a href=\"https:\/\/amzn.to\/32K9K6d\">Imbalanced Learning: Foundations, Algorithms, and Applications<\/a>, 2013.<\/p>\n<p>We can implement Tomek Links method for undersampling using the <a href=\"https:\/\/imbalanced-learn.readthedocs.io\/en\/stable\/generated\/imblearn.under_sampling.TomekLinks.html\">TomekLinks imbalanced-learn class<\/a>.<\/p>\n<pre class=\"crayon-plain-tag\">...\r\n# define the undersampling method\r\nundersample = TomekLinks()<\/pre>\n<p>The complete example of demonstrating the Tomek Links for undersampling is listed below.<\/p>\n<p>Because the procedure only removes so-named &ldquo;<em>Tomek Links<\/em>&ldquo;, we would not expect the resulting transformed dataset to be balanced, only less ambiguous along the class boundary.<\/p>\n<pre class=\"crayon-plain-tag\"># Undersample and plot imbalanced dataset with Tomek Links\r\nfrom collections import Counter\r\nfrom sklearn.datasets import make_classification\r\nfrom imblearn.under_sampling import TomekLinks\r\nfrom matplotlib import pyplot\r\nfrom numpy import where\r\n# define dataset\r\nX, y = make_classification(n_samples=10000, n_features=2, n_redundant=0,\r\n\tn_clusters_per_class=1, weights=[0.99], flip_y=0, random_state=1)\r\n# summarize class distribution\r\ncounter = Counter(y)\r\nprint(counter)\r\n# define the undersampling method\r\nundersample = TomekLinks()\r\n# transform the dataset\r\nX, y = undersample.fit_resample(X, y)\r\n# summarize the new class distribution\r\ncounter = Counter(y)\r\nprint(counter)\r\n# scatter plot of examples by class label\r\nfor label, _ in counter.items():\r\n\trow_ix = where(y == label)[0]\r\n\tpyplot.scatter(X[row_ix, 0], X[row_ix, 1], label=str(label))\r\npyplot.legend()\r\npyplot.show()<\/pre>\n<p>Running the example first summarizes the class distribution for the raw dataset, then the transformed dataset.<\/p>\n<p>We can see that only 26 examples from the majority class were removed.<\/p>\n<pre class=\"crayon-plain-tag\">Counter({0: 9900, 1: 100})\r\nCounter({0: 9874, 1: 100})<\/pre>\n<p>The scatter plot of the transformed dataset does not make the minor editing to the majority class obvious.<\/p>\n<p>This highlights that although finding the ambiguous examples on the class boundary is useful, alone, it is not a great undersampling technique. In practice, the Tomek Links procedure is often combined with other methods, such as the Condensed Nearest Neighbor Rule.<\/p>\n<blockquote>\n<p>The choice to combine Tomek Links and CNN is natural, as Tomek Links can be said to remove borderline and noisy instances, while CNN removes redundant instances.<\/p>\n<\/blockquote>\n<p>&mdash; Page 46, <a href=\"https:\/\/amzn.to\/32K9K6d\">Imbalanced Learning: Foundations, Algorithms, and Applications<\/a>, 2013.<\/p>\n<div id=\"attachment_9444\" style=\"width: 1290px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-9444\" class=\"size-full wp-image-9444\" src=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2019\/10\/Scatter-Plot-of-Imbalanced-Dataset-Undersampled-with-the-Tomeks-Links-Method.png\" alt=\"Scatter Plot of Imbalanced Dataset Undersampled With the Tomek Links Method\" width=\"1280\" height=\"960\" srcset=\"http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2019\/10\/Scatter-Plot-of-Imbalanced-Dataset-Undersampled-with-the-Tomeks-Links-Method.png 1280w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2019\/10\/Scatter-Plot-of-Imbalanced-Dataset-Undersampled-with-the-Tomeks-Links-Method-300x225.png 300w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2019\/10\/Scatter-Plot-of-Imbalanced-Dataset-Undersampled-with-the-Tomeks-Links-Method-768x576.png 768w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2019\/10\/Scatter-Plot-of-Imbalanced-Dataset-Undersampled-with-the-Tomeks-Links-Method-1024x768.png 1024w\" sizes=\"(max-width: 1280px) 100vw, 1280px\"><\/p>\n<p id=\"caption-attachment-9444\" class=\"wp-caption-text\">Scatter Plot of Imbalanced Dataset Undersampled With the Tomek Links Method<\/p>\n<\/div>\n<h3>Edited Nearest Neighbors Rule for Undersampling<\/h3>\n<p>Another rule for finding ambiguous and noisy examples in a dataset is called Edited Nearest Neighbors, or sometimes ENN for short.<\/p>\n<p>This rule involves using <em>k=3<\/em> nearest neighbors to locate those examples in a dataset that are misclassified and that are then removed before a k=1 classification rule is applied. This approach of resampling and classification was proposed by Dennis Wilson in his 1972 paper titled &ldquo;<a href=\"https:\/\/ieeexplore.ieee.org\/document\/4309137\">Asymptotic Properties of Nearest Neighbor Rules Using Edited Data<\/a>.&rdquo;<\/p>\n<blockquote>\n<p>The modified three-nearest neighbor rule which uses the three-nearest neighbor rule to edit the preclassified samples and then uses a single-nearest neighbor rule to make decisions is a particularly attractive rule.<\/p>\n<\/blockquote>\n<p>&mdash; <a href=\"https:\/\/ieeexplore.ieee.org\/document\/4309137\">Asymptotic Properties of Nearest Neighbor Rules Using Edited Data<\/a>, 1972.<\/p>\n<p>When used as an undersampling procedure, the rule can be applied to each example in the majority class, allowing those examples that are misclassified as belonging to the minority class to be removed, and those correctly classified to remain.<\/p>\n<p>It is also applied to each example in the minority class where those examples that are misclassified have their nearest neighbors from the majority class deleted.<\/p>\n<blockquote>\n<p>&hellip; for each instance a in the dataset, its three nearest neighbors are computed. If a is a majority class instance and is misclassified by its three nearest neighbors, then a is removed from the dataset. Alternatively, if a is a minority class instance and is misclassified by its three nearest neighbors, then the majority class instances among a&rsquo;s neighbors are removed.<\/p>\n<\/blockquote>\n<p>&mdash; Page 46, <a href=\"https:\/\/amzn.to\/32K9K6d\">Imbalanced Learning: Foundations, Algorithms, and Applications<\/a>, 2013.<\/p>\n<p>The Edited Nearest Neighbors rule can be implemented using the <a href=\"https:\/\/imbalanced-learn.readthedocs.io\/en\/stable\/generated\/imblearn.under_sampling.EditedNearestNeighbours.html\">EditedNearestNeighbours imbalanced-learn class<\/a>.<\/p>\n<p>The <em>n_neighbors<\/em> argument controls the number of neighbors to use in the editing rule, which defaults to three, as in the paper.<\/p>\n<pre class=\"crayon-plain-tag\">...\r\n# define the undersampling method\r\nundersample = EditedNearestNeighbours(n_neighbors=3)<\/pre>\n<p>The complete example of demonstrating the ENN rule for undersampling is listed below.<\/p>\n<p>Like Tomek Links, the procedure only removes noisy and ambiguous points along the class boundary. As such, we would not expect the resulting transformed dataset to be balanced.<\/p>\n<pre class=\"crayon-plain-tag\"># Undersample and plot imbalanced dataset with the Edited Nearest Neighbor rule\r\nfrom collections import Counter\r\nfrom sklearn.datasets import make_classification\r\nfrom imblearn.under_sampling import EditedNearestNeighbours\r\nfrom matplotlib import pyplot\r\nfrom numpy import where\r\n# define dataset\r\nX, y = make_classification(n_samples=10000, n_features=2, n_redundant=0,\r\n\tn_clusters_per_class=1, weights=[0.99], flip_y=0, random_state=1)\r\n# summarize class distribution\r\ncounter = Counter(y)\r\nprint(counter)\r\n# define the undersampling method\r\nundersample = EditedNearestNeighbours(n_neighbors=3)\r\n# transform the dataset\r\nX, y = undersample.fit_resample(X, y)\r\n# summarize the new class distribution\r\ncounter = Counter(y)\r\nprint(counter)\r\n# scatter plot of examples by class label\r\nfor label, _ in counter.items():\r\n\trow_ix = where(y == label)[0]\r\n\tpyplot.scatter(X[row_ix, 0], X[row_ix, 1], label=str(label))\r\npyplot.legend()\r\npyplot.show()<\/pre>\n<p>Running the example first summarizes the class distribution for the raw dataset, then the transformed dataset.<\/p>\n<p>We can see that only 94 examples from the majority class were removed.<\/p>\n<pre class=\"crayon-plain-tag\">Counter({0: 9900, 1: 100})\r\nCounter({0: 9806, 1: 100})<\/pre>\n<p>Given the small amount of undersampling performed, the change to the mass of majority examples is not obvious from the plot.<\/p>\n<p>Also, like Tomek Links, the Edited Nearest Neighbor Rule gives best results when combined with another undersampling method.<\/p>\n<div id=\"attachment_9445\" style=\"width: 1290px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-9445\" class=\"size-full wp-image-9445\" src=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2019\/10\/Scatter-Plot-of-Imbalanced-Dataset-Undersampled-with-the-Edited-Nearest-Neighbor-Rule.png\" alt=\"Scatter Plot of Imbalanced Dataset Undersampled With the Edited Nearest Neighbor Rule\" width=\"1280\" height=\"960\" srcset=\"http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2019\/10\/Scatter-Plot-of-Imbalanced-Dataset-Undersampled-with-the-Edited-Nearest-Neighbor-Rule.png 1280w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2019\/10\/Scatter-Plot-of-Imbalanced-Dataset-Undersampled-with-the-Edited-Nearest-Neighbor-Rule-300x225.png 300w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2019\/10\/Scatter-Plot-of-Imbalanced-Dataset-Undersampled-with-the-Edited-Nearest-Neighbor-Rule-768x576.png 768w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2019\/10\/Scatter-Plot-of-Imbalanced-Dataset-Undersampled-with-the-Edited-Nearest-Neighbor-Rule-1024x768.png 1024w\" sizes=\"(max-width: 1280px) 100vw, 1280px\"><\/p>\n<p id=\"caption-attachment-9445\" class=\"wp-caption-text\">Scatter Plot of Imbalanced Dataset Undersampled With the Edited Nearest Neighbor Rule<\/p>\n<\/div>\n<p>Ivan Tomek, developer of Tomek Links, explored extensions of the Edited Nearest Neighbor Rule in his 1976 paper titled &ldquo;<a href=\"https:\/\/ieeexplore.ieee.org\/document\/4309523\">An Experiment with the Edited Nearest-Neighbor Rule<\/a>.&rdquo;<\/p>\n<p>Among his experiments was a repeated ENN method that invoked the continued editing of the dataset using the ENN rule for a fixed number of iterations, referred to as &ldquo;<em>unlimited editing<\/em>.&rdquo;<\/p>\n<blockquote>\n<p>&hellip; unlimited repetition of Wilson&rsquo;s editing (in fact, editing is always stopped after a finite number of steps because after a certain number of repetitions the design set becomes immune to further elimination)<\/p>\n<\/blockquote>\n<p>&mdash; <a href=\"https:\/\/ieeexplore.ieee.org\/document\/4309523\">An Experiment with the Edited Nearest-Neighbor Rule<\/a>, 1976.<\/p>\n<p>He also describes a method referred to as &ldquo;<em>all k-NN<\/em>&rdquo; that removes all examples from the dataset that were classified incorrectly.<\/p>\n<p>Both of these additional editing procedures are also available via the imbalanced-learn library via the <a href=\"https:\/\/imbalanced-learn.readthedocs.io\/en\/stable\/generated\/imblearn.under_sampling.RepeatedEditedNearestNeighbours.html\">RepeatedEditedNearestNeighbours<\/a> and <a href=\"https:\/\/imbalanced-learn.readthedocs.io\/en\/stable\/generated\/imblearn.under_sampling.AllKNN.html\">AllKNN<\/a> classes.<\/p>\n<h2>Combinations of Keep and Delete Methods<\/h2>\n<p>In this section, we will take a closer look at techniques that combine the techniques we have already looked at to both keep and delete examples from the majority class, such as One-Sided Selection and the Neighborhood Cleaning Rule.<\/p>\n<h3>One-Sided Selection for Undersampling<\/h3>\n<p>One-Sided Selection, or OSS for short, is an undersampling technique that combines Tomek Links and the Condensed Nearest Neighbor (CNN) Rule.<\/p>\n<p>Specifically, Tomek Links are ambiguous points on the class boundary and are identified and removed in the majority class. The CNN method is then used to remove redundant examples from the majority class that are far from the decision boundary.<\/p>\n<blockquote>\n<p>OSS is an undersampling method resulting from the application of Tomek links followed by the application of US-CNN. Tomek links are used as an undersampling method and removes noisy and borderline majority class examples. [&hellip;] US-CNN aims to remove examples from the majority class that are distant from the decision border.<\/p>\n<\/blockquote>\n<p>&mdash; Page 84, <a href=\"https:\/\/amzn.to\/307Xlva\">Learning from Imbalanced Data Sets<\/a>, 2018.<\/p>\n<p>This combination of methods was proposed by Miroslav Kubat and Stan Matwin in their 1997 paper titled &ldquo;<a href=\"https:\/\/sci2s.ugr.es\/keel\/pdf\/algorithm\/congreso\/kubat97addressing.pdf\">Addressing The Curse Of Imbalanced Training Sets: One-sided Selection<\/a>.&rdquo;<\/p>\n<p>The CNN procedure occurs in one-step and involves first adding all minority class examples to the store and some number of majority class examples (e.g. 1), then classifying all remaining majority class examples with KNN (<em>k=1<\/em>) and adding those that are misclassified to the store.<\/p>\n<div id=\"attachment_9447\" style=\"width: 610px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-9447\" class=\"size-full wp-image-9447\" src=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2019\/10\/Overview-of-the-One-Sided-Selection-for-Undersampling-Procedure2.png\" alt=\"Overview of the One-Sided Selection for Undersampling Procedure\" width=\"600\" height=\"351\" srcset=\"http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2019\/10\/Overview-of-the-One-Sided-Selection-for-Undersampling-Procedure2.png 600w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2019\/10\/Overview-of-the-One-Sided-Selection-for-Undersampling-Procedure2-300x176.png 300w\" sizes=\"(max-width: 600px) 100vw, 600px\"><\/p>\n<p id=\"caption-attachment-9447\" class=\"wp-caption-text\">Overview of the One-Sided Selection for Undersampling Procedure<br \/>Taken from Addressing The Curse Of Imbalanced Training Sets: One-sided Selection.<\/p>\n<\/div>\n<p>We can implement the OSS undersampling strategy via the <a href=\"https:\/\/imbalanced-learn.readthedocs.io\/en\/stable\/generated\/imblearn.under_sampling.OneSidedSelection.html\">OneSidedSelection imbalanced-learn class<\/a>.<\/p>\n<p>The number of seed examples can be set with <em>n_seeds_S<\/em> and defaults to 1 and the <em>k<\/em> for KNN can be set via the <em>n_neighbors<\/em> argument and defaults to 1.<\/p>\n<p>Given that the CNN procedure occurs in one block, it is more useful to have a larger seed sample of the majority class in order to effectively remove redundant examples. In this case, we will use a value of 200.<\/p>\n<pre class=\"crayon-plain-tag\">...\r\n# define the undersampling method\r\nundersample = OneSidedSelection(n_neighbors=1, n_seeds_S=200)<\/pre>\n<p>The complete example of applying OSS on the binary classification problem is listed below.<\/p>\n<p>We might expect a large number of redundant examples from the majority class to be removed from the interior of the distribution (e.g. away from the class boundary).<\/p>\n<pre class=\"crayon-plain-tag\"># Undersample and plot imbalanced dataset with One-Sided Selection\r\nfrom collections import Counter\r\nfrom sklearn.datasets import make_classification\r\nfrom imblearn.under_sampling import OneSidedSelection\r\nfrom matplotlib import pyplot\r\nfrom numpy import where\r\n# define dataset\r\nX, y = make_classification(n_samples=10000, n_features=2, n_redundant=0,\r\n\tn_clusters_per_class=1, weights=[0.99], flip_y=0, random_state=1)\r\n# summarize class distribution\r\ncounter = Counter(y)\r\nprint(counter)\r\n# define the undersampling method\r\nundersample = OneSidedSelection(n_neighbors=1, n_seeds_S=200)\r\n# transform the dataset\r\nX, y = undersample.fit_resample(X, y)\r\n# summarize the new class distribution\r\ncounter = Counter(y)\r\nprint(counter)\r\n# scatter plot of examples by class label\r\nfor label, _ in counter.items():\r\n\trow_ix = where(y == label)[0]\r\n\tpyplot.scatter(X[row_ix, 0], X[row_ix, 1], label=str(label))\r\npyplot.legend()\r\npyplot.show()<\/pre>\n<p>Running the example first reports the class distribution in the raw dataset, then the transformed dataset.<\/p>\n<p>We can see that a large number of examples from the majority class were removed, consisting of both redundant examples (removed via CNN) and ambiguous examples (removed via Tomek Links). The ratio for this dataset is now around 1:10., down from 1:100.<\/p>\n<pre class=\"crayon-plain-tag\">Counter({0: 9900, 1: 100})\r\nCounter({0: 940, 1: 100})<\/pre>\n<p>A scatter plot of the transformed dataset is created showing that most of the majority class examples left belong are around the class boundary and the overlapping examples from the minority class.<\/p>\n<p>It might be interesting to explore larger seed samples from the majority class and different values of <em>k<\/em> used in the one-step CNN procedure.<\/p>\n<div id=\"attachment_9448\" style=\"width: 1290px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-9448\" class=\"size-full wp-image-9448\" src=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2019\/10\/Scatter-Plot-of-Imbalanced-Dataset-Undersampled-with-One-Sided-Selection.png\" alt=\"Scatter Plot of Imbalanced Dataset Undersampled With One-Sided Selection\" width=\"1280\" height=\"960\" srcset=\"http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2019\/10\/Scatter-Plot-of-Imbalanced-Dataset-Undersampled-with-One-Sided-Selection.png 1280w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2019\/10\/Scatter-Plot-of-Imbalanced-Dataset-Undersampled-with-One-Sided-Selection-300x225.png 300w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2019\/10\/Scatter-Plot-of-Imbalanced-Dataset-Undersampled-with-One-Sided-Selection-768x576.png 768w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2019\/10\/Scatter-Plot-of-Imbalanced-Dataset-Undersampled-with-One-Sided-Selection-1024x768.png 1024w\" sizes=\"(max-width: 1280px) 100vw, 1280px\"><\/p>\n<p id=\"caption-attachment-9448\" class=\"wp-caption-text\">Scatter Plot of Imbalanced Dataset Undersampled With One-Sided Selection<\/p>\n<\/div>\n<h3>Neighborhood Cleaning Rule for Undersampling<\/h3>\n<p>The Neighborhood Cleaning Rule, or NCR for short, is an undersampling technique that combines both the Condensed Nearest Neighbor (CNN) Rule to remove redundant examples and the Edited Nearest Neighbors (ENN) Rule to remove noisy or ambiguous examples.<\/p>\n<p>Like One-Sided Selection (OSS), the CSS method is applied in a one-step manner, then the examples that are misclassified according to a KNN classifier are removed, as per the ENN rule. Unlike OSS, less of the redundant examples are removed and more attention is placed on &ldquo;<em>cleaning<\/em>&rdquo; those examples that are retained.<\/p>\n<p>The reason for this is to focus less on improving the balance of the class distribution and more on the quality (unambiguity) of the examples that are retained in the majority class.<\/p>\n<blockquote>\n<p>&hellip; the quality of classification results does not necessarily depend on the size of the class. Therefore, we should consider, besides the class distribution, other characteristics of data, such as noise, that may hamper classification.<\/p>\n<\/blockquote>\n<p>&mdash; <a href=\"https:\/\/link.springer.com\/chapter\/10.1007%2F3-540-48229-6_9\">Improving Identification of Difficult Small Classes by Balancing Class Distribution<\/a>, 2001.<\/p>\n<p>This approach was proposed by Jorma Laurikkala in her 2001 paper titled &ldquo;<a href=\"https:\/\/link.springer.com\/chapter\/10.1007%2F3-540-48229-6_9\">Improving Identification of Difficult Small Classes by Balancing Class Distribution<\/a>.&rdquo;<\/p>\n<p>The approach involves first selecting all examples from the minority class. Then all of the ambiguous examples in the majority class are identified using the ENN rule and removed. Finally, a one-step version of CNN is used where those remaining examples in the majority class that are misclassified against the store are removed, but only if the number of examples in the majority class is larger than half the size of the minority class.<\/p>\n<div id=\"attachment_9449\" style=\"width: 610px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-9449\" class=\"size-full wp-image-9449\" src=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2019\/10\/Summary-of-the-Neighborhood-Cleaning-Rule-Algorithm.png\" alt=\"Summary of the Neighborhood Cleaning Rule Algorithm\" width=\"600\" height=\"224\" srcset=\"http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2019\/10\/Summary-of-the-Neighborhood-Cleaning-Rule-Algorithm.png 600w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2019\/10\/Summary-of-the-Neighborhood-Cleaning-Rule-Algorithm-300x112.png 300w\" sizes=\"(max-width: 600px) 100vw, 600px\"><\/p>\n<p id=\"caption-attachment-9449\" class=\"wp-caption-text\">Summary of the Neighborhood Cleaning Rule Algorithm.<br \/>Taken from Improving Identification of Difficult Small Classes by Balancing Class Distribution.<\/p>\n<\/div>\n<p>This technique can be implemented using the <a href=\"https:\/\/imbalanced-learn.readthedocs.io\/en\/stable\/generated\/imblearn.under_sampling.NeighbourhoodCleaningRule.html\">NeighbourhoodCleaningRule imbalanced-learn class<\/a>. The number of neighbors used in the ENN and CNN steps can be specified via the <em>n_neighbors<\/em> argument that defaults to three. The <em>threshold_cleaning<\/em> controls whether or not the CNN is applied to a given class, which might be useful if there are multiple minority classes with similar sizes. This is kept at 0.5.<\/p>\n<p>The complete example of applying NCR on the binary classification problem is listed below.<\/p>\n<p>Given the focus on data cleaning over removing redundant examples, we would expect only a modest reduction in the number of examples in the majority class.<\/p>\n<pre class=\"crayon-plain-tag\"># Undersample and plot imbalanced dataset with the neighborhood cleaning rule\r\nfrom collections import Counter\r\nfrom sklearn.datasets import make_classification\r\nfrom imblearn.under_sampling import NeighbourhoodCleaningRule\r\nfrom matplotlib import pyplot\r\nfrom numpy import where\r\n# define dataset\r\nX, y = make_classification(n_samples=10000, n_features=2, n_redundant=0,\r\n\tn_clusters_per_class=1, weights=[0.99], flip_y=0, random_state=1)\r\n# summarize class distribution\r\ncounter = Counter(y)\r\nprint(counter)\r\n# define the undersampling method\r\nundersample = NeighbourhoodCleaningRule(n_neighbors=3, threshold_cleaning=0.5)\r\n# transform the dataset\r\nX, y = undersample.fit_resample(X, y)\r\n# summarize the new class distribution\r\ncounter = Counter(y)\r\nprint(counter)\r\n# scatter plot of examples by class label\r\nfor label, _ in counter.items():\r\n\trow_ix = where(y == label)[0]\r\n\tpyplot.scatter(X[row_ix, 0], X[row_ix, 1], label=str(label))\r\npyplot.legend()\r\npyplot.show()<\/pre>\n<p>Running the example first reports the class distribution in the raw dataset, then the transformed dataset.<\/p>\n<p>We can see that only 114 examples from the majority class were removed.<\/p>\n<pre class=\"crayon-plain-tag\">Counter({0: 9900, 1: 100})\r\nCounter({0: 9786, 1: 100})<\/pre>\n<p>Given the limited and focused amount of undersampling performed, the change to the mass of majority examples is not obvious from the scatter plot that is created.<\/p>\n<div id=\"attachment_9450\" style=\"width: 1290px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-9450\" class=\"size-full wp-image-9450\" src=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2019\/10\/Scatter-Plot-of-Imbalanced-Dataset-Undersampled-with-the-Neighborhood-Cleaning-Rule.png\" alt=\"Scatter Plot of Imbalanced Dataset Undersampled With the Neighborhood Cleaning Rule\" width=\"1280\" height=\"960\" srcset=\"http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2019\/10\/Scatter-Plot-of-Imbalanced-Dataset-Undersampled-with-the-Neighborhood-Cleaning-Rule.png 1280w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2019\/10\/Scatter-Plot-of-Imbalanced-Dataset-Undersampled-with-the-Neighborhood-Cleaning-Rule-300x225.png 300w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2019\/10\/Scatter-Plot-of-Imbalanced-Dataset-Undersampled-with-the-Neighborhood-Cleaning-Rule-768x576.png 768w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2019\/10\/Scatter-Plot-of-Imbalanced-Dataset-Undersampled-with-the-Neighborhood-Cleaning-Rule-1024x768.png 1024w\" sizes=\"(max-width: 1280px) 100vw, 1280px\"><\/p>\n<p id=\"caption-attachment-9450\" class=\"wp-caption-text\">Scatter Plot of Imbalanced Dataset Undersampled With the Neighborhood Cleaning Rule<\/p>\n<\/div>\n<h2>Further Reading<\/h2>\n<p>This section provides more resources on the topic if you are looking to go deeper.<\/p>\n<h3>Papers<\/h3>\n<ul>\n<li><a href=\"https:\/\/www.site.uottawa.ca\/~nat\/Workshop2003\/jzhang.pdf\">kNN Approach To Unbalanced Data Distributions: A Case Study Involving Information Extraction<\/a>, 2003.<\/li>\n<li><a href=\"https:\/\/ieeexplore.ieee.org\/document\/1054155\">The Condensed Nearest Neighbor Rule (Corresp.)<\/a>, 1968<\/li>\n<li><a href=\"https:\/\/ieeexplore.ieee.org\/document\/4309452\">Two modifications of CNN<\/a>, 1976.<\/li>\n<li><a href=\"https:\/\/sci2s.ugr.es\/keel\/pdf\/algorithm\/congreso\/kubat97addressing.pdf\">Addressing The Curse Of Imbalanced Training Sets: One-sided Selection<\/a>, 1997.<\/li>\n<li><a href=\"https:\/\/ieeexplore.ieee.org\/document\/4309137\">Asymptotic Properties of Nearest Neighbor Rules Using Edited Data<\/a>, 1972.<\/li>\n<li><a href=\"https:\/\/ieeexplore.ieee.org\/document\/4309523\">An Experiment with the Edited Nearest-Neighbor Rule<\/a>, 1976.<\/li>\n<li><a href=\"https:\/\/link.springer.com\/chapter\/10.1007%2F3-540-48229-6_9\">Improving Identification of Difficult Small Classes by Balancing Class Distribution<\/a>, 2001.<\/li>\n<\/ul>\n<h3>Books<\/h3>\n<ul>\n<li><a href=\"https:\/\/amzn.to\/307Xlva\">Learning from Imbalanced Data Sets<\/a>, 2018.<\/li>\n<li><a href=\"https:\/\/amzn.to\/32K9K6d\">Imbalanced Learning: Foundations, Algorithms, and Applications<\/a>, 2013.<\/li>\n<\/ul>\n<h3>API<\/h3>\n<ul>\n<li><a href=\"https:\/\/imbalanced-learn.readthedocs.io\/en\/stable\/under_sampling.html\">Under-sampling, Imbalanced-Learn User Guide<\/a>.<\/li>\n<li><a href=\"https:\/\/imbalanced-learn.readthedocs.io\/en\/stable\/generated\/imblearn.under_sampling.NearMiss.html\">imblearn.under_sampling.NearMiss API<\/a><\/li>\n<li><a href=\"https:\/\/imbalanced-learn.readthedocs.io\/en\/stable\/generated\/imblearn.under_sampling.CondensedNearestNeighbour.html\">imblearn.under_sampling.CondensedNearestNeighbour API<\/a><\/li>\n<li><a href=\"https:\/\/imbalanced-learn.readthedocs.io\/en\/stable\/generated\/imblearn.under_sampling.TomekLinks.html\">imblearn.under_sampling.TomekLinks API<\/a><\/li>\n<li><a href=\"https:\/\/imbalanced-learn.readthedocs.io\/en\/stable\/generated\/imblearn.under_sampling.OneSidedSelection.html\">imblearn.under_sampling.OneSidedSelection API<\/a><\/li>\n<li><a href=\"https:\/\/imbalanced-learn.readthedocs.io\/en\/stable\/generated\/imblearn.under_sampling.EditedNearestNeighbours.html\">imblearn.under_sampling.EditedNearestNeighbours API<\/a>.<\/li>\n<li><a href=\"https:\/\/imbalanced-learn.readthedocs.io\/en\/stable\/generated\/imblearn.under_sampling.NeighbourhoodCleaningRule.html\">imblearn.under_sampling.NeighbourhoodCleaningRule API<\/a><\/li>\n<\/ul>\n<h3>Articles<\/h3>\n<ul>\n<li><a href=\"https:\/\/en.wikipedia.org\/wiki\/Oversampling_and_undersampling_in_data_analysis\">Oversampling and undersampling in data analysis, Wikipedia<\/a>.<\/li>\n<\/ul>\n<h2>Summary<\/h2>\n<p>In this tutorial, you discovered undersampling methods for imbalanced classification.<\/p>\n<p>Specifically, you learned:<\/p>\n<ul>\n<li>How to use the Near-Miss and Condensed Nearest Neighbor Rule methods that select examples to keep from the majority class.<\/li>\n<li>How to use Tomek Links and the Edited Nearest Neighbors Rule methods that select examples to delete from the majority class.<\/li>\n<li>How to use One-Sided Selection and the Neighborhood Cleaning Rule that combine methods for choosing examples to keep and delete from the majority class.<\/li>\n<\/ul>\n<p>Do you have any questions?<br \/>\nAsk your questions in the comments below and I will do my best to answer.<\/p>\n<p>The post <a rel=\"nofollow\" href=\"https:\/\/machinelearningmastery.com\/undersampling-algorithms-for-imbalanced-classification\/\">Undersampling Algorithms for Imbalanced Classification<\/a> appeared first on <a rel=\"nofollow\" href=\"https:\/\/machinelearningmastery.com\/\">Machine Learning Mastery<\/a>.<\/p>\n<\/div>\n<p><a href=\"https:\/\/machinelearningmastery.com\/undersampling-algorithms-for-imbalanced-classification\/\">Go to Source<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Author: Jason Brownlee Resampling methods are designed to change the composition of a training dataset for an imbalanced classification task. Most of the attention of [&hellip;] <span class=\"read-more-link\"><a class=\"read-more\" href=\"https:\/\/www.aiproblog.com\/index.php\/2020\/01\/19\/undersampling-algorithms-for-imbalanced-classification\/\">Read More<\/a><\/span><\/p>\n","protected":false},"author":1,"featured_media":3044,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_bbp_topic_count":0,"_bbp_reply_count":0,"_bbp_total_topic_count":0,"_bbp_total_reply_count":0,"_bbp_voice_count":0,"_bbp_anonymous_reply_count":0,"_bbp_topic_count_hidden":0,"_bbp_reply_count_hidden":0,"_bbp_forum_subforum_count":0,"footnotes":""},"categories":[24],"tags":[],"_links":{"self":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts\/3043"}],"collection":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/comments?post=3043"}],"version-history":[{"count":0,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts\/3043\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/media\/3044"}],"wp:attachment":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/media?parent=3043"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/categories?post=3043"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/tags?post=3043"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}