{"id":2969,"date":"2019-12-26T18:00:36","date_gmt":"2019-12-26T18:00:36","guid":{"rendered":"https:\/\/www.aiproblog.com\/index.php\/2019\/12\/26\/develop-an-intuition-for-severely-skewed-class-distributions\/"},"modified":"2019-12-26T18:00:36","modified_gmt":"2019-12-26T18:00:36","slug":"develop-an-intuition-for-severely-skewed-class-distributions","status":"publish","type":"post","link":"https:\/\/www.aiproblog.com\/index.php\/2019\/12\/26\/develop-an-intuition-for-severely-skewed-class-distributions\/","title":{"rendered":"Develop an Intuition for Severely Skewed Class Distributions"},"content":{"rendered":"<p>Author: Jason Brownlee<\/p>\n<div>\n<p>An imbalanced classification problem is a problem that involves predicting a class label where the distribution of class labels in the training dataset is not equal.<\/p>\n<p>A challenge for beginners working with imbalanced classification problems is what a specific skewed class distribution means. For example, what is the difference and implication for a 1:10 vs. a 1:100 class ratio?<\/p>\n<p>Differences in the class distribution for an imbalanced classification problem will influence the choice of data preparation and modeling algorithms. Therefore it is critical that practitioners develop an intuition for the implications for different class distributions.<\/p>\n<p>In this tutorial, you will discover how to develop a practical intuition for imbalanced and highly skewed class distributions.<\/p>\n<p>After completing this tutorial, you will know:<\/p>\n<ul>\n<li>How to create a synthetic dataset for binary classification and plot the examples by class.<\/li>\n<li>How to create synthetic classification datasets with any given class distribution.<\/li>\n<li>How different skewed class distributions actually look in practice.<\/li>\n<\/ul>\n<p>Let\u2019s get started.<\/p>\n<div id=\"attachment_9296\" style=\"width: 810px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-9296\" class=\"size-full wp-image-9296\" src=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2019\/12\/How-to-Develop-an-Intuition-Severely-Skewed-Class-Distributions.jpg\" alt=\"How to Develop an Intuition Severely Skewed Class Distributions\" width=\"800\" height=\"450\" srcset=\"http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2019\/12\/How-to-Develop-an-Intuition-Severely-Skewed-Class-Distributions.jpg 800w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2019\/12\/How-to-Develop-an-Intuition-Severely-Skewed-Class-Distributions-300x169.jpg 300w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2019\/12\/How-to-Develop-an-Intuition-Severely-Skewed-Class-Distributions-768x432.jpg 768w\" sizes=\"(max-width: 800px) 100vw, 800px\"><\/p>\n<p id=\"caption-attachment-9296\" class=\"wp-caption-text\">Develop an Intuition for Severely Skewed Class Distributions<br \/>Photo by <a href=\"https:\/\/flickr.com\/photos\/kasio69\/40486873483\/\">Boris Kasimov<\/a>, some rights reserved.<\/p>\n<\/div>\n<h2>Tutorial Overview<\/h2>\n<p>This tutorial is divided into three parts; they are:<\/p>\n<ol>\n<li>Create and Plot a Binary Classification Problem<\/li>\n<li>Create Synthetic Dataset With Class Distribution<\/li>\n<li>Effect of Skewed Class Distributions<\/li>\n<\/ol>\n<h2>Create and Plot a Binary Classification Problem<\/h2>\n<p>The scikit-learn Python machine learning library provides <a href=\"https:\/\/machinelearningmastery.com\/generate-test-datasets-python-scikit-learn\/\">functions for generating synthetic datasets<\/a>.<\/p>\n<p>The <a href=\"https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.datasets.make_blobs.html\">make_blobs() function<\/a> can be used to generate a specified number examples from a test classification problem with a specified number of classes. The function returns the input and output parts of each example ready for modeling.<\/p>\n<p>For example, the snippet below will generate 1,000 examples for a two-class (binary) classification problem with two input variables. The class values have the values of 0 and 1.<\/p>\n<pre class=\"crayon-plain-tag\">...\r\nX, y = make_blobs(n_samples=1000, centers=2, n_features=2, random_state=1, cluster_std=3)<\/pre>\n<p>Once generated, we can then plot the dataset to get an intuition for the spatial relationship between the examples.<\/p>\n<p>Because there are only two input variables, we can create a scatter plot to plot each example as a point. This can be achieved with the <a href=\"https:\/\/matplotlib.org\/3.1.1\/api\/_as_gen\/matplotlib.pyplot.scatter.html\">scatter() matplotlib function<\/a>.<\/p>\n<p>The color of the points can then be varied based on the class values. This can be achieved by first selecting the array indexes for the examples for a given class, then only plotting those points, then repeating the select-and-plot process for the other class. The <a href=\"https:\/\/docs.scipy.org\/doc\/numpy\/reference\/generated\/numpy.where.html\">where() NumPy function<\/a> can be used to retrieve the array indexes that match a criterion, such as a class label having a given value.<\/p>\n<p>For example:<\/p>\n<pre class=\"crayon-plain-tag\">...\r\n# create scatter plot for samples from each class\r\nfor class_value in range(2):\r\n\t# get row indexes for samples with this class\r\n\trow_ix = where(y == class_value)\r\n\t# create scatter of these samples\r\n\tpyplot.scatter(X[row_ix, 0], X[row_ix, 1])<\/pre>\n<p>Tying this together, the complete example of creating a binary classification test dataset and plotting the examples as a scatter plot is listed below.<\/p>\n<pre class=\"crayon-plain-tag\"># generate binary classification dataset and plot\r\nfrom numpy import where\r\nfrom matplotlib import pyplot\r\nfrom sklearn.datasets.samples_generator import make_blobs\r\n# generate dataset\r\nX, y = make_blobs(n_samples=1000, centers=2, n_features=2, random_state=1, cluster_std=3)\r\n# create scatter plot for samples from each class\r\nfor class_value in range(2):\r\n\t# get row indexes for samples with this class\r\n\trow_ix = where(y == class_value)\r\n\t# create scatter of these samples\r\n\tpyplot.scatter(X[row_ix, 0], X[row_ix, 1])\r\n# show the plot\r\npyplot.show()<\/pre>\n<p>Running the example creates the dataset and scatter plot, showing the examples for each of the two classes with different colors.<\/p>\n<p>We can see that there is an equal number of examples in each class, in this case, 500, and that we can imagine drawing a line to reasonably separate the classes, much like a classification predictive model might in learning how to discriminate the examples.<\/p>\n<div id=\"attachment_9291\" style=\"width: 1290px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-9291\" class=\"size-full wp-image-9291\" src=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2019\/10\/Scatter-Plot-of-Binary-Classification-Dataset.png\" alt=\"Scatter Plot of Binary Classification Dataset\" width=\"1280\" height=\"960\" srcset=\"http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2019\/10\/Scatter-Plot-of-Binary-Classification-Dataset.png 1280w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2019\/10\/Scatter-Plot-of-Binary-Classification-Dataset-300x225.png 300w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2019\/10\/Scatter-Plot-of-Binary-Classification-Dataset-768x576.png 768w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2019\/10\/Scatter-Plot-of-Binary-Classification-Dataset-1024x768.png 1024w\" sizes=\"(max-width: 1280px) 100vw, 1280px\"><\/p>\n<p id=\"caption-attachment-9291\" class=\"wp-caption-text\">Scatter Plot of Binary Classification Dataset<\/p>\n<\/div>\n<p>Now that we know how to create a synthetic binary classification dataset and plot the examples, let\u2019s look at the example of class imbalances on the example.<\/p>\n<h2>Create Synthetic Dataset with Class Distribution<\/h2>\n<p>The <em>make_blobs()<\/em> function will always create synthetic datasets with an equal class distribution.<\/p>\n<p>Nevertheless, we can use this function to create synthetic classification datasets with arbitrary class distributions with a few extra lines of code.<\/p>\n<p>A class distribution can be defined as a dictionary where the key is the class value (e.g. 0 or 1) and the value is the number of randomly generated examples to include in the dataset.<\/p>\n<p>For example, an equal class distribution with 5,000 examples in each class would be defined as:<\/p>\n<pre class=\"crayon-plain-tag\">...\r\n# define the class distribution\r\nproportions = {0:5000, 1:5000}<\/pre>\n<p>We can then enumerate through the different distributions and find the largest distribution, then use the <em>make_blobs()<\/em> function to create a dataset with that many examples for each of the classes.<\/p>\n<pre class=\"crayon-plain-tag\">...\r\n# determine the number of classes\r\nn_classes = len(proportions)\r\n# determine the number of examples to generate for each class\r\nlargest = max([v for k,v in proportions.items()])\r\nn_samples = largest * n_classes<\/pre>\n<p>This is a good starting point, but will give us more samples than are required for each class label.<\/p>\n<p>We can then enumerate through the class labels and select the desired number of examples for each class to comprise the dataset that will be returned.<\/p>\n<pre class=\"crayon-plain-tag\">...\r\n# collect the examples\r\nX_list, y_list = list(), list()\r\nfor k,v in proportions.items():\r\n\trow_ix = where(y == k)[0]\r\n\tselected = row_ix[:v]\r\n\tX_list.append(X[selected, :])\r\n\ty_list.append(y[selected])<\/pre>\n<p>We can tie this together into a new function named <em>get_dataset()<\/em> that will take a class distribution and return a synthetic dataset with that class distribution.<\/p>\n<pre class=\"crayon-plain-tag\"># create a dataset with a given class distribution\r\ndef get_dataset(proportions):\r\n\t# determine the number of classes\r\n\tn_classes = len(proportions)\r\n\t# determine the number of examples to generate for each class\r\n\tlargest = max([v for k,v in proportions.items()])\r\n\tn_samples = largest * n_classes\r\n\t# create dataset\r\n\tX, y = make_blobs(n_samples=n_samples, centers=n_classes, n_features=2, random_state=1, cluster_std=3)\r\n\t# collect the examples\r\n\tX_list, y_list = list(), list()\r\n\tfor k,v in proportions.items():\r\n\t\trow_ix = where(y == k)[0]\r\n\t\tselected = row_ix[:v]\r\n\t\tX_list.append(X[selected, :])\r\n\t\ty_list.append(y[selected])\r\n\treturn vstack(X_list), hstack(y_list)<\/pre>\n<p>The function can take any number of classes, although we will use it for simple binary classification problems.<\/p>\n<p>Next, we can take the code from the previous section for creating a scatter plot for a created dataset and place it in a helper function. Below is the <em>plot_dataset()<\/em> function that will plot the dataset and show a legend to indicate the mapping of colors to class labels.<\/p>\n<pre class=\"crayon-plain-tag\"># scatter plot of dataset, different color for each class\r\ndef plot_dataset(X, y):\r\n\t# create scatter plot for samples from each class\r\n\tn_classes = len(unique(y))\r\n\tfor class_value in range(n_classes):\r\n\t\t# get row indexes for samples with this class\r\n\t\trow_ix = where(y == class_value)[0]\r\n\t\t# create scatter of these samples\r\n\t\tpyplot.scatter(X[row_ix, 0], X[row_ix, 1], label=str(class_value))\r\n\t# show a legend\r\n\tpyplot.legend()\r\n\t# show the plot\r\n\tpyplot.show()<\/pre>\n<p>Finally, we can test these new functions.<\/p>\n<p>We will define a dataset with 5,000 examples for each class (10,000 total examples), and plot the result.<\/p>\n<p>The complete example is listed below.<\/p>\n<pre class=\"crayon-plain-tag\"># create and plot synthetic dataset with a given class distribution\r\nfrom numpy import unique\r\nfrom numpy import hstack\r\nfrom numpy import vstack\r\nfrom numpy import where\r\nfrom matplotlib import pyplot\r\nfrom sklearn.datasets.samples_generator import make_blobs\r\n\r\n# create a dataset with a given class distribution\r\ndef get_dataset(proportions):\r\n\t# determine the number of classes\r\n\tn_classes = len(proportions)\r\n\t# determine the number of examples to generate for each class\r\n\tlargest = max([v for k,v in proportions.items()])\r\n\tn_samples = largest * n_classes\r\n\t# create dataset\r\n\tX, y = make_blobs(n_samples=n_samples, centers=n_classes, n_features=2, random_state=1, cluster_std=3)\r\n\t# collect the examples\r\n\tX_list, y_list = list(), list()\r\n\tfor k,v in proportions.items():\r\n\t\trow_ix = where(y == k)[0]\r\n\t\tselected = row_ix[:v]\r\n\t\tX_list.append(X[selected, :])\r\n\t\ty_list.append(y[selected])\r\n\treturn vstack(X_list), hstack(y_list)\r\n\r\n# scatter plot of dataset, different color for each class\r\ndef plot_dataset(X, y):\r\n\t# create scatter plot for samples from each class\r\n\tn_classes = len(unique(y))\r\n\tfor class_value in range(n_classes):\r\n\t\t# get row indexes for samples with this class\r\n\t\trow_ix = where(y == class_value)[0]\r\n\t\t# create scatter of these samples\r\n\t\tpyplot.scatter(X[row_ix, 0], X[row_ix, 1], label=str(class_value))\r\n\t# show a legend\r\n\tpyplot.legend()\r\n\t# show the plot\r\n\tpyplot.show()\r\n\r\n# define the class distribution\r\nproportions = {0:5000, 1:5000}\r\n# generate dataset\r\nX, y = get_dataset(proportions)\r\n# plot dataset\r\nplot_dataset(X, y)<\/pre>\n<p>Running the example creates the dataset and plots the result as before, although this time with our provided class distribution.<\/p>\n<p>In this case, we have many more examples for each class and a helpful legend to indicate the mapping of plot colors to class labels.<\/p>\n<div id=\"attachment_9292\" style=\"width: 1290px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-9292\" class=\"size-full wp-image-9292\" src=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2019\/10\/Scatter-Plot-of-Binary-Classification-Dataset-With-Provided-Class-Distribution.png\" alt=\"Scatter Plot of Binary Classification Dataset With Provided Class Distribution\" width=\"1280\" height=\"960\" srcset=\"http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2019\/10\/Scatter-Plot-of-Binary-Classification-Dataset-With-Provided-Class-Distribution.png 1280w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2019\/10\/Scatter-Plot-of-Binary-Classification-Dataset-With-Provided-Class-Distribution-300x225.png 300w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2019\/10\/Scatter-Plot-of-Binary-Classification-Dataset-With-Provided-Class-Distribution-768x576.png 768w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2019\/10\/Scatter-Plot-of-Binary-Classification-Dataset-With-Provided-Class-Distribution-1024x768.png 1024w\" sizes=\"(max-width: 1280px) 100vw, 1280px\"><\/p>\n<p id=\"caption-attachment-9292\" class=\"wp-caption-text\">Scatter Plot of Binary Classification Dataset With Provided Class Distribution<\/p>\n<\/div>\n<p>Now that we have the tools to create and plot a synthetic dataset with arbitrary skewed class distributions, let\u2019s look at the effect of different distributions.<\/p>\n<h2>Effect of Skewed Class Distributions<\/h2>\n<p>It is important to develop an intuition for the spatial relationship for different class imbalances.<\/p>\n<p>For example, what is the 1:1000 class distribution relationship like?<\/p>\n<p>It is an abstract relationship and we need to tie it to something concrete.<\/p>\n<p>We can generate synthetic test datasets with different imbalanced class distribution and use that as a basis for developing an intuition for different skewed distributions we might be likely to encounter in real datasets.<\/p>\n<p>Reviewing scatter plots of different class distributions can give a rough feeling for the relationship between the classes that can be useful when thinking about the selection of techniques and evaluation of models when working with similar class distributions in the future. They provide a point of reference.<\/p>\n<p>We have already seen a 1:1 relationship in the previous section (e.g. 5000:5000).<\/p>\n<p>Note that when working with binary classification problems, especially imbalanced problems, it is important that the majority class is assigned to class 0 and the minority class is assigned to class 1. This is because many evaluation metrics will assume this relationship.<\/p>\n<p>Therefore, we can ensure our class distributions meet this practice by defining the majority then the minority classes in the call to the <em>get_dataset()<\/em> function; for example:<\/p>\n<pre class=\"crayon-plain-tag\">...\r\n# define the class distribution\r\nproportions = {0:10000, 1:10}\r\n# generate dataset\r\nX, y = get_dataset(proportions)\r\n...<\/pre>\n<p>In this section, we can look at different skewed class distributions with the size of the minority class increasing on a log scale, such as:<\/p>\n<ul>\n<li>1:10 or {0:10000, 1:1000}<\/li>\n<li>1:100 or {0:10000, 1:100}<\/li>\n<li>1:1000 or {0:10000, 1:10}<\/li>\n<\/ul>\n<p>Let\u2019s take a closer look at each class distribution in turn.<\/p>\n<h3>1:10 Imbalanced Class Distribution<\/h3>\n<p>A 1:10 class distribution with 10,000 to 1,000 examples means that there will be 11,000 examples in the dataset, with about 91 percent for class 0 and about 9 percent for class 1.<\/p>\n<p>The complete code example is listed below.<\/p>\n<pre class=\"crayon-plain-tag\"># create and plot synthetic dataset with a given class distribution\r\nfrom numpy import unique\r\nfrom numpy import hstack\r\nfrom numpy import vstack\r\nfrom numpy import where\r\nfrom matplotlib import pyplot\r\nfrom sklearn.datasets.samples_generator import make_blobs\r\n\r\n# create a dataset with a given class distribution\r\ndef get_dataset(proportions):\r\n\t# determine the number of classes\r\n\tn_classes = len(proportions)\r\n\t# determine the number of examples to generate for each class\r\n\tlargest = max([v for k,v in proportions.items()])\r\n\tn_samples = largest * n_classes\r\n\t# create dataset\r\n\tX, y = make_blobs(n_samples=n_samples, centers=n_classes, n_features=2, random_state=1, cluster_std=3)\r\n\t# collect the examples\r\n\tX_list, y_list = list(), list()\r\n\tfor k,v in proportions.items():\r\n\t\trow_ix = where(y == k)[0]\r\n\t\tselected = row_ix[:v]\r\n\t\tX_list.append(X[selected, :])\r\n\t\ty_list.append(y[selected])\r\n\treturn vstack(X_list), hstack(y_list)\r\n\r\n# scatter plot of dataset, different color for each class\r\ndef plot_dataset(X, y):\r\n\t# create scatter plot for samples from each class\r\n\tn_classes = len(unique(y))\r\n\tfor class_value in range(n_classes):\r\n\t\t# get row indexes for samples with this class\r\n\t\trow_ix = where(y == class_value)[0]\r\n\t\t# create scatter of these samples\r\n\t\tpyplot.scatter(X[row_ix, 0], X[row_ix, 1], label=str(class_value))\r\n\t# show a legend\r\n\tpyplot.legend()\r\n\t# show the plot\r\n\tpyplot.show()\r\n\r\n# define the class distribution\r\nproportions = {0:10000, 1:1000}\r\n# generate dataset\r\nX, y = get_dataset(proportions)\r\n# plot dataset\r\nplot_dataset(X, y)<\/pre>\n<p>Running the example creates the dataset with the defined class distribution and plots the result.<\/p>\n<p>Although the balance seems stark, the plot shows that about 10 percent of the points in the minority class compared to the majority class is not as bad as we might think.<\/p>\n<p>The relationship appears manageable, although if the classes overlapped significantly, we can imagine a very different story.<\/p>\n<div id=\"attachment_9293\" style=\"width: 1290px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-9293\" class=\"size-full wp-image-9293\" src=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2019\/10\/Scatter-Plot-of-Binary-Classification-Dataset-With-A-1-to-10-Class-Distribution.png\" alt=\"Scatter Plot of Binary Classification Dataset With A 1 to 10 Class Distribution\" width=\"1280\" height=\"960\" srcset=\"http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2019\/10\/Scatter-Plot-of-Binary-Classification-Dataset-With-A-1-to-10-Class-Distribution.png 1280w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2019\/10\/Scatter-Plot-of-Binary-Classification-Dataset-With-A-1-to-10-Class-Distribution-300x225.png 300w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2019\/10\/Scatter-Plot-of-Binary-Classification-Dataset-With-A-1-to-10-Class-Distribution-768x576.png 768w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2019\/10\/Scatter-Plot-of-Binary-Classification-Dataset-With-A-1-to-10-Class-Distribution-1024x768.png 1024w\" sizes=\"(max-width: 1280px) 100vw, 1280px\"><\/p>\n<p id=\"caption-attachment-9293\" class=\"wp-caption-text\">Scatter Plot of Binary Classification Dataset With A 1 to 10 Class Distribution<\/p>\n<\/div>\n<h3>1:100 Imbalanced Class Distribution<\/h3>\n<p>A 1:100 class distribution with 10,000 to 100 examples means that there will be 10,100 examples in the dataset, with about 99 percent for class 0 and about 1 percent for class 1.<\/p>\n<p>The complete code example is listed below.<\/p>\n<pre class=\"crayon-plain-tag\"># create and plot synthetic dataset with a given class distribution\r\nfrom numpy import unique\r\nfrom numpy import hstack\r\nfrom numpy import vstack\r\nfrom numpy import where\r\nfrom matplotlib import pyplot\r\nfrom sklearn.datasets.samples_generator import make_blobs\r\n\r\n# create a dataset with a given class distribution\r\ndef get_dataset(proportions):\r\n\t# determine the number of classes\r\n\tn_classes = len(proportions)\r\n\t# determine the number of examples to generate for each class\r\n\tlargest = max([v for k,v in proportions.items()])\r\n\tn_samples = largest * n_classes\r\n\t# create dataset\r\n\tX, y = make_blobs(n_samples=n_samples, centers=n_classes, n_features=2, random_state=1, cluster_std=3)\r\n\t# collect the examples\r\n\tX_list, y_list = list(), list()\r\n\tfor k,v in proportions.items():\r\n\t\trow_ix = where(y == k)[0]\r\n\t\tselected = row_ix[:v]\r\n\t\tX_list.append(X[selected, :])\r\n\t\ty_list.append(y[selected])\r\n\treturn vstack(X_list), hstack(y_list)\r\n\r\n# scatter plot of dataset, different color for each class\r\ndef plot_dataset(X, y):\r\n\t# create scatter plot for samples from each class\r\n\tn_classes = len(unique(y))\r\n\tfor class_value in range(n_classes):\r\n\t\t# get row indexes for samples with this class\r\n\t\trow_ix = where(y == class_value)[0]\r\n\t\t# create scatter of these samples\r\n\t\tpyplot.scatter(X[row_ix, 0], X[row_ix, 1], label=str(class_value))\r\n\t# show a legend\r\n\tpyplot.legend()\r\n\t# show the plot\r\n\tpyplot.show()\r\n\r\n# define the class distribution\r\nproportions = {0:10000, 1:100}\r\n# generate dataset\r\nX, y = get_dataset(proportions)\r\n# plot dataset\r\nplot_dataset(X, y)<\/pre>\n<p>Running the example creates the dataset with the defined class distribution and plots the result.<\/p>\n<p>A 1 to 100 relationship is a large skew.<\/p>\n<p>The plot makes this clear with what feels like a sprinkling of points compared to the enormous mass of the majority class.<\/p>\n<p>It is most likely that a real-world dataset will fall somewhere on the line between a 1:10 and 1:100 class distribution and the plot for 1:100 really highlights the need to carefully consider each point in the minority class, both in terms of measurement errors (e.g. outliers) and in terms of prediction errors that might be made by a model.<\/p>\n<div id=\"attachment_9294\" style=\"width: 1290px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-9294\" class=\"size-full wp-image-9294\" src=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2019\/10\/Scatter-Plot-of-Binary-Classification-Dataset-With-A-1-to-100-Class-Distribution.png\" alt=\"Scatter Plot of Binary Classification Dataset With A 1 to 100 Class Distribution\" width=\"1280\" height=\"960\" srcset=\"http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2019\/10\/Scatter-Plot-of-Binary-Classification-Dataset-With-A-1-to-100-Class-Distribution.png 1280w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2019\/10\/Scatter-Plot-of-Binary-Classification-Dataset-With-A-1-to-100-Class-Distribution-300x225.png 300w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2019\/10\/Scatter-Plot-of-Binary-Classification-Dataset-With-A-1-to-100-Class-Distribution-768x576.png 768w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2019\/10\/Scatter-Plot-of-Binary-Classification-Dataset-With-A-1-to-100-Class-Distribution-1024x768.png 1024w\" sizes=\"(max-width: 1280px) 100vw, 1280px\"><\/p>\n<p id=\"caption-attachment-9294\" class=\"wp-caption-text\">Scatter Plot of Binary Classification Dataset With A 1 to 100 Class Distribution<\/p>\n<\/div>\n<h3>1:1000 Imbalanced Class Distribution<\/h3>\n<p>A 1:100 class distribution with 10,000 to 10 examples means that there will be 10,010 examples in the dataset, with about 99.9 percent for class 0 and about 0.1 percent for class 1.<\/p>\n<p>The complete code example is listed below.<\/p>\n<pre class=\"crayon-plain-tag\"># create and plot synthetic dataset with a given class distribution\r\nfrom numpy import unique\r\nfrom numpy import hstack\r\nfrom numpy import vstack\r\nfrom numpy import where\r\nfrom matplotlib import pyplot\r\nfrom sklearn.datasets.samples_generator import make_blobs\r\n\r\n# create a dataset with a given class distribution\r\ndef get_dataset(proportions):\r\n\t# determine the number of classes\r\n\tn_classes = len(proportions)\r\n\t# determine the number of examples to generate for each class\r\n\tlargest = max([v for k,v in proportions.items()])\r\n\tn_samples = largest * n_classes\r\n\t# create dataset\r\n\tX, y = make_blobs(n_samples=n_samples, centers=n_classes, n_features=2, random_state=1, cluster_std=3)\r\n\t# collect the examples\r\n\tX_list, y_list = list(), list()\r\n\tfor k,v in proportions.items():\r\n\t\trow_ix = where(y == k)[0]\r\n\t\tselected = row_ix[:v]\r\n\t\tX_list.append(X[selected, :])\r\n\t\ty_list.append(y[selected])\r\n\treturn vstack(X_list), hstack(y_list)\r\n\r\n# scatter plot of dataset, different color for each class\r\ndef plot_dataset(X, y):\r\n\t# create scatter plot for samples from each class\r\n\tn_classes = len(unique(y))\r\n\tfor class_value in range(n_classes):\r\n\t\t# get row indexes for samples with this class\r\n\t\trow_ix = where(y == class_value)[0]\r\n\t\t# create scatter of these samples\r\n\t\tpyplot.scatter(X[row_ix, 0], X[row_ix, 1], label=str(class_value))\r\n\t# show a legend\r\n\tpyplot.legend()\r\n\t# show the plot\r\n\tpyplot.show()\r\n\r\n# define the class distribution\r\nproportions = {0:10000, 1:10}\r\n# generate dataset\r\nX, y = get_dataset(proportions)\r\n# plot dataset\r\nplot_dataset(X, y)<\/pre>\n<p>Running the example creates the dataset with the defined class distribution and plots the result.<\/p>\n<p>As we might already suspect, a 1 to 1,000 relationship is aggressive. In our chosen setup, just 10 examples of the minority class are present to 10,000 of the majority class.<\/p>\n<p>With such a lack of data, we can see that on modeling problems with such a dramatic skew, that we should probably spend a lot of time on the actual minority examples that are available and see if domain knowledge can be used in some way. Automatic modeling methods will have a tough challenge.<\/p>\n<p>This example also highlights another important aspect orthogonal to the class distribution and that is the number of examples. For example, although the dataset has a 1:1000 class distribution, having only 10 examples of the minority class is very challenging. Although, if we had the same class distribution with 1,000,000 of the majority class and 1,000 examples of the minority class, the additional 990 minority class examples would likely be invaluable in developing an effective model.<\/p>\n<div id=\"attachment_9295\" style=\"width: 1290px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-9295\" class=\"size-full wp-image-9295\" src=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2019\/10\/Scatter-Plot-of-Binary-Classification-Dataset-With-A-1-to-1000-Class-Distribution.png\" alt=\"Scatter Plot of Binary Classification Dataset With A 1 to 1000 Class Distribution\" width=\"1280\" height=\"960\" srcset=\"http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2019\/10\/Scatter-Plot-of-Binary-Classification-Dataset-With-A-1-to-1000-Class-Distribution.png 1280w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2019\/10\/Scatter-Plot-of-Binary-Classification-Dataset-With-A-1-to-1000-Class-Distribution-300x225.png 300w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2019\/10\/Scatter-Plot-of-Binary-Classification-Dataset-With-A-1-to-1000-Class-Distribution-768x576.png 768w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2019\/10\/Scatter-Plot-of-Binary-Classification-Dataset-With-A-1-to-1000-Class-Distribution-1024x768.png 1024w\" sizes=\"(max-width: 1280px) 100vw, 1280px\"><\/p>\n<p id=\"caption-attachment-9295\" class=\"wp-caption-text\">Scatter Plot of Binary Classification Dataset With A 1 to 1000 Class Distribution<\/p>\n<\/div>\n<h2>Further Reading<\/h2>\n<p>This section provides more resources on the topic if you are looking to go deeper.<\/p>\n<h3>API<\/h3>\n<ul>\n<li><a href=\"https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.datasets.make_blobs.html\">sklearn.datasets.make_blobs API<\/a>.<\/li>\n<li><a href=\"https:\/\/matplotlib.org\/3.1.1\/api\/_as_gen\/matplotlib.pyplot.scatter.html\">matplotlib.pyplot.scatter API<\/a>.<\/li>\n<li><a href=\"https:\/\/docs.scipy.org\/doc\/numpy\/reference\/generated\/numpy.where.html\">numpy.where API<\/a>.<\/li>\n<\/ul>\n<h2>Summary<\/h2>\n<p>In this tutorial, you discovered how to develop a practical intuition for imbalanced and highly skewed class distributions.<\/p>\n<p>Specifically, you learned:<\/p>\n<ul>\n<li>How to create a synthetic dataset for binary classification and plot the examples by class.<\/li>\n<li>How to create synthetic classification datasets with any given class distribution.<\/li>\n<li>How different skewed class distributions actually look in practice.<\/li>\n<\/ul>\n<p>Do you have any questions?<br \/>\nAsk your questions in the comments below and I will do my best to answer.<\/p>\n<p>The post <a rel=\"nofollow\" href=\"https:\/\/machinelearningmastery.com\/how-to-develop-an-intuition-skewed-class-distributions\/\">Develop an Intuition for Severely Skewed Class Distributions<\/a> appeared first on <a rel=\"nofollow\" href=\"https:\/\/machinelearningmastery.com\/\">Machine Learning Mastery<\/a>.<\/p>\n<\/div>\n<p><a href=\"https:\/\/machinelearningmastery.com\/how-to-develop-an-intuition-skewed-class-distributions\/\">Go to Source<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Author: Jason Brownlee An imbalanced classification problem is a problem that involves predicting a class label where the distribution of class labels in the training [&hellip;] <span class=\"read-more-link\"><a class=\"read-more\" href=\"https:\/\/www.aiproblog.com\/index.php\/2019\/12\/26\/develop-an-intuition-for-severely-skewed-class-distributions\/\">Read More<\/a><\/span><\/p>\n","protected":false},"author":1,"featured_media":2970,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_bbp_topic_count":0,"_bbp_reply_count":0,"_bbp_total_topic_count":0,"_bbp_total_reply_count":0,"_bbp_voice_count":0,"_bbp_anonymous_reply_count":0,"_bbp_topic_count_hidden":0,"_bbp_reply_count_hidden":0,"_bbp_forum_subforum_count":0,"footnotes":""},"categories":[24],"tags":[],"_links":{"self":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts\/2969"}],"collection":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/comments?post=2969"}],"version-history":[{"count":0,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts\/2969\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/media\/2970"}],"wp:attachment":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/media?parent=2969"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/categories?post=2969"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/tags?post=2969"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}