{"id":3739,"date":"2020-08-06T19:00:35","date_gmt":"2020-08-06T19:00:35","guid":{"rendered":"https:\/\/www.aiproblog.com\/index.php\/2020\/08\/06\/multi-class-imbalanced-classification\/"},"modified":"2020-08-06T19:00:35","modified_gmt":"2020-08-06T19:00:35","slug":"multi-class-imbalanced-classification","status":"publish","type":"post","link":"https:\/\/www.aiproblog.com\/index.php\/2020\/08\/06\/multi-class-imbalanced-classification\/","title":{"rendered":"Multi-Class Imbalanced Classification"},"content":{"rendered":"<p>Author: Jason Brownlee<\/p>\n<div>\n<p>Imbalanced classification are those prediction tasks where the distribution of examples across class labels is not equal.<\/p>\n<p>Most imbalanced classification examples focus on binary classification tasks, yet many of the tools and techniques for imbalanced classification also directly support multi-class classification problems.<\/p>\n<p>In this tutorial, you will discover how to use the tools of imbalanced classification with a multi-class dataset.<\/p>\n<p>After completing this tutorial, you will know:<\/p>\n<ul>\n<li>About the glass identification standard imbalanced multi-class prediction problem.<\/li>\n<li>How to use SMOTE oversampling for imbalanced multi-class classification.<\/li>\n<li>How to use cost-sensitive learning for imbalanced multi-class classification.<\/li>\n<\/ul>\n<p>Discover SMOTE, one-class classification, cost-sensitive learning, threshold moving, and much more <a href=\"https:\/\/machinelearningmastery.com\/imbalanced-classification-with-python\/\">in my new book<\/a>, with 30 step-by-step tutorials and full Python source code.<\/p>\n<p>Let&rsquo;s get started.<\/p>\n<div id=\"attachment_11038\" style=\"width: 809px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-11038\" class=\"size-full wp-image-11038\" src=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2020\/08\/Multi-Class-Imbalanced-Classification.jpg\" alt=\"Multi-Class Imbalanced Classification\" width=\"799\" height=\"533\" srcset=\"http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2020\/08\/Multi-Class-Imbalanced-Classification.jpg 799w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2020\/08\/Multi-Class-Imbalanced-Classification-300x200.jpg 300w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2020\/08\/Multi-Class-Imbalanced-Classification-768x512.jpg 768w\" sizes=\"(max-width: 799px) 100vw, 799px\"><\/p>\n<p id=\"caption-attachment-11038\" class=\"wp-caption-text\">Multi-Class Imbalanced Classification<br \/>Photo by <a href=\"https:\/\/www.flickr.com\/photos\/istolethetv\/8623391483\/\">istolethetv<\/a>, some rights reserved.<\/p>\n<\/div>\n<h2>Tutorial Overview<\/h2>\n<p>This tutorial is divided into three parts; they are:<\/p>\n<ol>\n<li>Glass Multi-Class Classification Dataset<\/li>\n<li>SMOTE Oversampling for Multi-Class Classification<\/li>\n<li>Cost-Sensitive Learning for Multi-Class Classification<\/li>\n<\/ol>\n<h2>Glass Multi-Class Classification Dataset<\/h2>\n<p>In this tutorial, we will focus on the standard imbalanced multi-class classification problem referred to as &ldquo;<strong>Glass Identification<\/strong>&rdquo; or simply &ldquo;<em>glass<\/em>.&rdquo;<\/p>\n<p>The dataset describes the chemical properties of glass and involves classifying samples of glass using their chemical properties as one of six classes. The dataset was credited to <a href=\"https:\/\/www.lexvisio.com\/expert-witness\/vina-r-spiehler-phd-dabft-spiehler-associates\">Vina Spiehler<\/a> in 1987.<\/p>\n<p>Ignoring the sample identification number, there are nine input variables that summarize the properties of the glass dataset; they are:<\/p>\n<ul>\n<li>RI: Refractive Index<\/li>\n<li>Na: Sodium<\/li>\n<li>Mg: Magnesium<\/li>\n<li>Al: Aluminum<\/li>\n<li>Si: Silicon<\/li>\n<li>K: Potassium<\/li>\n<li>Ca: Calcium<\/li>\n<li>Ba: Barium<\/li>\n<li>Fe: Iron<\/li>\n<\/ul>\n<p>The chemical compositions are measured as the weight percent in corresponding oxide.<\/p>\n<p>There are seven types of glass listed; they are:<\/p>\n<ul>\n<li>Class 1: building windows (float processed)<\/li>\n<li>Class 2: building windows (non-float processed)<\/li>\n<li>Class 3: vehicle windows (float processed)<\/li>\n<li>Class 4: vehicle windows (non-float processed)<\/li>\n<li>Class 5: containers<\/li>\n<li>Class 6: tableware<\/li>\n<li>Class 7: headlamps<\/li>\n<\/ul>\n<p><a href=\"https:\/\/en.wikipedia.org\/wiki\/Float_glass\">Float glass<\/a> refers to the process used to make the glass.<\/p>\n<p>There are 214 observations in the dataset and the number of observations in each class is imbalanced. Note that there are no examples for class 4 (non-float processed vehicle windows) in the dataset.<\/p>\n<ul>\n<li>Class 1: 70 examples<\/li>\n<li>Class 2: 76 examples<\/li>\n<li>Class 3: 17 examples<\/li>\n<li>Class 4: 0 examples<\/li>\n<li>Class 5: 13 examples<\/li>\n<li>Class 6: 9 examples<\/li>\n<li>Class 7: 29 examples<\/li>\n<\/ul>\n<p>Although there are minority classes, all classes are equally important in this prediction problem.<\/p>\n<p>The dataset can be divided into window glass (classes 1-4) and non-window glass (classes 5-7). There are 163 examples of window glass and 51 examples of non-window glass.<\/p>\n<ul>\n<li>Window Glass: 163 examples<\/li>\n<li>Non-Window Glass: 51 examples<\/li>\n<\/ul>\n<p>Another division of the observations would be between float processed glass and non-float processed glass, in the case of window glass only. This division is more balanced.<\/p>\n<ul>\n<li>Float Glass: 87 examples<\/li>\n<li>Non-Float Glass: 76 examples<\/li>\n<\/ul>\n<p>You can learn more about the dataset here:<\/p>\n<ul>\n<li><a href=\"https:\/\/raw.githubusercontent.com\/jbrownlee\/Datasets\/master\/glass.csv\">Glass Dataset (glass.csv)<\/a><\/li>\n<li><a href=\"https:\/\/raw.githubusercontent.com\/jbrownlee\/Datasets\/master\/glass.names\">Glass Dataset Description (glass.names)<\/a><\/li>\n<\/ul>\n<p>No need to download the dataset; we will download it automatically as part of the worked examples.<\/p>\n<p>Below is a sample of the first few rows of the data.<\/p>\n<pre class=\"urvanov-syntax-highlighter-plain-tag\">1.52101,13.64,4.49,1.10,71.78,0.06,8.75,0.00,0.00,1\r\n1.51761,13.89,3.60,1.36,72.73,0.48,7.83,0.00,0.00,1\r\n1.51618,13.53,3.55,1.54,72.99,0.39,7.78,0.00,0.00,1\r\n1.51766,13.21,3.69,1.29,72.61,0.57,8.22,0.00,0.00,1\r\n1.51742,13.27,3.62,1.24,73.08,0.55,8.07,0.00,0.00,1\r\n...<\/pre>\n<p>We can see that all inputs are numeric and the target variable in the final column is the integer encoded class label.<\/p>\n<p>You can learn more about how to work through this dataset as part of a project in the tutorial:<\/p>\n<ul>\n<li><a href=\"https:\/\/machinelearningmastery.com\/\">Imbalanced Multiclass Classification with the Glass Identification Dataset<\/a><\/li>\n<\/ul>\n<p>Now that we are familiar with the glass multi-class classification dataset, let&rsquo;s explore how we can use standard imbalanced classification tools with it.<\/p>\n<\/p>\n<div class=\"woo-sc-hr\"><\/div>\n<p><center><\/p>\n<h3>Want to Get Started With Imbalance Classification?<\/h3>\n<p>Take my free 7-day email crash course now (with sample code).<\/p>\n<p>Click to sign-up and also get a free PDF Ebook version of the course.<\/p>\n<p><a href=\"https:\/\/machinelearningmastery.lpages.co\/leadbox\/14de34d42172a2%3A164f8be4f346dc\/4529268551712768\/\" target=\"_blank\" style=\"background: rgb(255, 206, 10); color: rgb(255, 255, 255); text-decoration: none; font-family: Helvetica, Arial, sans-serif; font-weight: bold; font-size: 16px; line-height: 20px; padding: 10px; display: inline-block; max-width: 300px; border-radius: 5px; text-shadow: rgba(0, 0, 0, 0.25) 0px -1px 1px; box-shadow: rgba(255, 255, 255, 0.5) 0px 1px 3px inset, rgba(0, 0, 0, 0.5) 0px 1px 3px;\" rel=\"noopener noreferrer\">Download Your FREE Mini-Course<\/a><script data-leadbox=\"14de34d42172a2:164f8be4f346dc\" data-url=\"https:\/\/machinelearningmastery.lpages.co\/leadbox\/14de34d42172a2%3A164f8be4f346dc\/4529268551712768\/\" data-config=\"%7B%7D\" type=\"text\/javascript\" src=\"https:\/\/machinelearningmastery.lpages.co\/leadbox-1576257931.js\"><\/script><\/p>\n<p><\/center><\/p>\n<div class=\"woo-sc-hr\"><\/div>\n<h2>SMOTE Oversampling for Multi-Class Classification<\/h2>\n<p>Oversampling refers to copying or synthesizing new examples of the minority classes so that the number of examples in the minority class better resembles or matches the number of examples in the majority classes.<\/p>\n<p>Perhaps the most widely used approach to synthesizing new examples is called the Synthetic Minority Oversampling TEchnique, or SMOTE for short. This technique was described by Nitesh Chawla, et al. in their 2002 paper named for the technique titled &ldquo;<a href=\"https:\/\/arxiv.org\/abs\/1106.1813\">SMOTE: Synthetic Minority Over-sampling Technique<\/a>.&rdquo;<\/p>\n<p>You can learn more about SMOTE in the tutorial:<\/p>\n<ul>\n<li><a href=\"https:\/\/machinelearningmastery.com\/smote-oversampling-for-imbalanced-classification\/\">SMOTE for Imbalanced Classification with Python<\/a><\/li>\n<\/ul>\n<p>The <a href=\"https:\/\/imbalanced-learn.readthedocs.io\/en\/stable\/index.html\">imbalanced-learn library<\/a> provides an implementation of <a href=\"https:\/\/imbalanced-learn.readthedocs.io\/en\/stable\/generated\/imblearn.over_sampling.SMOTE.html\">SMOTE<\/a> that we can use that is compatible with the popular scikit-learn library.<\/p>\n<p>First, the library must be installed. We can install it using pip as follows:<\/p>\n<p>sudo pip install imbalanced-learn<\/p>\n<p>We can confirm that the installation was successful by printing the version of the installed library:<\/p>\n<pre class=\"urvanov-syntax-highlighter-plain-tag\"># check version number\r\nimport imblearn\r\nprint(imblearn.__version__)<\/pre>\n<p>Running the example will print the version number of the installed library; for example:<\/p>\n<pre class=\"urvanov-syntax-highlighter-plain-tag\">0.6.2<\/pre>\n<p>Before we apply SMOTE, let&rsquo;s first load the dataset and confirm the number of examples in each class.<\/p>\n<pre class=\"urvanov-syntax-highlighter-plain-tag\"># load and summarize the dataset\r\nfrom pandas import read_csv\r\nfrom collections import Counter\r\nfrom matplotlib import pyplot\r\nfrom sklearn.preprocessing import LabelEncoder\r\n# define the dataset location\r\nurl = 'https:\/\/raw.githubusercontent.com\/jbrownlee\/Datasets\/master\/glass.csv'\r\n# load the csv file as a data frame\r\ndf = read_csv(url, header=None)\r\ndata = df.values\r\n# split into input and output elements\r\nX, y = data[:, :-1], data[:, -1]\r\n# label encode the target variable\r\ny = LabelEncoder().fit_transform(y)\r\n# summarize distribution\r\ncounter = Counter(y)\r\nfor k,v in counter.items():\r\n\tper = v \/ len(y) * 100\r\n\tprint('Class=%d, n=%d (%.3f%%)' % (k, v, per))\r\n# plot the distribution\r\npyplot.bar(counter.keys(), counter.values())\r\npyplot.show()<\/pre>\n<p>Running the example first downloads the dataset and splits it into train and test sets.<\/p>\n<p>The number of rows in each class is then reported, confirming that some classes, such as 0 and 1, have many more examples (more than 70) than other classes, such as 3 and 4 (less than 15).<\/p>\n<pre class=\"urvanov-syntax-highlighter-plain-tag\">Class=0, n=70 (32.710%)\r\nClass=1, n=76 (35.514%)\r\nClass=2, n=17 (7.944%)\r\nClass=3, n=13 (6.075%)\r\nClass=4, n=9 (4.206%)\r\nClass=5, n=29 (13.551%)<\/pre>\n<p>A bar chart is created providing a visualization of the class breakdown of the dataset.<\/p>\n<p>This gives a clearer idea that classes 0 and 1 have many more examples than classes 2, 3, 4 and 5.<\/p>\n<div id=\"attachment_11035\" style=\"width: 1290px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-11035\" class=\"size-full wp-image-11035\" src=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2020\/06\/Histogram-of-Examples-in-Each-Class-in-the-Glass-Multi-Class-Classification-Dataset.png\" alt=\"Histogram of Examples in Each Class in the Glass Multi-Class Classification Dataset\" width=\"1280\" height=\"960\" srcset=\"http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2020\/06\/Histogram-of-Examples-in-Each-Class-in-the-Glass-Multi-Class-Classification-Dataset.png 1280w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2020\/06\/Histogram-of-Examples-in-Each-Class-in-the-Glass-Multi-Class-Classification-Dataset-300x225.png 300w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2020\/06\/Histogram-of-Examples-in-Each-Class-in-the-Glass-Multi-Class-Classification-Dataset-1024x768.png 1024w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2020\/06\/Histogram-of-Examples-in-Each-Class-in-the-Glass-Multi-Class-Classification-Dataset-768x576.png 768w\" sizes=\"(max-width: 1280px) 100vw, 1280px\"><\/p>\n<p id=\"caption-attachment-11035\" class=\"wp-caption-text\">Histogram of Examples in Each Class in the Glass Multi-Class Classification Dataset<\/p>\n<\/div>\n<p>Next, we can apply SMOTE to oversample the dataset.<\/p>\n<p>By default, SMOTE will oversample all classes to have the same number of examples as the class with the most examples.<\/p>\n<p>In this case, class 1 has the most examples with 76, therefore, SMOTE will oversample all classes to have 76 examples.<\/p>\n<p>The complete example of oversampling the glass dataset with SMOTE is listed below.<\/p>\n<pre class=\"urvanov-syntax-highlighter-plain-tag\"># example of oversampling a multi-class classification dataset\r\nfrom pandas import read_csv\r\nfrom imblearn.over_sampling import SMOTE\r\nfrom collections import Counter\r\nfrom matplotlib import pyplot\r\nfrom sklearn.preprocessing import LabelEncoder\r\n# define the dataset location\r\nurl = 'https:\/\/raw.githubusercontent.com\/jbrownlee\/Datasets\/master\/glass.csv'\r\n# load the csv file as a data frame\r\ndf = read_csv(url, header=None)\r\ndata = df.values\r\n# split into input and output elements\r\nX, y = data[:, :-1], data[:, -1]\r\n# label encode the target variable\r\ny = LabelEncoder().fit_transform(y)\r\n# transform the dataset\r\noversample = SMOTE()\r\nX, y = oversample.fit_resample(X, y)\r\n# summarize distribution\r\ncounter = Counter(y)\r\nfor k,v in counter.items():\r\n\tper = v \/ len(y) * 100\r\n\tprint('Class=%d, n=%d (%.3f%%)' % (k, v, per))\r\n# plot the distribution\r\npyplot.bar(counter.keys(), counter.values())\r\npyplot.show()<\/pre>\n<p>Running the example first loads the dataset and applies SMOTE to it.<\/p>\n<p>The distribution of examples in each class is then reported, confirming that each class now has 76 examples, as we expected.<\/p>\n<pre class=\"urvanov-syntax-highlighter-plain-tag\">Class=0, n=76 (16.667%)\r\nClass=1, n=76 (16.667%)\r\nClass=2, n=76 (16.667%)\r\nClass=3, n=76 (16.667%)\r\nClass=4, n=76 (16.667%)\r\nClass=5, n=76 (16.667%)<\/pre>\n<p>A bar chart of the class distribution is also created, providing a strong visual indication that all classes now have the same number of examples.<\/p>\n<div id=\"attachment_11036\" style=\"width: 1290px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-11036\" class=\"size-full wp-image-11036\" src=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2020\/06\/Histogram-of-Examples-in-Each-Class-in-the-Glass-Multi-Class-Classification-Dataset-After-Default-SMOTE-Oversampling.png\" alt=\"Histogram of Examples in Each Class in the Glass Multi-Class Classification Dataset After Default SMOTE Oversampling\" width=\"1280\" height=\"960\" srcset=\"http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2020\/06\/Histogram-of-Examples-in-Each-Class-in-the-Glass-Multi-Class-Classification-Dataset-After-Default-SMOTE-Oversampling.png 1280w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2020\/06\/Histogram-of-Examples-in-Each-Class-in-the-Glass-Multi-Class-Classification-Dataset-After-Default-SMOTE-Oversampling-300x225.png 300w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2020\/06\/Histogram-of-Examples-in-Each-Class-in-the-Glass-Multi-Class-Classification-Dataset-After-Default-SMOTE-Oversampling-1024x768.png 1024w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2020\/06\/Histogram-of-Examples-in-Each-Class-in-the-Glass-Multi-Class-Classification-Dataset-After-Default-SMOTE-Oversampling-768x576.png 768w\" sizes=\"(max-width: 1280px) 100vw, 1280px\"><\/p>\n<p id=\"caption-attachment-11036\" class=\"wp-caption-text\">Histogram of Examples in Each Class in the Glass Multi-Class Classification Dataset After Default SMOTE Oversampling<\/p>\n<\/div>\n<p>Instead of using the default strategy of SMOTE to oversample all classes to the number of examples in the majority class, we could instead specify the number of examples to oversample in each class.<\/p>\n<p>For example, we could oversample to 100 examples in classes 0 and 1 and 200 examples in remaining classes. This can be achieved by creating a dictionary that maps class labels to the number of desired examples in each class, then specifying this via the &ldquo;<em>sampling_strategy<\/em>&rdquo; argument to the SMOTE class.<\/p>\n<pre class=\"urvanov-syntax-highlighter-plain-tag\">...\r\n# transform the dataset\r\nstrategy = {0:100, 1:100, 2:200, 3:200, 4:200, 5:200}\r\noversample = SMOTE(sampling_strategy=strategy)\r\nX, y = oversample.fit_resample(X, y)<\/pre>\n<p>Tying this together, the complete example of using a custom oversampling strategy for SMOTE is listed below.<\/p>\n<pre class=\"urvanov-syntax-highlighter-plain-tag\"># example of oversampling a multi-class classification dataset with a custom strategy\r\nfrom pandas import read_csv\r\nfrom imblearn.over_sampling import SMOTE\r\nfrom collections import Counter\r\nfrom matplotlib import pyplot\r\nfrom sklearn.preprocessing import LabelEncoder\r\n# define the dataset location\r\nurl = 'https:\/\/raw.githubusercontent.com\/jbrownlee\/Datasets\/master\/glass.csv'\r\n# load the csv file as a data frame\r\ndf = read_csv(url, header=None)\r\ndata = df.values\r\n# split into input and output elements\r\nX, y = data[:, :-1], data[:, -1]\r\n# label encode the target variable\r\ny = LabelEncoder().fit_transform(y)\r\n# transform the dataset\r\nstrategy = {0:100, 1:100, 2:200, 3:200, 4:200, 5:200}\r\noversample = SMOTE(sampling_strategy=strategy)\r\nX, y = oversample.fit_resample(X, y)\r\n# summarize distribution\r\ncounter = Counter(y)\r\nfor k,v in counter.items():\r\n\tper = v \/ len(y) * 100\r\n\tprint('Class=%d, n=%d (%.3f%%)' % (k, v, per))\r\n# plot the distribution\r\npyplot.bar(counter.keys(), counter.values())\r\npyplot.show()<\/pre>\n<p>Running the example creates the desired sampling and summarizes the effect on the dataset, confirming the intended result.<\/p>\n<pre class=\"urvanov-syntax-highlighter-plain-tag\">Class=0, n=100 (10.000%)\r\nClass=1, n=100 (10.000%)\r\nClass=2, n=200 (20.000%)\r\nClass=3, n=200 (20.000%)\r\nClass=4, n=200 (20.000%)\r\nClass=5, n=200 (20.000%)<\/pre>\n<p>Note: you may see warnings that can be safely ignored for the purposes of this example, such as:<\/p>\n<pre class=\"urvanov-syntax-highlighter-plain-tag\">UserWarning: After over-sampling, the number of samples (200) in class 5 will be larger than the number of samples in the majority class (class #1 -&gt; 76)<\/pre>\n<p>A bar chart of the class distribution is also created confirming the specified class distribution after data sampling.<\/p>\n<div id=\"attachment_11037\" style=\"width: 1290px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-11037\" class=\"size-full wp-image-11037\" src=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2020\/06\/Histogram-of-Examples-in-Each-Class-in-the-Glass-Multi-Class-Classification-Dataset-After-Custom-SMOTE-Oversampling.png\" alt=\"Histogram of Examples in Each Class in the Glass Multi-Class Classification Dataset After Custom SMOTE Oversampling\" width=\"1280\" height=\"960\" srcset=\"http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2020\/06\/Histogram-of-Examples-in-Each-Class-in-the-Glass-Multi-Class-Classification-Dataset-After-Custom-SMOTE-Oversampling.png 1280w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2020\/06\/Histogram-of-Examples-in-Each-Class-in-the-Glass-Multi-Class-Classification-Dataset-After-Custom-SMOTE-Oversampling-300x225.png 300w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2020\/06\/Histogram-of-Examples-in-Each-Class-in-the-Glass-Multi-Class-Classification-Dataset-After-Custom-SMOTE-Oversampling-1024x768.png 1024w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2020\/06\/Histogram-of-Examples-in-Each-Class-in-the-Glass-Multi-Class-Classification-Dataset-After-Custom-SMOTE-Oversampling-768x576.png 768w\" sizes=\"(max-width: 1280px) 100vw, 1280px\"><\/p>\n<p id=\"caption-attachment-11037\" class=\"wp-caption-text\">Histogram of Examples in Each Class in the Glass Multi-Class Classification Dataset After Custom SMOTE Oversampling<\/p>\n<\/div>\n<p><strong>Note<\/strong>: when using data sampling like SMOTE, it must only be applied to the training dataset, not the entire dataset. I recommend using a Pipeline to ensure that the SMOTE method is correctly used when evaluating models and making predictions with models.<\/p>\n<p>You can see an example of the correct usage of SMOTE in a Pipeline in this tutorial:<\/p>\n<ul>\n<li><a href=\"https:\/\/machinelearningmastery.com\/smote-oversampling-for-imbalanced-classification\/\">SMOTE for Imbalanced Classification with Python<\/a><\/li>\n<\/ul>\n<h2>Cost-Sensitive Learning for Multi-Class Classification<\/h2>\n<p>Most machine learning algorithms assume that all classes have an equal number of examples.<\/p>\n<p>This is not the case in multi-class imbalanced classification. Algorithms can be modified to change the way learning is performed to bias towards those classes that have fewer examples in the training dataset. This is generally called cost-sensitive learning.<\/p>\n<p>For more on cost-sensitive learning, see the tutorial:<\/p>\n<ul>\n<li><a href=\"https:\/\/machinelearningmastery.com\/cost-sensitive-learning-for-imbalanced-classification\/\">Cost-Sensitive Learning for Imbalanced Classification<\/a><\/li>\n<\/ul>\n<p>The <a href=\"https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.ensemble.RandomForestClassifier.html\">RandomForestClassifier class<\/a> in scikit-learn supports cost-sensitive learning via the &ldquo;<em>class_weight<\/em>&rdquo; argument.<\/p>\n<p>By default, the random forest class assigns equal weight to each class.<\/p>\n<p>We can evaluate the classification accuracy of the default random forest class weighting on the glass imbalanced multi-class classification dataset.<\/p>\n<p>The complete example is listed below.<\/p>\n<pre class=\"urvanov-syntax-highlighter-plain-tag\"># baseline model and test harness for the glass identification dataset\r\nfrom numpy import mean\r\nfrom numpy import std\r\nfrom pandas import read_csv\r\nfrom sklearn.preprocessing import LabelEncoder\r\nfrom sklearn.model_selection import cross_val_score\r\nfrom sklearn.model_selection import RepeatedStratifiedKFold\r\nfrom sklearn.ensemble import RandomForestClassifier\r\n\r\n# load the dataset\r\ndef load_dataset(full_path):\r\n\t# load the dataset as a numpy array\r\n\tdata = read_csv(full_path, header=None)\r\n\t# retrieve numpy array\r\n\tdata = data.values\r\n\t# split into input and output elements\r\n\tX, y = data[:, :-1], data[:, -1]\r\n\t# label encode the target variable to have the classes 0 and 1\r\n\ty = LabelEncoder().fit_transform(y)\r\n\treturn X, y\r\n\r\n# evaluate a model\r\ndef evaluate_model(X, y, model):\r\n\t# define evaluation procedure\r\n\tcv = RepeatedStratifiedKFold(n_splits=5, n_repeats=3, random_state=1)\r\n\t# evaluate model\r\n\tscores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)\r\n\treturn scores\r\n\r\n# define the location of the dataset\r\nfull_path = 'https:\/\/raw.githubusercontent.com\/jbrownlee\/Datasets\/master\/glass.csv'\r\n# load the dataset\r\nX, y = load_dataset(full_path)\r\n# define the reference model\r\nmodel = RandomForestClassifier(n_estimators=1000)\r\n# evaluate the model\r\nscores = evaluate_model(X, y, model)\r\n# summarize performance\r\nprint('Mean Accuracy: %.3f (%.3f)' % (mean(scores), std(scores)))<\/pre>\n<p>Running the example evaluates the default random forest algorithm with 1,000 trees on the glass dataset using <a href=\"https:\/\/machinelearningmastery.com\/k-fold-cross-validation\/\">repeated stratified k-fold cross-validation<\/a>.<\/p>\n<p>The mean and standard deviation classification accuracy are reported at the end of the run.<\/p>\n<p>Your specific results may vary given the stochastic nature of the learning algorithm, the evaluation procedure, and differences in precision across machines. Try running the example a few times.<\/p>\n<p>In this case, we can see that the default model achieved a classification accuracy of about 79.6 percent.<\/p>\n<pre class=\"urvanov-syntax-highlighter-plain-tag\">Mean Accuracy: 0.796 (0.047)<\/pre>\n<p>We can specify the &ldquo;<em>class_weight<\/em>&rdquo; argument to the value &ldquo;<em>balanced<\/em>&rdquo; that will automatically calculates a class weighting that will ensure each class gets an equal weighting during the training of the model.<\/p>\n<pre class=\"urvanov-syntax-highlighter-plain-tag\">...\r\n# define the model\r\nmodel = RandomForestClassifier(n_estimators=1000, class_weight='balanced')<\/pre>\n<p>Tying this together, the complete example is listed below.<\/p>\n<pre class=\"urvanov-syntax-highlighter-plain-tag\"># cost sensitive random forest with default class weights\r\nfrom numpy import mean\r\nfrom numpy import std\r\nfrom pandas import read_csv\r\nfrom sklearn.preprocessing import LabelEncoder\r\nfrom sklearn.model_selection import cross_val_score\r\nfrom sklearn.model_selection import RepeatedStratifiedKFold\r\nfrom sklearn.ensemble import RandomForestClassifier\r\n\r\n# load the dataset\r\ndef load_dataset(full_path):\r\n\t# load the dataset as a numpy array\r\n\tdata = read_csv(full_path, header=None)\r\n\t# retrieve numpy array\r\n\tdata = data.values\r\n\t# split into input and output elements\r\n\tX, y = data[:, :-1], data[:, -1]\r\n\t# label encode the target variable\r\n\ty = LabelEncoder().fit_transform(y)\r\n\treturn X, y\r\n\r\n# evaluate a model\r\ndef evaluate_model(X, y, model):\r\n\t# define evaluation procedure\r\n\tcv = RepeatedStratifiedKFold(n_splits=5, n_repeats=3, random_state=1)\r\n\t# evaluate model\r\n\tscores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)\r\n\treturn scores\r\n\r\n# define the location of the dataset\r\nfull_path = 'https:\/\/raw.githubusercontent.com\/jbrownlee\/Datasets\/master\/glass.csv'\r\n# load the dataset\r\nX, y = load_dataset(full_path)\r\n# define the model\r\nmodel = RandomForestClassifier(n_estimators=1000, class_weight='balanced')\r\n# evaluate the model\r\nscores = evaluate_model(X, y, model)\r\n# summarize performance\r\nprint('Mean Accuracy: %.3f (%.3f)' % (mean(scores), std(scores)))<\/pre>\n<p>Running the example reports the mean and standard deviation classification accuracy of the cost-sensitive version of random forest on the glass dataset.<\/p>\n<p>Your specific results may vary given the stochastic nature of the learning algorithm, the evaluation procedure, and differences in precision across machines. Try running the example a few times.<\/p>\n<p>In this case, we can see that the default model achieved a lift in classification accuracy over the cost-insensitive version of the algorithm, with 80.2 percent classification accuracy vs. 79.6 percent.<\/p>\n<pre class=\"urvanov-syntax-highlighter-plain-tag\">Mean Accuracy: 0.802 (0.044)<\/pre>\n<p>The &ldquo;<em>class_weight<\/em>&rdquo; argument takes a dictionary of class labels mapped to a class weighting value.<\/p>\n<p>We can use this to specify a custom weighting, such as a default weighting for classes 0 and 1.0 that have many examples and a double class weighting of 2.0 for the other classes.<\/p>\n<pre class=\"urvanov-syntax-highlighter-plain-tag\">...\r\n# define the model\r\nweights = {0:1.0, 1:1.0, 2:2.0, 3:2.0, 4:2.0, 5:2.0}\r\nmodel = RandomForestClassifier(n_estimators=1000, class_weight=weights)<\/pre>\n<p>Tying this together, the complete example of using a custom class weighting for cost-sensitive learning on the glass multi-class imbalanced classification problem is listed below.<\/p>\n<pre class=\"urvanov-syntax-highlighter-plain-tag\"># cost sensitive random forest with custom class weightings\r\nfrom numpy import mean\r\nfrom numpy import std\r\nfrom pandas import read_csv\r\nfrom sklearn.preprocessing import LabelEncoder\r\nfrom sklearn.model_selection import cross_val_score\r\nfrom sklearn.model_selection import RepeatedStratifiedKFold\r\nfrom sklearn.ensemble import RandomForestClassifier\r\n\r\n# load the dataset\r\ndef load_dataset(full_path):\r\n\t# load the dataset as a numpy array\r\n\tdata = read_csv(full_path, header=None)\r\n\t# retrieve numpy array\r\n\tdata = data.values\r\n\t# split into input and output elements\r\n\tX, y = data[:, :-1], data[:, -1]\r\n\t# label encode the target variable\r\n\ty = LabelEncoder().fit_transform(y)\r\n\treturn X, y\r\n\r\n# evaluate a model\r\ndef evaluate_model(X, y, model):\r\n\t# define evaluation procedure\r\n\tcv = RepeatedStratifiedKFold(n_splits=5, n_repeats=3, random_state=1)\r\n\t# evaluate model\r\n\tscores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)\r\n\treturn scores\r\n\r\n# define the location of the dataset\r\nfull_path = 'https:\/\/raw.githubusercontent.com\/jbrownlee\/Datasets\/master\/glass.csv'\r\n# load the dataset\r\nX, y = load_dataset(full_path)\r\n# define the model\r\nweights = {0:1.0, 1:1.0, 2:2.0, 3:2.0, 4:2.0, 5:2.0}\r\nmodel = RandomForestClassifier(n_estimators=1000, class_weight=weights)\r\n# evaluate the model\r\nscores = evaluate_model(X, y, model)\r\n# summarize performance\r\nprint('Mean Accuracy: %.3f (%.3f)' % (mean(scores), std(scores)))<\/pre>\n<p>Running the example reports the mean and standard deviation classification accuracy of the cost-sensitive version of random forest on the glass dataset with custom weights.<\/p>\n<p>Your specific results may vary given the stochastic nature of the learning algorithm, the evaluation procedure, and differences in precision across machines. Try running the example a few times.<\/p>\n<p>In this case, we can see that we achieved a further lift in accuracy from about 80.2 percent with balanced class weighting to 80.8 percent with a more biased class weighting.<\/p>\n<pre class=\"urvanov-syntax-highlighter-plain-tag\">Mean Accuracy: 0.808 (0.059)<\/pre>\n<\/p>\n<h2>Further Reading<\/h2>\n<p>This section provides more resources on the topic if you are looking to go deeper.<\/p>\n<h3>Related Tutorials<\/h3>\n<ul>\n<li><a href=\"https:\/\/machinelearningmastery.com\/imbalanced-multiclass-classification-with-the-glass-identification-dataset\/\">Imbalanced Multiclass Classification with the Glass Identification Dataset<\/a><\/li>\n<li><a href=\"https:\/\/machinelearningmastery.com\/smote-oversampling-for-imbalanced-classification\/\">SMOTE for Imbalanced Classification with Python<\/a><\/li>\n<li><a href=\"https:\/\/machinelearningmastery.com\/cost-sensitive-logistic-regression\/\">Cost-Sensitive Logistic Regression for Imbalanced Classification<\/a><\/li>\n<li><a href=\"https:\/\/machinelearningmastery.com\/cost-sensitive-learning-for-imbalanced-classification\/\">Cost-Sensitive Learning for Imbalanced Classification<\/a><\/li>\n<\/ul>\n<h3>APIs<\/h3>\n<ul>\n<li><a href=\"https:\/\/imbalanced-learn.readthedocs.io\/en\/stable\/generated\/imblearn.over_sampling.SMOTE.html\">imblearn.over_sampling.SMOTE API<\/a>.<\/li>\n<li><a href=\"https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.ensemble.RandomForestClassifier.html\">sklearn.ensemble.RandomForestClassifier API<\/a>.<\/li>\n<\/ul>\n<h2>Summary<\/h2>\n<p>In this tutorial, you discovered how to use the tools of imbalanced classification with a multi-class dataset.<\/p>\n<p>Specifically, you learned:<\/p>\n<ul>\n<li>About the glass identification standard imbalanced multi-class prediction problem.<\/li>\n<li>How to use SMOTE oversampling for imbalanced multi-class classification.<\/li>\n<li>How to use cost-sensitive learning for imbalanced multi-class classification.<\/li>\n<\/ul>\n<p><strong>Do you have any questions?<\/strong><br \/>\nAsk your questions in the comments below and I will do my best to answer.<\/p>\n<p>The post <a rel=\"nofollow\" href=\"https:\/\/machinelearningmastery.com\/multi-class-imbalanced-classification\/\">Multi-Class Imbalanced Classification<\/a> appeared first on <a rel=\"nofollow\" href=\"https:\/\/machinelearningmastery.com\/\">Machine Learning Mastery<\/a>.<\/p>\n<\/div>\n<p><a href=\"https:\/\/machinelearningmastery.com\/multi-class-imbalanced-classification\/\">Go to Source<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Author: Jason Brownlee Imbalanced classification are those prediction tasks where the distribution of examples across class labels is not equal. Most imbalanced classification examples focus [&hellip;] <span class=\"read-more-link\"><a class=\"read-more\" href=\"https:\/\/www.aiproblog.com\/index.php\/2020\/08\/06\/multi-class-imbalanced-classification\/\">Read More<\/a><\/span><\/p>\n","protected":false},"author":1,"featured_media":3740,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_bbp_topic_count":0,"_bbp_reply_count":0,"_bbp_total_topic_count":0,"_bbp_total_reply_count":0,"_bbp_voice_count":0,"_bbp_anonymous_reply_count":0,"_bbp_topic_count_hidden":0,"_bbp_reply_count_hidden":0,"_bbp_forum_subforum_count":0,"footnotes":""},"categories":[24],"tags":[],"_links":{"self":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts\/3739"}],"collection":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/comments?post=3739"}],"version-history":[{"count":0,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts\/3739\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/media\/3740"}],"wp:attachment":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/media?parent=3739"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/categories?post=3739"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/tags?post=3739"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}