{"id":2978,"date":"2019-12-29T18:00:14","date_gmt":"2019-12-29T18:00:14","guid":{"rendered":"https:\/\/www.aiproblog.com\/index.php\/2019\/12\/29\/standard-machine-learning-datasets-for-imbalanced-classification\/"},"modified":"2019-12-29T18:00:14","modified_gmt":"2019-12-29T18:00:14","slug":"standard-machine-learning-datasets-for-imbalanced-classification","status":"publish","type":"post","link":"https:\/\/www.aiproblog.com\/index.php\/2019\/12\/29\/standard-machine-learning-datasets-for-imbalanced-classification\/","title":{"rendered":"Standard Machine Learning Datasets for Imbalanced Classification"},"content":{"rendered":"<p>Author: Jason Brownlee<\/p>\n<div>\n<p>An imbalanced classification problem is a problem that involves predicting a class label where the distribution of class labels in the training dataset is skewed.<\/p>\n<p>Many real-world classification problems have an imbalanced class distribution, therefore it is important for machine learning practitioners to get familiar with working with these types of problems.<\/p>\n<p>In this tutorial, you will discover a suite of standard machine learning datasets for imbalanced classification.<\/p>\n<p>After completing this tutorial, you will know:<\/p>\n<ul>\n<li>Standard machine learning datasets with an imbalance of two classes.<\/li>\n<li>Standard datasets for multiclass classification with a skewed class distribution.<\/li>\n<li>Popular imbalanced classification datasets used for machine learning competitions.<\/li>\n<\/ul>\n<p>Let\u2019s get started.<\/p>\n<div id=\"attachment_9318\" style=\"width: 809px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-9318\" class=\"size-full wp-image-9318\" src=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2019\/12\/Standard-Machine-Learning-Datasets-for-Imbalanced-Classification.jpg\" alt=\"Standard Machine Learning Datasets for Imbalanced Classification\" width=\"799\" height=\"533\" srcset=\"http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2019\/12\/Standard-Machine-Learning-Datasets-for-Imbalanced-Classification.jpg 799w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2019\/12\/Standard-Machine-Learning-Datasets-for-Imbalanced-Classification-300x200.jpg 300w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2019\/12\/Standard-Machine-Learning-Datasets-for-Imbalanced-Classification-768x512.jpg 768w\" sizes=\"(max-width: 799px) 100vw, 799px\"><\/p>\n<p id=\"caption-attachment-9318\" class=\"wp-caption-text\">Standard Machine Learning Datasets for Imbalanced Classification<br \/>Photo by <a href=\"https:\/\/flickr.com\/photos\/graeme\/47214646532\/\">Graeme Churchard<\/a>, some rights reserved.<\/p>\n<\/div>\n<h2>Tutorial Overview<\/h2>\n<p>This tutorial is divided into three parts; they are:<\/p>\n<ol>\n<li>Binary Classification Datasets<\/li>\n<li>Multiclass Classification Datasets<\/li>\n<li>Competition and Other Datasets<\/li>\n<\/ol>\n<h2>Binary Classification Datasets<\/h2>\n<p>Binary classification predictive modeling problems are those with two classes.<\/p>\n<p>Typically, imbalanced binary classification problems describe a normal state (class 0) and an abnormal state (class 1), such as fraud, a diagnosis, or a fault.<\/p>\n<p>In this section, we will take a closer look at three standard binary classification machine learning datasets with a class imbalance. These are datasets that are small enough to fit in memory and have been well studied, providing the basis of investigation in many research papers.<\/p>\n<p>The names of these datasets are as follows:<\/p>\n<ul>\n<li>Pima Indians Diabetes (Pima)<\/li>\n<li>Haberman Breast Cancer (Haberman)<\/li>\n<li>German Credit (German)<\/li>\n<\/ul>\n<p>Each dataset will be loaded and the nature of the class imbalance will be summarized.<\/p>\n<h3>Pima Indians Diabetes (Pima)<\/h3>\n<p>Each record describes the medical details of a female, and the prediction is the onset of diabetes within the next five years.<\/p>\n<ul>\n<li>More Details: <a href=\"https:\/\/raw.githubusercontent.com\/jbrownlee\/Datasets\/master\/pima-indians-diabetes.names\">pima-indians-diabetes.names<\/a><\/li>\n<li>Dataset: <a href=\"https:\/\/raw.githubusercontent.com\/jbrownlee\/Datasets\/master\/pima-indians-diabetes.csv\">pima-indians-diabetes.csv<\/a><\/li>\n<\/ul>\n<p>Below provides a sample of the first five rows of the dataset.<\/p>\n<pre class=\"crayon-plain-tag\">6,148,72,35,0,33.6,0.627,50,1\r\n1,85,66,29,0,26.6,0.351,31,0\r\n8,183,64,0,0,23.3,0.672,32,1\r\n1,89,66,23,94,28.1,0.167,21,0\r\n0,137,40,35,168,43.1,2.288,33,1\r\n...<\/pre>\n<p>The example below loads and summarizes the class breakdown of the dataset.<\/p>\n<pre class=\"crayon-plain-tag\"># Summarize the Pima Indians Diabetes dataset\r\nfrom numpy import unique\r\nfrom pandas import read_csv\r\n# load the dataset\r\nurl = 'https:\/\/raw.githubusercontent.com\/jbrownlee\/Datasets\/master\/pima-indians-diabetes.csv'\r\ndataframe = read_csv(url, header=None)\r\n# get the values\r\nvalues = dataframe.values\r\nX, y = values[:, :-1], values[:, -1]\r\n# gather details\r\nn_rows = X.shape[0]\r\nn_cols = X.shape[1]\r\nclasses = unique(y)\r\nn_classes = len(classes)\r\n# summarize\r\nprint('N Examples: %d' % n_rows)\r\nprint('N Inputs: %d' % n_cols)\r\nprint('N Classes: %d' % n_classes)\r\nprint('Classes: %s' % classes)\r\nprint('Class Breakdown:')\r\n# class breakdown\r\nbreakdown = ''\r\nfor c in classes:\r\n\ttotal = len(y[y == c])\r\n\tratio = (total \/ float(len(y))) * 100\r\n\tprint(' - Class %s: %d (%.5f%%)' % (str(c), total, ratio))<\/pre>\n<p>Running the example provides the following output.<\/p>\n<pre class=\"crayon-plain-tag\">N Examples: 768\r\nN Inputs: 8\r\nN Classes: 2\r\nClasses: [0. 1.]\r\nClass Breakdown:\r\n - Class 0.0: 500 (65.10417%)\r\n - Class 1.0: 268 (34.89583%)<\/pre>\n<\/p>\n<h3>Haberman Breast Cancer (Haberman)<\/h3>\n<p>Each record describes the medical details of a patient and the prediction is whether the patient survived after five years or not.<\/p>\n<ul>\n<li>More Details: <a href=\"https:\/\/raw.githubusercontent.com\/jbrownlee\/Datasets\/master\/haberman.names\">haberman.names<\/a><\/li>\n<li>Dataset: <a href=\"https:\/\/raw.githubusercontent.com\/jbrownlee\/Datasets\/master\/haberman.csv\">haberman.csv<\/a><\/li>\n<li><a href=\"http:\/\/archive.ics.uci.edu\/ml\/datasets\/haberman's+survival\">Additional Information<\/a><\/li>\n<\/ul>\n<p>Below provides a sample of the first five rows of the dataset.<\/p>\n<pre class=\"crayon-plain-tag\">30,64,1,1\r\n30,62,3,1\r\n30,65,0,1\r\n31,59,2,1\r\n31,65,4,1\r\n...<\/pre>\n<p>The example below loads and summarizes the class breakdown of the dataset.<\/p>\n<pre class=\"crayon-plain-tag\"># Summarize the Haberman Breast Cancer dataset\r\nfrom numpy import unique\r\nfrom pandas import read_csv\r\n# load the dataset\r\nurl = 'https:\/\/raw.githubusercontent.com\/jbrownlee\/Datasets\/master\/haberman.csv'\r\ndataframe = read_csv(url, header=None)\r\n# get the values\r\nvalues = dataframe.values\r\nX, y = values[:, :-1], values[:, -1]\r\n# gather details\r\nn_rows = X.shape[0]\r\nn_cols = X.shape[1]\r\nclasses = unique(y)\r\nn_classes = len(classes)\r\n# summarize\r\nprint('N Examples: %d' % n_rows)\r\nprint('N Inputs: %d' % n_cols)\r\nprint('N Classes: %d' % n_classes)\r\nprint('Classes: %s' % classes)\r\nprint('Class Breakdown:')\r\n# class breakdown\r\nbreakdown = ''\r\nfor c in classes:\r\n\ttotal = len(y[y == c])\r\n\tratio = (total \/ float(len(y))) * 100\r\n\tprint(' - Class %s: %d (%.5f%%)' % (str(c), total, ratio))<\/pre>\n<p>Running the example provides the following output.<\/p>\n<pre class=\"crayon-plain-tag\">N Examples: 306\r\nN Inputs: 3\r\nN Classes: 2\r\nClasses: [1 2]\r\nClass Breakdown:\r\n - Class 1: 225 (73.52941%)\r\n - Class 2: 81 (26.47059%)<\/pre>\n<\/p>\n<h3>German Credit (German)<\/h3>\n<p>Each record describes the financial details of a person and the prediction is whether the person is a good credit risk.<\/p>\n<ul>\n<li>More Details: <a href=\"https:\/\/raw.githubusercontent.com\/jbrownlee\/Datasets\/master\/german.names\">german.names<\/a><\/li>\n<li>Dataset: <a href=\"https:\/\/raw.githubusercontent.com\/jbrownlee\/Datasets\/master\/german.csv\">german.csv<\/a><\/li>\n<li><a href=\"https:\/\/archive.ics.uci.edu\/ml\/datasets\/statlog+(german+credit+data)\">Additional Information<\/a><\/li>\n<\/ul>\n<p>Below provides a sample of the first five rows of the dataset.<\/p>\n<pre class=\"crayon-plain-tag\">A11,6,A34,A43,1169,A65,A75,4,A93,A101,4,A121,67,A143,A152,2,A173,1,A192,A201,1\r\nA12,48,A32,A43,5951,A61,A73,2,A92,A101,2,A121,22,A143,A152,1,A173,1,A191,A201,2\r\nA14,12,A34,A46,2096,A61,A74,2,A93,A101,3,A121,49,A143,A152,1,A172,2,A191,A201,1\r\nA11,42,A32,A42,7882,A61,A74,2,A93,A103,4,A122,45,A143,A153,1,A173,2,A191,A201,1\r\nA11,24,A33,A40,4870,A61,A73,3,A93,A101,4,A124,53,A143,A153,2,A173,2,A191,A201,2\r\n...<\/pre>\n<p>The example below loads and summarizes the class breakdown of the dataset.<\/p>\n<pre class=\"crayon-plain-tag\"># Summarize the German Credit dataset\r\nfrom numpy import unique\r\nfrom pandas import read_csv\r\n# load the dataset\r\nurl = 'https:\/\/raw.githubusercontent.com\/jbrownlee\/Datasets\/master\/german.csv'\r\ndataframe = read_csv(url, header=None)\r\n# get the values\r\nvalues = dataframe.values\r\nX, y = values[:, :-1], values[:, -1]\r\n# gather details\r\nn_rows = X.shape[0]\r\nn_cols = X.shape[1]\r\nclasses = unique(y)\r\nn_classes = len(classes)\r\n# summarize\r\nprint('N Examples: %d' % n_rows)\r\nprint('N Inputs: %d' % n_cols)\r\nprint('N Classes: %d' % n_classes)\r\nprint('Classes: %s' % classes)\r\nprint('Class Breakdown:')\r\n# class breakdown\r\nbreakdown = ''\r\nfor c in classes:\r\n\ttotal = len(y[y == c])\r\n\tratio = (total \/ float(len(y))) * 100\r\n\tprint(' - Class %s: %d (%.5f%%)' % (str(c), total, ratio))<\/pre>\n<p>Running the example provides the following output.<\/p>\n<pre class=\"crayon-plain-tag\">N Examples: 1000\r\nN Inputs: 20\r\nN Classes: 2\r\nClasses: [1 2]\r\nClass Breakdown:\r\n - Class 1: 700 (70.00000%)\r\n - Class 2: 300 (30.00000%)<\/pre>\n<\/p>\n<h2>Multiclass Classification Datasets<\/h2>\n<p>Multiclass classification predictive modeling problems are those with more than two classes.<\/p>\n<p>Typically, imbalanced multiclass classification problems describe multiple different events, some significantly more common than others.<\/p>\n<p>In this section, we will take a closer look at three standard multiclass classification machine learning datasets with a class imbalance. These are datasets that are small enough to fit in memory and have been well studied, providing the basis of investigation in many research papers.<\/p>\n<p>The names of these datasets are as follows:<\/p>\n<ul>\n<li>Glass Identification (Glass)<\/li>\n<li>E-coli (Ecoli)<\/li>\n<li>Thyroid Gland (Thyroid)<\/li>\n<\/ul>\n<p><strong>Note<\/strong>: it is common in research papers to transform imbalanced multiclass classification problems into imbalanced binary classification problems by grouping all of the majority classes into one class and leaving the smallest minority class.<\/p>\n<p>Each dataset will be loaded and the nature of the class imbalance will be summarized.<\/p>\n<h3>Glass Identification (Glass)<\/h3>\n<p>Each record describes the chemical content of glass and prediction involves the type of glass.<\/p>\n<ul>\n<li>More Details: <a href=\"https:\/\/raw.githubusercontent.com\/jbrownlee\/Datasets\/master\/glass.names\">glass.names<\/a><\/li>\n<li>Dataset: <a href=\"https:\/\/raw.githubusercontent.com\/jbrownlee\/Datasets\/master\/glass.csv\">glass.csv<\/a><\/li>\n<li><a href=\"https:\/\/archive.ics.uci.edu\/ml\/datasets\/glass+identification\">Additional Information<\/a><\/li>\n<\/ul>\n<p>Below provides a sample of the first five rows of the dataset.<\/p>\n<pre class=\"crayon-plain-tag\">1.52101,13.64,4.49,1.10,71.78,0.06,8.75,0.00,0.00,1\r\n1.51761,13.89,3.60,1.36,72.73,0.48,7.83,0.00,0.00,1\r\n1.51618,13.53,3.55,1.54,72.99,0.39,7.78,0.00,0.00,1\r\n1.51766,13.21,3.69,1.29,72.61,0.57,8.22,0.00,0.00,1\r\n1.51742,13.27,3.62,1.24,73.08,0.55,8.07,0.00,0.00,1\r\n...<\/pre>\n<p>The first column represents a row identifier and can be removed.<\/p>\n<p>The example below loads and summarizes the class breakdown of the dataset.<\/p>\n<pre class=\"crayon-plain-tag\"># Summarize the Glass Identification dataset\r\nfrom numpy import unique\r\nfrom pandas import read_csv\r\n# load the dataset\r\nurl = 'https:\/\/raw.githubusercontent.com\/jbrownlee\/Datasets\/master\/glass.csv'\r\ndataframe = read_csv(url, header=None)\r\n# get the values\r\nvalues = dataframe.values\r\nX, y = values[:, :-1], values[:, -1]\r\n# gather details\r\nn_rows = X.shape[0]\r\nn_cols = X.shape[1]\r\nclasses = unique(y)\r\nn_classes = len(classes)\r\n# summarize\r\nprint('N Examples: %d' % n_rows)\r\nprint('N Inputs: %d' % n_cols)\r\nprint('N Classes: %d' % n_classes)\r\nprint('Classes: %s' % classes)\r\nprint('Class Breakdown:')\r\n# class breakdown\r\nbreakdown = ''\r\nfor c in classes:\r\n\ttotal = len(y[y == c])\r\n\tratio = (total \/ float(len(y))) * 100\r\n\tprint(' - Class %s: %d (%.5f%%)' % (str(c), total, ratio))<\/pre>\n<p>Running the example provides the following output.<\/p>\n<pre class=\"crayon-plain-tag\">N Examples: 214\r\nN Inputs: 9\r\nN Classes: 6\r\nClasses: [1. 2. 3. 5. 6. 7.]\r\nClass Breakdown:\r\n - Class 1.0: 70 (32.71028%)\r\n - Class 2.0: 76 (35.51402%)\r\n - Class 3.0: 17 (7.94393%)\r\n - Class 5.0: 13 (6.07477%)\r\n - Class 6.0: 9 (4.20561%)\r\n - Class 7.0: 29 (13.55140%)<\/pre>\n<\/p>\n<h3>E-coli (Ecoli)<\/h3>\n<p>Each record describes the result of different tests and prediction involves the protein localization site name.<\/p>\n<ul>\n<li>More Details: <a href=\"https:\/\/raw.githubusercontent.com\/jbrownlee\/Datasets\/master\/ecoli.names\">ecoli.names<\/a><\/li>\n<li>Dataset: <a href=\"https:\/\/raw.githubusercontent.com\/jbrownlee\/Datasets\/master\/ecoli.csv\">ecoli.csv<\/a><\/li>\n<li><a href=\"https:\/\/archive.ics.uci.edu\/ml\/datasets\/ecoli\">Additional Information<\/a><\/li>\n<\/ul>\n<p>Below provides a sample of the first five rows of the dataset.<\/p>\n<pre class=\"crayon-plain-tag\">0.49,0.29,0.48,0.50,0.56,0.24,0.35,cp\r\n0.07,0.40,0.48,0.50,0.54,0.35,0.44,cp\r\n0.56,0.40,0.48,0.50,0.49,0.37,0.46,cp\r\n0.59,0.49,0.48,0.50,0.52,0.45,0.36,cp\r\n0.23,0.32,0.48,0.50,0.55,0.25,0.35,cp\r\n...<\/pre>\n<p>The first column represents a row identifier or name and can be removed.<\/p>\n<p>The example below loads and summarizes the class breakdown of the dataset.<\/p>\n<pre class=\"crayon-plain-tag\"># Summarize the Ecoli dataset\r\nfrom numpy import unique\r\nfrom pandas import read_csv\r\n# load the dataset\r\nurl = 'https:\/\/raw.githubusercontent.com\/jbrownlee\/Datasets\/master\/ecoli.csv'\r\ndataframe = read_csv(url, header=None)\r\n# get the values\r\nvalues = dataframe.values\r\nX, y = values[:, :-1], values[:, -1]\r\n# gather details\r\nn_rows = X.shape[0]\r\nn_cols = X.shape[1]\r\nclasses = unique(y)\r\nn_classes = len(classes)\r\n# summarize\r\nprint('N Examples: %d' % n_rows)\r\nprint('N Inputs: %d' % n_cols)\r\nprint('N Classes: %d' % n_classes)\r\nprint('Classes: %s' % classes)\r\nprint('Class Breakdown:')\r\n# class breakdown\r\nbreakdown = ''\r\nfor c in classes:\r\n\ttotal = len(y[y == c])\r\n\tratio = (total \/ float(len(y))) * 100\r\n\tprint(' - Class %s: %d (%.5f%%)' % (str(c), total, ratio))<\/pre>\n<p>Running the example provides the following output.<\/p>\n<pre class=\"crayon-plain-tag\">N Examples: 336\r\nN Inputs: 7\r\nN Classes: 8\r\nClasses: ['cp' 'im' 'imL' 'imS' 'imU' 'om' 'omL' 'pp']\r\nClass Breakdown:\r\n - Class cp: 143 (42.55952%)\r\n - Class im: 77 (22.91667%)\r\n - Class imL: 2 (0.59524%)\r\n - Class imS: 2 (0.59524%)\r\n - Class imU: 35 (10.41667%)\r\n - Class om: 20 (5.95238%)\r\n - Class omL: 5 (1.48810%)\r\n - Class pp: 52 (15.47619%)<\/pre>\n<\/p>\n<h3>Thyroid Gland (Thyroid)<\/h3>\n<p>Each record describes the result of different tests on a thyroid and prediction involves the medical diagnosis of the thyroid.<\/p>\n<ul>\n<li>More Details: <a href=\"https:\/\/raw.githubusercontent.com\/jbrownlee\/Datasets\/master\/new-thyroid.names\">new-thyroid.names<\/a><\/li>\n<li>Dataset: <a href=\"https:\/\/raw.githubusercontent.com\/jbrownlee\/Datasets\/master\/new-thyroid.csv\">new-thyroid.csv<\/a><\/li>\n<li><a href=\"http:\/\/archive.ics.uci.edu\/ml\/datasets\/thyroid+disease\">Additional Information<\/a><\/li>\n<\/ul>\n<p>Below provides a sample of the first five rows of the dataset.<\/p>\n<pre class=\"crayon-plain-tag\">107,10.1,2.2,0.9,2.7,1\r\n113,9.9,3.1,2.0,5.9,1\r\n127,12.9,2.4,1.4,0.6,1\r\n109,5.3,1.6,1.4,1.5,1\r\n105,7.3,1.5,1.5,-0.1,1\r\n...<\/pre>\n<p>The example below loads and summarizes the class breakdown of the dataset.<\/p>\n<pre class=\"crayon-plain-tag\"># Summarize the Thyroid Gland dataset\r\nfrom numpy import unique\r\nfrom pandas import read_csv\r\n# load the dataset\r\nurl = 'https:\/\/raw.githubusercontent.com\/jbrownlee\/Datasets\/master\/new-thyroid.csv'\r\ndataframe = read_csv(url, header=None)\r\n# get the values\r\nvalues = dataframe.values\r\nX, y = values[:, :-1], values[:, -1]\r\n# gather details\r\nn_rows = X.shape[0]\r\nn_cols = X.shape[1]\r\nclasses = unique(y)\r\nn_classes = len(classes)\r\n# summarize\r\nprint('N Examples: %d' % n_rows)\r\nprint('N Inputs: %d' % n_cols)\r\nprint('N Classes: %d' % n_classes)\r\nprint('Classes: %s' % classes)\r\nprint('Class Breakdown:')\r\n# class breakdown\r\nbreakdown = ''\r\nfor c in classes:\r\n\ttotal = len(y[y == c])\r\n\tratio = (total \/ float(len(y))) * 100\r\n\tprint(' - Class %s: %d (%.5f%%)' % (str(c), total, ratio))<\/pre>\n<p>Running the example provides the following output.<\/p>\n<pre class=\"crayon-plain-tag\">N Examples: 215\r\nN Inputs: 5\r\nN Classes: 3\r\nClasses: [1. 2. 3.]\r\nClass Breakdown:\r\n - Class 1.0: 150 (69.76744%)\r\n - Class 2.0: 35 (16.27907%)\r\n - Class 3.0: 30 (13.95349%)<\/pre>\n<\/p>\n<h2>Competition and Other Datasets<\/h2>\n<p>This section lists additional datasets used in research papers that are less used, larger, or datasets used as the basis of machine learning competitions.<\/p>\n<p>The names of these datasets are as follows:<\/p>\n<ul>\n<li>Credit Card Fraud (Credit)<\/li>\n<li>Porto Seguro Auto Insurance Claim (Porto Seguro)<\/li>\n<\/ul>\n<p>Each dataset will be loaded and the nature of the class imbalance will be summarized.<\/p>\n<h3>Credit Card Fraud (Credit)<\/h3>\n<p>Each record describes a credit card translation and it is classified as fraud.<\/p>\n<p>This data is about 144 megabytes uncompressed or 66 megabytes compressed.<\/p>\n<ul>\n<li>Download: <a href=\"https:\/\/raw.githubusercontent.com\/jbrownlee\/Datasets\/master\/creditcardfraud.zip\">creditcardfraud.zip<\/a><\/li>\n<li><a href=\"https:\/\/www.kaggle.com\/mlg-ulb\/creditcardfraud\">Additional Information<\/a><\/li>\n<\/ul>\n<p>Download the dataset and unzip it into your current working directory.<\/p>\n<p>Below provides a sample of the first five rows of the dataset.<\/p>\n<pre class=\"crayon-plain-tag\">\"Time\",\"V1\",\"V2\",\"V3\",\"V4\",\"V5\",\"V6\",\"V7\",\"V8\",\"V9\",\"V10\",\"V11\",\"V12\",\"V13\",\"V14\",\"V15\",\"V16\",\"V17\",\"V18\",\"V19\",\"V20\",\"V21\",\"V22\",\"V23\",\"V24\",\"V25\",\"V26\",\"V27\",\"V28\",\"Amount\",\"Class\"\r\n0,-1.3598071336738,-0.0727811733098497,2.53634673796914,1.37815522427443,-0.338320769942518,0.462387777762292,0.239598554061257,0.0986979012610507,0.363786969611213,0.0907941719789316,-0.551599533260813,-0.617800855762348,-0.991389847235408,-0.311169353699879,1.46817697209427,-0.470400525259478,0.207971241929242,0.0257905801985591,0.403992960255733,0.251412098239705,-0.018306777944153,0.277837575558899,-0.110473910188767,0.0669280749146731,0.128539358273528,-0.189114843888824,0.133558376740387,-0.0210530534538215,149.62,\"0\"\r\n0,1.19185711131486,0.26615071205963,0.16648011335321,0.448154078460911,0.0600176492822243,-0.0823608088155687,-0.0788029833323113,0.0851016549148104,-0.255425128109186,-0.166974414004614,1.61272666105479,1.06523531137287,0.48909501589608,-0.143772296441519,0.635558093258208,0.463917041022171,-0.114804663102346,-0.183361270123994,-0.145783041325259,-0.0690831352230203,-0.225775248033138,-0.638671952771851,0.101288021253234,-0.339846475529127,0.167170404418143,0.125894532368176,-0.00898309914322813,0.0147241691924927,2.69,\"0\"\r\n1,-1.35835406159823,-1.34016307473609,1.77320934263119,0.379779593034328,-0.503198133318193,1.80049938079263,0.791460956450422,0.247675786588991,-1.51465432260583,0.207642865216696,0.624501459424895,0.066083685268831,0.717292731410831,-0.165945922763554,2.34586494901581,-2.89008319444231,1.10996937869599,-0.121359313195888,-2.26185709530414,0.524979725224404,0.247998153469754,0.771679401917229,0.909412262347719,-0.689280956490685,-0.327641833735251,-0.139096571514147,-0.0553527940384261,-0.0597518405929204,378.66,\"0\"\r\n1,-0.966271711572087,-0.185226008082898,1.79299333957872,-0.863291275036453,-0.0103088796030823,1.24720316752486,0.23760893977178,0.377435874652262,-1.38702406270197,-0.0549519224713749,-0.226487263835401,0.178228225877303,0.507756869957169,-0.28792374549456,-0.631418117709045,-1.0596472454325,-0.684092786345479,1.96577500349538,-1.2326219700892,-0.208037781160366,-0.108300452035545,0.00527359678253453,-0.190320518742841,-1.17557533186321,0.647376034602038,-0.221928844458407,0.0627228487293033,0.0614576285006353,123.5,\"0\"\r\n...<\/pre>\n<p>The example below loads and summarizes the class breakdown of the dataset.<\/p>\n<pre class=\"crayon-plain-tag\"># Summarize the Credit Card Fraud dataset\r\nfrom numpy import unique\r\nfrom pandas import read_csv\r\n# load the dataset\r\ndataframe = read_csv('creditcard.csv')\r\n# get the values\r\nvalues = dataframe.values\r\nX, y = values[:, :-1], values[:, -1]\r\n# gather details\r\nn_rows = X.shape[0]\r\nn_cols = X.shape[1]\r\nclasses = unique(y)\r\nn_classes = len(classes)\r\n# summarize\r\nprint('N Examples: %d' % n_rows)\r\nprint('N Inputs: %d' % n_cols)\r\nprint('N Classes: %d' % n_classes)\r\nprint('Classes: %s' % classes)\r\nprint('Class Breakdown:')\r\n# class breakdown\r\nbreakdown = ''\r\nfor c in classes:\r\n\ttotal = len(y[y == c])\r\n\tratio = (total \/ float(len(y))) * 100\r\n\tprint(' - Class %s: %d (%.5f%%)' % (str(c), total, ratio))<\/pre>\n<p>Running the example provides the following output.<\/p>\n<pre class=\"crayon-plain-tag\">N Examples: 284807\r\nN Inputs: 30\r\nN Classes: 2\r\nClasses: [0. 1.]\r\nClass Breakdown:\r\n - Class 0.0: 284315 (99.82725%)\r\n - Class 1.0: 492 (0.17275%)<\/pre>\n<\/p>\n<h3>Porto Seguro Auto Insurance Claim (Porto Seguro)<\/h3>\n<p>Each record describes people\u2019s car insurance details and prediction involves whether or not the person will make an insurance claim.<\/p>\n<p>This data is about 42 megabytes compressed.<\/p>\n<ul>\n<li><a href=\"https:\/\/www.kaggle.com\/c\/porto-seguro-safe-driver-prediction\/data\">Download and Additional Information<\/a><\/li>\n<\/ul>\n<p>Download the dataset and unzip it into your current working directory.<\/p>\n<p>Below provides a sample of the first five rows of the dataset.<\/p>\n<pre class=\"crayon-plain-tag\">id,target,ps_ind_01,ps_ind_02_cat,ps_ind_03,ps_ind_04_cat,ps_ind_05_cat,ps_ind_06_bin,ps_ind_07_bin,ps_ind_08_bin,ps_ind_09_bin,ps_ind_10_bin,ps_ind_11_bin,ps_ind_12_bin,ps_ind_13_bin,ps_ind_14,ps_ind_15,ps_ind_16_bin,ps_ind_17_bin,ps_ind_18_bin,ps_reg_01,ps_reg_02,ps_reg_03,ps_car_01_cat,ps_car_02_cat,ps_car_03_cat,ps_car_04_cat,ps_car_05_cat,ps_car_06_cat,ps_car_07_cat,ps_car_08_cat,ps_car_09_cat,ps_car_10_cat,ps_car_11_cat,ps_car_11,ps_car_12,ps_car_13,ps_car_14,ps_car_15,ps_calc_01,ps_calc_02,ps_calc_03,ps_calc_04,ps_calc_05,ps_calc_06,ps_calc_07,ps_calc_08,ps_calc_09,ps_calc_10,ps_calc_11,ps_calc_12,ps_calc_13,ps_calc_14,ps_calc_15_bin,ps_calc_16_bin,ps_calc_17_bin,ps_calc_18_bin,ps_calc_19_bin,ps_calc_20_bin\r\n7,0,2,2,5,1,0,0,1,0,0,0,0,0,0,0,11,0,1,0,0.7,0.2,0.7180703307999999,10,1,-1,0,1,4,1,0,0,1,12,2,0.4,0.8836789178,0.3708099244,3.6055512755000003,0.6,0.5,0.2,3,1,10,1,10,1,5,9,1,5,8,0,1,1,0,0,1\r\n9,0,1,1,7,0,0,0,0,1,0,0,0,0,0,0,3,0,0,1,0.8,0.4,0.7660776723,11,1,-1,0,-1,11,1,1,2,1,19,3,0.316227766,0.6188165191,0.3887158345,2.4494897428,0.3,0.1,0.3,2,1,9,5,8,1,7,3,1,1,9,0,1,1,0,1,0\r\n13,0,5,4,9,1,0,0,0,1,0,0,0,0,0,0,12,1,0,0,0.0,0.0,-1.0,7,1,-1,0,-1,14,1,1,2,1,60,1,0.316227766,0.6415857163,0.34727510710000004,3.3166247904,0.5,0.7,0.1,2,2,9,1,8,2,7,4,2,7,7,0,1,1,0,1,0\r\n16,0,0,1,2,0,0,1,0,0,0,0,0,0,0,0,8,1,0,0,0.9,0.2,0.5809475019,7,1,0,0,1,11,1,1,3,1,104,1,0.3741657387,0.5429487899000001,0.2949576241,2.0,0.6,0.9,0.1,2,4,7,1,8,4,2,2,2,4,9,0,0,0,0,0,0\r\n...<\/pre>\n<p>The example below loads and summarizes the class breakdown of the dataset.<\/p>\n<pre class=\"crayon-plain-tag\"># Summarize the Porto Seguro\u2019s Safe Driver Prediction dataset\r\nfrom numpy import unique\r\nfrom pandas import read_csv\r\n# load the dataset\r\ndataframe = read_csv('train.csv')\r\n# get the values\r\nvalues = dataframe.values\r\nX, y = values[:, :-1], values[:, -1]\r\n# gather details\r\nn_rows = X.shape[0]\r\nn_cols = X.shape[1]\r\nclasses = unique(y)\r\nn_classes = len(classes)\r\n# summarize\r\nprint('N Examples: %d' % n_rows)\r\nprint('N Inputs: %d' % n_cols)\r\nprint('N Classes: %d' % n_classes)\r\nprint('Classes: %s' % classes)\r\nprint('Class Breakdown:')\r\n# class breakdown\r\nbreakdown = ''\r\nfor c in classes:\r\n\ttotal = len(y[y == c])\r\n\tratio = (total \/ float(len(y))) * 100\r\n\tprint(' - Class %s: %d (%.5f%%)' % (str(c), total, ratio))<\/pre>\n<p>Running the example provides the following output.<\/p>\n<pre class=\"crayon-plain-tag\">N Examples: 595212\r\nN Inputs: 58\r\nN Classes: 2\r\nClasses: [0. 1.]\r\nClass Breakdown:\r\n - Class 0.0: 503955 (84.66815%)\r\n - Class 1.0: 91257 (15.33185%)<\/pre>\n<\/p>\n<h2>Further Reading<\/h2>\n<p>This section provides more resources on the topic if you are looking to go deeper.<\/p>\n<h3>Papers<\/h3>\n<ul>\n<li><a href=\"https:\/\/dl.acm.org\/citation.cfm?id=1007735\">A Study of the Behavior of Several Methods for Balancing Machine Learning Training Data<\/a>, 2004.<\/li>\n<li><a href=\"https:\/\/ieeexplore.ieee.org\/document\/5978225\">A Review on Ensembles for the Class Imbalance Problem: Bagging-, Boosting-, and Hybrid-Based Approaches<\/a>, 2011.<\/li>\n<\/ul>\n<h3>Articles<\/h3>\n<ul>\n<li><a href=\"https:\/\/imbalanced-learn.readthedocs.io\/en\/stable\/datasets\/\">imbalanced-learn, Dataset loading utilities<\/a>.<\/li>\n<li><a href=\"https:\/\/sci2s.ugr.es\/keel\/imbalanced.php\">KEEL-dataset Repository: Imbalanced data sets<\/a><\/li>\n<\/ul>\n<h2>Summary<\/h2>\n<p>In this tutorial, you discovered a suite of standard machine learning datasets for imbalanced classification.<\/p>\n<p>Specifically, you learned:<\/p>\n<ul>\n<li>Standard machine learning datasets with an imbalance of two classes.<\/li>\n<li>Standard datasets for multiclass classification with a skewed class distribution.<\/li>\n<li>Popular imbalanced classification datasets used for machine learning competitions.<\/li>\n<\/ul>\n<p>Do you have any questions?<br \/>\nAsk your questions in the comments below and I will do my best to answer.<\/p>\n<p>The post <a rel=\"nofollow\" href=\"https:\/\/machinelearningmastery.com\/standard-machine-learning-datasets-for-imbalanced-classification\/\">Standard Machine Learning Datasets for Imbalanced Classification<\/a> appeared first on <a rel=\"nofollow\" href=\"https:\/\/machinelearningmastery.com\/\">Machine Learning Mastery<\/a>.<\/p>\n<\/div>\n<p><a href=\"https:\/\/machinelearningmastery.com\/standard-machine-learning-datasets-for-imbalanced-classification\/\">Go to Source<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Author: Jason Brownlee An imbalanced classification problem is a problem that involves predicting a class label where the distribution of class labels in the training [&hellip;] <span class=\"read-more-link\"><a class=\"read-more\" href=\"https:\/\/www.aiproblog.com\/index.php\/2019\/12\/29\/standard-machine-learning-datasets-for-imbalanced-classification\/\">Read More<\/a><\/span><\/p>\n","protected":false},"author":1,"featured_media":2979,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_bbp_topic_count":0,"_bbp_reply_count":0,"_bbp_total_topic_count":0,"_bbp_total_reply_count":0,"_bbp_voice_count":0,"_bbp_anonymous_reply_count":0,"_bbp_topic_count_hidden":0,"_bbp_reply_count_hidden":0,"_bbp_forum_subforum_count":0,"footnotes":""},"categories":[24],"tags":[],"_links":{"self":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts\/2978"}],"collection":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/comments?post=2978"}],"version-history":[{"count":0,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts\/2978\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/media\/2979"}],"wp:attachment":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/media?parent=2978"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/categories?post=2978"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/tags?post=2978"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}