{"id":3232,"date":"2020-03-12T18:00:11","date_gmt":"2020-03-12T18:00:11","guid":{"rendered":"https:\/\/www.aiproblog.com\/index.php\/2020\/03\/12\/imbalanced-multiclass-classification-with-the-glass-identification-dataset\/"},"modified":"2020-03-12T18:00:11","modified_gmt":"2020-03-12T18:00:11","slug":"imbalanced-multiclass-classification-with-the-glass-identification-dataset","status":"publish","type":"post","link":"https:\/\/www.aiproblog.com\/index.php\/2020\/03\/12\/imbalanced-multiclass-classification-with-the-glass-identification-dataset\/","title":{"rendered":"Imbalanced Multiclass Classification with the Glass Identification Dataset"},"content":{"rendered":"<p>Author: Jason Brownlee<\/p>\n<div>\n<p>Multiclass classification problems are those where a label must be predicted, but there are more than two labels that may be predicted.<\/p>\n<p>These are challenging predictive modeling problems because a sufficiently representative number of examples of each class is required for a model to learn the problem. It is made challenging when the number of examples in each class is imbalanced, or skewed toward one or a few of the classes with very few examples of other classes.<\/p>\n<p>Problems of this type are referred to as imbalanced multiclass classification problems and they require both the careful design of an evaluation metric and test harness and choice of machine learning models. The glass identification dataset is a standard dataset for exploring the challenge of imbalanced multiclass classification.<\/p>\n<p>In this tutorial, you will discover how to develop and evaluate a model for the imbalanced multiclass glass identification dataset.<\/p>\n<p>After completing this tutorial, you will know:<\/p>\n<ul>\n<li>How to load and explore the dataset and generate ideas for data preparation and model selection.<\/li>\n<li>How to systematically evaluate a suite of machine learning models with a robust test harness.<\/li>\n<li>How to fit a final model and use it to predict the class labels for specific examples.<\/li>\n<\/ul>\n<p>Discover SMOTE, one-class classification, cost-sensitive learning, threshold moving, and much more <a href=\"https:\/\/machinelearningmastery.com\/imbalanced-classification-with-python\/\">in my new book<\/a>, with 30 step-by-step tutorials and full Python source code.<\/p>\n<p>Let&rsquo;s get started.<\/p>\n<div id=\"attachment_9763\" style=\"width: 810px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-9763\" class=\"size-full wp-image-9763\" src=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2020\/03\/Evaluate-Models-for-the-Imbalanced-Multiclass-Glass-Identification-Dataset.jpg\" alt=\"Evaluate Models for the Imbalanced Multiclass Glass Identification Dataset\" width=\"800\" height=\"531\" srcset=\"http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2020\/03\/Evaluate-Models-for-the-Imbalanced-Multiclass-Glass-Identification-Dataset.jpg 800w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2020\/03\/Evaluate-Models-for-the-Imbalanced-Multiclass-Glass-Identification-Dataset-300x199.jpg 300w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2020\/03\/Evaluate-Models-for-the-Imbalanced-Multiclass-Glass-Identification-Dataset-768x510.jpg 768w\" sizes=\"(max-width: 800px) 100vw, 800px\"><\/p>\n<p id=\"caption-attachment-9763\" class=\"wp-caption-text\">Evaluate Models for the Imbalanced Multiclass Glass Identification Dataset<br \/>Photo by <a href=\"https:\/\/flickr.com\/photos\/pocheco\/14906013416\/\">Sarah Nichols<\/a>, some rights reserved.<\/p>\n<\/div>\n<h2>Tutorial Overview<\/h2>\n<p>This tutorial is divided into five parts; they are:<\/p>\n<ol>\n<li>Glass Identification Dataset<\/li>\n<li>Explore the Dataset<\/li>\n<li>Model Test and Baseline Result<\/li>\n<li>Evaluate Models<\/li>\n<li>Make Predictions on New Data<\/li>\n<\/ol>\n<h2>Glass Identification Dataset<\/h2>\n<p>In this project, we will use a standard imbalanced machine learning dataset referred to as the &ldquo;<em>Glass Identification<\/em>&rdquo; dataset, or simply &ldquo;<em>glass<\/em>.&rdquo;<\/p>\n<p>The dataset describes the chemical properties of glass and involves classifying samples of glass using their chemical properties as one of six classes. The dataset was credited to <a href=\"https:\/\/www.lexvisio.com\/expert-witness\/vina-r-spiehler-phd-dabft-spiehler-associates\">Vina Spiehler<\/a> in 1987.<\/p>\n<p>Ignoring the sample identification number, there are nine input variables that summarize the properties of the glass dataset; they are:<\/p>\n<ul>\n<li><strong>RI<\/strong>: refractive index<\/li>\n<li><strong>Na<\/strong>: Sodium<\/li>\n<li><strong>Mg<\/strong>: Magnesium<\/li>\n<li><strong>Al<\/strong>: Aluminum<\/li>\n<li><strong>Si<\/strong>: Silicon<\/li>\n<li><strong>K<\/strong>: Potassium<\/li>\n<li><strong>Ca<\/strong>: Calcium<\/li>\n<li><strong>Ba<\/strong>: Barium<\/li>\n<li><strong>Fe<\/strong>: Iron<\/li>\n<\/ul>\n<p>The chemical compositions are measured as the weight percent in corresponding oxide.<\/p>\n<p>There are seven types of glass listed; they are:<\/p>\n<ul>\n<li><strong>Class 1<\/strong>: building windows (float processed)<\/li>\n<li><strong>Class 2<\/strong>: building windows (non-float processed)<\/li>\n<li><strong>Class 3<\/strong>: vehicle windows (float processed)<\/li>\n<li><strong>Class 4<\/strong>: vehicle windows (non-float processed)<\/li>\n<li><strong>Class 5<\/strong>: containers<\/li>\n<li><strong>Class 6<\/strong>: tableware<\/li>\n<li><strong>Class 7<\/strong>: headlamps<\/li>\n<\/ul>\n<p><a href=\"https:\/\/en.wikipedia.org\/wiki\/Float_glass\">Float glass<\/a> refers to the process used to make the glass.<\/p>\n<p>There are 214 observations in the dataset and the number of observations in each class is imbalanced. Note that there are no examples for class 4 (non-float processed vehicle windows) in the dataset.<\/p>\n<ul>\n<li><strong>Class 1<\/strong>: 70 examples<\/li>\n<li><strong>Class 2<\/strong>: 76 examples<\/li>\n<li><strong>Class 3<\/strong>: 17 examples<\/li>\n<li><strong>Class 4<\/strong>: 0 examples<\/li>\n<li><strong>Class 5<\/strong>: 13 examples<\/li>\n<li><strong>Class 6<\/strong>: 9 examples<\/li>\n<li><strong>Class 7<\/strong>: 29 examples<\/li>\n<\/ul>\n<p>Although there are minority classes, all classes are equally important in this prediction problem.<\/p>\n<p>The dataset can be divided into window glass (classes 1-4) and non-window glass (classes 5-7). There are 163 examples of window glass and 51 examples of non-window glass.<\/p>\n<ul>\n<li><strong>Window Glass<\/strong>: 163 examples<\/li>\n<li><strong>Non-Window Glass<\/strong>: 51 examples<\/li>\n<\/ul>\n<p>Another division of the observations would be between float processed glass and non-float processed glass, in the case of window glass only. This division is more balanced.<\/p>\n<ul>\n<li><strong>Float Glass<\/strong>: 87 examples<\/li>\n<li><strong>Non-Float Glass<\/strong>: 76 examples<\/li>\n<\/ul>\n<p>Next, let&rsquo;s take a closer look at the data.<\/p>\n<\/p>\n<div class=\"woo-sc-hr\"><\/div>\n<p><center><\/p>\n<h3>Want to Get Started With Imbalance Classification?<\/h3>\n<p>Take my free 7-day email crash course now (with sample code).<\/p>\n<p>Click to sign-up and also get a free PDF Ebook version of the course.<\/p>\n<p><a href=\"https:\/\/machinelearningmastery.lpages.co\/leadbox\/14de34d42172a2%3A164f8be4f346dc\/4529268551712768\/\" target=\"_blank\" style=\"background: rgb(255, 206, 10); color: rgb(255, 255, 255); text-decoration: none; font-family: Helvetica, Arial, sans-serif; font-weight: bold; font-size: 16px; line-height: 20px; padding: 10px; display: inline-block; max-width: 300px; border-radius: 5px; text-shadow: rgba(0, 0, 0, 0.25) 0px -1px 1px; box-shadow: rgba(255, 255, 255, 0.5) 0px 1px 3px inset, rgba(0, 0, 0, 0.5) 0px 1px 3px;\" rel=\"noopener noreferrer\">Download Your FREE Mini-Course<\/a><script data-leadbox=\"14de34d42172a2:164f8be4f346dc\" data-url=\"https:\/\/machinelearningmastery.lpages.co\/leadbox\/14de34d42172a2%3A164f8be4f346dc\/4529268551712768\/\" data-config=\"%7B%7D\" type=\"text\/javascript\" src=\"https:\/\/machinelearningmastery.lpages.co\/leadbox-1576257931.js\"><\/script><\/p>\n<p><\/center><\/p>\n<div class=\"woo-sc-hr\"><\/div>\n<h2>Explore the Dataset<\/h2>\n<p>First, download the dataset and save it in your current working directory with the name &ldquo;<em>glass.csv<\/em>&ldquo;.<\/p>\n<p>Note that this version of the dataset has the first column (row) number removed as it does not contain generalizable information for modeling.<\/p>\n<ul>\n<li><a href=\"https:\/\/raw.githubusercontent.com\/jbrownlee\/Datasets\/master\/glass.csv\">Download Glass Identification Dataset (glass.csv)<\/a><\/li>\n<\/ul>\n<p>Review the contents of the file.<\/p>\n<p>The first few lines of the file should look as follows:<\/p>\n<pre class=\"crayon-plain-tag\">1.52101,13.64,4.49,1.10,71.78,0.06,8.75,0.00,0.00,1\r\n1.51761,13.89,3.60,1.36,72.73,0.48,7.83,0.00,0.00,1\r\n1.51618,13.53,3.55,1.54,72.99,0.39,7.78,0.00,0.00,1\r\n1.51766,13.21,3.69,1.29,72.61,0.57,8.22,0.00,0.00,1\r\n1.51742,13.27,3.62,1.24,73.08,0.55,8.07,0.00,0.00,1\r\n...<\/pre>\n<p>We can see that the input variables are numeric and the class label is an integer is in the final column.<\/p>\n<p>All of the chemical input variables have the same units, although the first variable, the refractive index, has different units. As such, data scaling may be required for some modeling algorithms.<\/p>\n<p>The dataset can be loaded as a DataFrame using the <a href=\"https:\/\/pandas.pydata.org\/pandas-docs\/stable\/reference\/api\/pandas.read_csv.html\">read_csv() Pandas function<\/a>, specifying the location of the dataset and the fact that there is no header line.<\/p>\n<pre class=\"crayon-plain-tag\">...\r\n# define the dataset location\r\nfilename = 'glass.csv'\r\n# load the csv file as a data frame\r\ndataframe = read_csv(filename, header=None)<\/pre>\n<p>Once loaded, we can summarize the number of rows and columns by printing the shape of the <em>DataFrame<\/em>.<\/p>\n<pre class=\"crayon-plain-tag\">...\r\n# summarize the shape of the dataset\r\nprint(dataframe.shape)<\/pre>\n<p>We can also summarize the number of examples in each class using the <a href=\"https:\/\/docs.python.org\/3\/library\/collections.html\">Counter<\/a> object.<\/p>\n<pre class=\"crayon-plain-tag\">...\r\n# summarize the class distribution\r\ntarget = dataframe.values[:,-1]\r\ncounter = Counter(target)\r\nfor k,v in counter.items():\r\n\tper = v \/ len(target) * 100\r\n\tprint('Class=%d, Count=%d, Percentage=%.3f%%' % (k, v, per))<\/pre>\n<p>Tying this together, the complete example of loading and summarizing the dataset is listed below.<\/p>\n<pre class=\"crayon-plain-tag\"># load and summarize the dataset\r\nfrom pandas import read_csv\r\nfrom collections import Counter\r\n# define the dataset location\r\nfilename = 'glass.csv'\r\n# load the csv file as a data frame\r\ndataframe = read_csv(filename, header=None)\r\n# summarize the shape of the dataset\r\nprint(dataframe.shape)\r\n# summarize the class distribution\r\ntarget = dataframe.values[:,-1]\r\ncounter = Counter(target)\r\nfor k,v in counter.items():\r\n\tper = v \/ len(target) * 100\r\n\tprint('Class=%d, Count=%d, Percentage=%.3f%%' % (k, v, per))<\/pre>\n<p>Running the example first loads the dataset and confirms the number of rows and columns, which are 214 rows and 9 input variables and 1 target variable.<\/p>\n<p>The class distribution is then summarized, confirming the severe skew in the observations for each class.<\/p>\n<pre class=\"crayon-plain-tag\">(214, 10)\r\nClass=1, Count=70, Percentage=32.710%\r\nClass=2, Count=76, Percentage=35.514%\r\nClass=3, Count=17, Percentage=7.944%\r\nClass=5, Count=13, Percentage=6.075%\r\nClass=6, Count=9, Percentage=4.206%\r\nClass=7, Count=29, Percentage=13.551%<\/pre>\n<p>We can also take a look at the distribution of the input variables by creating a histogram for each.<\/p>\n<p>The complete example of creating histograms of all variables is listed below.<\/p>\n<pre class=\"crayon-plain-tag\"># create histograms of all variables\r\nfrom pandas import read_csv\r\nfrom matplotlib import pyplot\r\n# define the dataset location\r\nfilename = 'glass.csv'\r\n# load the csv file as a data frame\r\ndf = read_csv(filename, header=None)\r\n# create a histogram plot of each variable\r\ndf.hist()\r\n# show the plot\r\npyplot.show()<\/pre>\n<p>We can see that some of the variables have a <a href=\"https:\/\/machinelearningmastery.com\/continuous-probability-distributions-for-machine-learning\/\">Gaussian-like distribution<\/a> and others appear to have an exponential or even a bimodal distribution.<\/p>\n<p>Depending on the choice of algorithm, the data may benefit from standardization of some variables and perhaps a power transform.<\/p>\n<div id=\"attachment_9760\" style=\"width: 1290px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-9760\" class=\"size-full wp-image-9760\" src=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2019\/12\/Histogram-of-Variables-in-the-Glass-Identification-Dataset.png\" alt=\"Histogram of Variables in the Glass Identification Dataset\" width=\"1280\" height=\"960\" srcset=\"http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2019\/12\/Histogram-of-Variables-in-the-Glass-Identification-Dataset.png 1280w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2019\/12\/Histogram-of-Variables-in-the-Glass-Identification-Dataset-300x225.png 300w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2019\/12\/Histogram-of-Variables-in-the-Glass-Identification-Dataset-1024x768.png 1024w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2019\/12\/Histogram-of-Variables-in-the-Glass-Identification-Dataset-768x576.png 768w\" sizes=\"(max-width: 1280px) 100vw, 1280px\"><\/p>\n<p id=\"caption-attachment-9760\" class=\"wp-caption-text\">Histogram of Variables in the Glass Identification Dataset<\/p>\n<\/div>\n<p>Now that we have reviewed the dataset, let&rsquo;s look at developing a test harness for evaluating candidate models.<\/p>\n<h2>Model Test and Baseline Result<\/h2>\n<p>We will evaluate candidate models using repeated stratified k-fold cross-validation.<\/p>\n<p>The <a href=\"https:\/\/machinelearningmastery.com\/k-fold-cross-validation\/\">k-fold cross-validation<\/a> procedure provides a good general estimate of model performance that is not too optimistically biased, at least compared to a single train-test split. We will use k=5, meaning each fold will contain about 214\/5, or about 42 examples.<\/p>\n<p>Stratified means that each fold will aim to contain the same mixture of examples by class as the entire training dataset. Repeated means that the evaluation process will be performed multiple times to help avoid fluke results and better capture the variance of the chosen model. We will use three repeats.<\/p>\n<p>This means a single model will be fit and evaluated 5 * 3 or 15 times and the mean and standard deviation of these runs will be reported.<\/p>\n<p>This can be achieved using the <a href=\"https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.model_selection.RepeatedStratifiedKFold.html\">RepeatedStratifiedKFold<\/a> scikit-learn class.<\/p>\n<p>All classes are equally important. There are minority classes that are only represented with 4 percent or 6 percent of the data, yet no class has more than about 35 percent dominance of the dataset.<\/p>\n<p>As such, in this case, we will use classification accuracy to evaluate models.<\/p>\n<p>First, we can define a function to load the dataset and split the input variables into inputs and output variables and use a label encoder to ensure class labels are numbered sequentially from 0 to 5.<\/p>\n<pre class=\"crayon-plain-tag\"># load the dataset\r\ndef load_dataset(full_path):\r\n\t# load the dataset as a numpy array\r\n\tdata = read_csv(full_path, header=None)\r\n\t# retrieve numpy array\r\n\tdata = data.values\r\n\t# split into input and output elements\r\n\tX, y = data[:, :-1], data[:, -1]\r\n\t# label encode the target variable to have the classes 0 and 1\r\n\ty = LabelEncoder().fit_transform(y)\r\n\treturn X, y<\/pre>\n<p>We can define a function to evaluate a candidate model using stratified repeated 5-fold cross-validation then return a list of scores calculated on the model for each fold and repeat. The <em>evaluate_model()<\/em> function below implements this.<\/p>\n<pre class=\"crayon-plain-tag\"># evaluate a model\r\ndef evaluate_model(X, y, model):\r\n\t# define evaluation procedure\r\n\tcv = RepeatedStratifiedKFold(n_splits=5, n_repeats=3, random_state=1)\r\n\t# evaluate model\r\n\tscores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)\r\n\treturn scores<\/pre>\n<p>We can then call the <em>load_dataset()<\/em> function to load and confirm the glass identification dataset.<\/p>\n<pre class=\"crayon-plain-tag\">...\r\n# define the location of the dataset\r\nfull_path = 'glass.csv'\r\n# load the dataset\r\nX, y = load_dataset(full_path)\r\n# summarize the loaded dataset\r\nprint(X.shape, y.shape, Counter(y))<\/pre>\n<p>In this case, we will evaluate the baseline strategy of predicting the majority class in all cases.<\/p>\n<p>This can be implemented automatically using the <a href=\"https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.dummy.DummyClassifier.html\">DummyClassifier<\/a> class and setting the &ldquo;<em>strategy<\/em>&rdquo; to &ldquo;<em>most_frequent<\/em>&rdquo; that will predict the most common class (e.g. class 2) in the training dataset.<\/p>\n<p>As such, we would expect this model to achieve a classification accuracy of about 35 percent given this is the distribution of the most common class in the training dataset.<\/p>\n<pre class=\"crayon-plain-tag\">...\r\n# define the reference model\r\nmodel = DummyClassifier(strategy='most_frequent')<\/pre>\n<p>We can then evaluate the model by calling our <em>evaluate_model()<\/em> function and report the mean and standard deviation of the results.<\/p>\n<pre class=\"crayon-plain-tag\">...\r\n# evaluate the model\r\nscores = evaluate_model(X, y, model)\r\n# summarize performance\r\nprint('Mean Accuracy: %.3f (%.3f)' % (mean(scores), std(scores)))<\/pre>\n<p>Tying this all together, the complete example of evaluating the baseline model on the glass identification dataset using classification accuracy is listed below.<\/p>\n<pre class=\"crayon-plain-tag\"># baseline model and test harness for the glass identification dataset\r\nfrom collections import Counter\r\nfrom numpy import mean\r\nfrom numpy import std\r\nfrom pandas import read_csv\r\nfrom sklearn.preprocessing import LabelEncoder\r\nfrom sklearn.model_selection import cross_val_score\r\nfrom sklearn.model_selection import RepeatedStratifiedKFold\r\nfrom sklearn.dummy import DummyClassifier\r\n\r\n# load the dataset\r\ndef load_dataset(full_path):\r\n\t# load the dataset as a numpy array\r\n\tdata = read_csv(full_path, header=None)\r\n\t# retrieve numpy array\r\n\tdata = data.values\r\n\t# split into input and output elements\r\n\tX, y = data[:, :-1], data[:, -1]\r\n\t# label encode the target variable to have the classes 0 and 1\r\n\ty = LabelEncoder().fit_transform(y)\r\n\treturn X, y\r\n\r\n# evaluate a model\r\ndef evaluate_model(X, y, model):\r\n\t# define evaluation procedure\r\n\tcv = RepeatedStratifiedKFold(n_splits=5, n_repeats=3, random_state=1)\r\n\t# evaluate model\r\n\tscores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)\r\n\treturn scores\r\n\r\n# define the location of the dataset\r\nfull_path = 'glass.csv'\r\n# load the dataset\r\nX, y = load_dataset(full_path)\r\n# summarize the loaded dataset\r\nprint(X.shape, y.shape, Counter(y))\r\n# define the reference model\r\nmodel = DummyClassifier(strategy='most_frequent')\r\n# evaluate the model\r\nscores = evaluate_model(X, y, model)\r\n# summarize performance\r\nprint('Mean Accuracy: %.3f (%.3f)' % (mean(scores), std(scores)))<\/pre>\n<p>Running the example first loads the dataset and reports the number of cases correctly as 214 and the distribution of class labels as we expect.<\/p>\n<p>The <em>DummyClassifier<\/em> with our default strategy is then evaluated using repeated stratified k-fold cross-validation and the mean and standard deviation of the classification accuracy is reported as about 35.5 percent.<\/p>\n<p>This score provides a baseline on this dataset by which all other classification algorithms can be compared. Achieving a score above about 35.5 percent indicates that a model has skill on this dataset, and a score at or below this value indicates that the model does not have skill on this dataset.<\/p>\n<pre class=\"crayon-plain-tag\">(214, 9) (214,) Counter({1: 76, 0: 70, 5: 29, 2: 17, 3: 13, 4: 9})\r\nMean Accuracy: 0.355 (0.011)<\/pre>\n<p>Now that we have a test harness and a baseline in performance, we can begin to evaluate some models on this dataset.<\/p>\n<h2>Evaluate Models<\/h2>\n<p>In this section, we will evaluate a suite of different techniques on the dataset using the test harness developed in the previous section.<\/p>\n<p>The reported performance is good, but not highly optimized (e.g. hyperparameters are not tuned).<\/p>\n<p><strong>Can you do better?<\/strong> If you can achieve better classification accuracy using the same test harness, I&rsquo;d love to hear about it. Let me know in the comments below.<\/p>\n<h3>Evaluate Machine Learning Algorithms<\/h3>\n<p>Let&rsquo;s evaluate a mixture of machine learning models on the dataset.<\/p>\n<p>It can be a good idea to spot check a suite of different nonlinear algorithms on a dataset to quickly flush out what works well and deserves further attention and what doesn&rsquo;t.<\/p>\n<p>We will evaluate the following machine learning models on the glass dataset:<\/p>\n<ul>\n<li>Support Vector Machine (SVM)<\/li>\n<li>k-Nearest Neighbors (KNN)<\/li>\n<li>Bagged Decision Trees (BAG)<\/li>\n<li>Random Forest (RF)<\/li>\n<li>Extra Trees (ET)<\/li>\n<\/ul>\n<p>We will use mostly default model hyperparameters, with the exception of the number of trees in the ensemble algorithms, which we will set to a reasonable default of 1,000.<\/p>\n<p>We will define each model in turn and add them to a list so that we can evaluate them sequentially. The <em>get_models()<\/em> function below defines the list of models for evaluation, as well as a list of model short names for plotting the results later.<\/p>\n<pre class=\"crayon-plain-tag\"># define models to test\r\ndef get_models():\r\n\tmodels, names = list(), list()\r\n\t# SVM\r\n\tmodels.append(SVC(gamma='auto'))\r\n\tnames.append('SVM')\r\n\t# KNN\r\n\tmodels.append(KNeighborsClassifier())\r\n\tnames.append('KNN')\r\n\t# Bagging\r\n\tmodels.append(BaggingClassifier(n_estimators=1000))\r\n\tnames.append('BAG')\r\n\t# RF\r\n\tmodels.append(RandomForestClassifier(n_estimators=1000))\r\n\tnames.append('RF')\r\n\t# ET\r\n\tmodels.append(ExtraTreesClassifier(n_estimators=1000))\r\n\tnames.append('ET')\r\n\treturn models, names<\/pre>\n<p>We can then enumerate the list of models in turn and evaluate each, storing the scores for later evaluation.<\/p>\n<pre class=\"crayon-plain-tag\">...\r\n# define models\r\nmodels, names = get_models()\r\nresults = list()\r\n# evaluate each model\r\nfor i in range(len(models)):\r\n\t# evaluate the model and store results\r\n\tscores = evaluate_model(X, y, models[i])\r\n\tresults.append(scores)\r\n\t# summarize performance\r\n\tprint('&gt;%s %.3f (%.3f)' % (names[i], mean(scores), std(scores)))<\/pre>\n<p>At the end of the run, we can plot each sample of scores as a box and whisker plot with the same scale so that we can directly compare the distributions.<\/p>\n<pre class=\"crayon-plain-tag\">...\r\n# plot the results\r\npyplot.boxplot(results, labels=names, showmeans=True)\r\npyplot.show()<\/pre>\n<p>Tying this all together, the complete example of evaluating a suite of machine learning algorithms on the glass identification dataset is listed below.<\/p>\n<pre class=\"crayon-plain-tag\"># spot check machine learning algorithms on the glass identification dataset\r\nfrom numpy import mean\r\nfrom numpy import std\r\nfrom pandas import read_csv\r\nfrom matplotlib import pyplot\r\nfrom sklearn.preprocessing import LabelEncoder\r\nfrom sklearn.model_selection import cross_val_score\r\nfrom sklearn.model_selection import RepeatedStratifiedKFold\r\nfrom sklearn.svm import SVC\r\nfrom sklearn.neighbors import KNeighborsClassifier\r\nfrom sklearn.ensemble import RandomForestClassifier\r\nfrom sklearn.ensemble import ExtraTreesClassifier\r\nfrom sklearn.ensemble import BaggingClassifier\r\n\r\n# load the dataset\r\ndef load_dataset(full_path):\r\n\t# load the dataset as a numpy array\r\n\tdata = read_csv(full_path, header=None)\r\n\t# retrieve numpy array\r\n\tdata = data.values\r\n\t# split into input and output elements\r\n\tX, y = data[:, :-1], data[:, -1]\r\n\t# label encode the target variable to have the classes 0 and 1\r\n\ty = LabelEncoder().fit_transform(y)\r\n\treturn X, y\r\n\r\n# evaluate a model\r\ndef evaluate_model(X, y, model):\r\n\t# define evaluation procedure\r\n\tcv = RepeatedStratifiedKFold(n_splits=5, n_repeats=3, random_state=1)\r\n\t# evaluate model\r\n\tscores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)\r\n\treturn scores\r\n\r\n# define models to test\r\ndef get_models():\r\n\tmodels, names = list(), list()\r\n\t# SVM\r\n\tmodels.append(SVC(gamma='auto'))\r\n\tnames.append('SVM')\r\n\t# KNN\r\n\tmodels.append(KNeighborsClassifier())\r\n\tnames.append('KNN')\r\n\t# Bagging\r\n\tmodels.append(BaggingClassifier(n_estimators=1000))\r\n\tnames.append('BAG')\r\n\t# RF\r\n\tmodels.append(RandomForestClassifier(n_estimators=1000))\r\n\tnames.append('RF')\r\n\t# ET\r\n\tmodels.append(ExtraTreesClassifier(n_estimators=1000))\r\n\tnames.append('ET')\r\n\treturn models, names\r\n\r\n# define the location of the dataset\r\nfull_path = 'glass.csv'\r\n# load the dataset\r\nX, y = load_dataset(full_path)\r\n# define models\r\nmodels, names = get_models()\r\nresults = list()\r\n# evaluate each model\r\nfor i in range(len(models)):\r\n\t# evaluate the model and store results\r\n\tscores = evaluate_model(X, y, models[i])\r\n\tresults.append(scores)\r\n\t# summarize performance\r\n\tprint('&gt;%s %.3f (%.3f)' % (names[i], mean(scores), std(scores)))\r\n# plot the results\r\npyplot.boxplot(results, labels=names, showmeans=True)\r\npyplot.show()<\/pre>\n<p>Running the example evaluates each algorithm in turn and reports the mean and standard deviation classification accuracy.<\/p>\n<p>Your specific results will vary given the stochastic nature of the learning algorithms; consider running the example a few times.<\/p>\n<p>In this case, we can see that all of the tested algorithms have skill, achieving an accuracy above the default of 35.5 percent.<\/p>\n<p>The results suggest that ensembles of decision trees perform well on this dataset, with perhaps random forest performing the best overall achieving a classification accuracy of approximately 79.6 percent.<\/p>\n<pre class=\"crayon-plain-tag\">&gt;SVM 0.669 (0.057)\r\n&gt;KNN 0.647 (0.055)\r\n&gt;BAG 0.767 (0.070)\r\n&gt;RF 0.796 (0.062)\r\n&gt;ET 0.776 (0.057)<\/pre>\n<p>A figure is created showing one box and whisker plot for each algorithm&rsquo;s sample of results. The box shows the middle 50 percent of the data, the orange line in the middle of each box shows the median of the sample, and the green triangle in each box shows the mean of the sample.<\/p>\n<p>We can see that the distributions of scores for the ensembles of decision trees clustered together separate from the other algorithms tested. In most cases, the mean and median are close on the plot, suggesting a somewhat symmetrical distribution of scores that may indicate the models are stable.<\/p>\n<div id=\"attachment_9761\" style=\"width: 1290px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-9761\" class=\"size-full wp-image-9761\" src=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2019\/12\/Box-and-Whisker-Plot-of-Machine-Learning-Models-on-the-Imbalanced-Glass-Identification-Dataset.png\" alt=\"Box and Whisker Plot of Machine Learning Models on the Imbalanced Glass Identification Dataset\" width=\"1280\" height=\"960\" srcset=\"http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2019\/12\/Box-and-Whisker-Plot-of-Machine-Learning-Models-on-the-Imbalanced-Glass-Identification-Dataset.png 1280w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2019\/12\/Box-and-Whisker-Plot-of-Machine-Learning-Models-on-the-Imbalanced-Glass-Identification-Dataset-300x225.png 300w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2019\/12\/Box-and-Whisker-Plot-of-Machine-Learning-Models-on-the-Imbalanced-Glass-Identification-Dataset-1024x768.png 1024w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2019\/12\/Box-and-Whisker-Plot-of-Machine-Learning-Models-on-the-Imbalanced-Glass-Identification-Dataset-768x576.png 768w\" sizes=\"(max-width: 1280px) 100vw, 1280px\"><\/p>\n<p id=\"caption-attachment-9761\" class=\"wp-caption-text\">Box and Whisker Plot of Machine Learning Models on the Imbalanced Glass Identification Dataset<\/p>\n<\/div>\n<p>Now that we have seen how to evaluate models on this dataset, let&rsquo;s look at how we can use a final model to make predictions.<\/p>\n<h2>Make Predictions on New Data<\/h2>\n<p>In this section, we can fit a final model and use it to make predictions on single rows of data.<\/p>\n<p>We will use the Random Forest model as our final model that achieved a classification accuracy of about 79 percent.<\/p>\n<p>First, we can define the model.<\/p>\n<pre class=\"crayon-plain-tag\">...\r\n# define model to evaluate\r\nmodel = RandomForestClassifier(n_estimators=1000)<\/pre>\n<p>Once defined, we can fit it on the entire training dataset.<\/p>\n<pre class=\"crayon-plain-tag\">...\r\n# fit the model\r\nmodel.fit(X, y)<\/pre>\n<p>Once fit, we can use it to make predictions for new data by calling the <em>predict()<\/em> function.<\/p>\n<p>This will return the class label for each example.<\/p>\n<p>For example:<\/p>\n<pre class=\"crayon-plain-tag\">...\r\n# define a row of data\r\nrow = [...]\r\n# predict the class label\r\nyhat = model.predict([row])<\/pre>\n<p>To demonstrate this, we can use the fit model to make some predictions of labels for a few cases where we know the outcome.<\/p>\n<p>The complete example is listed below.<\/p>\n<pre class=\"crayon-plain-tag\"># fit a model and make predictions for the on the glass identification dataset\r\nfrom pandas import read_csv\r\nfrom sklearn.preprocessing import LabelEncoder\r\nfrom sklearn.ensemble import RandomForestClassifier\r\n\r\n# load the dataset\r\ndef load_dataset(full_path):\r\n\t# load the dataset as a numpy array\r\n\tdata = read_csv(full_path, header=None)\r\n\t# retrieve numpy array\r\n\tdata = data.values\r\n\t# split into input and output elements\r\n\tX, y = data[:, :-1], data[:, -1]\r\n\t# label encode the target variable to have the classes 0 and 1\r\n\ty = LabelEncoder().fit_transform(y)\r\n\treturn X, y\r\n\r\n# define the location of the dataset\r\nfull_path = 'glass.csv'\r\n# load the dataset\r\nX, y = load_dataset(full_path)\r\n# define model to evaluate\r\nmodel = RandomForestClassifier(n_estimators=1000)\r\n# fit the model\r\nmodel.fit(X, y)\r\n# known class 0 (class=1 in the dataset)\r\nrow = [1.52101,13.64,4.49,1.10,71.78,0.06,8.75,0.00,0.00]\r\nprint('&gt;Predicted=%d (expected 0)' % (model.predict([row])))\r\n# known class 1 (class=2 in the dataset)\r\nrow = [1.51574,14.86,3.67,1.74,71.87,0.16,7.36,0.00,0.12]\r\nprint('&gt;Predicted=%d (expected 1)' % (model.predict([row])))\r\n# known class 2 (class=3 in the dataset)\r\nrow = [1.51769,13.65,3.66,1.11,72.77,0.11,8.60,0.00,0.00]\r\nprint('&gt;Predicted=%d (expected 2)' % (model.predict([row])))\r\n# known class 3 (class=5 in the dataset)\r\nrow = [1.51915,12.73,1.85,1.86,72.69,0.60,10.09,0.00,0.00]\r\nprint('&gt;Predicted=%d (expected 3)' % (model.predict([row])))\r\n# known class 4 (class=6 in the dataset)\r\nrow = [1.51115,17.38,0.00,0.34,75.41,0.00,6.65,0.00,0.00]\r\nprint('&gt;Predicted=%d (expected 4)' % (model.predict([row])))\r\n# known class 5 (class=7 in the dataset)\r\nrow = [1.51556,13.87,0.00,2.54,73.23,0.14,9.41,0.81,0.01]\r\nprint('&gt;Predicted=%d (expected 5)' % (model.predict([row])))<\/pre>\n<p>Running the example first fits the model on the entire training dataset.<\/p>\n<p>Then the fit model is used to predict the label for one example taken from each of the six classes.<\/p>\n<p>We can see that the correct class label is predicted for each of the chosen examples. Nevertheless, on average, we expect that 1 in 5 predictions will be wrong and these errors may not be equally distributed across the classes.<\/p>\n<pre class=\"crayon-plain-tag\">&gt;Predicted=0 (expected 0)\r\n&gt;Predicted=1 (expected 1)\r\n&gt;Predicted=2 (expected 2)\r\n&gt;Predicted=3 (expected 3)\r\n&gt;Predicted=4 (expected 4)\r\n&gt;Predicted=5 (expected 5)<\/pre>\n<\/p>\n<h2>Further Reading<\/h2>\n<p>This section provides more resources on the topic if you are looking to go deeper.<\/p>\n<h3>APIs<\/h3>\n<ul>\n<li><a href=\"https:\/\/pandas.pydata.org\/pandas-docs\/stable\/reference\/api\/pandas.read_csv.html\">pandas.read_csv API<\/a>.<\/li>\n<li><a href=\"https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.dummy.DummyClassifier.html\">sklearn.dummy.DummyClassifier API<\/a>.<\/li>\n<li><a href=\"https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.ensemble.RandomForestClassifier.html\">sklearn.ensemble.RandomForestClassifier API<\/a>.<\/li>\n<\/ul>\n<h3>Dataset<\/h3>\n<ul>\n<li><a href=\"https:\/\/archive.ics.uci.edu\/ml\/datasets\/glass+identification\">Glass Identification Dataset, UCI Machine Learning Repository<\/a>.<\/li>\n<li><a href=\"https:\/\/raw.githubusercontent.com\/jbrownlee\/Datasets\/master\/glass.csv\">Glass Identification Dataset<\/a>.<\/li>\n<li><a href=\"https:\/\/raw.githubusercontent.com\/jbrownlee\/Datasets\/master\/glass.names\">Glass Identification Dataset Description<\/a>.<\/li>\n<\/ul>\n<h2>Summary<\/h2>\n<p>In this tutorial, you discovered how to develop and evaluate a model for the imbalanced multiclass glass identification dataset.<\/p>\n<p>Specifically, you learned:<\/p>\n<ul>\n<li>How to load and explore the dataset and generate ideas for data preparation and model selection.<\/li>\n<li>How to systematically evaluate a suite of machine learning models with a robust test harness.<\/li>\n<li>How to fit a final model and use it to predict the class labels for specific examples.<\/li>\n<\/ul>\n<p>Do you have any questions?<br \/>\nAsk your questions in the comments below and I will do my best to answer.<\/p>\n<p>The post <a rel=\"nofollow\" href=\"https:\/\/machinelearningmastery.com\/imbalanced-multiclass-classification-with-the-glass-identification-dataset\/\">Imbalanced Multiclass Classification with the Glass Identification Dataset<\/a> appeared first on <a rel=\"nofollow\" href=\"https:\/\/machinelearningmastery.com\/\">Machine Learning Mastery<\/a>.<\/p>\n<\/div>\n<p><a href=\"https:\/\/machinelearningmastery.com\/imbalanced-multiclass-classification-with-the-glass-identification-dataset\/\">Go to Source<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Author: Jason Brownlee Multiclass classification problems are those where a label must be predicted, but there are more than two labels that may be predicted. [&hellip;] <span class=\"read-more-link\"><a class=\"read-more\" href=\"https:\/\/www.aiproblog.com\/index.php\/2020\/03\/12\/imbalanced-multiclass-classification-with-the-glass-identification-dataset\/\">Read More<\/a><\/span><\/p>\n","protected":false},"author":1,"featured_media":3233,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_bbp_topic_count":0,"_bbp_reply_count":0,"_bbp_total_topic_count":0,"_bbp_total_reply_count":0,"_bbp_voice_count":0,"_bbp_anonymous_reply_count":0,"_bbp_topic_count_hidden":0,"_bbp_reply_count_hidden":0,"_bbp_forum_subforum_count":0,"footnotes":""},"categories":[24],"tags":[],"_links":{"self":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts\/3232"}],"collection":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/comments?post=3232"}],"version-history":[{"count":0,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts\/3232\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/media\/3233"}],"wp:attachment":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/media?parent=3232"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/categories?post=3232"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/tags?post=3232"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}