{"id":3200,"date":"2020-03-03T18:00:39","date_gmt":"2020-03-03T18:00:39","guid":{"rendered":"https:\/\/www.aiproblog.com\/index.php\/2020\/03\/03\/predictive-model-for-the-phoneme-imbalanced-classification-dataset\/"},"modified":"2020-03-03T18:00:39","modified_gmt":"2020-03-03T18:00:39","slug":"predictive-model-for-the-phoneme-imbalanced-classification-dataset","status":"publish","type":"post","link":"https:\/\/www.aiproblog.com\/index.php\/2020\/03\/03\/predictive-model-for-the-phoneme-imbalanced-classification-dataset\/","title":{"rendered":"Predictive Model for the Phoneme Imbalanced Classification Dataset"},"content":{"rendered":"<p>Author: Jason Brownlee<\/p>\n<div>\n<p>Many binary classification tasks do not have an equal number of examples from each class, e.g. the class distribution is skewed or imbalanced.<\/p>\n<p>Nevertheless, accuracy is equally important in both classes.<\/p>\n<p>An example is the classification of vowel sounds from European languages as either nasal or oral on speech recognition where there are many more examples of nasal than oral vowels. Classification accuracy is important for both classes, although accuracy as a metric cannot be used directly. Additionally, data sampling techniques may be required to transform the training dataset to make it more balanced when fitting machine learning algorithms.<\/p>\n<p>In this tutorial, you will discover how to develop and evaluate models for imbalanced binary classification of nasal and oral phonemes.<\/p>\n<p>After completing this tutorial, you will know:<\/p>\n<ul>\n<li>How to load and explore the dataset and generate ideas for data preparation and model selection.<\/li>\n<li>How to evaluate a suite of machine learning models and improve their performance with data oversampling techniques.<\/li>\n<li>How to fit a final model and use it to predict class labels for specific cases.<\/li>\n<\/ul>\n<p>Discover SMOTE, one-class classification, cost-sensitive learning, threshold moving, and much more <a href=\"https:\/\/machinelearningmastery.com\/imbalanced-classification-with-python\/\">in my new book<\/a>, with 30 step-by-step tutorials and full Python source code.<\/p>\n<p>Let&rsquo;s get started.<\/p>\n<div id=\"attachment_9703\" style=\"width: 809px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-9703\" class=\"size-full wp-image-9703\" src=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2020\/03\/Predictive-Model-for-the-Phoneme-Imbalanced-Classification-Dataset.jpg\" alt=\"Predictive Model for the Phoneme Imbalanced Classification Dataset\" width=\"799\" height=\"533\" srcset=\"http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2020\/03\/Predictive-Model-for-the-Phoneme-Imbalanced-Classification-Dataset.jpg 799w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2020\/03\/Predictive-Model-for-the-Phoneme-Imbalanced-Classification-Dataset-300x200.jpg 300w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2020\/03\/Predictive-Model-for-the-Phoneme-Imbalanced-Classification-Dataset-768x512.jpg 768w\" sizes=\"(max-width: 799px) 100vw, 799px\"><\/p>\n<p id=\"caption-attachment-9703\" class=\"wp-caption-text\">Predictive Model for the Phoneme Imbalanced Classification Dataset<br \/>Photo by <a href=\"https:\/\/flickr.com\/photos\/blachswan\/33812905292\/\">Ed Dunens<\/a>, some rights reserved.<\/p>\n<\/div>\n<h2>Tutorial Overview<\/h2>\n<p>This tutorial is divided into five parts; they are:<\/p>\n<ol>\n<li>Phoneme Dataset<\/li>\n<li>Explore the Dataset<\/li>\n<li>Model Test and Baseline Result<\/li>\n<li>Evaluate Models\n<ol>\n<li>Evaluate Machine Learning Algorithms<\/li>\n<li>Evaluate Data Oversampling Algorithms<\/li>\n<\/ol>\n<\/li>\n<li>Make Predictions on New Data<\/li>\n<\/ol>\n<h2>Phoneme Dataset<\/h2>\n<p>In this project, we will use a standard imbalanced machine learning dataset referred to as the &ldquo;<em>Phoneme<\/em>&rdquo; dataset.<\/p>\n<p>This dataset is credited to the ESPRIT (<a href=\"https:\/\/en.wikipedia.org\/wiki\/European_Strategic_Program_on_Research_in_Information_Technology\">European Strategic Program on Research in Information Technology<\/a>) project titled &ldquo;<em>ROARS<\/em>&rdquo; (Robust Analytical Speech Recognition System) and described in progress reports and technical reports from that project.<\/p>\n<blockquote>\n<p>The goal of the ROARS project is to increase the robustness of an existing analytical speech recognition system (i,e., one using knowledge about syllables, phonemes and phonetic features), and to use it as part of a speech understanding system with connected words and dialogue capability. This system will be evaluated for a specific application in two European languages<\/p>\n<\/blockquote>\n<p>&mdash; <a href=\"https:\/\/www.aclweb.org\/anthology\/H91-1007.pdf\">ESPRIT: The European Strategic Programme for Research and development in Information Technology<\/a>.<\/p>\n<p>The goal of the dataset was to distinguish between nasal and oral vowels.<\/p>\n<p>Vowel sounds were spoken and recorded to digital files. Then audio features were automatically extracted from each sound.<\/p>\n<blockquote>\n<p>Five different attributes were chosen to characterize each vowel: they are the amplitudes of the five first harmonics AHi, normalised by the total energy Ene (integrated on all the frequencies): AHi\/Ene. Each harmonic is signed: positive when it corresponds to a local maximum of the spectrum and negative otherwise.<\/p>\n<\/blockquote>\n<p>&mdash; <a href=\"https:\/\/raw.githubusercontent.com\/jbrownlee\/Datasets\/master\/phoneme.names\">Phoneme Dataset Description<\/a>.<\/p>\n<p>There are two classes for the two types of sounds; they are:<\/p>\n<ul>\n<li><strong>Class 0<\/strong>: Nasal Vowels (majority class).<\/li>\n<li><strong>Class 1<\/strong>: Oral Vowels (minority class).<\/li>\n<\/ul>\n<p>Next, let&rsquo;s take a closer look at the data.<\/p>\n<\/p>\n<div class=\"woo-sc-hr\"><\/div>\n<p><center><\/p>\n<h3>Want to Get Started With Imbalance Classification?<\/h3>\n<p>Take my free 7-day email crash course now (with sample code).<\/p>\n<p>Click to sign-up and also get a free PDF Ebook version of the course.<\/p>\n<p><a href=\"https:\/\/machinelearningmastery.lpages.co\/leadbox\/14de34d42172a2%3A164f8be4f346dc\/4529268551712768\/\" target=\"_blank\" style=\"background: rgb(255, 206, 10); color: rgb(255, 255, 255); text-decoration: none; font-family: Helvetica, Arial, sans-serif; font-weight: bold; font-size: 16px; line-height: 20px; padding: 10px; display: inline-block; max-width: 300px; border-radius: 5px; text-shadow: rgba(0, 0, 0, 0.25) 0px -1px 1px; box-shadow: rgba(255, 255, 255, 0.5) 0px 1px 3px inset, rgba(0, 0, 0, 0.5) 0px 1px 3px;\" rel=\"noopener noreferrer\">Download Your FREE Mini-Course<\/a><script data-leadbox=\"14de34d42172a2:164f8be4f346dc\" data-url=\"https:\/\/machinelearningmastery.lpages.co\/leadbox\/14de34d42172a2%3A164f8be4f346dc\/4529268551712768\/\" data-config=\"%7B%7D\" type=\"text\/javascript\" src=\"https:\/\/machinelearningmastery.lpages.co\/leadbox-1576257931.js\"><\/script><\/p>\n<p><\/center><\/p>\n<div class=\"woo-sc-hr\"><\/div>\n<h2>Explore the Dataset<\/h2>\n<p>The Phoneme dataset is a widely used standard machine learning dataset, used to explore and demonstrate many techniques designed specifically for imbalanced classification.<\/p>\n<p>One example is the popular <a href=\"https:\/\/arxiv.org\/abs\/1106.1813\">SMOTE data oversampling technique<\/a>.<\/p>\n<p>First, download the dataset and save it in your current working directory with the name &ldquo;<em>phoneme.csv<\/em>&ldquo;.<\/p>\n<ul>\n<li><a href=\"https:\/\/raw.githubusercontent.com\/jbrownlee\/Datasets\/master\/phoneme.csv\">Download Phoneme Dataset (phoneme.csv)<\/a><\/li>\n<\/ul>\n<p>Review the contents of the file.<\/p>\n<p>The first few lines of the file should look as follows:<\/p>\n<pre class=\"crayon-plain-tag\">1.24,0.875,-0.205,-0.078,0.067,0\r\n0.268,1.352,1.035,-0.332,0.217,0\r\n1.567,0.867,1.3,1.041,0.559,0\r\n0.279,0.99,2.555,-0.738,0.0,0\r\n0.307,1.272,2.656,-0.946,-0.467,0\r\n...<\/pre>\n<p>We can see that the given input variables are numeric and class labels are 0 and 1 for nasal and oral respectively.<\/p>\n<p>The dataset can be loaded as a DataFrame using the <a href=\"https:\/\/pandas.pydata.org\/pandas-docs\/stable\/reference\/api\/pandas.read_csv.html\">read_csv() Pandas function<\/a>, specifying the location and the fact that there is no header line.<\/p>\n<pre class=\"crayon-plain-tag\">...\r\n# define the dataset location\r\nfilename = 'phoneme.csv'\r\n# load the csv file as a data frame\r\ndataframe = read_csv(filename, header=None)<\/pre>\n<p>Once loaded, we can summarize the number of rows and columns by printing the shape of the <a href=\"https:\/\/pandas.pydata.org\/pandas-docs\/stable\/reference\/api\/pandas.DataFrame.html\">DataFrame<\/a>.<\/p>\n<pre class=\"crayon-plain-tag\">...\r\n# summarize the shape of the dataset\r\nprint(dataframe.shape)<\/pre>\n<p>We can also summarize the number of examples in each class using the <a href=\"https:\/\/docs.python.org\/3\/library\/collections.html#collections.Counter\">Counter<\/a> object.<\/p>\n<pre class=\"crayon-plain-tag\">...\r\n# summarize the class distribution\r\ntarget = dataframe.values[:,-1]\r\ncounter = Counter(target)\r\nfor k,v in counter.items():\r\n\tper = v \/ len(target) * 100\r\n\tprint('Class=%s, Count=%d, Percentage=%.3f%%' % (k, v, per))<\/pre>\n<p>Tying this together, the complete example of loading and summarizing the dataset is listed below.<\/p>\n<pre class=\"crayon-plain-tag\"># load and summarize the dataset\r\nfrom pandas import read_csv\r\nfrom collections import Counter\r\n# define the dataset location\r\nfilename = 'phoneme.csv'\r\n# load the csv file as a data frame\r\ndataframe = read_csv(filename, header=None)\r\n# summarize the shape of the dataset\r\nprint(dataframe.shape)\r\n# summarize the class distribution\r\ntarget = dataframe.values[:,-1]\r\ncounter = Counter(target)\r\nfor k,v in counter.items():\r\n\tper = v \/ len(target) * 100\r\n\tprint('Class=%s, Count=%d, Percentage=%.3f%%' % (k, v, per))<\/pre>\n<p>Running the example first loads the dataset and confirms the number of rows and columns, that is 5,404 rows and five input variables and one target variable.<\/p>\n<p>The class distribution is then summarized, confirming a modest class imbalance with approximately 70 percent for the majority class (<em>nasal<\/em>) and approximately 30 percent for the minority class (<em>oral<\/em>).<\/p>\n<pre class=\"crayon-plain-tag\">(5404, 6)\r\nClass=0.0, Count=3818, Percentage=70.651%\r\nClass=1.0, Count=1586, Percentage=29.349%<\/pre>\n<p>We can also take a look at the distribution of the five numerical input variables by creating a histogram for each.<\/p>\n<p>The complete example is listed below.<\/p>\n<pre class=\"crayon-plain-tag\"># create histograms of numeric input variables\r\nfrom pandas import read_csv\r\nfrom matplotlib import pyplot\r\n# define the dataset location\r\nfilename = 'phoneme.csv'\r\n# load the csv file as a data frame\r\ndf = read_csv(filename, header=None)\r\n# histograms of all variables\r\ndf.hist()\r\npyplot.show()<\/pre>\n<p>Running the example creates the figure with one histogram subplot for each of the five numerical input variables in the dataset, as well as the numerical class label.<\/p>\n<p>We can see that the variables have differing scales, although most appear to have a Gaussian or <a href=\"https:\/\/machinelearningmastery.com\/continuous-probability-distributions-for-machine-learning\/\">Gaussian-like distribution<\/a>.<\/p>\n<p>Depending on the choice of modeling algorithms, we would expect scaling the distributions to the same range to be useful, and perhaps standardize the use of some power transforms.<\/p>\n<div id=\"attachment_9699\" style=\"width: 1290px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-9699\" class=\"size-full wp-image-9699\" src=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2019\/12\/Histogram-Plots-of-the-Variables-for-the-Phoneme-Dataset.png\" alt=\"Histogram Plots of the Variables for the Phoneme Dataset\" width=\"1280\" height=\"960\" srcset=\"http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2019\/12\/Histogram-Plots-of-the-Variables-for-the-Phoneme-Dataset.png 1280w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2019\/12\/Histogram-Plots-of-the-Variables-for-the-Phoneme-Dataset-300x225.png 300w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2019\/12\/Histogram-Plots-of-the-Variables-for-the-Phoneme-Dataset-1024x768.png 1024w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2019\/12\/Histogram-Plots-of-the-Variables-for-the-Phoneme-Dataset-768x576.png 768w\" sizes=\"(max-width: 1280px) 100vw, 1280px\"><\/p>\n<p id=\"caption-attachment-9699\" class=\"wp-caption-text\">Histogram Plots of the Variables for the Phoneme Dataset<\/p>\n<\/div>\n<p>We can also create a scatter plot for each pair of input variables, called a scatter plot matrix.<\/p>\n<p>This can be helpful to see if any variables relate to each other or change in the same direction, e.g. are correlated.<\/p>\n<p>We can also color the dots of each scatter plot according to the class label. In this case, the majority class (<em>nasal<\/em>) will be mapped to blue dots and the minority class (<em>oral<\/em>) will be mapped to red dots.<\/p>\n<p>The complete example is listed below.<\/p>\n<pre class=\"crayon-plain-tag\"># create pairwise scatter plots of numeric input variables\r\nfrom pandas import read_csv\r\nfrom pandas import DataFrame\r\nfrom pandas.plotting import scatter_matrix\r\nfrom matplotlib import pyplot\r\n# define the dataset location\r\nfilename = 'phoneme.csv'\r\n# load the csv file as a data frame\r\ndf = read_csv(filename, header=None)\r\n# define a mapping of class values to colors\r\ncolor_dict = {0:'blue', 1:'red'}\r\n# map each row to a color based on the class value\r\ncolors = [color_dict[x] for x in df.values[:, -1]]\r\n# drop the target variable\r\ninputs = DataFrame(df.values[:, :-1])\r\n# pairwise scatter plots of all numerical variables\r\nscatter_matrix(inputs, diagonal='kde', color=colors)\r\npyplot.show()<\/pre>\n<p>Running the example creates a figure showing the scatter plot matrix, with five plots by five plots, comparing each of the five numerical input variables with each other. The diagonal of the matrix shows the density distribution of each variable.<\/p>\n<p>Each pairing appears twice, both above and below the top-left to bottom-right diagonal, providing two ways to review the same variable interactions.<\/p>\n<p>We can see that the distributions for many variables do differ for the two class labels, suggesting that some reasonable discrimination between the classes will be feasible.<\/p>\n<div id=\"attachment_9700\" style=\"width: 1290px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-9700\" class=\"size-full wp-image-9700\" src=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2019\/12\/Scatter-Plot-Matrix-by-Class-for-the-Numerical-Input-Variables-in-the-Phoneme-Dataset.png\" alt=\"Scatter Plot Matrix by Class for the Numerical Input Variables in the Phoneme Dataset\" width=\"1280\" height=\"960\" srcset=\"http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2019\/12\/Scatter-Plot-Matrix-by-Class-for-the-Numerical-Input-Variables-in-the-Phoneme-Dataset.png 1280w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2019\/12\/Scatter-Plot-Matrix-by-Class-for-the-Numerical-Input-Variables-in-the-Phoneme-Dataset-300x225.png 300w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2019\/12\/Scatter-Plot-Matrix-by-Class-for-the-Numerical-Input-Variables-in-the-Phoneme-Dataset-1024x768.png 1024w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2019\/12\/Scatter-Plot-Matrix-by-Class-for-the-Numerical-Input-Variables-in-the-Phoneme-Dataset-768x576.png 768w\" sizes=\"(max-width: 1280px) 100vw, 1280px\"><\/p>\n<p id=\"caption-attachment-9700\" class=\"wp-caption-text\">Scatter Plot Matrix by Class for the Numerical Input Variables in the Phoneme Dataset<\/p>\n<\/div>\n<p>Now that we have reviewed the dataset, let&rsquo;s look at developing a test harness for evaluating candidate models.<\/p>\n<h2>Model Test and Baseline Result<\/h2>\n<p>We will evaluate candidate models using repeated stratified k-fold cross-validation.<\/p>\n<p>The <a href=\"https:\/\/machinelearningmastery.com\/k-fold-cross-validation\/\">k-fold cross-validation procedure<\/a> provides a good general estimate of model performance that is not too optimistically biased, at least compared to a single train-test split. We will use k=10, meaning each fold will contain about 5404\/10 or about 540 examples.<\/p>\n<p>Stratified means that each fold will contain the same mixture of examples by class, that is about 70 percent to 30 percent nasal to oral vowels. Repetition indicates that the evaluation process will be performed multiple times to help avoid fluke results and better capture the variance of the chosen model. We will use three repeats.<\/p>\n<p>This means a single model will be fit and evaluated 10 * 3, or 30, times and the mean and standard deviation of these runs will be reported.<\/p>\n<p>This can be achieved using the <a href=\"https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.model_selection.RepeatedStratifiedKFold.html\">RepeatedStratifiedKFold scikit-learn class<\/a>.<\/p>\n<p>Class labels will be predicted and both class labels are equally important. Therefore, we will select a metric that quantifies the performance of a model on both classes separately.<\/p>\n<p>You may remember that the sensitivity is a measure of the accuracy for the positive class and specificity is a measure of the accuracy of the negative class.<\/p>\n<ul>\n<li>Sensitivity = TruePositives \/ (TruePositives + FalseNegatives)<\/li>\n<li>Specificity = TrueNegatives \/ (TrueNegatives + FalsePositives)<\/li>\n<\/ul>\n<p>The G-mean seeks a balance of these scores, the <a href=\"https:\/\/en.wikipedia.org\/wiki\/Geometric_mean\">geometric mean<\/a>, where poor performance for one or the other results in a low G-mean score.<\/p>\n<ul>\n<li>G-Mean = sqrt(Sensitivity * Specificity)<\/li>\n<\/ul>\n<p>We can calculate the G-mean for a set of predictions made by a model using the <a href=\"https:\/\/imbalanced-learn.readthedocs.io\/en\/stable\/generated\/imblearn.metrics.geometric_mean_score.html\">geometric_mean_score() function<\/a> provided by the imbalanced-learn library.<\/p>\n<p>We can define a function to load the dataset and split the columns into input and output variables. The <em>load_dataset()<\/em> function below implements this.<\/p>\n<pre class=\"crayon-plain-tag\"># load the dataset\r\ndef load_dataset(full_path):\r\n\t# load the dataset as a numpy array\r\n\tdata = read_csv(full_path, header=None)\r\n\t# retrieve numpy array\r\n\tdata = data.values\r\n\t# split into input and output elements\r\n\tX, y = data[:, :-1], data[:, -1]\r\n\treturn X, y<\/pre>\n<p>We can then define a function that will evaluate a given model on the dataset and return a list of G-Mean scores for each fold and repeat. The <em>evaluate_model()<\/em> function below implements this, taking the dataset and model as arguments and returning the list of scores.<\/p>\n<pre class=\"crayon-plain-tag\"># evaluate a model\r\ndef evaluate_model(X, y, model):\r\n\t# define evaluation procedure\r\n\tcv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)\r\n\t# define the model evaluation the metric\r\n\tmetric = make_scorer(geometric_mean_score)\r\n\t# evaluate model\r\n\tscores = cross_val_score(model, X, y, scoring=metric, cv=cv, n_jobs=-1)\r\n\treturn scores<\/pre>\n<p>Finally, we can evaluate a baseline model on the dataset using this test harness.<\/p>\n<p>A model that predicts the majority class label (0) or the minority class label (1) for all cases will result in a G-mean of zero. As such, a good default strategy would be to randomly predict one class label or another with a 50 percent probability and aim for a G-mean of about 0.5.<\/p>\n<p>This can be achieved using the <a href=\"https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.dummy.DummyClassifier.html\">DummyClassifier<\/a> class from the scikit-learn library and setting the &ldquo;<em>strategy<\/em>&rdquo; argument to &lsquo;<em>uniform<\/em>&lsquo;.<\/p>\n<pre class=\"crayon-plain-tag\">...\r\n# define the reference model\r\nmodel = DummyClassifier(strategy='uniform')<\/pre>\n<p>Once the model is evaluated, we can report the mean and standard deviation of the G-mean scores directly.<\/p>\n<pre class=\"crayon-plain-tag\">...\r\n# evaluate the model\r\nscores = evaluate_model(X, y, model)\r\n# summarize performance\r\nprint('Mean G-Mean: %.3f (%.3f)' % (mean(scores), std(scores)))<\/pre>\n<p>Tying this together, the complete example of loading the dataset, evaluating a baseline model, and reporting the performance is listed below.<\/p>\n<pre class=\"crayon-plain-tag\"># test harness and baseline model evaluation\r\nfrom collections import Counter\r\nfrom numpy import mean\r\nfrom numpy import std\r\nfrom pandas import read_csv\r\nfrom sklearn.model_selection import cross_val_score\r\nfrom sklearn.model_selection import RepeatedStratifiedKFold\r\nfrom imblearn.metrics import geometric_mean_score\r\nfrom sklearn.metrics import make_scorer\r\nfrom sklearn.dummy import DummyClassifier\r\n\r\n# load the dataset\r\ndef load_dataset(full_path):\r\n\t# load the dataset as a numpy array\r\n\tdata = read_csv(full_path, header=None)\r\n\t# retrieve numpy array\r\n\tdata = data.values\r\n\t# split into input and output elements\r\n\tX, y = data[:, :-1], data[:, -1]\r\n\treturn X, y\r\n\r\n# evaluate a model\r\ndef evaluate_model(X, y, model):\r\n\t# define evaluation procedure\r\n\tcv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)\r\n\t# define the model evaluation the metric\r\n\tmetric = make_scorer(geometric_mean_score)\r\n\t# evaluate model\r\n\tscores = cross_val_score(model, X, y, scoring=metric, cv=cv, n_jobs=-1)\r\n\treturn scores\r\n\r\n# define the location of the dataset\r\nfull_path = 'phoneme.csv'\r\n# load the dataset\r\nX, y = load_dataset(full_path)\r\n# summarize the loaded dataset\r\nprint(X.shape, y.shape, Counter(y))\r\n# define the reference model\r\nmodel = DummyClassifier(strategy='uniform')\r\n# evaluate the model\r\nscores = evaluate_model(X, y, model)\r\n# summarize performance\r\nprint('Mean G-Mean: %.3f (%.3f)' % (mean(scores), std(scores)))<\/pre>\n<p>Running the example first loads and summarizes the dataset.<\/p>\n<p>We can see that we have the correct number of rows loaded and that we have five audio-derived input variables.<\/p>\n<p>Next, the average of the G-Mean scores is reported.<\/p>\n<p>Your specific results will vary given the stochastic nature of the algorithm; consider running the example a few times.<\/p>\n<p>In this case, we can see that the baseline algorithm achieves a G-Mean of about 0.509, close to the theoretical maximum of 0.5. This score provides a lower limit on model skill; any model that achieves an average G-Mean above about 0.509 (or really above 0.5) has skill, whereas models that achieve a score below this value do not have skill on this dataset.<\/p>\n<pre class=\"crayon-plain-tag\">(5404, 5) (5404,) Counter({0.0: 3818, 1.0: 1586})\r\nMean G-Mean: 0.509 (0.020)<\/pre>\n<p>Now that we have a test harness and a baseline in performance, we can begin to evaluate some models on this dataset.<\/p>\n<h2>Evaluate Models<\/h2>\n<p>In this section, we will evaluate a suite of different techniques on the dataset using the test harness developed in the previous section.<\/p>\n<p>The goal is to both demonstrate how to work through the problem systematically and to demonstrate the capability of some techniques designed for imbalanced classification problems.<\/p>\n<p>The reported performance is good, but not highly optimized (e.g. hyperparameters are not tuned).<\/p>\n<p><strong>You can do better?<\/strong> If you can achieve better G-mean performance using the same test harness, I&rsquo;d love to hear about it. Let me know in the comments below.<\/p>\n<h3>Evaluate Machine Learning Algorithms<\/h3>\n<p>Let&rsquo;s start by evaluating a mixture of machine learning models on the dataset.<\/p>\n<p>It can be a good idea to spot check a suite of different linear and nonlinear algorithms on a dataset to quickly flush out what works well and deserves further attention, and what doesn&rsquo;t.<\/p>\n<p>We will evaluate the following machine learning models on the phoneme dataset:<\/p>\n<ul>\n<li>Logistic Regression (LR)<\/li>\n<li>Support Vector Machine (SVM)<\/li>\n<li>Bagged Decision Trees (BAG)<\/li>\n<li>Random Forest (RF)<\/li>\n<li>Extra Trees (ET)<\/li>\n<\/ul>\n<p>We will use mostly default model hyperparameters, with the exception of the number of trees in the ensemble algorithms, which we will set to a reasonable default of 1,000.<\/p>\n<p>We will define each model in turn and add them to a list so that we can evaluate them sequentially. The <em>get_models()<\/em> function below defines the list of models for evaluation, as well as a list of model short names for plotting the results later.<\/p>\n<pre class=\"crayon-plain-tag\"># define models to test\r\ndef get_models():\r\n\tmodels, names = list(), list()\r\n\t# LR\r\n\tmodels.append(LogisticRegression(solver='lbfgs'))\r\n\tnames.append('LR')\r\n\t# SVM\r\n\tmodels.append(SVC(gamma='scale'))\r\n\tnames.append('SVM')\r\n\t# Bagging\r\n\tmodels.append(BaggingClassifier(n_estimators=1000))\r\n\tnames.append('BAG')\r\n\t# RF\r\n\tmodels.append(RandomForestClassifier(n_estimators=1000))\r\n\tnames.append('RF')\r\n\t# ET\r\n\tmodels.append(ExtraTreesClassifier(n_estimators=1000))\r\n\tnames.append('ET')\r\n\treturn models, names<\/pre>\n<p>We can then enumerate the list of models in turn and evaluate each, reporting the mean G-Mean and storing the scores for later plotting.<\/p>\n<pre class=\"crayon-plain-tag\">...\r\n# define models\r\nmodels, names = get_models()\r\nresults = list()\r\n# evaluate each model\r\nfor i in range(len(models)):\r\n\t# evaluate the model and store results\r\n\tscores = evaluate_model(X, y, models[i])\r\n\tresults.append(scores)\r\n\t# summarize and store\r\n\tprint('&gt;%s %.3f (%.3f)' % (names[i], mean(scores), std(scores)))<\/pre>\n<p>At the end of the run, we can plot each sample of scores as a box and whisker plot with the same scale so that we can directly compare the distributions.<\/p>\n<pre class=\"crayon-plain-tag\">...\r\n# plot the results\r\npyplot.boxplot(results, labels=names, showmeans=True)\r\npyplot.show()<\/pre>\n<p>Tying this all together, the complete example of evaluating a suite of machine learning algorithms on the phoneme dataset is listed below.<\/p>\n<pre class=\"crayon-plain-tag\"># spot check machine learning algorithms on the phoneme dataset\r\nfrom numpy import mean\r\nfrom numpy import std\r\nfrom pandas import read_csv\r\nfrom matplotlib import pyplot\r\nfrom sklearn.model_selection import cross_val_score\r\nfrom sklearn.model_selection import RepeatedStratifiedKFold\r\nfrom imblearn.metrics import geometric_mean_score\r\nfrom sklearn.metrics import make_scorer\r\nfrom sklearn.linear_model import LogisticRegression\r\nfrom sklearn.svm import SVC\r\nfrom sklearn.ensemble import RandomForestClassifier\r\nfrom sklearn.ensemble import ExtraTreesClassifier\r\nfrom sklearn.ensemble import BaggingClassifier\r\n\r\n# load the dataset\r\ndef load_dataset(full_path):\r\n\t# load the dataset as a numpy array\r\n\tdata = read_csv(full_path, header=None)\r\n\t# retrieve numpy array\r\n\tdata = data.values\r\n\t# split into input and output elements\r\n\tX, y = data[:, :-1], data[:, -1]\r\n\treturn X, y\r\n\r\n# evaluate a model\r\ndef evaluate_model(X, y, model):\r\n\t# define evaluation procedure\r\n\tcv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)\r\n\t# define the model evaluation the metric\r\n\tmetric = make_scorer(geometric_mean_score)\r\n\t# evaluate model\r\n\tscores = cross_val_score(model, X, y, scoring=metric, cv=cv, n_jobs=-1)\r\n\treturn scores\r\n\r\n# define models to test\r\ndef get_models():\r\n\tmodels, names = list(), list()\r\n\t# LR\r\n\tmodels.append(LogisticRegression(solver='lbfgs'))\r\n\tnames.append('LR')\r\n\t# SVM\r\n\tmodels.append(SVC(gamma='scale'))\r\n\tnames.append('SVM')\r\n\t# Bagging\r\n\tmodels.append(BaggingClassifier(n_estimators=1000))\r\n\tnames.append('BAG')\r\n\t# RF\r\n\tmodels.append(RandomForestClassifier(n_estimators=1000))\r\n\tnames.append('RF')\r\n\t# ET\r\n\tmodels.append(ExtraTreesClassifier(n_estimators=1000))\r\n\tnames.append('ET')\r\n\treturn models, names\r\n\r\n# define the location of the dataset\r\nfull_path = 'phoneme.csv'\r\n# load the dataset\r\nX, y = load_dataset(full_path)\r\n# define models\r\nmodels, names = get_models()\r\nresults = list()\r\n# evaluate each model\r\nfor i in range(len(models)):\r\n\t# evaluate the model and store results\r\n\tscores = evaluate_model(X, y, models[i])\r\n\tresults.append(scores)\r\n\t# summarize and store\r\n\tprint('&gt;%s %.3f (%.3f)' % (names[i], mean(scores), std(scores)))\r\n# plot the results\r\npyplot.boxplot(results, labels=names, showmeans=True)\r\npyplot.show()<\/pre>\n<p>Running the example evaluates each algorithm in turn and reports the mean and standard deviation G-Mean.<\/p>\n<p>Your specific results will vary given the stochastic nature of the learning algorithms; consider running the example a few times.<\/p>\n<p>In this case, we can see that all of the tested algorithms have skill, achieving a G-Mean above the default of 0.5 The results suggest that the ensemble of decision tree algorithms perform better on this dataset with perhaps Extra Trees (ET) performing the best with a G-Mean of about 0.896.<\/p>\n<pre class=\"crayon-plain-tag\">&gt;LR 0.637 (0.023)\r\n&gt;SVM 0.801 (0.022)\r\n&gt;BAG 0.888 (0.017)\r\n&gt;RF 0.892 (0.018)\r\n&gt;ET 0.896 (0.017)<\/pre>\n<p>A figure is created showing one box and whisker plot for each algorithm&rsquo;s sample of results. The box shows the middle 50 percent of the data, the orange line in the middle of each box shows the median of the sample, and the green triangle in each box shows the mean of the sample.<\/p>\n<p>We can see that all three ensembles of trees algorithms (BAG, RF, and ET) have a tight distribution and a mean and median that closely align, perhaps suggesting a non-skewed and Gaussian distribution of scores, e.g. stable.<\/p>\n<div id=\"attachment_9701\" style=\"width: 1290px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-9701\" class=\"size-full wp-image-9701\" src=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2019\/12\/Box-and-Whisker-Plot-of-Machine-Learning-Models-on-the-Imbalanced-Phoneme-Dataset.png\" alt=\"Box and Whisker Plot of Machine Learning Models on the Imbalanced Phoneme Dataset\" width=\"1280\" height=\"960\" srcset=\"http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2019\/12\/Box-and-Whisker-Plot-of-Machine-Learning-Models-on-the-Imbalanced-Phoneme-Dataset.png 1280w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2019\/12\/Box-and-Whisker-Plot-of-Machine-Learning-Models-on-the-Imbalanced-Phoneme-Dataset-300x225.png 300w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2019\/12\/Box-and-Whisker-Plot-of-Machine-Learning-Models-on-the-Imbalanced-Phoneme-Dataset-1024x768.png 1024w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2019\/12\/Box-and-Whisker-Plot-of-Machine-Learning-Models-on-the-Imbalanced-Phoneme-Dataset-768x576.png 768w\" sizes=\"(max-width: 1280px) 100vw, 1280px\"><\/p>\n<p id=\"caption-attachment-9701\" class=\"wp-caption-text\">Box and Whisker Plot of Machine Learning Models on the Imbalanced Phoneme Dataset<\/p>\n<\/div>\n<p>Now that we have a good first set of results, let&rsquo;s see if we can improve them with data oversampling methods.<\/p>\n<h3>Evaluate Data Oversampling Algorithms<\/h3>\n<p>Data sampling provides a way to better prepare the imbalanced training dataset prior to fitting a model.<\/p>\n<p>The simplest oversampling technique is to duplicate examples in the minority class, called random oversampling. Perhaps the most popular oversampling method is the SMOTE oversampling technique for creating new synthetic examples for the minority class.<\/p>\n<p>We will test five different oversampling methods; specifically:<\/p>\n<ul>\n<li>Random Oversampling (ROS)<\/li>\n<li>SMOTE (SMOTE)<\/li>\n<li>BorderLine SMOTE (BLSMOTE)<\/li>\n<li>SVM SMOTE (SVMSMOTE)<\/li>\n<li>ADASYN (ADASYN)<\/li>\n<\/ul>\n<p>Each technique will be tested with the best performing algorithm from the previous section, specifically Extra Trees.<\/p>\n<p>We will use the default hyperparameters for each oversampling algorithm, which will oversample the minority class to have the same number of examples as the majority class in the training dataset.<\/p>\n<p>The expectation is that each oversampling technique will result in a lift in performance compared to the algorithm without oversampling with the smallest lift provided by Random Oversampling and perhaps the best lift provided by SMOTE or one of its variations.<\/p>\n<p>We can update the <em>get_models()<\/em> function to return lists of oversampling algorithms to evaluate; for example:<\/p>\n<pre class=\"crayon-plain-tag\"># define oversampling models to test\r\ndef get_models():\r\n\tmodels, names = list(), list()\r\n\t# RandomOverSampler\r\n\tmodels.append(RandomOverSampler())\r\n\tnames.append('ROS')\r\n\t# SMOTE\r\n\tmodels.append(SMOTE())\r\n\tnames.append('SMOTE')\r\n\t# BorderlineSMOTE\r\n\tmodels.append(BorderlineSMOTE())\r\n\tnames.append('BLSMOTE')\r\n\t# SVMSMOTE\r\n\tmodels.append(SVMSMOTE())\r\n\tnames.append('SVMSMOTE')\r\n\t# ADASYN\r\n\tmodels.append(ADASYN())\r\n\tnames.append('ADASYN')\r\n\treturn models, names<\/pre>\n<p>We can then enumerate each and create a <a href=\"https:\/\/imbalanced-learn.readthedocs.io\/en\/stable\/generated\/imblearn.pipeline.Pipeline.html\">Pipeline<\/a> from the imbalanced-learn library that is aware of how to oversample a training dataset. This will ensure that the training dataset within the cross-validation model evaluation is sampled correctly, without data leakage that could result in an optimistic evaluation of model performance.<\/p>\n<p>First, we will normalize the input variables because most oversampling techniques will make use of a nearest neighbor algorithm and it is important that all variables have the same scale when using this technique. This will be followed by a given oversampling algorithm, then ending with the Extra Trees algorithm that will be fit on the oversampled training dataset.<\/p>\n<pre class=\"crayon-plain-tag\">...\r\n# define the model\r\nmodel = ExtraTreesClassifier(n_estimators=1000)\r\n# define the pipeline steps\r\nsteps = [('s', MinMaxScaler()), ('o', models[i]), ('m', model)]\r\n# define the pipeline\r\npipeline = Pipeline(steps=steps)\r\n# evaluate the model and store results\r\nscores = evaluate_model(X, y, pipeline)<\/pre>\n<p>Tying this together, the complete example of evaluating oversampling algorithms with Extra Trees on the phoneme dataset is listed below.<\/p>\n<pre class=\"crayon-plain-tag\"># data oversampling algorithms on the phoneme imbalanced dataset\r\nfrom numpy import mean\r\nfrom numpy import std\r\nfrom pandas import read_csv\r\nfrom matplotlib import pyplot\r\nfrom sklearn.preprocessing import MinMaxScaler\r\nfrom sklearn.model_selection import cross_val_score\r\nfrom sklearn.model_selection import RepeatedStratifiedKFold\r\nfrom imblearn.metrics import geometric_mean_score\r\nfrom sklearn.metrics import make_scorer\r\nfrom sklearn.ensemble import ExtraTreesClassifier\r\nfrom imblearn.over_sampling import RandomOverSampler\r\nfrom imblearn.over_sampling import SMOTE\r\nfrom imblearn.over_sampling import BorderlineSMOTE\r\nfrom imblearn.over_sampling import SVMSMOTE\r\nfrom imblearn.over_sampling import ADASYN\r\nfrom imblearn.pipeline import Pipeline\r\n\r\n# load the dataset\r\ndef load_dataset(full_path):\r\n\t# load the dataset as a numpy array\r\n\tdata = read_csv(full_path, header=None)\r\n\t# retrieve numpy array\r\n\tdata = data.values\r\n\t# split into input and output elements\r\n\tX, y = data[:, :-1], data[:, -1]\r\n\treturn X, y\r\n\r\n# evaluate a model\r\ndef evaluate_model(X, y, model):\r\n\t# define evaluation procedure\r\n\tcv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)\r\n\t# define the model evaluation the metric\r\n\tmetric = make_scorer(geometric_mean_score)\r\n\t# evaluate model\r\n\tscores = cross_val_score(model, X, y, scoring=metric, cv=cv, n_jobs=-1)\r\n\treturn scores\r\n\r\n# define oversampling models to test\r\ndef get_models():\r\n\tmodels, names = list(), list()\r\n\t# RandomOverSampler\r\n\tmodels.append(RandomOverSampler())\r\n\tnames.append('ROS')\r\n\t# SMOTE\r\n\tmodels.append(SMOTE())\r\n\tnames.append('SMOTE')\r\n\t# BorderlineSMOTE\r\n\tmodels.append(BorderlineSMOTE())\r\n\tnames.append('BLSMOTE')\r\n\t# SVMSMOTE\r\n\tmodels.append(SVMSMOTE())\r\n\tnames.append('SVMSMOTE')\r\n\t# ADASYN\r\n\tmodels.append(ADASYN())\r\n\tnames.append('ADASYN')\r\n\treturn models, names\r\n\r\n# define the location of the dataset\r\nfull_path = 'phoneme.csv'\r\n# load the dataset\r\nX, y = load_dataset(full_path)\r\n# define models\r\nmodels, names = get_models()\r\nresults = list()\r\n# evaluate each model\r\nfor i in range(len(models)):\r\n\t# define the model\r\n\tmodel = ExtraTreesClassifier(n_estimators=1000)\r\n\t# define the pipeline steps\r\n\tsteps = [('s', MinMaxScaler()), ('o', models[i]), ('m', model)]\r\n\t# define the pipeline\r\n\tpipeline = Pipeline(steps=steps)\r\n\t# evaluate the model and store results\r\n\tscores = evaluate_model(X, y, pipeline)\r\n\tresults.append(scores)\r\n\t# summarize and store\r\n\tprint('&gt;%s %.3f (%.3f)' % (names[i], mean(scores), std(scores)))\r\n# plot the results\r\npyplot.boxplot(results, labels=names, showmeans=True)\r\npyplot.show()<\/pre>\n<p>Running the example evaluates each oversampling method with the Extra Trees model on the dataset.<\/p>\n<p>Your specific results will vary given the stochastic nature of the learning algorithms; consider running the example a few times.<\/p>\n<p>In this case, as we expected, each oversampling technique resulted in a lift in performance for the ET algorithm without any oversampling (0.896), except the random oversampling technique.<\/p>\n<p>The results suggest that the modified versions of SMOTE and ADASYN performed better than default SMOTE, and in this case, ADASYN achieved the best G-Mean score of 0.910.<\/p>\n<pre class=\"crayon-plain-tag\">&gt;ROS 0.894 (0.018)\r\n&gt;SMOTE 0.906 (0.015)\r\n&gt;BLSMOTE 0.909 (0.013)\r\n&gt;SVMSMOTE 0.909 (0.014)\r\n&gt;ADASYN 0.910 (0.013)<\/pre>\n<p>The distribution of results can be compared with box and whisker plots.<\/p>\n<p>We can see the distributions all roughly have the same tight distribution and that the difference in means of the results can be used to select a model.<\/p>\n<div id=\"attachment_9702\" style=\"width: 1290px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-9702\" class=\"size-full wp-image-9702\" src=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2019\/12\/Box-and-Whisker-Plot-of-Extra-Trees-Models-with-Data-Oversampling-on-the-Imbalanced-Phoneme-Dataset.png\" alt=\"Box and Whisker Plot of Extra Trees Models With Data Oversampling on the Imbalanced Phoneme Dataset\" width=\"1280\" height=\"960\" srcset=\"http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2019\/12\/Box-and-Whisker-Plot-of-Extra-Trees-Models-with-Data-Oversampling-on-the-Imbalanced-Phoneme-Dataset.png 1280w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2019\/12\/Box-and-Whisker-Plot-of-Extra-Trees-Models-with-Data-Oversampling-on-the-Imbalanced-Phoneme-Dataset-300x225.png 300w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2019\/12\/Box-and-Whisker-Plot-of-Extra-Trees-Models-with-Data-Oversampling-on-the-Imbalanced-Phoneme-Dataset-1024x768.png 1024w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2019\/12\/Box-and-Whisker-Plot-of-Extra-Trees-Models-with-Data-Oversampling-on-the-Imbalanced-Phoneme-Dataset-768x576.png 768w\" sizes=\"(max-width: 1280px) 100vw, 1280px\"><\/p>\n<p id=\"caption-attachment-9702\" class=\"wp-caption-text\">Box and Whisker Plot of Extra Trees Models With Data Oversampling on the Imbalanced Phoneme Dataset<\/p>\n<\/div>\n<p>Next, let&rsquo;s see how we might use a final model to make predictions on new data.<\/p>\n<h2>Make Prediction on New Data<\/h2>\n<p>In this section, we will fit a final model and use it to make predictions on single rows of data<\/p>\n<p>We will use the ADASYN oversampled version of the Extra Trees model as the final model and a normalization scaling on the data prior to fitting the model and making a prediction. Using the pipeline will ensure that the transform is always performed correctly.<\/p>\n<p>First, we can define the model as a pipeline.<\/p>\n<pre class=\"crayon-plain-tag\">...\r\n# define the model\r\nmodel = ExtraTreesClassifier(n_estimators=1000)\r\n# define the pipeline steps\r\nsteps = [('s', MinMaxScaler()), ('o', ADASYN()), ('m', model)]\r\n# define the pipeline\r\npipeline = Pipeline(steps=steps)<\/pre>\n<p>Once defined, we can fit it on the entire training dataset.<\/p>\n<pre class=\"crayon-plain-tag\">...\r\n# fit the model\r\npipeline.fit(X, y)<\/pre>\n<p>Once fit, we can use it to make predictions for new data by calling the <em>predict()<\/em> function. This will return the class label of 0 for &ldquo;<em>nasal<\/em>, or 1 for &ldquo;<em>oral<\/em>&ldquo;.<\/p>\n<p>For example:<\/p>\n<pre class=\"crayon-plain-tag\">...\r\n# define a row of data\r\nrow = [...]\r\n# make prediction\r\nyhat = pipeline.predict([row])<\/pre>\n<p>To demonstrate this, we can use the fit model to make some predictions of labels for a few cases where we know if the case is nasal or oral.<\/p>\n<p>The complete example is listed below.<\/p>\n<pre class=\"crayon-plain-tag\"># fit a model and make predictions for the on the phoneme dataset\r\nfrom pandas import read_csv\r\nfrom sklearn.preprocessing import MinMaxScaler\r\nfrom imblearn.over_sampling import ADASYN\r\nfrom sklearn.ensemble import ExtraTreesClassifier\r\nfrom imblearn.pipeline import Pipeline\r\n\r\n# load the dataset\r\ndef load_dataset(full_path):\r\n\t# load the dataset as a numpy array\r\n\tdata = read_csv(full_path, header=None)\r\n\t# retrieve numpy array\r\n\tdata = data.values\r\n\t# split into input and output elements\r\n\tX, y = data[:, :-1], data[:, -1]\r\n\treturn X, y\r\n\r\n# define the location of the dataset\r\nfull_path = 'phoneme.csv'\r\n# load the dataset\r\nX, y = load_dataset(full_path)\r\n# define the model\r\nmodel = ExtraTreesClassifier(n_estimators=1000)\r\n# define the pipeline steps\r\nsteps = [('s', MinMaxScaler()), ('o', ADASYN()), ('m', model)]\r\n# define the pipeline\r\npipeline = Pipeline(steps=steps)\r\n# fit the model\r\npipeline.fit(X, y)\r\n# evaluate on some nasal cases (known class 0)\r\nprint('Nasal:')\r\ndata = [[1.24,0.875,-0.205,-0.078,0.067],\r\n\t[0.268,1.352,1.035,-0.332,0.217],\r\n\t[1.567,0.867,1.3,1.041,0.559]]\r\nfor row in data:\r\n\t# make prediction\r\n\tyhat = pipeline.predict([row])\r\n\t# get the label\r\n\tlabel = yhat[0]\r\n\t# summarize\r\n\tprint('&gt;Predicted=%d (expected 0)' % (label))\r\n# evaluate on some oral cases (known class 1)\r\nprint('Oral:')\r\ndata = [[0.125,0.548,0.795,0.836,0.0],\r\n\t[0.318,0.811,0.818,0.821,0.86],\r\n\t[0.151,0.642,1.454,1.281,-0.716]]\r\nfor row in data:\r\n\t# make prediction\r\n\tyhat = pipeline.predict([row])\r\n\t# get the label\r\n\tlabel = yhat[0]\r\n\t# summarize\r\n\tprint('&gt;Predicted=%d (expected 1)' % (label))<\/pre>\n<p>Running the example first fits the model on the entire training dataset.<\/p>\n<p>Then the fit model is used to predict the label of nasal cases chosen from the dataset file. We can see that all cases are correctly predicted.<\/p>\n<p>Then some oral cases are used as input to the model and the label is predicted. As we might have hoped, the correct labels are predicted for all cases.<\/p>\n<pre class=\"crayon-plain-tag\">Nasal:\r\n&gt;Predicted=0 (expected 0)\r\n&gt;Predicted=0 (expected 0)\r\n&gt;Predicted=0 (expected 0)\r\nOral:\r\n&gt;Predicted=1 (expected 1)\r\n&gt;Predicted=1 (expected 1)\r\n&gt;Predicted=1 (expected 1)<\/pre>\n<\/p>\n<h2>Further Reading<\/h2>\n<p>This section provides more resources on the topic if you are looking to go deeper.<\/p>\n<h3>Papers<\/h3>\n<ul>\n<li><a href=\"https:\/\/www.aclweb.org\/anthology\/H91-1007.pdf\">ESPRIT: The European Strategic Programme for Research and development in Information Technology<\/a>.<\/li>\n<\/ul>\n<h3>APIs<\/h3>\n<ul>\n<li><a href=\"https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.model_selection.RepeatedStratifiedKFold.html\">sklearn.model_selection.RepeatedStratifiedKFold API<\/a>.<\/li>\n<li><a href=\"https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.dummy.DummyClassifier.html\">sklearn.dummy.DummyClassifier API<\/a>.<\/li>\n<li><a href=\"https:\/\/imbalanced-learn.readthedocs.io\/en\/stable\/generated\/imblearn.metrics.geometric_mean_score.html\">imblearn.metrics.geometric_mean_score API<\/a>.<\/li>\n<\/ul>\n<h3>Dataset<\/h3>\n<ul>\n<li><a href=\"https:\/\/raw.githubusercontent.com\/jbrownlee\/Datasets\/master\/phoneme.csv\">Phoneme Dataset<\/a><\/li>\n<li><a href=\"https:\/\/raw.githubusercontent.com\/jbrownlee\/Datasets\/master\/phoneme.names\">Phoneme Dataset Description<\/a><\/li>\n<li><a href=\"https:\/\/sci2s.ugr.es\/keel\/dataset.php?cod=105\">Phoneme Dataset on KEEL<\/a><\/li>\n<li><a href=\"https:\/\/www.elen.ucl.ac.be\/neural-nets\/Research\/Projects\/ELENA\/databases\/REAL\/phoneme\/\">Phoneme Dataset on the ELENA Project<\/a><\/li>\n<\/ul>\n<h2>Summary<\/h2>\n<p>In this tutorial, you discovered how to develop and evaluate models for imbalanced binary classification of nasal and oral phonemes.<\/p>\n<p>Specifically, you learned:<\/p>\n<ul>\n<li>How to load and explore the dataset and generate ideas for data preparation and model selection.<\/li>\n<li>How to evaluate a suite of machine learning models and improve their performance with data oversampling techniques.<\/li>\n<li>How to fit a final model and use it to predict class labels for specific cases.<\/li>\n<\/ul>\n<p>Do you have any questions?<br \/>\nAsk your questions in the comments below and I will do my best to answer.<\/p>\n<p>The post <a rel=\"nofollow\" href=\"https:\/\/machinelearningmastery.com\/predictive-model-for-the-phoneme-imbalanced-classification-dataset\/\">Predictive Model for the Phoneme Imbalanced Classification Dataset<\/a> appeared first on <a rel=\"nofollow\" href=\"https:\/\/machinelearningmastery.com\/\">Machine Learning Mastery<\/a>.<\/p>\n<\/div>\n<p><a href=\"https:\/\/machinelearningmastery.com\/predictive-model-for-the-phoneme-imbalanced-classification-dataset\/\">Go to Source<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Author: Jason Brownlee Many binary classification tasks do not have an equal number of examples from each class, e.g. the class distribution is skewed or [&hellip;] <span class=\"read-more-link\"><a class=\"read-more\" href=\"https:\/\/www.aiproblog.com\/index.php\/2020\/03\/03\/predictive-model-for-the-phoneme-imbalanced-classification-dataset\/\">Read More<\/a><\/span><\/p>\n","protected":false},"author":1,"featured_media":3201,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_bbp_topic_count":0,"_bbp_reply_count":0,"_bbp_total_topic_count":0,"_bbp_total_reply_count":0,"_bbp_voice_count":0,"_bbp_anonymous_reply_count":0,"_bbp_topic_count_hidden":0,"_bbp_reply_count_hidden":0,"_bbp_forum_subforum_count":0,"footnotes":""},"categories":[24],"tags":[],"_links":{"self":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts\/3200"}],"collection":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/comments?post=3200"}],"version-history":[{"count":0,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts\/3200\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/media\/3201"}],"wp:attachment":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/media?parent=3200"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/categories?post=3200"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/tags?post=3200"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}