{"id":3194,"date":"2020-03-01T18:00:55","date_gmt":"2020-03-01T18:00:55","guid":{"rendered":"https:\/\/www.aiproblog.com\/index.php\/2020\/03\/01\/imbalanced-classification-model-to-detect-mammography-microcalcifications\/"},"modified":"2020-03-01T18:00:55","modified_gmt":"2020-03-01T18:00:55","slug":"imbalanced-classification-model-to-detect-mammography-microcalcifications","status":"publish","type":"post","link":"https:\/\/www.aiproblog.com\/index.php\/2020\/03\/01\/imbalanced-classification-model-to-detect-mammography-microcalcifications\/","title":{"rendered":"Imbalanced Classification Model to Detect Mammography Microcalcifications"},"content":{"rendered":"<p>Author: Jason Brownlee<\/p>\n<div>\n<p>Cancer detection is a popular example of an imbalanced classification problem because there are often significantly more cases of non-cancer than actual cancer.<\/p>\n<p>A standard imbalanced classification dataset is the mammography dataset that involves detecting breast cancer from radiological scans, specifically the presence of clusters of microcalcifications that appear bright on a mammogram. This dataset was constructed by scanning the images, segmenting them into candidate objects, and using computer vision techniques to describe each candidate object.<\/p>\n<p>It is a popular dataset for imbalanced classification because of the severe class imbalance, specifically where 98 percent of candidate microcalcifications are not cancer and only 2 percent were labeled as cancer by an experienced radiographer.<\/p>\n<p>In this tutorial, you will discover how to develop and evaluate models for the imbalanced mammography cancer classification dataset.<\/p>\n<p>After completing this tutorial, you will know:<\/p>\n<ul>\n<li>How to load and explore the dataset and generate ideas for data preparation and model selection.<\/li>\n<li>How to evaluate a suite of machine learning models and improve their performance with data cost-sensitive techniques.<\/li>\n<li>How to fit a final model and use it to predict class labels for specific cases.<\/li>\n<\/ul>\n<p>Discover SMOTE, one-class classification, cost-sensitive learning, threshold moving, and much more <a href=\"https:\/\/machinelearningmastery.com\/imbalanced-classification-with-python\/\">in my new book<\/a>, with 30 step-by-step tutorials and full Python source code.<\/p>\n<p>Let&rsquo;s get started.<\/p>\n<div id=\"attachment_9690\" style=\"width: 809px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-9690\" class=\"size-full wp-image-9690\" src=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2020\/03\/Develop-an-Imbalanced-Classification-Model-to-Detect-Microcalcifications.jpg\" alt=\"Develop an Imbalanced Classification Model to Detect Microcalcifications\" width=\"799\" height=\"453\" srcset=\"http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2020\/03\/Develop-an-Imbalanced-Classification-Model-to-Detect-Microcalcifications.jpg 799w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2020\/03\/Develop-an-Imbalanced-Classification-Model-to-Detect-Microcalcifications-300x170.jpg 300w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2020\/03\/Develop-an-Imbalanced-Classification-Model-to-Detect-Microcalcifications-768x435.jpg 768w\" sizes=\"(max-width: 799px) 100vw, 799px\"><\/p>\n<p id=\"caption-attachment-9690\" class=\"wp-caption-text\">Develop an Imbalanced Classification Model to Detect Microcalcifications<br \/>Photo by <a href=\"https:\/\/flickr.com\/photos\/volvob12b\/44164103752\/\">Bernard Spragg. NZ<\/a>, some rights reserved.<\/p>\n<\/div>\n<h2>Tutorial Overview<\/h2>\n<p>This tutorial is divided into five parts; they are:<\/p>\n<ol>\n<li>Mammography Dataset<\/li>\n<li>Explore the Dataset<\/li>\n<li>Model Test and Baseline Result<\/li>\n<li>Evaluate Models\n<ol>\n<li>Evaluate Machine Learning Algorithms<\/li>\n<li>Evaluate Cost-Sensitive Algorithms<\/li>\n<\/ol>\n<\/li>\n<li>Make Predictions on New Data<\/li>\n<\/ol>\n<h2>Mammography Dataset<\/h2>\n<p>In this project, we will use a standard imbalanced machine learning dataset referred to as the &ldquo;<em>mammography<\/em>&rdquo; dataset or sometimes &ldquo;<em>Woods Mammography<\/em>.&rdquo;<\/p>\n<p>The dataset is credited to Kevin Woods, et al. and the 1993 paper titled &ldquo;<a href=\"https:\/\/www.worldscientific.com\/doi\/abs\/10.1142\/S0218001493000698\">Comparative Evaluation Of Pattern Recognition Techniques For Detection Of Microcalcifications In Mammography<\/a>.&rdquo;<\/p>\n<p>The focus of the problem is on detecting breast cancer from radiological scans, specifically the presence of clusters of microcalcifications that appear bright on a mammogram.<\/p>\n<p>The dataset involved first started with 24 mammograms with a known cancer diagnosis that were scanned. The images were then pre-processed using image segmentation computer vision algorithms to extract candidate objects from the mammogram images. Once segmented, the objects were then manually labeled by an experienced radiologist.<\/p>\n<p>A total of 29 features were extracted from the segmented objects thought to be most relevant to pattern recognition, which was reduced to 18, then finally to seven, as follows (taken directly from the paper):<\/p>\n<ul>\n<li>Area of object (in pixels).<\/li>\n<li>Average gray level of the object.<\/li>\n<li>Gradient strength of the object&rsquo;s perimeter pixels.<\/li>\n<li>Root mean square noise fluctuation in the object.<\/li>\n<li>Contrast, average gray level of the object minus the average of a two-pixel wide border surrounding the object.<\/li>\n<li>A low order moment based on shape descriptor.<\/li>\n<\/ul>\n<p>There are two classes and the goal is to distinguish between microcalcifications and non-microcalcifications using the features for a given segmented object.<\/p>\n<ul>\n<li><strong>Non-microcalcifications<\/strong>: negative case, or majority class.<\/li>\n<li><strong>Microcalcifications<\/strong>: positive case, or minority class.<\/li>\n<\/ul>\n<p>A number of models were evaluated and compared in the original paper, such as neural networks, decision trees, and k-nearest neighbors. Models were evaluated using <a href=\"https:\/\/machinelearningmastery.com\/roc-curves-and-precision-recall-curves-for-classification-in-python\/\">ROC Curves<\/a> and compared using the area under ROC Curve, or ROC AUC for short.<\/p>\n<p>ROC Curves and area under ROC Curves were chosen with the intent to minimize the false-positive rate (complement of the specificity) and maximize the true-positive rate (sensitivity), the two axes of the ROC Curve. The use of the ROC Curves also suggests the desire for a probabilistic model from which an operator can select a probability threshold as the cut-off between the acceptable false positive and true positive rates.<\/p>\n<p>Their results suggested a &ldquo;<em>linear classifier<\/em>&rdquo; (seemingly a <a href=\"https:\/\/machinelearningmastery.com\/classification-as-conditional-probability-and-the-naive-bayes-algorithm\/\">Gaussian Naive Bayes classifier<\/a>) performed the best with a ROC AUC of 0.936 averaged over 100 runs.<\/p>\n<p>Next, let&rsquo;s take a closer look at the data.<\/p>\n<\/p>\n<div class=\"woo-sc-hr\"><\/div>\n<p><center><\/p>\n<h3>Want to Get Started With Imbalance Classification?<\/h3>\n<p>Take my free 7-day email crash course now (with sample code).<\/p>\n<p>Click to sign-up and also get a free PDF Ebook version of the course.<\/p>\n<p><a href=\"https:\/\/machinelearningmastery.lpages.co\/leadbox\/14de34d42172a2%3A164f8be4f346dc\/4529268551712768\/\" target=\"_blank\" style=\"background: rgb(255, 206, 10); color: rgb(255, 255, 255); text-decoration: none; font-family: Helvetica, Arial, sans-serif; font-weight: bold; font-size: 16px; line-height: 20px; padding: 10px; display: inline-block; max-width: 300px; border-radius: 5px; text-shadow: rgba(0, 0, 0, 0.25) 0px -1px 1px; box-shadow: rgba(255, 255, 255, 0.5) 0px 1px 3px inset, rgba(0, 0, 0, 0.5) 0px 1px 3px;\" rel=\"noopener noreferrer\">Download Your FREE Mini-Course<\/a><script data-leadbox=\"14de34d42172a2:164f8be4f346dc\" data-url=\"https:\/\/machinelearningmastery.lpages.co\/leadbox\/14de34d42172a2%3A164f8be4f346dc\/4529268551712768\/\" data-config=\"%7B%7D\" type=\"text\/javascript\" src=\"https:\/\/machinelearningmastery.lpages.co\/leadbox-1576257931.js\"><\/script><\/p>\n<p><\/center><\/p>\n<div class=\"woo-sc-hr\"><\/div>\n<h2>Explore the Dataset<\/h2>\n<p>The Mammography dataset is a widely used standard machine learning dataset, used to explore and demonstrate many techniques designed specifically for imbalanced classification.<\/p>\n<p>One example is the popular <a href=\"https:\/\/arxiv.org\/abs\/1106.1813\">SMOTE data oversampling technique<\/a>.<\/p>\n<p>A version of this dataset was made available that has some differences to the dataset described in the original paper.<\/p>\n<p>First, download the dataset and save it in your current working directory with the name &ldquo;<em>mammography.csv<\/em>&rdquo;<\/p>\n<ul>\n<li><a href=\"https:\/\/raw.githubusercontent.com\/jbrownlee\/Datasets\/master\/mammography.csv\">Download Mammography Dataset (mammography.csv)<\/a><\/li>\n<\/ul>\n<p>Review the contents of the file.<\/p>\n<p>The first few lines of the file should look as follows:<\/p>\n<pre class=\"crayon-plain-tag\">0.23001961,5.0725783,-0.27606055,0.83244412,-0.37786573,0.4803223,'-1'\r\n0.15549112,-0.16939038,0.67065219,-0.85955255,-0.37786573,-0.94572324,'-1'\r\n-0.78441482,-0.44365372,5.6747053,-0.85955255,-0.37786573,-0.94572324,'-1'\r\n0.54608818,0.13141457,-0.45638679,-0.85955255,-0.37786573,-0.94572324,'-1'\r\n-0.10298725,-0.3949941,-0.14081588,0.97970269,-0.37786573,1.0135658,'-1'\r\n...<\/pre>\n<p>We can see that the dataset has six rather than the seven input variables. It is possible that the first input variable listed in the paper (area in pixels) was removed from this version of the dataset.<\/p>\n<p>The input variables are numerical (real-valued) and the target variable is the string with &lsquo;-1&rsquo; for the majority class and &lsquo;1&rsquo; for the minority class. These values will need to be encoded as 0 and 1 respectively to meet the expectations of classification algorithms on binary imbalanced classification problems.<\/p>\n<p>The dataset can be loaded as a DataFrame using the <a href=\"https:\/\/pandas.pydata.org\/pandas-docs\/stable\/reference\/api\/pandas.read_csv.html\">read_csv() Pandas function<\/a>, specifying the location and the fact that there is no header line.<\/p>\n<pre class=\"crayon-plain-tag\">...\r\n# define the dataset location\r\nfilename = 'mammography.csv'\r\n# load the csv file as a data frame\r\ndataframe = read_csv(filename, header=None)<\/pre>\n<p>Once loaded, we can summarize the number of rows and columns by printing the shape of the <a href=\"https:\/\/pandas.pydata.org\/pandas-docs\/stable\/reference\/api\/pandas.DataFrame.html\">DataFrame<\/a>.<\/p>\n<pre class=\"crayon-plain-tag\">...\r\n# summarize the shape of the dataset\r\nprint(dataframe.shape)<\/pre>\n<p>We can also summarize the number of examples in each class using the <a href=\"https:\/\/docs.python.org\/3\/library\/collections.html#collections.Counter\">Counter<\/a> object.<\/p>\n<pre class=\"crayon-plain-tag\">...\r\n# summarize the class distribution\r\ntarget = dataframe.values[:,-1]\r\ncounter = Counter(target)\r\nfor k,v in counter.items():\r\n\tper = v \/ len(target) * 100\r\n\tprint('Class=%s, Count=%d, Percentage=%.3f%%' % (k, v, per))<\/pre>\n<p>Tying this together, the complete example of loading and summarizing the dataset is listed below.<\/p>\n<pre class=\"crayon-plain-tag\"># load and summarize the dataset\r\nfrom pandas import read_csv\r\nfrom collections import Counter\r\n# define the dataset location\r\nfilename = 'mammography.csv'\r\n# load the csv file as a data frame\r\ndataframe = read_csv(filename, header=None)\r\n# summarize the shape of the dataset\r\nprint(dataframe.shape)\r\n# summarize the class distribution\r\ntarget = dataframe.values[:,-1]\r\ncounter = Counter(target)\r\nfor k,v in counter.items():\r\n\tper = v \/ len(target) * 100\r\n\tprint('Class=%s, Count=%d, Percentage=%.3f%%' % (k, v, per))<\/pre>\n<p>Running the example first loads the dataset and confirms the number of rows and columns, that is 11,183 rows and six input variables and one target variable.<\/p>\n<p>The class distribution is then summarized, confirming the severe class imbalanced with approximately 98 percent for the majority class (no cancer) and approximately 2 percent for the minority class (cancer).<\/p>\n<pre class=\"crayon-plain-tag\">(11183, 7)\r\nClass='-1', Count=10923, Percentage=97.675%\r\nClass='1', Count=260, Percentage=2.325%<\/pre>\n<p>The dataset appears to generally match the dataset described in the SMOTE paper. Specifically in terms of the ratio of negative to positive examples.<\/p>\n<blockquote>\n<p>A typical mammography dataset might contain 98% normal pixels and 2% abnormal pixels.<\/p>\n<\/blockquote>\n<p>&mdash; <a href=\"https:\/\/arxiv.org\/abs\/1106.1813\">SMOTE: Synthetic Minority Over-sampling Technique<\/a>, 2002.<\/p>\n<p>Also, the specific number of examples in the minority and majority classes also matches the paper.<\/p>\n<blockquote>\n<p>The experiments were conducted on the mammography dataset. There were 10923 examples in the majority class and 260 examples in the minority class originally.<\/p>\n<\/blockquote>\n<p>&mdash; <a href=\"https:\/\/arxiv.org\/abs\/1106.1813\">SMOTE: Synthetic Minority Over-sampling Technique<\/a>, 2002.<\/p>\n<p>I believe this is the same dataset, although I cannot explain the mismatch in the number of input features, e.g. six compared to seven in the original paper.<\/p>\n<p>We can also take a look at the distribution of the six numerical input variables by creating a histogram for each.<\/p>\n<p>The complete example is listed below.<\/p>\n<pre class=\"crayon-plain-tag\"># create histograms of numeric input variables\r\nfrom pandas import read_csv\r\nfrom matplotlib import pyplot\r\n# define the dataset location\r\nfilename = 'mammography.csv'\r\n# load the csv file as a data frame\r\ndf = read_csv(filename, header=None)\r\n# histograms of all variables\r\ndf.hist()\r\npyplot.show()<\/pre>\n<p>Running the example creates the figure with one histogram subplot for each of the six numerical input variables in the dataset.<\/p>\n<p>We can see that the variables have differing scales and that most of the variables have an exponential distribution, e.g. most cases falling into one bin, and the rest falling into a long tail. The final variable appears to have a bimodal distribution.<\/p>\n<p>Depending on the choice of modeling algorithms, we would expect scaling the distributions to the same range to be useful, and perhaps the use of some power transforms.<\/p>\n<div id=\"attachment_9686\" style=\"width: 1290px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-9686\" class=\"size-full wp-image-9686\" src=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2019\/11\/Histogram-Plots-of-the-Numerical-Input-Variables-for-the-Mammography-Dataset.png\" alt=\"Histogram Plots of the Numerical Input Variables for the Mammography Dataset\" width=\"1280\" height=\"960\" srcset=\"http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2019\/11\/Histogram-Plots-of-the-Numerical-Input-Variables-for-the-Mammography-Dataset.png 1280w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2019\/11\/Histogram-Plots-of-the-Numerical-Input-Variables-for-the-Mammography-Dataset-300x225.png 300w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2019\/11\/Histogram-Plots-of-the-Numerical-Input-Variables-for-the-Mammography-Dataset-1024x768.png 1024w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2019\/11\/Histogram-Plots-of-the-Numerical-Input-Variables-for-the-Mammography-Dataset-768x576.png 768w\" sizes=\"(max-width: 1280px) 100vw, 1280px\"><\/p>\n<p id=\"caption-attachment-9686\" class=\"wp-caption-text\">Histogram Plots of the Numerical Input Variables for the Mammography Dataset<\/p>\n<\/div>\n<p>We can also create a scatter plot for each pair of input variables, called a scatter plot matrix.<\/p>\n<p>This can be helpful to see if any variables relate to each other or change in the same direction, e.g. are correlated.<\/p>\n<p>We can also color the dots of each scatter plot according to the class label. In this case, the majority class (no cancer) will be mapped to blue dots and the minority class (cancer) will be mapped to red dots.<\/p>\n<p>The complete example is listed below.<\/p>\n<pre class=\"crayon-plain-tag\"># create pairwise scatter plots of numeric input variables\r\nfrom pandas import read_csv\r\nfrom pandas.plotting import scatter_matrix\r\nfrom matplotlib import pyplot\r\n# define the dataset location\r\nfilename = 'mammography.csv'\r\n# load the csv file as a data frame\r\ndf = read_csv(filename, header=None)\r\n# define a mapping of class values to colors\r\ncolor_dict = {\"'-1'\":'blue', \"'1'\":'red'}\r\n# map each row to a color based on the class value\r\ncolors = [color_dict[str(x)] for x in df.values[:, -1]]\r\n# pairwise scatter plots of all numerical variables\r\nscatter_matrix(df, diagonal='kde', color=colors)\r\npyplot.show()<\/pre>\n<p>Running the example creates a figure showing the scatter plot matrix, with six plots by six plots, comparing each of the six numerical input variables with each other. The diagonal of the matrix shows the density distribution of each variable.<\/p>\n<p>Each pairing appears twice both above and below the top-left to bottom-right diagonal, providing two ways to review the same variable interactions.<\/p>\n<p>We can see that the distributions for many variables do differ for the two-class labels, suggesting that some reasonable discrimination between the cancer and no cancer cases will be feasible.<\/p>\n<div id=\"attachment_9687\" style=\"width: 1290px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-9687\" class=\"size-full wp-image-9687\" src=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2019\/11\/Scatter-Plot-Matrix-by-Class-for-the-Numerical-Input-Variables-in-the-Mammography-Dataset.png\" alt=\"Scatter Plot Matrix by Class for the Numerical Input Variables in the Mammography Dataset\" width=\"1280\" height=\"960\" srcset=\"http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2019\/11\/Scatter-Plot-Matrix-by-Class-for-the-Numerical-Input-Variables-in-the-Mammography-Dataset.png 1280w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2019\/11\/Scatter-Plot-Matrix-by-Class-for-the-Numerical-Input-Variables-in-the-Mammography-Dataset-300x225.png 300w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2019\/11\/Scatter-Plot-Matrix-by-Class-for-the-Numerical-Input-Variables-in-the-Mammography-Dataset-1024x768.png 1024w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2019\/11\/Scatter-Plot-Matrix-by-Class-for-the-Numerical-Input-Variables-in-the-Mammography-Dataset-768x576.png 768w\" sizes=\"(max-width: 1280px) 100vw, 1280px\"><\/p>\n<p id=\"caption-attachment-9687\" class=\"wp-caption-text\">Scatter Plot Matrix by Class for the Numerical Input Variables in the Mammography Dataset<\/p>\n<\/div>\n<p>Now that we have reviewed the dataset, let&rsquo;s look at developing a test harness for evaluating candidate models.<\/p>\n<h2>Model Test and Baseline Result<\/h2>\n<p>We will evaluate candidate models using repeated stratified k-fold cross-validation.<\/p>\n<p>The <a href=\"https:\/\/machinelearningmastery.com\/k-fold-cross-validation\/\">k-fold cross-validation procedure<\/a> provides a good general estimate of model performance that is not too optimistically biased, at least compared to a single train-test split. We will use k=10, meaning each fold will contain about 11183\/10 or about 1,118 examples.<\/p>\n<p>Stratified means that each fold will contain the same mixture of examples by class, that is about 98 percent to 2 percent no-cancer to cancer objects. Repetition indicates that the evaluation process will be performed multiple times to help avoid fluke results and better capture the variance of the chosen model. We will use three repeats.<\/p>\n<p>This means a single model will be fit and evaluated 10 * 3 or 30 times and the mean and standard deviation of these runs will be reported.<\/p>\n<p>This can be achieved using the <a href=\"https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.model_selection.RepeatedStratifiedKFold.html\">RepeatedStratifiedKFold scikit-learn class<\/a>.<\/p>\n<p>We will evaluate and compare models using the area under ROC Curve or ROC AUC calculated via the <a href=\"https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.metrics.roc_auc_score.html\">roc_auc_score() function<\/a>.<\/p>\n<p>We can define a function to load the dataset and split the columns into input and output variables. We will correctly encode the class labels as 0 and 1. The <em>load_dataset()<\/em> function below implements this.<\/p>\n<pre class=\"crayon-plain-tag\"># load the dataset\r\ndef load_dataset(full_path):\r\n\t# load the dataset as a numpy array\r\n\tdata = read_csv(full_path, header=None)\r\n\t# retrieve numpy array\r\n\tdata = data.values\r\n\t# split into input and output elements\r\n\tX, y = data[:, :-1], data[:, -1]\r\n\t# label encode the target variable to have the classes 0 and 1\r\n\ty = LabelEncoder().fit_transform(y)\r\n\treturn X, y<\/pre>\n<p>We can then define a function that will evaluate a given model on the dataset and return a list of ROC AUC scores for each fold and repeat.<\/p>\n<p>The <em>evaluate_model()<\/em> function below implements this, taking the dataset and model as arguments and returning the list of scores.<\/p>\n<pre class=\"crayon-plain-tag\"># evaluate a model\r\ndef evaluate_model(X, y, model):\r\n\t# define evaluation procedure\r\n\tcv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)\r\n\t# evaluate model\r\n\tscores = cross_val_score(model, X, y, scoring='roc_auc', cv=cv, n_jobs=-1)\r\n\treturn scores<\/pre>\n<p>Finally, we can evaluate a baseline model on the dataset using this test harness.<\/p>\n<p>A model that predicts the a random class in proportion to the base rate of each class will result in a ROC AUC of 0.5, the baseline in performance on this dataset. This is a so-called &ldquo;no skill&rdquo; classifier.<\/p>\n<p>This can be achieved using the <a href=\"https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.dummy.DummyClassifier.html\">DummyClassifier<\/a> class from the scikit-learn library and setting the &ldquo;<em>strategy<\/em>&rdquo; argument to &lsquo;<em>stratified<\/em>&lsquo;.<\/p>\n<pre class=\"crayon-plain-tag\">...\r\n# define the reference model\r\nmodel = DummyClassifier(strategy='stratified')<\/pre>\n<p>Once the model is evaluated, we can report the mean and standard deviation of the ROC AUC scores directly.<\/p>\n<pre class=\"crayon-plain-tag\">...\r\n# evaluate the model\r\nscores = evaluate_model(X, y, model)\r\n# summarize performance\r\nprint('Mean ROC AUC: %.3f (%.3f)' % (mean(scores), std(scores)))<\/pre>\n<p>Tying this together, the complete example of loading the dataset, evaluating a baseline model, and reporting the performance is listed below.<\/p>\n<pre class=\"crayon-plain-tag\"># test harness and baseline model evaluation\r\nfrom collections import Counter\r\nfrom numpy import mean\r\nfrom numpy import std\r\nfrom pandas import read_csv\r\nfrom sklearn.preprocessing import LabelEncoder\r\nfrom sklearn.model_selection import cross_val_score\r\nfrom sklearn.model_selection import RepeatedStratifiedKFold\r\nfrom sklearn.dummy import DummyClassifier\r\n\r\n# load the dataset\r\ndef load_dataset(full_path):\r\n\t# load the dataset as a numpy array\r\n\tdata = read_csv(full_path, header=None)\r\n\t# retrieve numpy array\r\n\tdata = data.values\r\n\t# split into input and output elements\r\n\tX, y = data[:, :-1], data[:, -1]\r\n\t# label encode the target variable to have the classes 0 and 1\r\n\ty = LabelEncoder().fit_transform(y)\r\n\treturn X, y\r\n\r\n# evaluate a model\r\ndef evaluate_model(X, y, model):\r\n\t# define evaluation procedure\r\n\tcv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)\r\n\t# evaluate model\r\n\tscores = cross_val_score(model, X, y, scoring='roc_auc', cv=cv, n_jobs=-1)\r\n\treturn scores\r\n\r\n# define the location of the dataset\r\nfull_path = 'mammography.csv'\r\n# load the dataset\r\nX, y = load_dataset(full_path)\r\n# summarize the loaded dataset\r\nprint(X.shape, y.shape, Counter(y))\r\n# define the reference model\r\nmodel = DummyClassifier(strategy='stratified')\r\n# evaluate the model\r\nscores = evaluate_model(X, y, model)\r\n# summarize performance\r\nprint('Mean ROC AUC: %.3f (%.3f)' % (mean(scores), std(scores)))<\/pre>\n<p>Running the example first loads and summarizes the dataset.<\/p>\n<p>We can see that we have the correct number of rows loaded, and that we have six computer vision derived input variables. Importantly, we can see that the class labels have the correct mapping to integers with 0 for the majority class and 1 for the minority class, customary for imbalanced binary classification datasets.<\/p>\n<p>Next, the average of the ROC AUC scores is reported.<\/p>\n<p>As expected, the no-skill classifier achieves the worst-case performance of a mean ROC AUC of approximately 0.5. This provides a baseline in performance, above which a model can be considered skillful on this dataset.<\/p>\n<pre class=\"crayon-plain-tag\">(11183, 6) (11183,) Counter({0: 10923, 1: 260})\r\nMean ROC AUC: 0.503 (0.016)<\/pre>\n<p>Now that we have a test harness and a baseline in performance, we can begin to evaluate some models on this dataset.<\/p>\n<h2>Evaluate Models<\/h2>\n<p>In this section, we will evaluate a suite of different techniques on the dataset using the test harness developed in the previous section.<\/p>\n<p>The goal is to both demonstrate how to work through the problem systematically and to demonstrate the capability of some techniques designed for imbalanced classification problems.<\/p>\n<p>The reported performance is good, but not highly optimized (e.g. hyperparameters are not tuned).<\/p>\n<p><strong>Can you do better?<\/strong> If you can achieve better ROC AUC performance using the same test harness, I&rsquo;d love to hear about it. Let me know in the comments below.<\/p>\n<h3>Evaluate Machine Learning Algorithms<\/h3>\n<p>Let&rsquo;s start by evaluating a mixture of machine learning models on the dataset.<\/p>\n<p>It can be a good idea to spot check a suite of different linear and nonlinear algorithms on a dataset to quickly flush out what works well and deserves further attention, and what doesn&rsquo;t.<\/p>\n<p>We will evaluate the following machine learning models on the mammography dataset:<\/p>\n<ul>\n<li>Logistic Regression (LR)<\/li>\n<li>Support Vector Machine (SVM)<\/li>\n<li>Bagged Decision Trees (BAG)<\/li>\n<li>Random Forest (RF)<\/li>\n<li>Gradient Boosting Machine (GBM)<\/li>\n<\/ul>\n<p>We will use mostly default model hyperparameters, with the exception of the number of trees in the ensemble algorithms, which we will set to a reasonable default of 1,000.<\/p>\n<p>We will define each model in turn and add them to a list so that we can evaluate them sequentially. The <em>get_models()<\/em> function below defines the list of models for evaluation, as well as a list of model short names for plotting the results later.<\/p>\n<pre class=\"crayon-plain-tag\"># define models to test\r\ndef get_models():\r\n\tmodels, names = list(), list()\r\n\t# LR\r\n\tmodels.append(LogisticRegression(solver='lbfgs'))\r\n\tnames.append('LR')\r\n\t# SVM\r\n\tmodels.append(SVC(gamma='scale'))\r\n\tnames.append('SVM')\r\n\t# Bagging\r\n\tmodels.append(BaggingClassifier(n_estimators=1000))\r\n\tnames.append('BAG')\r\n\t# RF\r\n\tmodels.append(RandomForestClassifier(n_estimators=1000))\r\n\tnames.append('RF')\r\n\t# GBM\r\n\tmodels.append(GradientBoostingClassifier(n_estimators=1000))\r\n\tnames.append('GBM')\r\n\treturn models, names<\/pre>\n<p>We can then enumerate the list of models in turn and evaluate each, reporting the mean ROC AUC and storing the scores for later plotting.<\/p>\n<pre class=\"crayon-plain-tag\">...\r\n# define models\r\nmodels, names = get_models()\r\nresults = list()\r\n# evaluate each model\r\nfor i in range(len(models)):\r\n\t# evaluate the model and store results\r\n\tscores = evaluate_model(X, y, models[i])\r\n\tresults.append(scores)\r\n\t# summarize and store\r\n\tprint('&gt;%s %.3f (%.3f)' % (names[i], mean(scores), std(scores)))<\/pre>\n<p>At the end of the run, we can plot each sample of scores as a box and whisker plot with the same scale so that we can directly compare the distributions.<\/p>\n<pre class=\"crayon-plain-tag\">...\r\n# plot the results\r\npyplot.boxplot(results, labels=names, showmeans=True)\r\npyplot.show()<\/pre>\n<p>Tying this all together, the complete example of evaluating a suite of machine learning algorithms on the mammography dataset is listed below.<\/p>\n<pre class=\"crayon-plain-tag\"># spot check machine learning algorithms on the mammography dataset\r\nfrom numpy import mean\r\nfrom numpy import std\r\nfrom pandas import read_csv\r\nfrom matplotlib import pyplot\r\nfrom sklearn.preprocessing import LabelEncoder\r\nfrom sklearn.model_selection import cross_val_score\r\nfrom sklearn.model_selection import RepeatedStratifiedKFold\r\nfrom sklearn.linear_model import LogisticRegression\r\nfrom sklearn.svm import SVC\r\nfrom sklearn.ensemble import RandomForestClassifier\r\nfrom sklearn.ensemble import GradientBoostingClassifier\r\nfrom sklearn.ensemble import BaggingClassifier\r\n\r\n# load the dataset\r\ndef load_dataset(full_path):\r\n\t# load the dataset as a numpy array\r\n\tdata = read_csv(full_path, header=None)\r\n\t# retrieve numpy array\r\n\tdata = data.values\r\n\t# split into input and output elements\r\n\tX, y = data[:, :-1], data[:, -1]\r\n\t# label encode the target variable to have the classes 0 and 1\r\n\ty = LabelEncoder().fit_transform(y)\r\n\treturn X, y\r\n\r\n# evaluate a model\r\ndef evaluate_model(X, y, model):\r\n\t# define evaluation procedure\r\n\tcv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)\r\n\t# evaluate model\r\n\tscores = cross_val_score(model, X, y, scoring='roc_auc', cv=cv, n_jobs=-1)\r\n\treturn scores\r\n\r\n# define models to test\r\ndef get_models():\r\n\tmodels, names = list(), list()\r\n\t# LR\r\n\tmodels.append(LogisticRegression(solver='lbfgs'))\r\n\tnames.append('LR')\r\n\t# SVM\r\n\tmodels.append(SVC(gamma='scale'))\r\n\tnames.append('SVM')\r\n\t# Bagging\r\n\tmodels.append(BaggingClassifier(n_estimators=1000))\r\n\tnames.append('BAG')\r\n\t# RF\r\n\tmodels.append(RandomForestClassifier(n_estimators=1000))\r\n\tnames.append('RF')\r\n\t# GBM\r\n\tmodels.append(GradientBoostingClassifier(n_estimators=1000))\r\n\tnames.append('GBM')\r\n\treturn models, names\r\n\r\n# define the location of the dataset\r\nfull_path = 'mammography.csv'\r\n# load the dataset\r\nX, y = load_dataset(full_path)\r\n# define models\r\nmodels, names = get_models()\r\nresults = list()\r\n# evaluate each model\r\nfor i in range(len(models)):\r\n\t# evaluate the model and store results\r\n\tscores = evaluate_model(X, y, models[i])\r\n\tresults.append(scores)\r\n\t# summarize and store\r\n\tprint('&gt;%s %.3f (%.3f)' % (names[i], mean(scores), std(scores)))\r\n# plot the results\r\npyplot.boxplot(results, labels=names, showmeans=True)\r\npyplot.show()<\/pre>\n<p>Running the example evaluates each algorithm in turn and reports the mean and standard deviation ROC AUC.<\/p>\n<p>Your specific results will vary given the stochastic nature of the learning algorithms; consider running the example a few times.<\/p>\n<p>In this case, we can see that all of the tested algorithms have skill, achieving a ROC AUC above the default of 0.5.<\/p>\n<p>The results suggest that the ensemble of decision tree algorithms performs better on this dataset with perhaps Random Forest performing the best, with a ROC AUC of about 0.950.<\/p>\n<p>It is interesting to note that this is better than the ROC AUC described in the paper of 0.93, although we used a different model evaluation procedure.<\/p>\n<p>The evaluation was a little unfair to the LR and SVM algorithms as we did not scale the input variables prior to fitting the model. We can explore this in the next section.<\/p>\n<pre class=\"crayon-plain-tag\">&gt;LR 0.919 (0.040)\r\n&gt;SVM 0.880 (0.049)\r\n&gt;BAG 0.941 (0.041)\r\n&gt;RF 0.950 (0.036)\r\n&gt;GBM 0.918 (0.037)<\/pre>\n<p>A figure is created showing one box and whisker plot for each algorithm&rsquo;s sample of results. The box shows the middle 50 percent of the data, the orange line in the middle of each box shows the median of the sample, and the green triangle in each box shows the mean of the sample.<\/p>\n<p>We can see that both BAG and RF have a tight distribution and a mean and median that closely align, perhaps suggesting a non-skewed and Gaussian distribution of scores, e.g. stable.<\/p>\n<div id=\"attachment_9688\" style=\"width: 1290px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-9688\" class=\"size-full wp-image-9688\" src=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2019\/11\/Box-and-Whisker-Plot-of-Machine-Learning-Models-on-the-Imbalanced-Mammography-Dataset.png\" alt=\"Box and Whisker Plot of Machine Learning Models on the Imbalanced Mammography Dataset\" width=\"1280\" height=\"960\" srcset=\"http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2019\/11\/Box-and-Whisker-Plot-of-Machine-Learning-Models-on-the-Imbalanced-Mammography-Dataset.png 1280w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2019\/11\/Box-and-Whisker-Plot-of-Machine-Learning-Models-on-the-Imbalanced-Mammography-Dataset-300x225.png 300w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2019\/11\/Box-and-Whisker-Plot-of-Machine-Learning-Models-on-the-Imbalanced-Mammography-Dataset-1024x768.png 1024w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2019\/11\/Box-and-Whisker-Plot-of-Machine-Learning-Models-on-the-Imbalanced-Mammography-Dataset-768x576.png 768w\" sizes=\"(max-width: 1280px) 100vw, 1280px\"><\/p>\n<p id=\"caption-attachment-9688\" class=\"wp-caption-text\">Box and Whisker Plot of Machine Learning Models on the Imbalanced Mammography Dataset<\/p>\n<\/div>\n<p>Now that we have a good first set of results, let&rsquo;s see if we can improve them with cost-sensitive classifiers.<\/p>\n<h3>Evaluate Cost-Sensitive Algorithms<\/h3>\n<p>Some machine learning algorithms can be adapted to pay more attention to one class than another when fitting the model.<\/p>\n<p>These are referred to as cost-sensitive machine learning models and they can be used for imbalanced classification by specifying a cost that is inversely proportional to the class distribution. For example, with a 98 percent to 2 percent distribution for the majority and minority classes, we can specify to give errors on the minority class a weighting of 98 and errors for the majority class a weighting of 2.<\/p>\n<p>Three algorithms that offer this capability are:<\/p>\n<ul>\n<li>Logistic Regression (LR)<\/li>\n<li>Support Vector Machine (SVM)<\/li>\n<li>Random Forest (RF)<\/li>\n<\/ul>\n<p>This can be achieved in scikit-learn by setting the &ldquo;<em>class_weight<\/em>&rdquo; argument to &ldquo;<em>balanced<\/em>&rdquo; to make these algorithms cost-sensitive.<\/p>\n<p>For example, the updated <em>get_models()<\/em> function below defines the cost-sensitive version of these three algorithms to be evaluated on our dataset.<\/p>\n<pre class=\"crayon-plain-tag\"># define models to test\r\ndef get_models():\r\n\tmodels, names = list(), list()\r\n\t# LR\r\n\tmodels.append(LogisticRegression(solver='lbfgs', class_weight='balanced'))\r\n\tnames.append('LR')\r\n\t# SVM\r\n\tmodels.append(SVC(gamma='scale', class_weight='balanced'))\r\n\tnames.append('SVM')\r\n\t# RF\r\n\tmodels.append(RandomForestClassifier(n_estimators=1000))\r\n\tnames.append('RF')\r\n\treturn models, names<\/pre>\n<p>Additionally, when exploring the dataset, we noticed that many of the variables had a seemingly exponential data distribution. Sometimes we can better spread the data for a variable by using a power transform on each variable. This will be particularly helpful to the LR and SVM algorithm and may also help the RF algorithm.<\/p>\n<p>We can implement this within each fold of the cross-validation model evaluation process using a <a href=\"https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.pipeline.Pipeline.html\">Pipeline<\/a>. The first step will learn the <a href=\"https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.preprocessing.PowerTransformer.html\">PowerTransformer<\/a> on the training set folds and apply it to the training and test set folds. The second step will be the model that we are evaluating. The pipeline can then be evaluated directly using our <em>evaluate_model()<\/em> function, for example:<\/p>\n<pre class=\"crayon-plain-tag\">...\r\n# defines pipeline steps\r\nsteps = [('p', PowerTransformer()), ('m',models[i])]\r\n# define pipeline\r\npipeline = Pipeline(steps=steps)\r\n# evaluate the pipeline and store results\r\nscores = evaluate_model(X, y, pipeline)<\/pre>\n<p>Tying this together, the complete example of evaluating power transformed cost-sensitive machine learning algorithms on the mammography dataset is listed below.<\/p>\n<pre class=\"crayon-plain-tag\"># cost-sensitive machine learning algorithms on the mammography dataset\r\nfrom numpy import mean\r\nfrom numpy import std\r\nfrom pandas import read_csv\r\nfrom matplotlib import pyplot\r\nfrom sklearn.preprocessing import LabelEncoder\r\nfrom sklearn.preprocessing import PowerTransformer\r\nfrom sklearn.pipeline import Pipeline\r\nfrom sklearn.model_selection import cross_val_score\r\nfrom sklearn.model_selection import RepeatedStratifiedKFold\r\nfrom sklearn.linear_model import LogisticRegression\r\nfrom sklearn.svm import SVC\r\nfrom sklearn.ensemble import RandomForestClassifier\r\n\r\n# load the dataset\r\ndef load_dataset(full_path):\r\n\t# load the dataset as a numpy array\r\n\tdata = read_csv(full_path, header=None)\r\n\t# retrieve numpy array\r\n\tdata = data.values\r\n\t# split into input and output elements\r\n\tX, y = data[:, :-1], data[:, -1]\r\n\t# label encode the target variable to have the classes 0 and 1\r\n\ty = LabelEncoder().fit_transform(y)\r\n\treturn X, y\r\n\r\n# evaluate a model\r\ndef evaluate_model(X, y, model):\r\n\t# define evaluation procedure\r\n\tcv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)\r\n\t# evaluate model\r\n\tscores = cross_val_score(model, X, y, scoring='roc_auc', cv=cv, n_jobs=-1)\r\n\treturn scores\r\n\r\n# define models to test\r\ndef get_models():\r\n\tmodels, names = list(), list()\r\n\t# LR\r\n\tmodels.append(LogisticRegression(solver='lbfgs', class_weight='balanced'))\r\n\tnames.append('LR')\r\n\t# SVM\r\n\tmodels.append(SVC(gamma='scale', class_weight='balanced'))\r\n\tnames.append('SVM')\r\n\t# RF\r\n\tmodels.append(RandomForestClassifier(n_estimators=1000))\r\n\tnames.append('RF')\r\n\treturn models, names\r\n\r\n# define the location of the dataset\r\nfull_path = 'mammography.csv'\r\n# load the dataset\r\nX, y = load_dataset(full_path)\r\n# define models\r\nmodels, names = get_models()\r\nresults = list()\r\n# evaluate each model\r\nfor i in range(len(models)):\r\n\t# defines pipeline steps\r\n\tsteps = [('p', PowerTransformer()), ('m',models[i])]\r\n\t# define pipeline\r\n\tpipeline = Pipeline(steps=steps)\r\n\t# evaluate the pipeline and store results\r\n\tscores = evaluate_model(X, y, pipeline)\r\n\tresults.append(scores)\r\n\t# summarize and store\r\n\tprint('&gt;%s %.3f (%.3f)' % (names[i], mean(scores), std(scores)))\r\n# plot the results\r\npyplot.boxplot(results, labels=names, showmeans=True)\r\npyplot.show()<\/pre>\n<p>Running the example evaluates each algorithm in turn and reports the mean and standard deviation ROC AUC.<\/p>\n<p>Your specific results will vary given the stochastic nature of the learning algorithms; consider running the example a few times.<\/p>\n<p>In this case, we can see that all three of the tested algorithms achieved a lift on ROC AUC compared to their non-transformed and cost-insensitive versions. It would be interesting to repeat the experiment without the transform to see if it was the transform or the cost-sensitive version of the algorithms, or both that resulted in the lifts in performance.<\/p>\n<p>In this case, we can see the SVM achieved the best performance, performing better than RF in this and the previous section and achieving a mean ROC AUC of about 0.957.<\/p>\n<pre class=\"crayon-plain-tag\">&gt;LR 0.922 (0.036)\r\n&gt;SVM 0.957 (0.024)\r\n&gt;RF 0.951 (0.035)<\/pre>\n<p>Box and whisker plots are then created comparing the distribution of ROC AUC scores.<\/p>\n<p>The SVM distribution appears compact compared to the other two models. As such the performance is likely stable and may make a good choice for a final model.<\/p>\n<div id=\"attachment_9689\" style=\"width: 1290px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-9689\" class=\"size-full wp-image-9689\" src=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2019\/11\/Box-and-Whisker-Plots-of-Cost-Sensitive-Machine-Learning-Models-on-the-Imbalanced-Mammography-Dataset.png\" alt=\"Box and Whisker Plots of Cost-Sensitive Machine Learning Models on the Imbalanced Mammography Dataset\" width=\"1280\" height=\"960\" srcset=\"http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2019\/11\/Box-and-Whisker-Plots-of-Cost-Sensitive-Machine-Learning-Models-on-the-Imbalanced-Mammography-Dataset.png 1280w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2019\/11\/Box-and-Whisker-Plots-of-Cost-Sensitive-Machine-Learning-Models-on-the-Imbalanced-Mammography-Dataset-300x225.png 300w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2019\/11\/Box-and-Whisker-Plots-of-Cost-Sensitive-Machine-Learning-Models-on-the-Imbalanced-Mammography-Dataset-1024x768.png 1024w, http:\/\/3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com\/wp-content\/uploads\/2019\/11\/Box-and-Whisker-Plots-of-Cost-Sensitive-Machine-Learning-Models-on-the-Imbalanced-Mammography-Dataset-768x576.png 768w\" sizes=\"(max-width: 1280px) 100vw, 1280px\"><\/p>\n<p id=\"caption-attachment-9689\" class=\"wp-caption-text\">Box and Whisker Plots of Cost-Sensitive Machine Learning Models on the Imbalanced Mammography Dataset<\/p>\n<\/div>\n<p>Next, let&rsquo;s see how we might use a final model to make predictions on new data.<\/p>\n<h2>Make Predictions on New Data<\/h2>\n<p>In this section, we will fit a final model and use it to make predictions on single rows of data<\/p>\n<p>We will use the cost-sensitive version of the SVM model as the final model and a power transform on the data prior to fitting the model and making a prediction. Using the pipeline will ensure that the transform is always performed correctly on input data.<\/p>\n<p>First, we can define the model as a pipeline.<\/p>\n<pre class=\"crayon-plain-tag\">...\r\n# define model to evaluate\r\nmodel = SVC(gamma='scale', class_weight='balanced')\r\n# power transform then fit model\r\npipeline = Pipeline(steps=[('t',PowerTransformer()), ('m',model)])<\/pre>\n<p>Once defined, we can fit it on the entire training dataset.<\/p>\n<pre class=\"crayon-plain-tag\">...\r\n# fit the model\r\npipeline.fit(X, y)<\/pre>\n<p>Once fit, we can use it to make predictions for new data by calling the <em>predict()<\/em> function. This will return the class label of 0 for &ldquo;<em>no cancer&rdquo;<\/em>, or 1 for &ldquo;<em>cancer<\/em>&ldquo;.<\/p>\n<p>For example:<\/p>\n<pre class=\"crayon-plain-tag\">...\r\n# define a row of data\r\nrow = [...]\r\n# make prediction\r\nyhat = model.predict([row])<\/pre>\n<p>To demonstrate this, we can use the fit model to make some predictions of labels for a few cases where we know if the case is a no cancer or cancer.<\/p>\n<p>The complete example is listed below.<\/p>\n<pre class=\"crayon-plain-tag\"># fit a model and make predictions for the on the mammography dataset\r\nfrom pandas import read_csv\r\nfrom sklearn.preprocessing import LabelEncoder\r\nfrom sklearn.preprocessing import PowerTransformer\r\nfrom sklearn.svm import SVC\r\nfrom sklearn.pipeline import Pipeline\r\n\r\n# load the dataset\r\ndef load_dataset(full_path):\r\n\t# load the dataset as a numpy array\r\n\tdata = read_csv(full_path, header=None)\r\n\t# retrieve numpy array\r\n\tdata = data.values\r\n\t# split into input and output elements\r\n\tX, y = data[:, :-1], data[:, -1]\r\n\t# label encode the target variable to have the classes 0 and 1\r\n\ty = LabelEncoder().fit_transform(y)\r\n\treturn X, y\r\n\r\n# define the location of the dataset\r\nfull_path = 'mammography.csv'\r\n# load the dataset\r\nX, y = load_dataset(full_path)\r\n# define model to evaluate\r\nmodel = SVC(gamma='scale', class_weight='balanced')\r\n# power transform then fit model\r\npipeline = Pipeline(steps=[('t',PowerTransformer()), ('m',model)])\r\n# fit the model\r\npipeline.fit(X, y)\r\n# evaluate on some no cancer cases (known class 0)\r\nprint('No Cancer:')\r\ndata = [[0.23001961,5.0725783,-0.27606055,0.83244412,-0.37786573,0.4803223],\r\n\t[0.15549112,-0.16939038,0.67065219,-0.85955255,-0.37786573,-0.94572324],\r\n\t[-0.78441482,-0.44365372,5.6747053,-0.85955255,-0.37786573,-0.94572324]]\r\nfor row in data:\r\n\t# make prediction\r\n\tyhat = pipeline.predict([row])\r\n\t# get the label\r\n\tlabel = yhat[0]\r\n\t# summarize\r\n\tprint('&gt;Predicted=%d (expected 0)' % (label))\r\n# evaluate on some cancer (known class 1)\r\nprint('Cancer:')\r\ndata = [[2.0158239,0.15353258,-0.32114211,2.1923706,-0.37786573,0.96176503],\r\n\t[2.3191888,0.72860087,-0.50146835,-0.85955255,-0.37786573,-0.94572324],\r\n\t[0.19224721,-0.2003556,-0.230979,1.2003796,2.2620867,1.132403]]\r\nfor row in data:\r\n\t# make prediction\r\n\tyhat = pipeline.predict([row])\r\n\t# get the label\r\n\tlabel = yhat[0]\r\n\t# summarize\r\n\tprint('&gt;Predicted=%d (expected 1)' % (label))<\/pre>\n<p>Running the example first fits the model on the entire training dataset.<\/p>\n<p>Then the fit model used to predict the label of no cancer cases is chosen from the dataset file. We can see that all cases are correctly predicted.<\/p>\n<p>Then some cases of actual cancer are used as input to the model and the label is predicted. As we might have hoped, the correct labels are predicted for all cases.<\/p>\n<pre class=\"crayon-plain-tag\">No Cancer:\r\n&gt;Predicted=0 (expected 0)\r\n&gt;Predicted=0 (expected 0)\r\n&gt;Predicted=0 (expected 0)\r\nCancer:\r\n&gt;Predicted=1 (expected 1)\r\n&gt;Predicted=1 (expected 1)\r\n&gt;Predicted=1 (expected 1)<\/pre>\n<\/p>\n<h2>Further Reading<\/h2>\n<p>This section provides more resources on the topic if you are looking to go deeper.<\/p>\n<h3>Papers<\/h3>\n<ul>\n<li><a href=\"https:\/\/www.worldscientific.com\/doi\/abs\/10.1142\/S0218001493000698\">Comparative Evaluation Of Pattern Recognition Techniques For Detection Of Microcalcifications In Mammography<\/a>, 1993.<\/li>\n<li><a href=\"https:\/\/arxiv.org\/abs\/1106.1813\">SMOTE: Synthetic Minority Over-sampling Technique<\/a>, 2002.<\/li>\n<\/ul>\n<h3>APIs<\/h3>\n<ul>\n<li><a href=\"https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.model_selection.RepeatedStratifiedKFold.html\">sklearn.model_selection.RepeatedStratifiedKFold API<\/a>.<\/li>\n<li><a href=\"https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.metrics.roc_auc_score.html\">sklearn.metrics.roc_auc_score API<\/a>.<\/li>\n<li><a href=\"https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.dummy.DummyClassifier.html\">sklearn.dummy.DummyClassifier API<\/a>.<\/li>\n<li><a href=\"https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.svm.SVC.html\">sklearn.svm.SVC API<\/a>.<\/li>\n<\/ul>\n<h3>Dataset<\/h3>\n<ul>\n<li><a href=\"https:\/\/raw.githubusercontent.com\/jbrownlee\/Datasets\/master\/mammography.csv\">Mammography Dataset<\/a>.<\/li>\n<li><a href=\"https:\/\/raw.githubusercontent.com\/jbrownlee\/Datasets\/master\/mammography.names\">Mammography Dataset Description<\/a><\/li>\n<\/ul>\n<h2>Summary<\/h2>\n<p>In this tutorial, you discovered how to develop and evaluate models for imbalanced mammography cancer classification dataset.<\/p>\n<p>Specifically, you learned:<\/p>\n<ul>\n<li>How to load and explore the dataset and generate ideas for data preparation and model selection.<\/li>\n<li>How to evaluate a suite of machine learning models and improve their performance with data cost-sensitive techniques.<\/li>\n<li>How to fit a final model and use it to predict class labels for specific cases.<\/li>\n<\/ul>\n<p>Do you have any questions?<br \/>\nAsk your questions in the comments below and I will do my best to answer.<\/p>\n<p>The post <a rel=\"nofollow\" href=\"https:\/\/machinelearningmastery.com\/imbalanced-classification-model-to-detect-microcalcifications\/\">Imbalanced Classification Model to Detect Mammography Microcalcifications<\/a> appeared first on <a rel=\"nofollow\" href=\"https:\/\/machinelearningmastery.com\/\">Machine Learning Mastery<\/a>.<\/p>\n<\/div>\n<p><a href=\"https:\/\/machinelearningmastery.com\/imbalanced-classification-model-to-detect-microcalcifications\/\">Go to Source<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Author: Jason Brownlee Cancer detection is a popular example of an imbalanced classification problem because there are often significantly more cases of non-cancer than actual [&hellip;] <span class=\"read-more-link\"><a class=\"read-more\" href=\"https:\/\/www.aiproblog.com\/index.php\/2020\/03\/01\/imbalanced-classification-model-to-detect-mammography-microcalcifications\/\">Read More<\/a><\/span><\/p>\n","protected":false},"author":1,"featured_media":3195,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_bbp_topic_count":0,"_bbp_reply_count":0,"_bbp_total_topic_count":0,"_bbp_total_reply_count":0,"_bbp_voice_count":0,"_bbp_anonymous_reply_count":0,"_bbp_topic_count_hidden":0,"_bbp_reply_count_hidden":0,"_bbp_forum_subforum_count":0,"footnotes":""},"categories":[24],"tags":[],"_links":{"self":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts\/3194"}],"collection":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/comments?post=3194"}],"version-history":[{"count":0,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts\/3194\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/media\/3195"}],"wp:attachment":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/media?parent=3194"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/categories?post=3194"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/tags?post=3194"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}