Best Resources for Imbalanced Classification

Author: Jason Brownlee

Classification is a predictive modeling problem that involves predicting a class label for a given example.

It is generally assumed that the distribution of examples in the training dataset is even across all of the classes. In practice, this is rarely the case.

Those classification predictive models where the distribution of examples across class labels is not equal (e.g. are skewed) are called “imbalanced classification.”

Typically, a slight imbalance is not a problem and standard machine learning techniques can be used. In those cases where the imbalance is severe, such as a 1:100, 1:1000, or higher ratio of the minority to the majority class, then specialized techniques are required.

The reason why specialized techniques are required for classification problems with a severe imbalance in the classes is that most machine learning models used for classification were designed and tested around the assumption that the class distribution is equal. As such, they often fail or result in misleading results.

In this tutorial, you will discover the best resources that you can use to get started with imbalanced classification.

After completing this tutorial, you will know:

The best books on the topic of machine learning for imbalanced classification.
The best survey papers that introduce the topic of class imbalance.
The best Python libraries that you can use to develop solutions for your imbalanced dataset.

Let’s get started.

Best Resources for Imbalanced Classification
Photo by Radek Kucharski, some rights reserved.

Tutorial Overview

This tutorial is divided into three parts; they are:

Books on Imbalanced Classification
Survey Papers on Imbalanced Classification
Python Libraries for Imbalanced Classification

Books on Imbalanced Classification

Addressing imbalanced classification predictive modeling problems with machine learning is a relatively new area of study.

Nevertheless, given the pervasiveness of imbalanced classification datasets, a few books and book chapters are available on the topic.

In this section, we will take a closer look at the following books on imbalanced classification for machine learning:

I will also include the following book that features a dedicated chapter on the topic:

Applied Predictive Modeling, 2013.

There are two other books I found that are related, but perhaps more tangentially, and I won’t cover them in more detail; they were:

Let’s take a closer look at the books.

Imbalanced Learning: Foundations, Algorithms, and Applications

This book is a collection of papers that form chapters, edited by two academics that have written a lot on the topic: Haibo He and Yunqian Ma.

The book was published in 2013.

Imbalanced Learning: Foundations, Algorithms, and Applications, 2013.

Imbalanced Learning – Foundations, Algorithms, and Applications

The book is designed to bring a postgraduate student or academic up to speed with the field of imbalanced learning. This is a more general field than imbalanced classification, as it includes other problem types where the training dataset may be imbalanced, such as regression and clustering.

Specifically, we define imbalanced learning as the learning process for data representation and information extraction with severe data distribution skews to develop effective decision boundaries to support the decision-making process. The learning process could involve supervised learning, unsupervised learning, semi-supervised learning, or a combination of two or all of them. The task of imbalanced learning could also be applied to regression, classification, or clustering tasks.

— Pages 1-2, Imbalanced Learning: Foundations, Algorithms, and Applications, 2013.

It provides an excellent starting point for a practitioner to get an overview of the field and the techniques.

The table of contents for this book is listed below.

1. Introduction
2. Foundations of Imbalanced Learning
3. Imbalanced Datasets: From Sampling to Classifiers
4. Ensemble Methods for Class Imbalance Learning
5. Class Imbalance Learning Methods for Support Vector Machines
6. Class Imbalance and Active Learning
7. Nonstationary Stream Data Learning with Imbalanced Class Distribution
8. Assessment Metrics for Imbalanced Learning

Learn more about the book here.

Learning from Imbalanced Data Sets

This book is also a collection of papers on the topic of machine learning for imbalanced datasets, although feels more cohesiveness than the previous book “Imbalanced Learning.”

The book was written or edited by a laundry list of academics Alberto Fernández, Salvador García, Mikel Galar, Ronaldo Prati, Bartosz Krawczyk, and Francisco Herrera and was published in 2018.

Learning from Imbalanced Data Sets, 2018.

Learning from Imbalanced Data Sets

Similar to the previous book, this book is designed to bring postgraduate students and engineers up to speed with the field of machine learning for imbalanced datasets.

The intended audience of this book are developers and engineers aiming to apply imbalance-learning techniques to solve different kinds of real-world problems, as well as researchers and students needing a comprehensive review on techniques, methodologies, and tools for learning from imbalanced data.

— Page viii, Learning from Imbalanced Data Sets, 2018.

The book reads as being more systematic (e.g. working through a project end-to-end) and practical than the previous book, which read as more academic (pet methods or subfields). I would recommend buying both together if you had the budget.

The table of contents for this book is listed below.

1. Introduction to KDD and Data Science
2. Foundations on Imbalanced Classification
3. Performance Measures
4. Cost-Sensitive Learning
5. Data Level Preprocessing Methods
6. Algorithm-Level Approaches
7. Ensemble Learning
8. Imbalanced Classification with Multiple Classes
9. Dimensionality Reduction for Imbalanced Learning
10. Data Intrinsic Characteristics
11. Learning from Imbalanced Data Streams
12. Non-classical Imbalanced Classification Problems
13. Imbalanced Classification for Big Data
14. Software and Libraries for Imbalanced Classification

Learn more about the book here.

Applied Predictive Modeling

This is one of my favorite handbooks for applied machine learning, written by Max Kuhn and Kjell Johnson and focused on R.

The book was published in 2013, but the general advice is probably timeless.

Applied Predictive Modeling, 2013.

Applied Predictive Modeling

Although the whole book is a great read, the book has one chapter dedicated to the problem of imbalanced classification.

Chapter 16: Remedies for Severe Class Imbalance

The approach to the chapter is a case study on a “Caravan Policy Ownership” dataset. The authors work through this problem to demonstrate a suite of different practical techniques for handling a severe class imbalance.

This chapter is required reading for a practical demonstration on how to work through a real-world imbalanced dataset using modern methods.

The sections of this chapter are as follows:

16.1 Case Study: Predicting Caravan Policy Ownership
16.2 The Effect of Class Imbalance
16.3 Model Tuning
16.4 Alternate Cutoffs
16.5 Adjusting Prior Probabilities
16.6 Unequal Case Weights
16.7 Sampling Methods
16.8 Cost-Sensitive Training
16.9 Computing

Learn more about the book here.

Survey Papers on Imbalanced Classification

There are thousands of publications on machine learning methods for imbalanced classification and related problems and techniques.

Instead of enumerating the best papers in the field, in this section, we will take a look at some of the best survey papers.

A survey paper is a paper that gives a broad overview of the field and position of the techniques in the field and how they might relate to each other. They are designed to help newcomers to the field, such as postgraduate students and engineers, get up-to-speed rapidly.

As a practitioner, reading a survey paper may be more efficient than skimming books on the topic.

There are many great survey papers to choose from; my recommended favorites are as follows:

Learning From Imbalanced Data: Open Challenges And Future Directions, Bartosz Krawczyk, 2016.
A Survey of Predictive Modelling under Imbalanced Distributions, Paula Branco, Luis Torgo, and Rita Ribeiro, 2015.
Classification Of Imbalanced Data: A Review, Yanmin Sun, Andrew Wong, Mohamed Kamel, 2009.
Learning from Imbalanced Data, Haibo He and Edwardo Garcia, 2009.

I also recommend study papers, papers that demonstrate one or more standard techniques against a suite of standard machine learning datasets. In this case, the techniques are designed to address the imbalanced class distribution and the standard datasets have a skewed class distribution.

These papers quickly flush out what methods work (or are popular) and what datasets are useful as benchmarks.

Some examples of good papers of this type include:

Python Libraries for Imbalanced Classification

Python has rapidly become the preferred programming language for applied machine learning.

Scikit-Learn Library

The go-to library for machine learning in Python is scikit-learn, which provides data preparation, machine learning algorithms, and model evaluation schemes, among other techniques.

Scikit-learn is a Python module integrating a wide range of state-of-the-art machine learning algorithms for medium-scale supervised and unsupervised problems. This package focuses on bringing machine learning to non-specialists using a general-purpose high-level language.

— Scikit-learn: Machine Learning in Python, 2011.

Although not designed around the problem of imbalanced classification, the scikit-learn library does provide some tools for handling imbalanced datasets, such as:

Support for a range of metrics, e.g. ROC AUC and precision/recall, F1, Brier Score and more.
Support for class weighting, e.g. Decision Trees, SVM and more.

Imbalanced-Learn Library

A project related to scikit-learn dedicated to the problem of imbalanced classification is called imbalanced-learn.

It provides techniques that can be used for imbalanced classification in conjunction with the scikit-learn library, allowing learning algorithms and model evaluation techniques to be shared between the libraries.

imbalanced-learn is an open-source python toolbox aiming at providing a wide range of methods to cope with the problem of imbalanced dataset frequently encountered in machine learning and pattern recognition.

— Imbalanced-learn: A Python Toolbox to Tackle the Curse of Imbalanced Datasets in Machine Learning, 2016.

The library focuses on providing oversampling and undersampling techniques to make the class distribution more equal in a training dataset prior to fitting a given machine learning model.

For more on imbalanced-learn, see:

Summary

In this tutorial, you discovered the best resources that you can use to get started with imbalanced classification.

Specifically, you learned:

The best books on the topic of machine learning for imbalanced classification.
The best survey papers that introduce the topic of class imbalance.
The best Python libraries that you can use to develop solutions for your imbalanced dataset.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

The post Best Resources for Imbalanced Classification appeared first on Machine Learning Mastery.

Go to Source