Author: Jason Brownlee

Probability is a field of mathematics that quantifies uncertainty.

It is undeniably a pillar of the field of machine learning, and many recommend it as a prerequisite subject to study prior to getting started. This is misleading advice, as probability makes more sense to a practitioner once they have the context of the applied machine learning process in which to interpret it.

In this post, you will discover why machine learning practitioners should study probabilities to improve their skills and capabilities.

After reading this post, you will know:

- Not everyone should learn probability; it depends where you are in your journey of learning machine learning.
- Many algorithms are designed using the tools and techniques from probability, such as Naive Bayes and Probabilistic Graphical Models.
- The maximum likelihood framework that underlies the training of many machine learning algorithms comes from the field of probability.

Let’s get started.

## Overview

This tutorial is divided into seven parts; they are:

- Reasons to NOT Learn Probability
- Class Membership Requires Predicting a Probability
- Some Algorithms Are Designed Using Probability
- Models Are Trained Using a Probabilistic Framework
- Models Can Be Tuned With a Probabilistic Framework
- Probabilistic Measures Are Used to Evaluate Model Skill
- One More Reason

## Reasons to NOT Learn Probability

Before we go through the reasons that you should learn probability, let’s start off by taking a small look at the reason why you should not.

I think you should not study probability if you are just getting started with applied machine learning.

**It’s not required**. Having an appreciation for the abstract theory that underlies some machine learning algorithms is not required in order to use machine learning as a tool to solve problems.**It’s slow**. Taking months to years to study an entire related field before starting machine learning will delay you achieving your goals of being able to work through predictive modeling problems.**It’s a huge field**. Not all of probability is relevant to theoretical machine learning, let alone applied machine learning.

I recommend a breadth-first approach to getting started in applied machine learning.

I call this the results-first approach. It is where you start by learning and practicing the steps for working through a predictive modeling problem end-to-end (e.g. how to get results) with a tool (such as scikit-learn and Pandas in Python).

This process then provides the skeleton and context for progressively deepening your knowledge, such as how algorithms work and, eventually, the math that underlies them.

After you know how to work through a predictive modeling problem, let’s look at why you should deepen your understanding of probability.

## 1. Class Membership Requires Predicting a Probability

Classification predictive modeling problems are those where an example is assigned a given label.

An example that you may be familiar with is the iris flowers dataset where we have four measurements of a flower and the goal is to assign one of three different known species of iris flower to the observation.

We can model the problem as directly assigning a class label to each observation.

**Input**: Measurements of a flower.**Output**: One iris species.

A more common approach is to frame the problem as a probabilistic class membership, where the probability of an observation belonging to each known class is predicted.

**Input**: Measurements of a flower.**Output**: Probability of membership to each iris species.

Framing the problem as a prediction of class membership simplifies the modeling problem and makes it easier for a model to learn. It allows the model to capture ambiguity in the data, which allows a process downstream, such as the user to interpret the probabilities in the context of the domain.

The probabilities can be transformed into a crisp class label by choosing the class with the largest probability. The probabilities can also be scaled or transformed using a probability calibration process.

This choice of a class membership framing of the problem interpretation of the predictions made by the model requires a basic understanding of probability.

## 2. Some Algorithms Are Designed Using Probability

There are algorithms that are specifically designed to harness the tools and methods from probability.

These range from individual algorithms, like Naive Bayes algorithm, which is constructed using Bayes Theorem with some simplifying assumptions.

- Naive Bayes

It also extends to whole fields of study, such as probabilistic graphical models, often called graphical models or PGM for short, and designed around Bayes Theorem.

- Probabilistic Graphical Models

A notable graphical model is Bayesian Belief Networks or Bayes Nets, which are capable of capturing the conditional dependencies between variables.

- Bayesian Belief Networks

## 3. Models Are Trained Using a Probabilistic Framework

Many machine learning models are trained using an iterative algorithm designed under a probabilistic framework.

Perhaps the most common is the framework of maximum likelihood estimation, sometimes shorted as MLE. This is a framework for estimating model parameters (e.g. weights) given observed data.

This is the framework that underlies the ordinary least squares estimate of a linear regression model.

The expectation-maximization algorithm, or EM for short, is an approach for maximum likelihood estimation often used for unsupervised data clustering, e.g. estimating k means for k clusters, also known as the k-Means clustering algorithm.

For models that predict class membership, maximum likelihood estimation provides the framework for minimizing the difference or divergence between an observed and predicted probability distribution. This is used in classification algorithms like logistic regression as well as deep learning neural networks.

It is common to measure this difference in probability distribution during training using entropy, e.g. via cross-entropy. Entropy, and differences between distributions measured via KL divergence, and cross-entropy are from the field of information theory that directly build upon probability theory. For example, entropy is calculated directly as the negative log of the probability.

## 4. Models Can Be Tuned With a Probabilistic Framework

It is common to tune the hyperparameters of a machine learning model, such as k for kNN or the learning rate in a neural network.

Typical approaches include grid searching ranges of hyperparameters or randomly sampling hyperparameter combinations.

Bayesian optimization is a more efficient to hyperparameter optimization that involves a directed search of the space of possible configurations based on those configurations that are most likely to result in better performance.

As its name suggests, the approach was devised from and harnesses Bayes Theorem when sampling the space of possible configurations.

## 5. Probabilistic Measures Are Used to Evaluate Model Skill

For those algorithms where a prediction of probabilities is made, evaluation measures are required to summarize the performance of the model.

There are many measures used to summarize the performance of a model based on predicted probabilities. Common examples include aggregate measures like log loss and Brier score.

For binary classification tasks where a single probability score is predicted, Receiver Operating Characteristic, or ROC, curves can be constructed to explore different cut-offs that can be used when interpreting the prediction that, in turn, result in different trade-offs. The area under the ROC curve, or ROC AUC, can also be calculated as an aggregate measure.

Choice and interpretation of these scoring methods require a foundational understanding of probability theory.

## One More Reason

If I could give one more reason, it would be: Because it is fun.

Seriously.

Learning probability, at least the way I teach it with practical examples and executable code, is a lot of fun. Once you can see how the operations work on real data, it is hard to avoid developing a strong intuition for a subject that is often quite unintuitive.

Do you have more reasons why it is critical for an intermediate machine learning practitioner to learn probability?

Let me know in the comments below.

## Further Reading

This section provides more resources on the topic if you are looking to go deeper.

### Books

- Pattern Recognition and Machine Learning, 2006.
- Machine Learning: A Probabilistic Perspective, 2012.
- Machine Learning, 1997.

### Posts

- A Gentle Introduction to Probability Scoring Methods in Python
- How and When to Use ROC Curves and Precision-Recall Curves for Classification in Python
- How to Choose Loss Functions When Training Deep Learning Neural Networks

### Articles

- Graphical model, Wikipedia.
- Maximum likelihood estimation, Wikipedia.
- Expectation-maximization algorithm, Wikipedia.
- Cross entropy, Wikipedia.
- Kullback-Leibler divergence, Wikipedia.
- Bayesian optimization, Wikipedia.

## Summary

In this post, you discovered why, as a machine learning practitioner, you should deepen your understanding of probability.

Specifically, you learned:

- Not everyone should learn probability; it depends where you are in your journey of learning machine learning.
- Many algorithms are designed using the tools and techniques from probability, such as Naive Bayes and Probabilistic Graphical Models.
- The maximum likelihood framework that underlies the training of many machine learning algorithms comes from the field of probability.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post 5 Reasons to Learn Probability for Machine Learning appeared first on Machine Learning Mastery.