A Gentle Introduction to Bayes Theorem for Machine Learning

Author: Jason Brownlee

Bayes Theorem provides a principled way for calculating a conditional probability.

It is a deceptively simple calculation, although it can be used to easily calculate the conditional probability of events where intuition often fails.

Bayes Theorem also provides a way for thinking about the evaluation and selection of different models for a given dataset in applied machine learning. Maximizing the probability of a model fitting a dataset is more generally referred to as maximum a posteriori, or MAP for short, and provides a probabilistic framework for predictive modeling.

In this post, you will discover Bayes Theorem for calculating conditional probabilities.

After reading this post, you will know:

An intuition for Bayes Theorem from a perspective of conditional probability.
An intuition for Bayes Theorem from the perspective of machine learning.
How to calculate conditional probability using Bayes Theorem for a real world example.

Discover bayes opimization, naive bayes, maximum likelihood, distributions, cross entropy, and much more in my new book, with 28 step-by-step tutorials and full Python source code.

Let’s get started.

Update Oct/2019: Join the discussion about this tutorial on HackerNews.

A Gentle Introduction to Bayes Theorem for Machine Learning
Photo by Marco Verch, some rights reserved.

Overview

This tutorial is divided into three parts; they are:

Bayes Theorem of Conditional Probability
Bayes Theorem of Modeling Hypotheses
Worked Example of Bayes Theorem

Bayes Theorem of Conditional Probability

Before we dive into Bayes theorem, let’s review marginal, joint, and conditional probability (for more details see this longer tutorial).

Recall that marginal probability is the probability of an event, irrespective of other random variables. If the random variable is independent, then it is the probability of the event directly, otherwise, if the variable is dependent upon other variables, then the marginal probability is the probability of the event summed over all outcomes for the dependent variables, called the sum rule.

Marginal Probability: The probability of an event irrespective of the outcomes of other random variables, e.g. P(A).

The joint probability is the probability of two (or more) simultaneous events, often described in terms of events A and B from two dependent random variables, e.g. X and Y. The joint probability is often summarized as just the outcomes, e.g. A and B.

Joint Probability: Probability of two (or more) simultaneous events, e.g. P(A and B) or P(A, B).

The conditional probability is the probability of one event given the occurrence of another event, often described in terms of events A and B from two dependent random variables e.g. X and Y.

Conditional Probability: Probability of one (or more) event given the occurrence of another event, e.g. P(A given B) or P(A | B).

The joint probability can be calculated using the conditional probability; for example:

P(A, B) = P(A | B) * P(B)

This is called the product rule. Importantly, the joint probability is symmetrical, meaning that:

P(A, B) = P(B, A)

The conditional probability can be calculated using the joint probability; for example:

P(A | B) = P(A, B) / P(B)

The conditional probability is not symmetrical; for example:

P(A | B) != P(B | A)

Nevertheless, one conditional probability can be calculated using the other conditional probability; for example:

P(A|B) = P(B|A) * P(A) / P(B)

The reverse is also true; for example:

P(B|A) = P(A|B) * P(B) / P(A)

This alternate approach of calculating the conditional probability is useful either when the joint probability is challenging to calculate, or when the reverse conditional probability is available or easy to calculate.

This alternate calculation of the conditional probability is referred to as Bayes Rule or Bayes Theorem, named for Reverend Thomas Bayes, who is credited with first describing it. It is grammatically correct to refer to it as Bayes’ Theorem (with the apostrophe), but it is common to omit the apostrophe for simplicity.

It is often the case that we do not have access to the denominator, e.g. P(B).

As such, there is an alternate formulation of Bayes Theorem that we can use that uses the complement of P(A), e.g. the probability of not A can be stated as 1 – P(A) or P(not A). This alternate formulation is described below:

P(A|B) = P(B|A) * P(A) / P(B|A) * P(A) + P(B|not A) * P(not A)

Want to Learn Probability for Machine Learning

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Download Your FREE Mini-Course

Bayes Theorem of Modeling Hypotheses

Bayes Theorem is a useful tool in applied machine learning.

It provides a way of thinking about the relationship between data and a model.

A machine learning algorithm or model is a specific way of thinking about the structured relationships in the data. In this way, a model can be thought of as a hypothesis about the relationships in the data, such as the relationship between input (X) and output (y). The practice of applied machine learning is the testing and analysis of different hypotheses (models) on a given dataset.

If this idea of thinking of a model as a hypothesis is new to you, see this tutorial on the topic.

Bayes Theorem provides a probabilistic model to describe the relationship between data (D) and a hypothesis (H); for example:

P(h|D) = P(D|h) * P(h) / P(D)

Breaking this down, it says that the probability of a given hypothesis holding or being true given some observed data can be calculated as the probability of observing the data given the hypothesis multiplied by the probability of the hypothesis being true regardless of the data, divided by the probability of observing the data regardless of the hypothesis.

Bayes theorem provides a way to calculate the probability of a hypothesis based on its prior probability, the probabilities of observing various data given the hypothesis, and the observed data itself.

— Page 156, Machine Learning, 1997.

Under this framework, each piece of the calculation has a specific name; for example:

P(h|D): Posterior probability of the hypothesis (the thing we want to calculate).
P(h): Prior probability of the hypothesis.

This gives a useful framework for thinking about and modeling a machine learning problem.

If we have some prior domain knowledge about the hypothesis, this is captured in the prior probability. If we don’t, then all hypotheses may have the same prior probability.

If the probability of observing the data P(D) increases, then the probability of the hypothesis holding given the data P(h|D) decreases. Conversely, if the probability of the hypothesis P(h) and the probability of observing the data given hypothesis increases, the probability of the hypothesis holding given the data P(h|D) increases.

The notion of testing different models on a dataset in applied machine learning can be thought of as estimating the probability of each hypothesis (h1, h2, h3, … in H) being true given the observed data.

The optimization or seeking the hypothesis with the maximum posterior probability in modeling is called maximum a posteriori or MAP for short.

Any such maximally probable hypothesis is called a maximum a posteriori (MAP) hypothesis. We can determine the MAP hypotheses by using Bayes theorem to calculate the posterior probability of each candidate hypothesis.

— Page 157, Machine Learning, 1997.

Under this framework, the probability of the data (D) is constant as it is used in the assessment of each hypothesis. Therefore, it can be removed from the calculation to give the simplified unnormalized estimate as follows:

max h in H P(h|D) = P(D|h) * P(h)

If we do not have any prior information about the hypothesis being tested, they can be assigned a uniform probability, and this term too will be a constant and can be removed from the calculation to give the following:

max h in H P(h|D) = P(D|h)

That is, the goal is to locate a hypothesis that best explains the observed data.

Worked Example of Bayes Theorem

Bayes theorem is best understood with a real-life worked example with real numbers to demonstrate the calculations.

An excellent and widely used example of the benefit of Bayes Theorem is in the analysis of a medical diagnostic test.

Problem: Consider a human population that may or may not have cancer (Cancer is True or False) and a medical test that returns positive or negative for detecting cancer (Test is Positive or Negative), e.g. like a mammogram for detecting breast cancer. If a patient has the test and it comes back positive, what is the probability that the patient has cancer?

Medical diagnostic tests are not perfect; they have error. Sometimes a patient will have cancer, but the test will not detect it. This capability of the test to detect cancer is referred to as the sensitivity, or the true positive rate.

In this case, we will contrive a value for the test. The test is good, but not great, with a true positive rate or sensitivity of 85%. That is, of all the people who have cancer and are tested, 85% of them will get a positive result from the test.

P(Test=Positive | Cancer=True) = 0.85

Given this information, our intuition would suggest that there is an 85% probability that the patient has cancer.

And again, our intuitions of probability are wrong.

This type of error in interpreting probabilities is so common that it has its own name; it is referred to as the base rate fallacy.

It has this name because the error in estimating the probability of an event is caused by ignoring the base rate. That is, ignoring the probability of having cancer, regardless of the diagnostic test.

In this case, we can assume the probability of breast cancer is low, and use a contrived value of one person in 5,000, or (0.0002) 0.02%.

P(Cancer=True) = 0.02%.

We can correctly calculate the probability of a patient having cancer given a positive test result using Bayes Theorem.

P(Cancer=True | Test=Positive) = P(Test=Positive|Cancer=True) * P(Cancer=True) / P(Test=Positive)

We know the probability of the test being positive given that the patient has cancer is 85%, and we know the base rate or the prior probability of a given patient having cancer is 0.02%; we can plug these values in:

P(Cancer=True | Test=Positive) = 0.85 * 0.0002 / P(Test=Positive)

We don’t know P(Test=Positive), but we do know the complement probability of P(Cancer=True), that is P(Cancer=False):

P(Cancer=False) = 1 – P(Cancer=True)
= 1 – 0.0002
= 0.9998

We can therefore state the calculation as follows:

P(Cancer=True | Test=Positive) = P(Test=Positive|Cancer=True) * P(Cancer=True) / P(Test=Positive|Cancer=True) * P(Cancer=True) + P(Test=Positive|Cancer=False) * P(Cancer=False)

We can plug in our known values as follows:

P(Cancer=True | Test=Positive) = 0.85 * 0.0002 / 0.85 * 0.0002 + P(Test=Positive|Cancer=False) * 0.9998

We still do not know the probability of a positive test result given no cancer. This requires additional information.

Specifically, we need to know how good the test is at correctly identifying people that do not have cancer. That is, testing negative result (Test=Negative) when the patient does not have cancer (Cancer=False), called the true negative rate or the specificity. We will use a value of 95%.

P(Test=Negative | Cancer=False) = 0.95

Using this, we can calculate the false positive or false alarm rate as the complement of the true negative rate.

P(Test=Positive|Cancer=False) = 1 – P(Test=Negative | Cancer=False)
= 1 – 0.95
= 0.05

We can plug this false alarm rate into our Bayes Theorem as follows:

P(Cancer=True | Test=Positive) = 0.85 * 0.0002 / 0.85 * 0.0002 + 0.05 * 0.9998
= 0.00017 / 0.00017 + 0.04999
= 0.00017 / 0.05016
= 0.003389154704944

The correct calculation suggests that if the patient is informed they have cancer with this test, then there is only 0.33% chance that they have cancer. It is a terrible diagnostic test!

The example also shows that correct calculation of the conditional probability requires additional information, such as the base rate, the sensitivity (or true positive rate), and the specificity (or true negative rate). In this case:

Base Rate: 0.02% of people have cancer.
Sensitivity: 85% of people with cancer will get a positive test result.
Specificity: 95% of people without cancer will get a negative test result.

To make this example concrete, the example below performs the same calculation in Python, allowing you to play with the parameters and test different scenarios.

# example of calculating bayes theorem for a diagnostic test

# calculate P(A|B) given P(A), P(B|A), P(not B|not A)
def bayes_theorem(p_a, p_b_given_a, p_not_b_given_not_a):
	# calculate P(not A)
	not_a = 1 - p_a
	# calculate P(B|not A)
	b_given_not_a = 1 - p_not_b_given_not_a
	# calculate P(A|B)
	p_a_given_b = (p_b_given_a * p_a) / (p_b_given_a * p_a + b_given_not_a * not_a)
	return p_a_given_b

# P(A), the base rate
base_rate = 1 / 5000
# P(B|A), sensitivity
sensitivity = 0.85
# P(not B| not A), specificity
specificity = 0.95
# calculate P(A|B)
result = bayes_theorem(base_rate, sensitivity, specificity)
# summarize
print('P(A|B) = %.3f%%' % (result * 100))

Running the example calculates the probability that a patient has cancer given the test returns a positive result, matching our manual calculation.

P(A|B) = 0.339%

We can see that Bayes Theorem allows us to be even more precise.

For example, if we had more information about the patient (e.g. their age) and about the domain (e.g. cancer rates for age ranges), we could offer an even more accurate probability estimate.

Summary

In this post, you discovered Bayes Theorem for calculating conditional probabilities.

Specifically, you learned:

An intuition for Bayes Theorem from a perspective of conditional probability.
An intuition for Bayes Theorem from the perspective of testing machine
How to calculate conditional probability using Bayes Theorem for a real world example.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

The post A Gentle Introduction to Bayes Theorem for Machine Learning appeared first on Machine Learning Mastery.

Go to Source