A Gentle Introduction to Uncertainty in Machine Learning

Author: Jason Brownlee

Applied machine learning requires managing uncertainty.

There are many sources of uncertainty in a machine learning project, including variance in the specific data values, the sample of data collected from the domain, and in the imperfect nature of any models developed from such data.

Managing the uncertainty that is inherent in machine learning for predictive modeling can be achieved via the tools and techniques from probability, a field specifically designed to handle uncertainty.

In this post, you will discover the challenge of uncertainty in machine learning.

After reading this post, you will know:

Uncertainty is the biggest source of difficulty for beginners in machine learning, especially developers.
Noise in data, incomplete coverage of the domain, and imperfect models provide the three main sources of uncertainty in machine learning.
Probability provides the foundation and tools for quantifying, handling, and harnessing uncertainty in applied machine learning.

Let’s get started.

A Gentle Introduction to Uncertainty in Machine Learning
Photo by Anastasiy Safari, some rights reserved.

Overview

This tutorial is divided into five parts; they are:

Uncertainty in Machine Learning
Noise in Observations
Incomplete Coverage of the Domain
Imperfect Model of the Problem
How to Manage Uncertainty

Uncertainty in Machine Learning

Applied machine learning requires getting comfortable with uncertainty.

Uncertainty means working with imperfect or incomplete information.

Uncertainty is fundamental to the field of machine learning, yet it is one of the aspects that causes the most difficulty for beginners, especially those coming from a developer background.

For software engineers and developers, computers are deterministic. You write a program, and the computer does what you say. Algorithms are analyzed based on space or time complexity and can be chosen to optimize whichever is most important to the project, like execution speed or memory constraints.

Predictive modeling with machine learning involves fitting a model to map examples of inputs to an output, such as a number in the case of a regression problem or a class label in the case of a classification problem.

Naturally, the beginner asks reasonable questions, such as:

What are the best features that I should use?
What is the best algorithm for my dataset?

The answers to these questions are unknown and might even be unknowable, at least exactly.

Many branches of computer science deal mostly with entities that are entirely deterministic and certain. […] Given that many computer scientists and software engineers work in a relatively clean and certain environment, it can be surprising that machine learning makes heavy use of probability theory.

— Page 54, Deep Learning, 2016.

This is the major cause of difficulty for beginners.

The reason that the answers are unknown is because of uncertainty, and the solution is to systematically evaluate different solutions until a good or good-enough set of features and/or algorithm is discovered for a specific prediction problem.

There are three main sources of uncertainty in machine learning, and in the following sections, we will take a look at three possible sources in turn.

Noise in Observations

Observations from the domain are not crisp; instead, they contain noise.

An observation from the domain is often referred to as an “instance” or a “sample” and is one row of data. It is what was measured or what was collected. It is the data that describes the object or subject. It is the input to a model and the expected output.

An example might be one set of measurements of one iris flower and the species of flower that was measured in the case of training data.

Sepal length: 	5.1 cm
Sepal width: 	3.5 cm
Petal length: 	1.4 cm
Petal width: 	0.2 cm
Species: 		Iris setosa

In the case of new data for which a prediction is to be made, it is just the measurements without the species of flower.

Sepal length: 	5.1 cm
Sepal width: 	3.5 cm
Petal length: 	1.4 cm
Petal width: 	0.2 cm
Species: 		?

Noise refers to variability in the observation.

Variability could be natural, such as a larger or smaller flower than normal. It could also be an error, such as a slip when measuring or a typo when writing it down.

This variability impacts not just the inputs or measurements but also the outputs; for example, an observation could have an incorrect class label.

This means that although we have observations for the domain, we must expect some variability or randomness.

The real world, and in turn, real data, is messy or imperfect. As practitioners, we must remain skeptical of the data and develop systems to expect and even harness this uncertainty.

This is why so much time is spent on reviewing statistics of data and creating visualizations to help identify those aberrant or unusual cases: so-called data cleaning.

Incomplete Coverage of the Domain

Observations from a domain used to train a model are a sample and incomplete by definition.

In statistics, a random sample refers to a collection of observations chosen from the domain without systematic bias. There will always be some bias.

For example, we might choose to measure the size of randomly selected flowers in one garden. The flowers are randomly selected, but the scope is limited to one garden. Scope can be increased to gardens in one city, across a country, across a continent, and so on.

A suitable level of variance and bias in the sample is required such that the sample is representative of the task or project for which the data or model will be used.

We aim to collect or obtain a suitably representative random sample of observations to train and evaluate a machine learning model. Often, we have little control over the sampling process. Instead, we access a database or CSV file and the data we have is the data we must work with.

In all cases, we will never have all of the observations. If we did, a predictive model would not be required.

This means that there will always be some unobserved cases. There will be part of the problem domain for which we do not have coverage. No matter how well we encourage our models to generalize, we can only hope that we can cover the cases in the training dataset and the salient cases that are not.

This is why we split a dataset into train and test sets or use resampling methods like k-fold cross-validation. We do this to handle the uncertainty in the representativeness of our dataset and estimate the performance of a modeling procedure on data not used in that procedure.

Imperfect Model of the Problem

A machine learning model will always have some error.

This is often summarized as “all models are wrong,” or more completely in an aphorism by George Box:

All models are wrong but some are useful

This does not apply just to the model, the artifact, but the whole procedure used to prepare it, including the choice and preparation of data, choice of training hyperparameters, and the interpretation of model predictions.

Model error could mean imperfect predictions, such as predicting a quantity in a regression problem that is quite different to what was expected, or predicting a class label that does not match what would be expected.

This type of error in prediction is expected given the uncertainty we have about the data that we have just discussed, both in terms of noise in the observations and incomplete coverage of the domain.

Another type of error is an error of omission. We leave out details or abstract them in order to generalize to new cases. This is achieved by selecting models that are simpler but more robust to the specifics of the data, as opposed to complex models that may be highly specialized to the training data. As such, we might and often do choose a model known to make errors on the training dataset with the expectation that the model will generalize better to new cases and have better overall performance.

In many cases, it is more practical to use a simple but uncertain rule rather than a complex but certain one, even if the true rule is deterministic and our modeling system has the fidelity to accommodate a complex rule.

— Page 55, Deep Learning, 2016.

Nevertheless, predictions are required.

Given we know that the models will make errors, we handle this uncertainty by seeking a model that is good enough. This often is interpreted as selecting a model that is skillful as compared to a naive method or other established learning models, e.g. good relative performance.

How to Manage Uncertainty

Uncertainty in applied machine learning is managed using probability.

Probability is the field of mathematics designed to handle, manipulate, and harness uncertainty.

A key concept in the field of pattern recognition is that of uncertainty. It arises both through noise on measurements, as well as through the finite size of data sets. Prob- ability theory provides a consistent framework for the quantification and manipulation of uncertainty and forms one of the central foundations for pattern recognition.

— Page 12, Pattern Recognition and Machine Learning, 2006.

In fact, probability theory is central to the broader field of artificial intelligence.

Agents can handle uncertainty by using the methods of probability and decision theory, but first they must learn their probabilistic theories of the world from experience.

— Page 802, Artificial Intelligence: A Modern Approach, 3rd edition, 2009.

The methods and tools from probability provide the foundation and way of thinking about the random or stochastic nature of the predictive modeling problems addressed with machine learning; for example:

In terms of noisy observations, probability and statistics help us to understand and quantify the expected value, variability, of variables in our observations from the domain.
In terms of the incomplete coverage of the domain, probability helps to understand and quantify the expected distribution and density of observations in the domain.
In terms of model error, probability helps to understand and quantify the expected capability and variance in performance of our predictive models when applied to new data.

But this is just the beginning, as probability provides the foundation for the iterative training of many machine learning models, called maximum likelihood estimation, behind models such as linear regression, logistic regression, artificial neural networks, and much more.

Probability also provides the basis for developing specific algorithms, such as Naive Bayes, as well as entire subfields of study in machine learning, such as graphical models like the Bayesian Belief Network.

Probabilistic methods form the basis of a plethora of techniques for data mining and machine learning.

— Page 336, Data Mining: Practical Machine Learning Tools and Techniques. 4th edition, 2016.

The procedures we use in applied machine learning are carefully chosen to address the sources of uncertainty that we have discussed, but understanding why the procedures were chosen requires a basic understanding of probability and probability theory.

Summary

In this post, you discovered the challenge of uncertainty in machine learning.

Specifically, you learned:

Uncertainty is the biggest source of difficulty for beginners in machine learning, especially developers.
Noise in data, incomplete coverage of the domain, and imperfect models provide the three main sources of uncertainty in machine learning.
Probability provides the foundation and tools for quantifying, handling, and harnessing uncertainty in applied machine learning.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

The post A Gentle Introduction to Uncertainty in Machine Learning appeared first on Machine Learning Mastery.

Go to Source