Framework for Better Deep Learning

Author: Jason Brownlee

Modern deep learning libraries such as Keras allow you to define and start fitting a wide range of neural network models in minutes with just a few lines of code.

Nevertheless, it is still challenging to configure a neural network to get good performance on a new predictive modeling problem.

The challenge of getting good performance can be broken down into three main areas: problems with learning, problems with generalization, and problems with predictions.

Once you have diagnosed the specific type of problem that you are having with a network, a suite of classical and modern techniques can then be selected to address the issue and improve performance.

In this post, you will discover a framework for diagnosing performance problems with deep learning models and techniques that you can use to target and improve each specific performance problem.

After reading this post, you will know:

Defining and fitting neural networks has never been easier, although getting good performance on new problems remains challenging.
Neural network modeling performance problems can be decomposed into learning, generalization, and prediction type problems.
There are decades of techniques as well as modern methods that can be used to target each type of model performance problem.

Let’s get started.

Framework for Better Deep Learning
Photo by Anupam_ts, some rights reserved.

Overview

This tutorial is divided into seven parts; they are:

Neural Network Renaissance
Challenge of Configuring Neural Networks
Framework for Systematically Better Deep Learning
Better Learning Techniques
Better Generalization Techniques
Better Predictions Techniques
How to Use the Framework

Neural Network Renaissance

Historically, neural network models had to be coded from scratch.

You might spend days or weeks translating poorly described mathematics into code and days or weeks more debugging your code just to get a simple neural network model to run.

Those days are in the past.

Today, you can define and begin fitting most types of neural networks in minutes with just a few lines of code, thanks to open source libraries such as Keras built on top of sophisticated mathematical libraries such as TensorFlow.

This means that standard models such as Multilayer Perceptrons can be developed and evaluated rapidly, as well as more sophisticated models that may previously have been beyond the capabilities of most practitioners to implement such as Convolutional Neural Networks and Recurrent Neural Networks like the Long Short-Term Memory network.

As deep learning practitioners, we live in amazing and productive times.

Nevertheless, even through new neural network models can be defined and evaluated rapidly, there remains little guidance on how to actually configure neural network models in order to get the most out of them.

Challenge of Configuring Neural Networks

Configuring neural network models is often referred to as a “dark art.”

This is because there are no hard and fast rules for configuring a network for a given problem. We cannot analytically calculate the optimal model type or model configuration for a given dataset.

Instead, there are decades worth of techniques, heuristics, tips, tricks, and other tacit knowledge spread across code, papers, blog posts, and in peoples heads.

A shortcut to configuring a neural network on a problem is to copy the configuration of another network for a similar problem. But this strategy rarely leads to good results as model configurations are not transferable across problems. It is also likely that you work on predictive modeling problems that are most unlike other problems described in the literature.

Fortunately, there are techniques that are known to address specific issues when configuring and training a neural network that are available in modern deep learning libraries like Keras.

Further, discoveries have been made in the past 5 to 10 years in areas such as activation functions, adaptive learning rates, regularization methods, and ensemble techniques that have been shown to dramatically improve the performance of neural network models regardless of their specific type.

The techniques are available; you just need to know what they are and when to use them.

Want Better Results with Deep Learning?

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Download Your FREE Mini-Course

Framework for Systematically Better Deep Learning

Unfortunately, you cannot simply grid search across the techniques used to improve deep learning performance.

Almost universally, they uniquely change aspects of the training data, learning process, model architecture, and more. Instead, you must diagnose the type of performance problem you are having with your model, then carefully choose and evaluate a given intervention tailored to that diagnosed problem.

There are three types of problems that are straightforward to diagnose with regard to poor performance of a deep learning neural network model; they are:

Problems with Learning. Problems with learning manifest in a model that cannot effectively learn a training dataset or shows slow progress or bad performance when learning the training dataset.
Problems with Generalization. Problems with generalization manifest in a model that overfits the training dataset and makes poor performance on a holdout dataset.
Problems with Predictions. Problems with predictions manifest in the stochastic training algorithm having a strong influence on the final model, causing a high variance in behavior and performance.

This breakdown provides a systematic approach to thinking about the performance of your deep learning model.

There is some natural overlap and interaction between these areas of concern. For example, problems with learning affect the ability of the model to generalize as well as the variance in the predictions made from a final model.

The sequential relationship between the three areas in the proposed breakdown allows the issue of deep learning model performance to be first isolated, then targeted with a specific technique or methodology.

We can summarize techniques that assist with each of these problems as follows:

Better Learning. Techniques that improve or accelerate the adaptation of neural network model weights in response to a training dataset.
Better Generalization. Techniques that improve the performance of a neural network model on a holdout dataset.
Better Predictions. Techniques that reduce the variance in the performance of a final model.

Now that we have a framework for systematically diagnosing a performance problem with a deep learning neural network, let’s take a look at some examples of techniques that may be used in each of these three areas of concern.

Better Learning Techniques

Better learning techniques are those changes to a neural network model or learning algorithm that improve or accelerate the adaptation of the model weights in response to a training dataset.

In this section, we will review the techniques used to improve the adaptation of the model weights.

This begins with the careful configuration of the hyperparameters related to optimizing the neural network model using the stochastic gradient descent algorithm and updating the weights using the backpropagation of error algorithm; for example:

Configure Batch Size. Including exploring whether variations such as batch, stochastic (online), or mini-batch gradient descent are more appropriate.
Configure Learning Rate. Including understanding the effect of different learning rates on your problem and whether modern adaptive learning rate methods such as Adam would be appropriate.
Configure Loss Function. Including understand the way different loss functions must be interpreted and whether an alternate loss function would be appropriate for your problem.

This also includes simple data preparation and the automatic rescaling of inputs at deeper layers.

Data Scaling Techniques. Including the sensitivity that small network weights have to the scale of input variables and the impact of large errors in the target variable have on weight updates.
Batch Normalization. Including the sensitivity to changes in the distribution of inputs to layers deep in a network model and the benefits of standardizing layer inputs to add consistency of input and stability to the learning process.

Stochastic gradient descent is a general optimization algorithm that can be applied to a wide range of problems. Nevertheless, the optimization process (or learning process) can become unstable and specific interventions are required; for example:

Vanishing Gradients. Prevent the training of deep multiple-layered networks causing layers close to the input layer to not have their weights updated; that can be addressed using modern activation functions such as the rectified linear activation function.
Exploding Gradients. Large weight updates cause a numerical overflow or underflow making the network weights take on a NaN or Inf value; that can be addressed using gradient scaling or gradient clipping.

The limitation of data on some predictive modeling problems can prevent effective learning. Specialized techniques can be used to jump-start the optimization process, providing a useful initial set of weights or even whole models that can be used for feature extraction; for example:

Greedy Layer-Wise Pretraining. Where layers are added one at a time to a model, learning to interpret the output of prior layers and permitting the development of much deeper models: a milestone technique in the field of deep learning.
Transfer Learning. Where a model is trained on a different, but somehow related, predictive modeling problem and then used to seed the weights or used wholesale as a feature extraction model to provide input to a model trained on the problem of interest.

Are there additional techniques that you use to improve learning?
Let me know in the comments below.

Better Generalization Techniques

Better generalization techniques are those that change the neural network model or learning algorithm to reduce the effect of the model overfitting the training dataset and improve the performance of the model on a holdout validation or test dataset.

In this section, we will review the techniques to reduce generalization of the model during training.

Techniques that are designed to reduce generalization error are commonly referred to as regularization techniques. Almost universally, regularization is achieved by somehow reducing or limiting model complexity.

Perhaps the most widely understood measure of model complexity is the size or magnitude of the model weights. A model with large weights is a sign that it may be overly specialized to the inputs in the training data, making it unstable when used when making a prediction on new unseen data. Keeping weights small via weight regularization is a powerful and widely used technique.

Weight Regularization. A change to the loss function that penalizes a model in proportion to the norm (magnitude) of the model weights, encouraging smaller weights and, in turn, a lower complexity model.

Rather than simply encouraging the weights to remain small via an updated loss function, it is possible to force the weights to be small using a constraint.

Weight Constraint. Update to the model to rescale the weights when the vector norm of the weights exceeds a threshold.

The output of a neural network layer, regardless of where that layer is in the stack of layers, can be thought of as an internal representation or set of extracted features with regard to the input. Simpler internal representations can have a regularizing effect on the model and can be encouraged through constraints that encourage sparsity (zero values).

Activity Regularization. A change to the loss function that penalized a model in proportion to the norm (magnitude) of the layer activations, encouraging smaller or more sparse internal representations.

Noise can be added to the model to encourage robustness with regard to the raw inputs or outputs from prior layers during training; for example:

Input Noise. Addition of statistical variation or noise at the input layer or between hidden layers to reduce the model’s dependence on specific input values.
Dropout. Probabilistically removing connections (weights) while training the network to break tight coupling between nodes across layers.

Often, overfitting can occur due simply to training the model for too long on the training dataset. A simple solution is to stop the training early.

Early Stopping. Monitor model performance on the hold out validation dataset during training and stop the training process when performance on the validation set starts to degrade.

Are there additional techniques that you use to improve generalization?
Let me know in the comments below.

Better Predictions Techniques

Better prediction techniques are those that complement the model training process in order to reduce the variance in the expected performance of the final model.

In this section, we will review the techniques to reduce the expected variance of a final deep learning neural network model.

The variance in the performance of the final model can be reduced by adding bias. The most common way to introduce bias to the final model is to combine the predictions from multiple models. This is referred to as ensemble learning.

More than reducing the variance of the performance of a final model, ensemble learning can also result in better predictive performance.

Effective ensemble learning methods require that each contributing model have skill, meaning that the models make predictions that are better than random, but that the prediction errors between the models have a low correlation. This means, that the ensemble member models should have skill, but in different ways.

This can be achieved by varying one aspect of the ensemble; for example:

Vary the training data used to fit each member.
Vary the members that contribute to the ensemble prediction.
Vary the way that the predictions from the ensemble members are combined.

The training data can be varied by fitting models on different subsamples of the dataset.

This might involve fitting and retaining models on different randomly selected subsets of the training dataset, retaining models for each fold in a k-fold cross-validation, or retaining models across different samples with replacement using the bootstrap method (e.g. bootstrap aggregation). Collectively, we can think of these methods as resampling ensembles.

Resampling Ensemble. Ensemble of models fit on different samples of the training dataset.

Perhaps the simplest way to vary the members of the ensemble involves gathering models from multiple runs of the learning algorithm on the training dataset. The stochastic learning algorithm will cause a slightly different fit on each run that, in turn, will have a slightly different fit. Averaging the models across multiple runs will ensure the performance remains consistent.

Model Averaging Ensemble. Retrain models across multiple runs of the same learning algorithm on the same dataset.

Variations on this approach may involve training models with different hyperparameter configurations.

It can be expensive to train multiple final deep learning models, especially when one model may take days or weeks to fit.

An alternative is to collect models for use as contributing ensemble members during a single training run; for example:

Horizontal Ensemble. Ensemble members collected from a contiguous block of training epochs towards the end of a single training run.
Snapshot Ensemble. A training run using an aggressive cyclic learning rate where ensemble members are collected at the trough of each cycle of the learning rate.

The simplest way to combine the predictions from multiple ensemble members is to calculate the average of the predictions in the case of regression, or the statistical mode or most frequent prediction in the case of classification.

Alternately, the best way to combine the predictions from multiple models can be learned; for example:

Weighted Average Ensemble (blending). The contribution from each ensemble member to an ensemble prediction is weighted using learned coefficients that indicates the trust in each model.
Stacked Generalization (stacking). A new model is trained to learn how to best combine the predictions from the ensemble members.

An alternative to combining the predictions from the ensemble members, the models themselves may be combined; for example:

Average Model Weight Ensemble. Weights from multiple neural network models are averaged into a single model used to make a prediction.

Are there additional techniques that you use to reduce the variance of the final model?
Let me know in the comments below.

How to Use the Framework

We can think of the organization of techniques into the three areas of better learning, generalization, and prediction as a systematic framework for improving the performance of your neural network model.

There are too many techniques to reasonably investigate and evaluate each in your project.

Instead, you need to be methodical and use the techniques in a targeted way to address a defined problem.

Step 1: Diagnose the Performance Problem

The first step in using this framework is to diagnose the performance problem that you are having with your model.

A robust diagnostic tool is to calculate a learning curve of loss and a problem-specific metric (like RMSE for regression or accuracy for classification) on a train and validation dataset over a given number of training epochs.

If the loss on the training dataset is poor, stuck, or fails to improve, perhaps you have a learning problem.
If the loss or problem-specific metric on the training dataset continues to improve and gets worse on the validation dataset, perhaps you have a generalization problem.
If the loss or problem-specific metric on the validation dataset shows a high variance towards the end of the run, perhaps you have a prediction problem.

Step 2: Select and Evaluate a Technique

Review the techniques that are designed to address your problem.

Select a technique that appears to be a good fit for your model and problem. This may require some prior experience with the techniques and may be challenging for a beginner.

Thankfully, there are heuristics and best-practices that work well on most problems.

For example:

Learning Problem: Dialing-in the hyperparameters of the learning algorithm; specifically, the learning rate offers the biggest leverage.
Generalization Problem: Using weight regularization and early stopping works well on most models with most problems, or try dropout with early stopping.
Prediction Problem: Average the prediction from models collected over multiple runs or multiple epochs on one run to add sufficient bias.

Pick an intervention, then read-up a little bit on it, including how it works, why it works, and importantly, find examples for how practitioners before you have used it to get an idea for how you might use it on your problem.

Step 3: Go To Step 1

Once you have identified an issue and addressed it with an intervention, repeat the process.

Developing a better model is an iterative process that may require multiple interventions at multiple levels that complement each other.

This is an empirical process. This means that you are reliant on the robustness of your test harness to give you a reliable summary of performance before and after an intervention. Spend the time to ensure your test harness is robust, guarantee that the train, test, and validation datasets are clean and provide a suitably representative sample of observation from your problem domain.

Summary

In this post, you discovered a framework for diagnosing performance problems with deep learning models and techniques that you can use to target and improve each specific performance problem.

Specifically, you learned:

Defining and fitting neural networks has never been easier, although getting good performance on new problems remains challenging.
Neural network modeling performance problems can be decomposed into learning, generalization, and prediction type problems.
There are decades of techniques as well as modern methods that can be used to target each type of model performance problem.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

The post Framework for Better Deep Learning appeared first on Machine Learning Mastery.

Go to Source