Blockdrop to Accelerate Neural Network training by IBM Research

Author: Sharmistha Chatterjee

Scaling AI with Dynamic Inference Paths in Neural Networks

Introduction

IBM Research, with the help of the University of Texas Austin and the University of Maryland, has created a technology, called BlockDrop, that promises to speed convolutional neural network operations without any loss of fidelity.

This could further excel the use of neural nets, particularly in places with limited computing capability.

Increase in accuracy level have been accompanied by increasingly complex and deep network architectures. This presents a problem for domains where fast inference is essential, particularly in delay-sensitive and realtime scenarios such as autonomous driving, robotic navigation, or user-interactive applications on mobile devices.

Further research results show regularization techniques for fully connected layers, is less effective for convolutional layers, as activation units in these layers are spatially correlated and information can still flow through convolutional networks despite dropout.

BlockDrop method introduced by IBM Research is a complementary method to existing model compression techniques, as this form of structured dropout drops spatially correlated information. The residual blocks of a neural network kept for evaluation, can be further pruned for greater speed up.

Residual block — Building block of a Restnet,

The above figure illustrates blockdrop mechanism for a given image input to convolution network. It further includes activation units which contain semantic information in the input image. The activations dropped at random is not effective in removing semantic information because nearby activations contain closely related information. The best strategy is to drop continuous regions that helps in removing certain semantic information (e.g., head or feet), and compels remaining units to learn features for classifying input image.

Policy Network for Dynamic Inference Paths

BlockDrop mechanism learns to dynamically choose which layers of a deep network to execute during inference so as to best reduce total computation without degrading prediction accuracy. It exploits the robustness of Residual Networks (ResNets) by dropping layers that aren’t necessary to compute to achieve the desired level of accuracy, resulting in dynamic selection of residual blocks for a given novel image. Thus it aids in:

Allocating system resources in a more efficient manner.
Facilitating further insights into ResNets, e.g., whether different blocks encode information about objects.
Achieving minimal block usage based on image-specific decisions to optimally drop blocks.

For example, given a pre-trained ResNet, a policy network is trained into an associative reinforcement learning setting for the dual reward of utilizing a minimal number of blocks while preserving recognition accuracy. Experiments on CIFAR and ImageNet reveal learned policies not only accelerate inference but also encode meaningful visual information. A ResNet-101 model, with this method achieves a speedup of 20% on average, going as high as 36% for some images, while maintaining the same 76.4% top-1 accuracy on ImageNet.

BlockDrop strategy learns a model, referred to as the policy network, that, given a novel input image, outputs the posterior probabilities of all the binary decisions for dropping or keeping each block in a pre-trained ResNet. The policy network is trained using curriculum learning to maximize a reward that incentivizes the use of as few blocks as possible while preserving the prediction accuracy.

In addition, the pre-trained ResNet is further jointly fine-tuned with the policy network to produce feature transformations targeted for block dropping behavior. The method represents an instantiation of associative reinforcement learning where all the decisions are taken in a single step given the context (i.e., the input instance)1. This results in lightweight policy execution and scalable to very deep networks.

Deep Learning Neural networks like a recurrent model (LSTM) could also serve as the policy network, however research findings reveal a CNN to be more efficient with similar performance.

The above Figure represents a conceptual overview of BlockDrop, that learns a policy to select the minimal configuration of blocks needed to correctly classify a given input image. The resulting instance-specific paths in the network not only reflect the image’s difficulty (easier samples use fewer blocks) but also encode meaningful visual information (patterns of blocks correspond to clusters of visual features).

The above figure depicts policy network architecture of Blockdrop. On any given new image, the policy network outputs dropping and keeping decisions for each block in a pre-trained ResNet. This final active blocks retained are used for evaluating prediction. Policy rewards account for both block usage and prediction accuracy. The policy network is further trained to optimize the expected reward with a curriculum learning strategy, and then jointly fine-tuned with the ResNet.

The above figure illustrates samples from ImageNet. The top row contains images that are correctly classified with the least number of blocks, while samples in the bottom row utilize the most blocks. Samples using fewer blocks are indeed easier to identify since they contain single frontal view objects positioned in the center, while several objects, occlusion, or cluttered background occur in samples that require more blocks.

This is based on the hypothesis that block usage is a function of instance difficulty where BlockDrop automatically learns “sorting” images into easy or hard cases.

Library and Usage

See source code and comments on the Github page, here

Conclusion

In this blog we have discussed about BlockDrop strategy aimed to speedup training of neural networks. It has the following characteristics:

Speed AI-based computer vision operations.
Approximately takes 200 times less power per pixel than comparable systems using traditional hardware.
Facilitates deployment of top-performing deep neural network models on mobile devices by effectively reducing the storage and computational costs of such networks.
Determines the minimal configuration of layers, or blocks, needed to correctly classify a given input image. Simplicity of images helps to remove more layers and save more time.
Application has been extended to ResNets for faster inference by selectively choosing residual blocks to evaluate in a learned and optimized manner conditioned on inputs.
Extensive experiments conducted on CIFAR and ImageNet, shows considerable gains over existing methods in terms of the efficiency and accuracy trade-off.

References

BlockDrop: Dynamic Inference Paths in Residual Networks

Go to Source