Introduction to Dropout to regularize Deep Neural Network

Author: saurav singla

Dropout means to drop out units which are covered up and noticeable in a neural network. Dropout is a staggeringly in vogue method to overcome overfitting in neural networks.

Deep Learning framework is now getting further and more profound. With these bigger networks, we can accomplish better prediction exactness. However, this was not the case a few years ago. Deep Learning was having overfitting issue. At that point, around the year 2012, the idea of Dropout by Hinton in their paper by randomly excluding subsets of features at each iteration of a training procedure. The concept revolutionized Deep Learning. A significant part of the achievement that we have with Deep Learning is ascribed to Dropout.

No alt text provided for this image

Preceding Dropout, a significant research area was in regularization. Introduction of regularization methods in neural networks, for example, L1 and L2 weight penalties, began from the mid-2000s. Notwithstanding, these regularizations didn’t totally tackle the overfitting issue.

Wager et al. in their paper 2013, dropout regularization was better than L2-regularization for learning weights for features.

Dropout is a method where randomly selected neurons are dropped during training. They are “dropped-out” arbitrarily. This infers that their contribution to the activation of downstream neurons is transiently evacuated on the forward pass and any weight refreshes are not applied to the neuron on the backward pass. 

You can envision that if neurons are haphazardly dropped out of the network during training, that other neuron will have to step in and handle the portrayal required to make predictions for the missing neurons. This is believed to bring about various independent internal representations being learned by the network.

In spite of the fact that dropout has ended up being an exceptionally successful technique, the reasons for its success are not yet well understood at a theoretic level.

No alt text provided for this image

We can see standard feedforward pass: weights multiply inputs, add bias, and pass it to the activation function. The second arrangement of equations clarify how it would look like in the event that we put in dropout:

  • Generate a dropout mask: Bernoulli random variables (example 1.0*(np.random.random((size))>p)
  • Use the mask to the inputs disconnecting some neurons.
  • Utilize this new layer to multiply weights and add bias
  • Finally, use the activation function.

All the weights are shared over the potentially exponential number of networks, and during backpropagation, only the weights of the “thinned network” will be refreshed.

According to (Srivastava, 2013) Dropout, neural networks can be trained along with stochastic gradient descent. Dropout is done independently for each training case in each minibatch. Dropout can be utilized with any activation function and their experiments with logistic, tanh and rectified linear units yielded comparable outcomes however requiring different amounts of training time and rectified linear units was the quickest to train.


Kingma et al., 2015 recommended Dropout requires indicating the dropout rates which are the probabilities of dropping a neuron. The dropout rates are normally optimized utilizing grid search. Additionally, Variational Dropout is an exquisite translation of Gaussian Dropout as an extraordinary instance of Bayesian regularization. This method permits us to tune dropout rate and can, in principle, be utilized to set individual dropout rates for each layer, neuron or even weight.


Another experiment by (Ba et al., 2013) increasing the number of hidden units in the deep learning algorithm. One notable thing for dropout regularization is that it accomplishes considerably prevalent performance with large numbers of hidden units since all units have an equivalent probability to be excluded.


  • Generally, utilize small dropout value of 20%-50% of neurons with 20% providing a great beginning point. A probability too low has insignificant impact and worth too high outcomes in under-learning by the system.
  • You are probably going to show signs of improvement execution when dropout is utilized on a larger network, allowing the model a greater amount of a chance to learn free portrayals.
  • Use dropout on approaching (obvious) just as concealed units.  Utilization of dropout at each layer of the system has demonstrated great outcomes.


  • Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I. and Salakhutdinov, R., 2014. Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research, 15(1), pp.1929-1958.
  • Hinton, G.E., Srivastava, N., Krizhevsky, A., Sutskever, I. and Salakhutdinov, R.R., 2012. Improving neural networks by preventing co-adaptation of feature detectors. arXiv preprint arXiv:1207.0580.
  • Wager, S., Wang, S. and Liang, P.S., 2013. Dropout training as adaptive regularization. In Advances in neural information processing systems (pp. 351-359).
  • Srivastava, N., 2013. Improving neural networks with dropout. The University of Toronto, 182(566), p.7.
  • Kingma, D.P., Salimans, T. and Welling, M., 2015. Variational dropout and the local reparameterization trick. In Advances in neural information processing systems (pp. 2575-2583).
  • Ba, J. and Frey, B., 2013. Adaptive dropout for training deep neural networks. In Advances in neural information processing systems (pp. 3084-3092).


Go to Source