Rethinking the Usage of Batch Normalization and Dropout in the Training of Deep Neural Networks (1905.05928v1)

Published 15 May 2019 in cs.LG and stat.ML

Abstract: In this work, we propose a novel technique to boost training efficiency of a neural network. Our work is based on an excellent idea that whitening the inputs of neural networks can achieve a fast convergence speed. Given the well-known fact that independent components must be whitened, we introduce a novel Independent-Component (IC) layer before each weight layer, whose inputs would be made more independent. However, determining independent components is a computationally intensive task. To overcome this challenge, we propose to implement an IC layer by combining two popular techniques, Batch Normalization and Dropout, in a new manner that we can rigorously prove that Dropout can quadratically reduce the mutual information and linearly reduce the correlation between any pair of neurons with respect to the dropout layer parameter $p$. As demonstrated experimentally, the IC layer consistently outperforms the baseline approaches with more stable training process, faster convergence speed and better convergence limit on CIFAR10/100 and ILSVRC2012 datasets. The implementation of our IC layer makes us rethink the common practices in the design of neural networks. For example, we should not place Batch Normalization before ReLU since the non-negative responses of ReLU will make the weight layer updated in a suboptimal way, and we can achieve better performance by combining Batch Normalization and Dropout together as an IC layer.

Citations (78)

View on Semantic Scholar

Summary

The paper proposes a novel Independent-Component (IC) layer combining BatchNorm and Dropout to improve deep network training convergence and stability.
Experiments on CIFAR and ILSVRC datasets show ResNets with IC layers achieved better convergence speed and stability compared to traditional BatchNorm implementations.
Findings suggest re-evaluating traditional architectures, showing placing IC layers before weight layers significantly boosts training efficiency and outcome stability.

Reconsidering Batch Normalization and Dropout in Deep Neural Networks

The paper "Rethinking the Usage of Batch Normalization and Dropout in the Training of Deep Neural Networks" investigates the integration and optimization of Batch Normalization (BatchNorm) and Dropout techniques to enhance the training efficiency of deep neural networks (DNNs). This research challenges traditional practices and proposes a novel Independent-Component (IC) layer aimed at improving the independence of neural activations, which enhances convergence speed and stability.

Primarily, the authors propose combining BatchNorm and Dropout to create an Independent-Component (IC) layer. This design is grounded in the theory that reducing correlation and mutual information between neurons can expedite convergence—a hypothesis supported by innovative utilization of information-theoretic measures. Specifically, the IC layer, which interleaves BatchNorm and Dropout, makes neural activations more independent, reducing mutual information quadratically and correlation linearly as functions of the dropout parameter, $p$ .

Traditionally, whitening has been applied as a preprocessing step due to its computational demands when applied as a layer transform within networks. This paper breaks from convention by efficiently implementing whitening within the network using a computationally light approach that does not compromise network performance. Through rigorous theoretical proofs, the paper demonstrates that dropout effectively diminishes mutual information between neuron pairs while retaining individual neuron information, balanced by the dropout probability $p$ .

Experimental results solidify these claims. Significant improvements in convergence speed and training stability were observed in ResNet architectures altered to incorporate IC layers. Testing on CIFAR10, CIFAR100, and ILSVRC2012 datasets showed the reformulated networks attaining better performance metrics compared to traditional BatchNorm applications. Empirical evidence suggests that placing IC layers before weight layers, as opposed to post-activation placement—a typical norm—yields superior network training characteristics.

The implications of these findings are substantial both theoretically and practically. Theoretically, the paper highlights a connection between activation independence and network training dynamics, advocating for refined placements and roles of normalization layers within deep architectures. Practically, it prompts a reevaluation of conventional network architectures, suggesting that optimal layer ordering can provide tangible boosts in training efficiency and outcome stability.

Future research could further explore this paradigm, potentially integrating advanced normalization techniques such as layer normalization, instance normalization, or group normalization within the IC layer framework. Investigating the scalability and adaptability of this approach across more diverse network architectures and task domains could also be fruitful, potentially offering a new standard in neural network design optimization practices.

In summary, this paper sets forth a strong conceptual and practical framework for improving DNN training through innovative rethinking of normalization and dropout strategies, paving the way for future exploration into more efficacious architectural designs in deep learning.

Related Papers

YouTube

Show All Videos