How Does Batch Normalization Help Optimization? (1805.11604v5)

Published 29 May 2018 in stat.ML, cs.LG, and cs.NE

Abstract: Batch Normalization (BatchNorm) is a widely adopted technique that enables faster and more stable training of deep neural networks (DNNs). Despite its pervasiveness, the exact reasons for BatchNorm's effectiveness are still poorly understood. The popular belief is that this effectiveness stems from controlling the change of the layers' input distributions during training to reduce the so-called "internal covariate shift". In this work, we demonstrate that such distributional stability of layer inputs has little to do with the success of BatchNorm. Instead, we uncover a more fundamental impact of BatchNorm on the training process: it makes the optimization landscape significantly smoother. This smoothness induces a more predictive and stable behavior of the gradients, allowing for faster training.

Authors (4)

Shibani Santurkar (26 papers)
Dimitris Tsipras (22 papers)
Andrew Ilyas (39 papers)
Aleksander Madry (86 papers)

Citations (1,466)

View on Semantic Scholar

Summary

The paper’s main contribution is revealing that BatchNorm smooths the loss landscape, making gradients more predictive and stable.
It disproves the common belief that BatchNorm’s success stems from mitigating internal covariate shift by demonstrating consistent performance even with induced noise.
Empirical and theoretical analyses highlight that improved Lipschitz continuity and β-smoothness contribute to faster convergence in BatchNorm models.

How Does Batch Normalization Help Optimization?

The paper "How Does Batch Normalization Help Optimization?" presents a thorough investigation into the effectiveness of Batch Normalization (BatchNorm) in the optimization of Deep Neural Networks (DNNs). The authors disprove the widely held belief that BatchNorm's success is attributable to its ability to reduce internal covariate shift (ICS). Instead, they reveal a more fundamental reason for BatchNorm's optimization benefits: it smoothens the optimization landscape, thereby making the gradients more predictive and stable.

Key Findings

Disproving the Link to Internal Covariate Shift

The paper begins by challenging the assumption that BatchNorm's primary contribution is the mitigation of ICS. Internal covariate shift refers to the change in the distribution of layer inputs due to updates in the preceding layers, potentially complicating the training process. Despite its widespread acceptance, the authors find minimal empirical support for this notion.

Two key results substantiate this claim:

Experimental Evidence Against ICS Reduction: Experiments show that intentional introduction of noise (mimicking covariate shift) into BatchNorm layers does not degrade optimization performance. This contradicts the idea that reducing ICS is necessary for BatchNorm's effectiveness.
Quantitative Analysis of ICS: Metrics like the stability of activation distributions and the consistency of gradient directions reveal that BatchNorm does not significantly stabilize distributions compared to non-BatchNorm networks, and may even exhibit worse ICS metrics.

Smoothness in the Optimization Process

The authors then pivot to identify what truly contributes to BatchNorm's success. They assert that BatchNorm enhances the smoothness of the optimization landscape. This is articulated through several observations and theoretical results:

Lipschitz Continuity and Gradient Predictiveness: BatchNorm results in a more Lipschitz continuous loss function, thereby making gradients more reliable. This allows for larger and more stable gradient steps during training.
Empirical Smoothness Analysis: Empirical evidence demonstrates that models with BatchNorm exhibit less variability in both the loss values and gradient magnitudes when exploring the loss surface along gradient directions. This reduction in gradient variability supports faster convergence.

Theoretical Justifications

To provide a theoretical foundation, the paper explores the Lipschitz properties and $\beta$ -smoothness of loss and gradients. The key theoretical results include:

Improved Lipschitzness: BatchNorm scales gradients to attenuate the Lipschitz constant of the loss function. This implies a reduction in the rate of change of the loss function, contributing to training stability.
Enhanced $\beta$ -Smoothness: BatchNorm also improves the smoothness of the gradients themselves, meaning that the gradients do not exhibit abrupt changes, allowing for more predictive and reliable updates.

Generalization to Other Normalization Techniques

The paper further investigates whether the smoothening effect is unique to BatchNorm. By analyzing other normalization schemes (e.g., $\ell_p$ -norm-based normalization strategies), the authors find that several alternative techniques provide similar benefits in terms of optimization performance and landscape smoothness. This suggests that while BatchNorm is highly effective, it is not uniquely superior in its ability to smoothen the optimization landscape.

Implications and Future Directions

The findings in this paper have significant implications:

Practical Optimization: Understanding that the effectiveness of BatchNorm is due to its impact on optimization smoothness, rather than reduction of ICS, can lead to more informed choices in model architecture and training strategies. This shifts the focus towards exploring other normalization techniques that might offer even better performance.
Theoretical Insights: The paper prompts a reevaluation of several entrenched theories in deep learning literature. It encourages further research into the mathematical properties underpinning optimization in deep networks.
Generalization and Robustness: Although this work primarily focuses on training stability and efficiency, the smoothening effects observed might also contribute to the generalization capabilities of the trained models. Future research could investigate how these optimization properties impact the generalization performance.

Conclusion

The research by Santurkar et al. offers a nuanced understanding of BatchNorm, shifting the narrative from internal covariate shift reduction to a more fundamental reshaping of the optimization landscape. This insight not only clarifies the mechanism through which BatchNorm facilitates efficient training but also opens avenues for further advancements in the training methodologies of DNNs. The results urge a broader and deeper exploration of the normalization techniques, pushing the field towards increasingly robust and efficient machine learning models.

PDF Markdown

Related Papers

Tweets

https://twitter.com/filipviz/status/1871683942044381220

https://twitter.com/cloneofsimo/status/1816757590250185127

https://twitter.com/cloneofsimo/status/1931026762890444887

https://twitter.com/torchcompiled/status/1871642333386887530

https://twitter.com/SanderWangSD/status/1814970699305336944

https://twitter.com/revhowardarson/status/1790138418909139200

YouTube

Show All Videos