L2 Regularization versus Batch and Weight Normalization (1706.05350v1)

Published 16 Jun 2017 in cs.LG and stat.ML

Abstract: Batch Normalization is a commonly used trick to improve the training of deep neural networks. These neural networks use L2 regularization, also called weight decay, ostensibly to prevent overfitting. However, we show that L2 regularization has no regularizing effect when combined with normalization. Instead, regularization has an influence on the scale of weights, and thereby on the effective learning rate. We investigate this dependence, both in theory, and experimentally. We show that popular optimization methods such as ADAM only partially eliminate the influence of normalization on the learning rate. This leads to a discussion on other ways to mitigate this issue.

PDF Abstract

An Analytical Examination of $L_2$ Regularization with Normalization Techniques in Deep Learning

The paper "L2 Regularization versus Batch and Weight Normalization" investigates the role of $L_2$ regularization in the context of normalized deep neural networks. It questions the assumed purpose of $L_2$ regularization to mitigate overfitting in models employing normalization techniques such as Batch Normalization (BN), Weight Normalization (WN), and Layer Normalization (LN). Through theoretical exposition and experimental validation, the paper demonstrates that $L_2$ regularization's role deviates significantly from traditional assumptions when applied in conjunction with normalization methods.

Main Findings

Lack of Regularizing Effect with Normalization: The findings argue that $L_2$ regularization does not exert a classical regularizing influence in neural networks using normalization techniques. Instead, $L_2$ regularization affects the scale of weights, subsequently impacting the effective learning rate. This is due to the invariance of normalized functions to the scaling of weights, effectively neutralizing the regularizing intent.
Decoupling from Overfitting Prevention: Contrary to typical beliefs, the paper shows that the degree of $L_2$ regularization modulates the learning rate rather than acting as a countermeasure against overfitting. The regularization term, while adjusting the weights' magnitude, ceases to influence the complexity of the underlying function due to the function's inherent invariance to weight scaling.
Normalization's Impact on Effective Learning Rate: The paper highlights how the interplay between the weight scale and the effective learning rate becomes a critical factor. As regularization enforces smaller weights, it inadvertently leads to an increased effective learning rate, thereby conflicting with the intuitive goal of regularization that strives for model stability.
Behavior of Popular Optimization Methods: The analysis extends to popular optimization methods like ADAM and RMSProp, determining that they only partially rectify the learning rate dependency on weight scaling introduced by normalization.

Experimental Validation

The experimental results presented support the theoretical analysis by demonstrating a correlation between regularization parameters and effective learning rates across various optimization schemes. A series of experiments using the CIFAR-10 dataset illustrate the influence of weight scale and its regulation by $L_2$ regularization, particularly when combined with Nesterov momentum and ADAM optimization techniques. These results highlight the largely undiscussed influence of weight regularization on the computational behavior and effectiveness of training procedures within the normalized neural network framework.

Implications

Practical Implications: For practitioners, the paper suggests reconsidering the necessity and typical usage of $L_2$ regularization in neural networks with normalization. Adjustments to common practices regarding learning rate schedules and weight regularization could enhance training efficacy when normalization is employed.
Theoretical Insights: The insights provided challenge traditional views on regularization and normalization, suggesting a more nuanced understanding is needed regarding their interaction. This may inform future research aspiring to develop regularization strategies that are inherently compatible with normalization.
Potential Future Research: Future directions may involve exploring alternative methods or regularization techniques that align with the properties of normalization or provide empirically validated frameworks for learning rate adjustment that stabilize neural network training.

In conclusion, this paper contributes a nuanced perspective on $L_2$ regularization in the context of normalization, augmenting the ongoing discussion on optimizing deep learning models efficiently and effectively. The interplay between regularization parameters and normalization demands further exploration, with implications spanning both theoretical understanding and practical application in the deep learning community.