Regularization Matters: Generalization and Optimization of Neural Nets v.s. their Induced Kernel

Published 12 Oct 2018 in stat.ML and cs.LG | (1810.05369v4)

Abstract: Recent works have shown that on sufficiently over-parametrized neural nets, gradient descent with relatively large initialization optimizes a prediction function in the RKHS of the Neural Tangent Kernel (NTK). This analysis leads to global convergence results but does not work when there is a standard $\ell_2$ regularizer, which is useful to have in practice. We show that sample efficiency can indeed depend on the presence of the regularizer: we construct a simple distribution in d dimensions which the optimal regularized neural net learns with $O(d)$ samples but the NTK requires $\Omega(d^2)$ samples to learn. To prove this, we establish two analysis tools: i) for multi-layer feedforward ReLU nets, we show that the global minimizer of a weakly-regularized cross-entropy loss is the max normalized margin solution among all neural nets, which generalizes well; ii) we develop a new technique for proving lower bounds for kernel methods, which relies on showing that the kernel cannot focus on informative features. Motivated by our generalization results, we study whether the regularized global optimum is attainable. We prove that for infinite-width two-layer nets, noisy gradient descent optimizes the regularized neural net loss to a global minimum in polynomial iterations.

Abstract PDF Upgrade to Chat

Citations (238)

View on Semantic Scholar

Summary

The paper demonstrates that explicit ℓ2 regularization reduces sample complexity from Ω(d²) to O(d), leading to a narrower generalization gap compared to NTK.
The paper introduces novel theoretical tools that establish lower bounds for kernel methods, revealing NTK's limitations in capturing informative features in high-dimensional settings.
The paper proves that gradient descent with noise efficiently finds global solutions in regularized infinite-width networks, validating improved test accuracy and robust margin maximization.

Regularization Matters: Generalization and Optimization of Neural Nets vs. Their Induced Kernel

This paper conducts a rigorous investigation into the interplay between regularization, generalization, and optimization in neural networks compared to their kernel counterparts, specifically focusing on the Neural Tangent Kernel (NTK). The authors present a compelling argument that explicit regularization can significantly influence the generalization abilities of neural networks, particularly in settings where the NTK fails to perform efficiently.

Main Contributions

Sample Complexity and Generalization Gap: The paper delivers a stark comparison between the sample complexity of neural networks and NTK in a specifically constructed setting. It demonstrates a distribution where a regularized neural network learns effectively with $O(d)$ samples, whereas the NTK requires $\Omega(d^2)$ samples. This result is pivotal in showcasing that explicit $\ell_2$ regularization allows neural networks to attain lower generalization error by optimizing the network's margin, which is not achievable under the NTK framework with the same efficiency.
Theoretical Tools and Novel Lower Bounds: The authors develop new analytical tools to support their claims, including methods for establishing lower bounds for kernel methods. They demonstrate that kernel methods such as NTK cannot effectively capture informative features in high dimensions without an unsustainable increase in sample complexity. These techniques offer an independent utility beyond this work, providing insight into the limitations of kernel methods in other complex data distributions.
Optimization of Regularized Neural Networks: The paper proves that for infinite-width two-layer networks, gradient descent with small noise can find global solutions for the regularized neural net loss efficiently—in polynomial time. This result stands in contrast to prior work that lacks explicit convergence rates and highlights the feasibility of practically optimizing regularized neural networks to achieve improved generalization.
Margin Maximization through Regularization: By employing margin theories, the authors present a robust case that the global minima of weakly-regularized logistic loss in neural networks yield max-margin solutions. This property theoretically justifies the superior generalization observed in regularized neural networks and aligns with empirical findings. The paper establishes that these outcomes are consistent across varying depths and widths, providing a foundation for understanding the benefits of over-parameterization in neural networks.
Experimental Validation: The research includes empirical validation illustrating that explicit regularization leads to improved test accuracy and higher margin solutions compared to setups without regularization. The experiments further reinforce the theoretical assertions that the benefits of regularization are not merely artifacts of specific assumptions but hold broadly across different architectures and datasets.

Implications and Future Directions

The implications of these findings are multifaceted:

Practical Regularization Strategies: Understanding how regularization shapes both the margin and generalization of neural networks can inform the design of more effective training strategies and model architectures for diverse applications.
Kernel Method Development: The results suggest a need for reevaluating assumptions about the efficacy of kernel methods against adaptive neural architectures in high-dimensional spaces.
Computational Limits: The paper paves the way for future exploration into the computational boundaries necessary for achieving generalization similar to neural networks using kernel methods.

In future research, extending the exploration to deeper architectures, non-homogeneous activations, and alternative regularization norms could prove insightful. Additionally, investigating the implications of these results on understanding implicit regularization in optimization dynamics, especially in settings with non-convex loss landscapes, could yield new perspectives on neural network learning dynamics.

Markdown