An Investigation into Neural Net Optimization via Hessian Eigenvalue Density (1901.10159v1)

Published 29 Jan 2019 in cs.LG and stat.ML

Abstract: To understand the dynamics of optimization in deep neural networks, we develop a tool to study the evolution of the entire Hessian spectrum throughout the optimization process. Using this, we study a number of hypotheses concerning smoothness, curvature, and sharpness in the deep learning literature. We then thoroughly analyze a crucial structural feature of the spectra: in non-batch normalized networks, we observe the rapid appearance of large isolated eigenvalues in the spectrum, along with a surprising concentration of the gradient in the corresponding eigenspaces. In batch normalized networks, these two effects are almost absent. We characterize these effects, and explain how they affect optimization speed through both theory and experiments. As part of this work, we adapt advanced tools from numerical linear algebra that allow scalable and accurate estimation of the entire Hessian spectrum of ImageNet-scale neural networks; this technique may be of independent interest in other applications.

Citations (291)

View on Semantic Scholar

Summary

The paper introduces a scalable method for analyzing Hessian eigenvalue density, revealing that large isolated eigenvalues slow network convergence.
The study shows that batch normalization mitigates the formation of these isolated eigenvalues, enhancing overall training efficiency.
It employs advanced numerical techniques like the Lanczos algorithm to achieve high-precision Hessian spectrum estimates on large-scale models.

Overview of Neural Network Optimization via Hessian Eigenvalue Density

The paper investigates the optimization of deep neural networks (DNNs) through detailed analysis of the Hessian spectrum of the training loss. By developing a new tool for examining the comprehensive Hessian eigenvalue density, the authors aim to gain insights into several hypotheses in the literature regarding features such as smoothness, curvature, and sharpness of the loss landscape.

The paper diverges from prior work limited to small models or sparse eigenvalue computations by facilitating scalable, accurate analysis applicable to contemporary neural networks at the scale of datasets like ImageNet. Notably, the research reveals that non-batch normalized networks exhibit a rapid emergence of isolated large eigenvalues, which are largely absent in batch normalized architectures. The paper's results underscore that these isolated eigenvalues detrimentally impact optimization speed by concentrating gradient energy in specific eigenspaces, hindering progress along other directions.

Methodology

The authors employ advanced techniques from numerical linear algebra to estimate the entire Hessian spectrum efficiently. Validated against small models to double-precision accuracy $10^{-14}$ , the method proves scalable for models with tens of millions of parameters. By harnessing the Lanczos algorithm for approximation and concentration results for quadratic forms, this tool enables high-resolution temporal analysis of Hessian spectra throughout DNN training.

Key Findings

Eigenvalue Behavior: The presence of large, isolated Hessian eigenvalues in non-batch normalized networks is associated with slow convergence. This is attributed to the distribution of gradient energy primarily along these directions, limiting effective parameter updates.
Batch Normalization Impact: Batch normalization effectively suppresses the formation of isolated large eigenvalues, thus enhancing optimization efficiency. The authors identify that batch normalization reduces the coupling between stochastic gradients and outlier eigenvalues.
Practical Implications: The tool and findings illuminate the fundamental nature of optimization dynamics in DNNs and add clarity to ongoing debates about loss surface geometry and architecture optimizations such as learning rates, residual connections, and batch normalization.

Implications and Future Directions

The implications of these findings are manifold. In practice, enhancing training dynamics through the control of eigenvalue distributions could lead to more efficient network training strategies and architectures. From a theoretical standpoint, the research provides a robust framework for examining the interplay of architectural features with optimization speed and stability.

Future developments in AI could see the application of this spectral analysis methodology to architecturally diverse models, potentially guiding the design of networks with inherently favorable optimization dynamics. This line of research could also speculatively contribute to developing generalization heuristics, as a more refined understanding of curvature through Hessian spectra might correlate with model performance.

In summary, the paper significantly contributes to the understanding of neural network optimization by unveiling the critical role of Hessian eigenvalue distributions, with batch normalization emerging as a pivotal mechanism in shaping these distributions. The presented analytical framework and conclusions will likely direct future inquiries into enhancing optimization methods and network design principles.

Related Papers

Tweets

https://twitter.com/YingXiao/status/1774843448295874600