On the Power-Law Hessian Spectrums in Deep Learning (2201.13011v2)

Published 31 Jan 2022 in cs.LG, physics.bio-ph, and q-bio.BM

Abstract: It is well-known that the Hessian of deep loss landscape matters to optimization, generalization, and even robustness of deep learning. Recent works empirically discovered that the Hessian spectrum in deep learning has a two-component structure that consists of a small number of large eigenvalues and a large number of nearly-zero eigenvalues. However, the theoretical mechanism or the mathematical behind the Hessian spectrum is still largely under-explored. To the best of our knowledge, we are the first to demonstrate that the Hessian spectrums of well-trained deep neural networks exhibit simple power-law structures. Inspired by the statistical physical theories and the spectral analysis of natural proteins, we provide a maximum-entropy theoretical interpretation for explaining why the power-law structure exist and suggest a spectral parallel between protein evolution and training of deep neural networks. By conducing extensive experiments, we further use the power-law spectral framework as a useful tool to explore multiple novel behaviors of deep learning.

Citations (7)

View on Semantic Scholar

Summary

The paper provides robust empirical evidence that well-trained deep neural networks exhibit power-law distributed Hessian eigenvalues across varied architectures and datasets.
It introduces a maximum entropy framework that theoretically explains the emergence of power-law structures in the Hessian spectrum and links them to optimization stability.
The study demonstrates that the power-law slope metric correlates with generalization performance, guiding improvements in training dynamics and model robustness.

On the Power-Law Hessian Spectrums in Deep Learning

The paper "On the Power-Law Hessian Spectrums in Deep Learning" provides an in-depth exploration of a phenomenon observed in the training of deep neural networks (DNNs): the emergence of power-law distributions in the spectrum of the Hessian matrix. The Hessian is critical in understanding the optimization landscape, generalization capabilities, and robustness of DNNs. Prior work has noted that the Hessian spectrum often exhibits a two-component structure with a few large eigenvalues and a multitude of near-zero eigenvalues, yet a comprehensive theoretical explanation has been lacking.

Key Contributions and Findings

Empirical Discovery of Power-Law Structures: The authors provide significant empirical evidence showing that well-trained DNNs exhibit a power-law distribution in their Hessian eigenvalues. This discovery is grounded in extensive experimental evaluation across several datasets, including MNIST and CIFAR-10/100, and different model architectures such as LeNet and ResNet18. The power-law behavior was consistently observed across various training conditions and optimizers, reaffirming its robustness.
Theoretical Interpretation Using Maximum Entropy: The paper introduces a maximum-entropy-based theoretical framework to explain the emergence of power-law distributions in Hessian spectrums. This approach draws parallels with statistical physics, positing that DNNs, when trained, reach an entropy-maximizing state that naturally leads to power-law structured spectra. This insight significantly adds to the theoretical understanding of why certain spectral properties arise in trained models.
Implications for Deep Learning Dynamics:

The research ties the power-law spectrum to several key aspects of deep learning: - Robustness of Learning Spaces: The paper reveals that the power-law structure entails the large eigengaps necessary for the robustness of low-dimensional spaces in which learning occurs, supported by theoretical guarantees via the Davis-Kahan theorem. - Generalization and Overfitting: The slope magnitude of the fitted power-law line, denoted as $\hat{s}$ , serves as a novel metric to predict the sharpness of minima and generalization performance. Smaller $\hat{s}$ indicates flatter minima and better generalization capabilities. - Training Dynamics and Optimization: The spectrum's adherence to power laws across training settings suggests fundamental constraints and opportunities for optimizing batch sizes and learning rates more effectively.

Comparison with Protein Science: A novel interdisciplinary conjecture is offered by drawing an analogy between protein structures and deep learning models. Just like proteins demonstrate power-law behavior in their vibrational spectra, so too do deep networks in their Hessian spectrums, suggesting universal principles governing complex systems.

Implications and Future Directions

This work opens several avenues for future research in the theoretical and empirical domains of machine learning. The theoretical constructs and empirical evidence provided form a robust foundation for exploring the spectral properties of the loss landscape in greater depth. Future research may focus on:

Extending the understanding of power-law dynamics to other components of the training process, such as the gradient noise covariance.
Investigating how power-law properties influence the design and training of novel neural architectures and optimization strategies.
Delving deeper into the interdisciplinary connections hinted at with protein science, potentially unveiling broader applications.

The findings underline the complexity and interconnectedness of components within DNNs, offering a new lens to evaluate and enhance both theoretical frameworks and practical applications in deep learning. The elucidation of power-law spectrums ties together threads across multiple domains, potentially catalyzing new breakthroughs in understanding the fundamental behaviors of neural networks.

PDF Markdown

Related Papers

Tweets

https://twitter.com/yaroslavvb/status/1811964789780021716