Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
113 tokens/sec
GPT-4o
12 tokens/sec
Gemini 2.5 Pro Pro
36 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
3 tokens/sec
DeepSeek R1 via Azure Pro
33 tokens/sec
2000 character limit reached

A Fine-Grained Spectral Perspective on Neural Networks (1907.10599v4)

Published 24 Jul 2019 in cs.LG, cs.NE, and stat.ML

Abstract: Are neural networks biased toward simple functions? Does depth always help learn more complex features? Is training the last layer of a network as good as training all layers? How to set the range for learning rate tuning? These questions seem unrelated at face value, but in this work we give all of them a common treatment from the spectral perspective. We will study the spectra of the Conjugate Kernel, CK, (also called the Neural Network-Gaussian Process Kernel), and the Neural Tangent Kernel, NTK. Roughly, the CK and the NTK tell us respectively "what a network looks like at initialization" and "what a network looks like during and after training." Their spectra then encode valuable information about the initial distribution and the training and generalization properties of neural networks. By analyzing the eigenvalues, we lend novel insights into the questions put forth at the beginning, and we verify these insights by extensive experiments of neural networks. We derive fast algorithms for computing the spectra of CK and NTK when the data is uniformly distributed over the boolean cube, and show this spectra is the same in high dimensions when data is drawn from isotropic Gaussian or uniformly over the sphere. Code replicating our results is available at github.com/thegregyang/NNspectra.

Citations (109)

Summary

  • The paper challenges the universal simplicity bias by showing that network function complexity varies with activation, depth, and weight variance.
  • The paper demonstrates that there exists an optimal network depth that maximizes feature learning before excessive depth degrades performance.
  • The paper reveals how spectral analysis of CK and NTK can guide hyperparameter tuning, including predicting maximal learning rates for efficient SGD training.

A Fine-Grained Spectral Perspective on Neural Networks

In this research paper, the authors delve into the spectral properties of neural networks through a comprehensive analysis of two crucial constructs: the Conjugate Kernel (CK) and the Neural Tangent Kernel (NTK). The primary motivation behind this paper is to answer some fundamental questions surrounding neural networks, such as their inclination towards simpler functions, the role of depth in learning complex features, and the practicalities of hyperparameter tuning. The paper provides a rigorous exploration of these topics by employing a spectral analysis of CK and NTK, unraveling their implications on the initialization, training, and generalization abilities of neural networks.

Spectral Analysis of CK and NTK

The CK and NTK offer distinct insights into the neural network's behavior across its lifecycle. CK elucidates the network's properties at initialization by capturing the Gaussian process distribution of an infinitely wide network. Conversely, NTK monitors the dynamic evolution during training, encapsulating how the network behaves as a linear model. The authors derive efficient algorithms to compute the spectra of CK and NTK, especially when data is uniformly distributed over the boolean cube, and demonstrate their high-dimensional equivalence with isotropic Gaussian or spherical data distributions.

Key Findings and Contributions

  1. Simplicity Bias: The authors refute the universality of the simplicity bias in neural networks. While prior observations suggested neural networks favor simple functions, this paper indicates that the bias is non-universal and heavily dependent on the network's activation function, weight variance, and depth. Specifically, relu networks exhibit strong simplicity bias, favoring low-degree polynomial functions, whereas erf networks with substantial depth and weight variance can display a "white noise" distribution, mitigating this bias entirely.
  2. Optimal Network Depth: The research introduces the concept of an optimal depth for neural networks, which varies with the complexity of the target function. Contrary to the notion that deeper networks inherently provide superior performance, the paper finds that excessive depth can be detrimental. It reveals that deeper networks are more adept at learning complex features, but there exists a specific depth that maximizes this ability, beyond which performance diminishes.
  3. Impact of Hyperparameters: By examining the spectrum of NTK, the authors provide a nuanced understanding of hyperparameter effects on neural networks. Training all layers (NTK dynamics) results in better learning of complex features compared to training only the last layer (CK dynamics). This assertion aligns with the notion that NTK holds a complexity bias, favoring more intricate functions compared to CK.
  4. Maximal Learning Rate Prediction: The paper successfully predicts the maximal learning rate for stochastic gradient descent (SGD) across realistic datasets like MNIST and CIFAR10, as well as theoretical distributions. This prediction aids in setting a reliable upper bound for learning rate tuning, enhancing model training efficiency.

Implications and Future Directions

The implications of this research extend both theoretically and practically. The insights into the networks’ inclination towards certain function complexities can inform better architecture design and initialization strategies. Moreover, the identification of optimal network depth offers critical guidance for constructing more efficient models. On the practical front, the paper’s methodology for predicting the maximal learning rate addresses a persistent challenge in hyperparameter optimization, potentially reducing experimental costs.

Future research could further refine the fractional variance heuristic introduced for understanding generalization properties, aiming for greater precision in predicting test losses. Additionally, extending this spectral analysis framework to other neural architectures, such as convolutional or recurrent networks, may yield broader insights across the deep learning field. In conclusion, this paper provides a compelling spectral perspective on neural networks, challenging prevailing assumptions and paving the way for more informed neural network design and training techniques.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Authors (2)

Youtube Logo Streamline Icon: https://streamlinehq.com