The Convergence Rate of Neural Networks for Learned Functions of Different Frequencies (1906.00425v3)

Published 2 Jun 2019 in cs.LG, eess.SP, and stat.ML

Abstract: We study the relationship between the frequency of a function and the speed at which a neural network learns it. We build on recent results that show that the dynamics of overparameterized neural networks trained with gradient descent can be well approximated by a linear system. When normalized training data is uniformly distributed on a hypersphere, the eigenfunctions of this linear system are spherical harmonic functions. We derive the corresponding eigenvalues for each frequency after introducing a bias term in the model. This bias term had been omitted from the linear network model without significantly affecting previous theoretical results. However, we show theoretically and experimentally that a shallow neural network without bias cannot represent or learn simple, low frequency functions with odd frequencies. Our results lead to specific predictions of the time it will take a network to learn functions of varying frequency. These predictions match the empirical behavior of both shallow and deep networks.

Authors (4)

Ronen Basri (42 papers)
David Jacobs (36 papers)
Yoni Kasten (29 papers)
Shira Kritchman (3 papers)

Citations (201)

View on Semantic Scholar

Summary

The paper introduces bias in spectral analysis to enable learning of previously unreachable odd-frequency functions.
It demonstrates that low-frequency functions converge faster than high-frequency ones, with learning rates scaling as O(k²).
Experimental validations confirm that incorporating bias corrects convergence issues in overparameterized neural networks.

Analysis of Neural Network Convergence with Bias: Frequency-based Dynamics

The paper "The Convergence Rate of Neural Networks for Learned Functions of Different Frequencies" by Basri, Jacobs, Kasten, and Kritchman provides an in-depth theoretical and empirical analysis of how neural networks learn functions characterized by different frequencies. This paper leverages recent insights into the dynamics of overparameterized neural networks when trained with gradient descent, particularly focusing on spectral properties of these processes.

The core of the discussion revolves around the observation that neural networks tend to approximate their learning dynamics as linear systems when the training data is uniformly distributed over a hypersphere. The eigenfunctions of these linear systems correspond to spherical harmonics. The authors contribute significantly by determining the eigenvalues associated with each frequency after the inclusion of bias terms in neural network models, a factor previously ignored without impacting theoretical predictions.

Key Contributions and Results

Spectral Eigenanalysis with Bias:
- The paper breaks new ground by introducing bias terms in the spectral analysis of neural networks. This adjustment is pivotal since networks without bias fail to represent or learn functions of certain frequency compositions—specifically low-frequency functions with odd harmonics.
Convergence Rate Predictions:
- Through both theoretical derivations and experimental validations, the authors predict that learning rates for functions are significantly affected by their frequency components. Low-frequency functions are quicker to learn compared to high-frequency functions, a prediction supported by empirical data showing learning rates scaling as $O(k^2)$ where $k$ is the frequency.
Bias Impact on Odd Frequencies:
- Notably, when bias is absent—as shown in the experimental section—networks struggle to learn simple functions with odd frequencies $k \geq 3$ . With bias introduced, these odd frequencies are no longer in the null space and can be learned effectively, aligning with the learning rates of even frequencies.
Experimental Validation:
- By analyzing a variety of network architectures, including deeper networks with and without skip connections, the observed convergence rates consistently follow the theoretical predictions, thus confirming the robustness of the frequency-based analysis.

Theoretical Implications

The paper fosters a deeper understanding of how gradient descent acts as a frequency-based regularization mechanism. By inherently learning low-frequency components faster, neural networks with overparameterization naturally gravitate towards simpler function representations even within complex solution spaces. Therefore, even large models avoid overfitting by settling on these smoother solutions first.

Practical Implications and Future Directions

Early Stopping as a Regularization Tool:
- Early stopping becomes a practical method to ensure that networks capitalize on low-frequency learning, avoiding the intricacies and noise present in higher frequencies.
Network Design Considerations:
- The paper suggests architectural and training adjustments (such as bias inclusion and learning rate adjustments) are crucial for leveraging the innate biases toward low-frequency learning.
Extensions to Complex Tasks:
- While the paper largely addresses simplified spectral environments, future work can explore these dynamics in real-world data distributions, potentially leading to nuanced insights in domains like signal processing and vision tasks.

In summary, this paper refines our understanding of neural network training dynamics through a frequency lens, emphasizing the impact of bias and offering an analytical perspective that could influence network design and training methodologies, with potential relevance extending well beyond the academic context.

PDF Markdown

Related Papers

GitHub

GitHub - ykasten/Convergence-Rate-NN-Different-Frequencies (9 stars)