Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
169 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Emergence and scaling laws in SGD learning of shallow neural networks (2504.19983v1)

Published 28 Apr 2025 in cs.LG and stat.ML

Abstract: We study the complexity of online stochastic gradient descent (SGD) for learning a two-layer neural network with $P$ neurons on isotropic Gaussian data: $f_(\boldsymbol{x}) = \sum_{p=1}P a_p\cdot \sigma(\langle\boldsymbol{x},\boldsymbol{v}_p^\rangle)$, $\boldsymbol{x} \sim \mathcal{N}(0,\boldsymbol{I}d)$, where the activation $\sigma:\mathbb{R}\to\mathbb{R}$ is an even function with information exponent $k>2$ (defined as the lowest degree in the Hermite expansion), ${\boldsymbol{v}^p}{p\in[P]}\subset \mathbb{R}d$ are orthonormal signal directions, and the non-negative second-layer coefficients satisfy $\sum_{p} a_p2=1$. We focus on the challenging ``extensive-width'' regime $P\gg 1$ and permit diverging condition number in the second-layer, covering as a special case the power-law scaling $a_p\asymp p{-\beta}$ where $\beta\in\mathbb{R}_{\ge 0}$. We provide a precise analysis of SGD dynamics for the training of a student two-layer network to minimize the mean squared error (MSE) objective, and explicitly identify sharp transition times to recover each signal direction. In the power-law setting, we characterize scaling law exponents for the MSE loss with respect to the number of training samples and SGD steps, as well as the number of parameters in the student neural network. Our analysis entails that while the learning of individual teacher neurons exhibits abrupt transitions, the juxtaposition of $P\gg 1$ emergent learning curves at different timescales leads to a smooth scaling law in the cumulative objective.

Summary

Insightful Essay on "Emergence and Scaling Laws in SGD Learning of Shallow Neural Networks"

The paper "Emergence and Scaling Laws in SGD Learning of Shallow Neural Networks" by Ren, Nichani, Wu, and Lee, explores the dynamics and complexity of training two-layer neural networks using Stochastic Gradient Descent (SGD) in high-dimensional spaces. Focused on the learning mechanisms of such networks operating over isotropic Gaussian data, the paper provides a nuanced analysis of SGD's behavior and efficiency in the extensive-width regime where the number of neurons far exceeds typical settings. Herein, the primary attributes and implications of the paper are explored.

Key Contributions and Analysis

Complexity of Learning Two-layer Networks

The paper scrutinizes the optimization trajectory of learning two-layer networks in scenarios with many neurons (extensive-width), orthogonal initialization, and diverging condition numbers. The analysis centers around a student-teacher model where the student network is tasked with learning a target function modeled by a teacher network. Specifically, the teacher network is composed using an additive model: f(x)=p=1Papσ(x,vp),xN(0,Id),f_*(x) = \sum_{p=1}^P a_p \cdot \sigma(\langle x, v_p^* \rangle),\, x \sim \mathcal{N}(0,I_d), where {vp}\{v^*_p\} are orthonormal directions and apa_p are second-layer coefficients adhering to prescribed conditions such as power-law decay. This formulation highlights the challenge posed by extensive-width networks (P1P \gg 1) and large condition numbers amaxamin1\frac{a_{\max}}{a_{\min}} \gg 1, especially within non-linear feature learning contexts.

Emergent Learning Curves

The paper introduces emergent learning curves where recovery of each signal direction exhibits a sharp transition from high-error states to efficient learning phases. This is characterized by extensive periods of stagnation (plateaus), followed by rapid error reduction via SGD optimization. This transition mechanism offers a profound insight into the dynamics observed in gradient-based learning, particularly under high-dimensional settings and with lower boundary constraints such as IE(σ)>2\text{IE}(\sigma) > 2.

Scaling Laws in High-dimensional Learning

A concept of scaling laws, akin to those observed in empirical studies on neural networks, is derived theoretically. When second-layer coefficients of the teacher model follow an extensive power law, the population Mean Squared Error (MSE) manifests as predictable power-law decay. The scaling laws describe the functional relationship between performance measures (MSE loss) and resources such as compute and data, presenting a solid theoretical underpinning for practical applications within AI.

Implications and Future Directions

Theoretical and Practical Implications

The strong mathematical proof structures provide assurance that SGD can perform efficiently, under well-defined conditions, in exceedingly high-dimensional spaces with many parameters (extensive-width scenario). The insights on how similar curves align into smooth power-law scaling elucidate the foundational behaviors in large-scale models, potentially guiding hyperparameter choices and architectural design in deep learning applications.

Future Research

While the paper elucidates much about extensive-width networks, there remains room for exploration in settings where input data display anisotropic properties or when second-layer coefficients follow broader distributions. Moreover, the applicability of these scaling laws to evolving architectures and optimization strategies—such as adaptive moments and layer-wise training—could enrich our understanding of SGD in non-standard paradigms.

Conclusion

In conclusion, this paper offers a rigorous exploration into the computational complexity and emergent dynamics of SGD learning for shallow networks. Grounded in theoretical physics concepts and statistical inference, the results not only align with empirical observations but also extend existing frameworks by introducing new scaling laws. It thus provides a compelling narrative for future investigations into high-dimensional neural training regimes and their implications for AI scalability.

By synthesizing both complexity analysis and scaling laws, the authors have laid the groundwork for enhanced theoretical and practical considerations in the context of deep learning, advocating a robust approach to understanding and optimizing extensive-width neural networks.