Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

A Dynamical Model of Neural Scaling Laws (2402.01092v4)

Published 2 Feb 2024 in stat.ML, cond-mat.dis-nn, and cs.LG

Abstract: On a variety of tasks, the performance of neural networks predictably improves with training time, dataset size and model size across many orders of magnitude. This phenomenon is known as a neural scaling law. Of fundamental importance is the compute-optimal scaling law, which reports the performance as a function of units of compute when choosing model sizes optimally. We analyze a random feature model trained with gradient descent as a solvable model of network training and generalization. This reproduces many observations about neural scaling laws. First, our model makes a prediction about why the scaling of performance with training time and with model size have different power law exponents. Consequently, the theory predicts an asymmetric compute-optimal scaling rule where the number of training steps are increased faster than model parameters, consistent with recent empirical observations. Second, it has been observed that early in training, networks converge to their infinite-width dynamics at a rate $1/\textit{width}$ but at late time exhibit a rate $\textit{width}{-c}$, where $c$ depends on the structure of the architecture and task. We show that our model exhibits this behavior. Lastly, our theory shows how the gap between training and test loss can gradually build up over time due to repeated reuse of data.

A Dynamical Model of Neural Scaling Laws

This paper presents a comprehensive paper on neural scaling laws by introducing a dynamical model that analyzes the performance improvements of neural networks as a function of training time, model size, and dataset size. The core objective is to understand how these improvements scale when compute resources are allocated optimally, a concept referred to as the compute-optimal scaling law. The authors employ a random feature model trained with gradient descent, providing a solvable framework that reproduces several empirical observations on scaling laws in neural networks.

Key Contributions and Findings

  1. Asymmetric Scaling and Power Law Exponents: The paper highlights that the power law exponents vary when scaling is examined with respect to training time and model size. This discrepancy suggests an asymmetric compute-optimal scaling strategy where the optimal approach should increase the number of training steps more rapidly than the model parameters. This insight aligns well with recent empirical findings.
  2. Convergence Dynamics in Finite Width Models: It is noted that early in training, models converge to their infinite-width dynamics at a rate of 1/width1/\text{width}. However, at later times, the convergence rate is observed to be widthc\text{width}^{-c}, where the constant cc is architecture and task-dependent. The theoretical model successfully captures these dynamics.
  3. Training and Test Loss Gap Formation: The paper provides a theoretical basis for how the gap between training and test loss builds up over time, primarily due to repeated data reuse. This illustration offers insights into transitions between effectively online and offline training regimes.
  4. Mode Error and Learning Trajectories: The paper introduces the concept of a transfer function to describe convergence dynamics along kernel eigenfunctions, revealing detailed behavior over training iterations.
  5. Universal Early-Time Corrections: The research concludes that early-time corrections universally scale as 1/width1/\text{width} or 1/dataset size1/\text{dataset size}, unveiling uniform behavior across model dynamics before task-specific training dynamics emerge at later stages.

Implications and Future Directions

The theoretical insights drawn from this model have several practical and theoretical implications:

  • Architecture and Hyperparameter Tuning: The results suggest that scaling architecture and adjusting hyperparameters asymmetrically could lead to better compute-efficiency, especially when compute resources are constrained.
  • Data and Model Bottlenecks: Identifying and leveraging model and data bottlenecks can inform strategies for dataset construction, model architecture selection, and resource allocation in training large neural networks.
  • Compute-Optimal Scaling Strategy: The derived power-law relationships and scaling strategy offer a quantitative means to plan resource allocation effectively for both model training and deployment.
  • Insight into Generalizability: Understanding how ensemble strategies and data variation impacts test set performance can inform robust machine learning practices, particularly for generalization under constrained dataset conditions.

The model is particularly notable for its tractability, allowing extensions to various training paradigms like momentum and discrete optimization algorithms. While the model captures key aspects of scaling laws, the authors note that additional work incorporating feature learning dynamics could further enhance understanding, as their experiments demonstrate substantial deviations from predicted trends due to ongoing kernel evolution.

Overall, this paper serves as a critical step towards formulating a unified theoretical framework for neural scaling laws that connects compute efficiency with model training dynamics, providing valuable insights for future research in neural network optimization and scaling.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (59)
  1. The neural tangent kernel in high dimensions: Triple descent and a multi-scale theory of generalization. In International Conference on Machine Learning, pp.  74–84. PMLR, 2020a.
  2. Understanding double descent requires a fine-grained bias-variance decomposition. Advances in neural information processing systems, 33:11022–11032, 2020b.
  3. High-dimensional dynamics of generalization error in neural networks. Neural Networks, 132:428–446, 2020.
  4. Getting vit in shape: Scaling laws for compute-optimal model design. arXiv preprint arXiv:2305.13035, 2023.
  5. A theory for emergence of complex skills in language models. arXiv preprint arXiv:2307.15936, 2023.
  6. Neural networks as kernel learners: The silent alignment effect. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=1NvflqAdoom.
  7. The onset of variance-limited behavior for networks in the lazy and rich regimes. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=JLINxPOVTh7.
  8. Explaining neural scaling laws. arXiv preprint arXiv:2102.06701, 2021.
  9. Learning curves for sgd on structured features. arXiv preprint arXiv:2106.02713, 2021.
  10. Learning curves for SGD on structured features. In International Conference on Learning Representations, 2022a. URL https://openreview.net/forum?id=WPI2vbkAl3Q.
  11. Self-consistent dynamical field theory of kernel evolution in wide neural networks. arXiv preprint arXiv:2205.09653, 2022b.
  12. Dynamics of finite width kernel and prediction fluctuations in mean field neural networks. arXiv preprint arXiv:2304.03408, 2023.
  13. Spectrum dependent learning curves in kernel regression and wide neural networks. In International Conference on Machine Learning, pp.  1024–1034. PMLR, 2020.
  14. Depthwise hyperparameter transfer in residual networks: Dynamics and scaling limit, 2023.
  15. Broken neural scaling laws. arXiv preprint arXiv:2210.14891, 2022.
  16. Spectral bias and task-model alignment explain generalization in kernel regression and infinitely wide neural networks. Nature communications, 12(1):2914, 2021.
  17. Fast rates for regularized least-squares algorithm. 2005.
  18. Dimension free ridge regression. arXiv preprint arXiv:2210.08571, 2022.
  19. On lazy training in differentiable programming. Advances in neural information processing systems, 32, 2019.
  20. Algorithms for learning kernels based on centered alignment. The Journal of Machine Learning Research, 13(1):795–828, 2012.
  21. Path integral approach to random neural networks. Physical Review E, 98(6):062120, 2018.
  22. Generalization error rates in kernel regression: The crossover from the noiseless to noisy regime. Advances in Neural Information Processing Systems, 34:10131–10143, 2021.
  23. Asymptotics of wide networks from feynman diagrams. In International Conference on Learning Representations, 2020.
  24. Deep learning versus kernel learning: an empirical study of loss landscape geometry and the time evolution of the neural tangent kernel. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., and Lin, H. (eds.), Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020.
  25. Scaling description of generalization with number of parameters in deep learning. Journal of Statistical Mechanics: Theory and Experiment, 2020(2):023401, 2020.
  26. The three stages of learning dynamics in high-dimensional kernel methods. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=EQmAP4F859.
  27. Surprises in high-dimensional ridgeless least squares interpolation. Annals of statistics, 50(2):949, 2022.
  28. Statistical field theory for neural networks, volume 970. Springer, 2020.
  29. Deep learning scaling is predictable, empirically. arXiv preprint arXiv:1712.00409, 2017.
  30. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022.
  31. Neural tangent kernel: Convergence and generalization in neural networks. Advances in neural information processing systems, 31, 2018.
  32. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020.
  33. Long, P. M. Properties of the after kernel. arXiv preprint arXiv:2105.10585, 2021.
  34. Learning curves of generic features maps for realistic datasets with a teacher-student model. Advances in Neural Information Processing Systems, 34:18137–18151, 2021.
  35. Fluctuations, bias, variance & ensemble of learners: Exact asymptotics for convex losses in high-dimension. In International Conference on Machine Learning, pp.  14283–14314. PMLR, 2022.
  36. A solvable model of neural scaling laws. arXiv preprint arXiv:2210.16859, 2022.
  37. Passed & spurious: Descent algorithms and local minima in spiked matrix-tensor models. In international conference on machine learning, pp.  4333–4342. PMLR, 2019.
  38. Statistical dynamics of classical systems. Physical Review A, 8(1):423, 1973.
  39. The generalization error of random features regression: Precise asymptotics and the double descent curve. Communications on Pure and Applied Mathematics, 75(4):667–766, 2022.
  40. The quantization model of neural scaling. arXiv preprint arXiv:2303.13506, 2023.
  41. The effective noise of stochastic gradient descent. Journal of Statistical Mechanics: Theory and Experiment, 2022(8):083405, 2022.
  42. Dynamical mean-field theory for stochastic gradient descent in gaussian mixture classification. Advances in Neural Information Processing Systems, 33:9540–9550, 2020.
  43. The deep bootstrap framework: Good online learners are good offline generalizers. In International Conference on Learning Representations, 2021a.
  44. The deep bootstrap framework: Good online learners are good offline generalizers. In International Conference on Learning Representations, 2021b. URL https://openreview.net/forum?id=guetrIHLFGI.
  45. Sgd in the large: Average-case analysis, asymptotics, and stepsize criticality. In Conference on Learning Theory, pp.  3548–3626. PMLR, 2021.
  46. The principles of deep learning theory. Cambridge University Press Cambridge, MA, USA, 2022.
  47. Learning curves for noisy heterogeneous feature-subsampled ridge ensembles. ArXiv, 2023.
  48. The eigenlearning framework: A conservation law perspective on kernel regression and wide neural networks. arXiv preprint arXiv:2110.03922, 2021.
  49. More is better in modern machine learning: when infinite overparameterization is optimal and overfitting is obligatory. arXiv preprint arXiv:2311.14646, 2023.
  50. Dynamic theory of the spin-glass phase. Physical Review Letters, 47(5):359, 1981.
  51. Beyond neural scaling laws: beating power law scaling via data pruning. Advances in Neural Information Processing Systems, 35:19523–19536, 2022.
  52. Asymptotic learning curves of kernel methods: empirical data versus teacher–student paradigm. Journal of Statistical Mechanics: Theory and Experiment, 2020(12):124001, 2020.
  53. Last iterate convergence of sgd for least-squares in the interpolation regime. Advances in Neural Information Processing Systems, 34:21581–21591, 2021.
  54. Limitations of the ntk for understanding generalization in deep learning. arXiv preprint arXiv:2206.10012, 2022.
  55. Feature-learning networks are consistent across widths at realistic scales. arXiv preprint arXiv:2305.18411, 2023.
  56. Tuning large neural networks via zero-shot hyperparameter transfer. In Beygelzimer, A., Dauphin, Y., Liang, P., and Vaughan, J. W. (eds.), Advances in Neural Information Processing Systems, 2021. URL https://openreview.net/forum?id=Bx6qKuBM2AD.
  57. Wide residual networks. arXiv preprint arXiv:1605.07146, 2016.
  58. Learning curves for deep structured gaussian feature models, 2023.
  59. Contrasting random and learned features in deep bayesian linear regression. Physical Review E, 105(6):064118, 2022.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Blake Bordelon (27 papers)
  2. Alexander Atanasov (14 papers)
  3. Cengiz Pehlevan (81 papers)
Citations (24)