A Dynamical Model of Neural Scaling Laws (2402.01092v4)

Published 2 Feb 2024 in stat.ML, cond-mat.dis-nn, and cs.LG

Abstract: On a variety of tasks, the performance of neural networks predictably improves with training time, dataset size and model size across many orders of magnitude. This phenomenon is known as a neural scaling law. Of fundamental importance is the compute-optimal scaling law, which reports the performance as a function of units of compute when choosing model sizes optimally. We analyze a random feature model trained with gradient descent as a solvable model of network training and generalization. This reproduces many observations about neural scaling laws. First, our model makes a prediction about why the scaling of performance with training time and with model size have different power law exponents. Consequently, the theory predicts an asymmetric compute-optimal scaling rule where the number of training steps are increased faster than model parameters, consistent with recent empirical observations. Second, it has been observed that early in training, networks converge to their infinite-width dynamics at a rate $1/\textit{width}$ but at late time exhibit a rate $\textit{width}^{-c}$, where $c$ depends on the structure of the architecture and task. We show that our model exhibits this behavior. Lastly, our theory shows how the gap between training and test loss can gradually build up over time due to repeated reuse of data.

PDF HTML Abstract

A Dynamical Model of Neural Scaling Laws

This paper presents a comprehensive paper on neural scaling laws by introducing a dynamical model that analyzes the performance improvements of neural networks as a function of training time, model size, and dataset size. The core objective is to understand how these improvements scale when compute resources are allocated optimally, a concept referred to as the compute-optimal scaling law. The authors employ a random feature model trained with gradient descent, providing a solvable framework that reproduces several empirical observations on scaling laws in neural networks.

Key Contributions and Findings

Asymmetric Scaling and Power Law Exponents: The paper highlights that the power law exponents vary when scaling is examined with respect to training time and model size. This discrepancy suggests an asymmetric compute-optimal scaling strategy where the optimal approach should increase the number of training steps more rapidly than the model parameters. This insight aligns well with recent empirical findings.
Convergence Dynamics in Finite Width Models: It is noted that early in training, models converge to their infinite-width dynamics at a rate of $1/\text{width}$ . However, at later times, the convergence rate is observed to be $\text{width}^{-c}$ , where the constant $c$ is architecture and task-dependent. The theoretical model successfully captures these dynamics.
Training and Test Loss Gap Formation: The paper provides a theoretical basis for how the gap between training and test loss builds up over time, primarily due to repeated data reuse. This illustration offers insights into transitions between effectively online and offline training regimes.
Mode Error and Learning Trajectories: The paper introduces the concept of a transfer function to describe convergence dynamics along kernel eigenfunctions, revealing detailed behavior over training iterations.
Universal Early-Time Corrections: The research concludes that early-time corrections universally scale as $1/\text{width}$ or $1/\text{dataset size}$ , unveiling uniform behavior across model dynamics before task-specific training dynamics emerge at later stages.

Implications and Future Directions

The theoretical insights drawn from this model have several practical and theoretical implications:

Architecture and Hyperparameter Tuning: The results suggest that scaling architecture and adjusting hyperparameters asymmetrically could lead to better compute-efficiency, especially when compute resources are constrained.
Data and Model Bottlenecks: Identifying and leveraging model and data bottlenecks can inform strategies for dataset construction, model architecture selection, and resource allocation in training large neural networks.
Compute-Optimal Scaling Strategy: The derived power-law relationships and scaling strategy offer a quantitative means to plan resource allocation effectively for both model training and deployment.
Insight into Generalizability: Understanding how ensemble strategies and data variation impacts test set performance can inform robust machine learning practices, particularly for generalization under constrained dataset conditions.

The model is particularly notable for its tractability, allowing extensions to various training paradigms like momentum and discrete optimization algorithms. While the model captures key aspects of scaling laws, the authors note that additional work incorporating feature learning dynamics could further enhance understanding, as their experiments demonstrate substantial deviations from predicted trends due to ongoing kernel evolution.

Overall, this paper serves as a critical step towards formulating a unified theoretical framework for neural scaling laws that connects compute efficiency with model training dynamics, providing valuable insights for future research in neural network optimization and scaling.

PDF Markdown Bookmark Chat (Pro)

References (59)

Authors (3)

Blake Bordelon (27 papers)
Alexander Atanasov (14 papers)
Cengiz Pehlevan (81 papers)

Citations (24)

View on Semantic Scholar

Tweets

https://twitter.com/StatMLPapers/status/1779722347022332077

https://twitter.com/StatMLPapers/status/1754526291137552878

https://twitter.com/ABAtanasov/status/1915944058205442491

https://twitter.com/leafs_s_jp/status/1916480817649356952

https://twitter.com/blake__bordelon/status/1764962206876110869

https://twitter.com/0xkidwai/status/1757782091264086460

A Dynamical Model of Neural Scaling Laws (2402.01092v4)