Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
129 tokens/sec
GPT-4o
28 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Propagation of Chaos in One-hidden-layer Neural Networks beyond Logarithmic Time (2504.13110v1)

Published 17 Apr 2025 in stat.ML and cs.LG

Abstract: We study the approximation gap between the dynamics of a polynomial-width neural network and its infinite-width counterpart, both trained using projected gradient descent in the mean-field scaling regime. We demonstrate how to tightly bound this approximation gap through a differential equation governed by the mean-field dynamics. A key factor influencing the growth of this ODE is the local Hessian of each particle, defined as the derivative of the particle's velocity in the mean-field dynamics with respect to its position. We apply our results to the canonical feature learning problem of estimating a well-specified single-index model; we permit the information exponent to be arbitrarily large, leading to convergence times that grow polynomially in the ambient dimension $d$. We show that, due to a certain ``self-concordance'' property in these problems -- where the local Hessian of a particle is bounded by a constant times the particle's velocity -- polynomially many neurons are sufficient to closely approximate the mean-field dynamics throughout training.

Summary

  • The paper quantifies finite-width network effects, or propagation of chaos, over polynomial time, advancing analysis beyond logarithmic time limits.
  • Analysis employs mean-field dynamics, differential equations, and structured potential functions to manage errors over time.
  • The findings inform efficient training protocols and demonstrate advantages for learning single-index models compared to kernel methods.

Propagation of Chaos in One-Hidden-Layer Neural Networks beyond Logarithmic Time

The paper under discussion rigorously explores the dynamics of one-hidden-layer neural networks, specifically focusing on the mean-field scaling regime. The authors aim to analyze the approximation gap between finite-width neural networks and their infinite-width counterparts when trained using projected gradient descent. Their pivotal contribution is a detailed characterization of this gap through a differential equation that describes the mean-field dynamics. This work significantly extends the domain's understanding of neural network training dynamics by providing a precise quantification of the finite-width effect, known as the propagation of chaos, over polynomial time horizons—a stark contrast to the typical exponential growth constraints associated with earlier analyses.

Mean-field Dynamics and Local Strong Convexity

At the core of this paper is the mean-field regime, which is employed to describe the evolution of parameters in the infinite-dimensional space of probability measures. The mean-field analysis is contrasted with the Neural Tangent Kernel (NTK) approach, which remains close to the initialization, often limiting feature learning.

The authors introduce the concept of "local Hessian" and "interaction Hessian" to delineate the bounds on the interaction among neurons in terms of their velocities and positions. The local Hessian encapsulates the sensitivity of a neuron's velocity concerning its position, helping to evaluate the dynamics and stability of the training process. Whereas the interaction Hessian captures the non-local effects arising from other neurons, thereby introducing a self-regulatory mechanism through the PSD (positive semi-definite) nature of these matrices.

Theoretical Contributions: Propagation of Chaos

The paper delivers a series of theoretical results under the umbrella of propagation of chaos, ensuring that the behavior of polynomial-width networks closely approximates the infinite-width limit. This is articulated through a comprehensive set of assumptions including local strong convexity and the average stability of neurons. A key insight is demonstrating that, unlike previous works limited to logarithmic time scales, these networks can be efficiently operated over polynomial time by establishing a structured potential function that leverages the inherent biases and interactions in the network.

  1. Bound Stability and Growth: The paper details conditions under which deviations from the mean-field limit—captured by the discrepancy in neural velocities—are controlled over time, using the self-concordant behavior of neural dynamics to resist exponential error propagation.
  2. Effective Coupling: The authors introduce a potential function leveraging the balanced spectral decomposition of the interaction Hessian to manage increased errors near the support of the target distribution.
  3. Local Strong Convexity: Assuming strong local convexity provides assurance that neurons are entrapped around the optimal configuration over finite-time horizons, thus enabling error correction while maintaining global stability.

Application: Single-Index Models

As an application, the authors examine single-index models with high information exponents. They demonstrate that their theoretical advancements allow convergence in time scales polynomial in ambient dimension, extending beyond previous barriers. The implications for learning such models indicate a significant advantage in training complexity and capacity through neural networks actively engaging in feature learning over kernel methods.

Implications and Future Directions

The insights from this work have profound implications for both theoretical and practical domains. By shifting from a primarily asymptotic perspective to a non-asymptotic approach, the authors provide tools to understand and quantify the neural network performance in realistic settings where dimension and computation time are limited. Practically, this work can inform the design of more computationally efficient training protocols for deep learning, suggesting that appropriately structured network overparameterization can enhance learning efficiency without incurring prohibitive computational costs.

Future research directions may include extending the applicability of these theoretical mechanisms to more complex, layered architectures, and addressing scenarios where the target function's complexity exceeds the assumptions in the current framework. Additionally, bridging this mean-field analysis with empirical insights in practical deep learning scenarios could yield notable improvements in both speed and accuracy of contemporary learning systems.

X Twitter Logo Streamline Icon: https://streamlinehq.com