Quantitative Convergence of Shallow Neural Networks
- The paper's main contribution is establishing quantitative convergence rates with explicit bounds on error decay dependent on network width and overparameterization.
- It employs local linearization and spectral analysis of the Jacobian to guarantee effective gradient descent convergence in nonconvex shallow networks.
- The work bridges theory and practice by reducing impractical overparameterization requirements while providing actionable parameter regimes for reliable model training.
Quantitative convergence of shallow neural networks concerns the rigorous characterization and estimation of convergence rates—typically of training error, excess risk, or empirical processes—when learning (potentially nonconvex) shallow neural architectures from data using gradient-based algorithms. Unlike qualitative results, which may guarantee eventual convergence without specifying rates or parameter regimes, quantitative convergence provides explicit dependencies on network width, sample size, initialization, overparameterization, architecture, loss decay modalities (linear, exponential, polynomial), and properties of data and activations. This area addresses the gap between the impractically large overparameterization requirements of classical theory and the often much weaker requirements observed in practice. The following sections synthesize key definitions, methodologies, parameter regimes, and theoretical guarantees in contemporary literature.
1. Definitions and Formal Guarantees
In the context of shallow (one-hidden-layer) neural networks, quantitative convergence refers to explicit rate statements or bounds, usually of the form
for geometric/exponential decay (Oymak et al., 2019), or analogous rates for excess risk or expected loss.
- Global minimum: The set of parameters achieving zero training loss (exact interpolation of labels), amid typically many global optima due to overparameterization.
- Overparameterization: The regime where the number of network parameters is sufficiently larger than or scaling with the data sample size . The minimal regimes that guarantee global geometric convergence are of central interest (Oymak et al., 2019, Song et al., 2021, Razborov, 2022).
- Polyak–Łojasiewicz (PL) inequality: A relaxation of strong convexity leveraged in nonconvex analyses, stating . A local or global PL-type lower bound on curvature along gradient flows enables linear convergence analyses, even for nonconvex landscapes (Song et al., 2021, Dana et al., 24 Feb 2025).
- NTK regime: The training dynamics are linearized around initialization, with the Neural Tangent Kernel (NTK) remaining nearly constant, often guaranteeing exponential convergence for overparameterized networks (Xu et al., 7 Dec 2024, Liao et al., 2021), but possibly stifling feature learning (Caron et al., 2023).
2. Parameter Regimes for Convergence
Overparameterization Thresholds
A central object of paper is establishing the minimal network width (for input dimension and samples) such that convergence guarantees hold.
Paper | Network Class | Overparam. Threshold | Rate/Guarantee |
---|---|---|---|
(Oymak et al., 2019) | Shallow, smooth/ReLU | Geometric (exponential) | |
(Song et al., 2021) | Shallow, smooth, std. init | Geometric | |
(Razborov, 2022) | Shallow, ReLU, both layers | Exponential, uniform NTK conditioning | |
(Polaczyk et al., 2022) | Shallow, ReLU | Global, exponential under SGD | |
(Dana et al., 24 Feb 2025) | Shallow, ReLU, high-dim data | Exponential, PL trajectory |
In all cases, compared to early results requiring or worse (in NTK analyses), more recent work demonstrates that subquadratic (or even for high-dimensional orthogonalized data) regimes suffice for global, geometric convergence under stochastic or batch gradient descent.
Impact of Data Geometry
High-dimensional data (with ) and weak correlations among samples allow extremely modest widths for convergence. For near-orthogonal data, a single neuron per sample is sufficient in principle (Dana et al., 24 Feb 2025).
3. Training Dynamics and Methodology
Local Linearization and Jacobian Spectrum
Convergence under moderate overparameterization is often established via a local linearization approach:
- Jacobian Spectrum Control: Quantitative bounds are derived for the minimum and maximum singular values of the output's Jacobian with respect to the parameters, both at initialization and along the optimization trajectory. Sufficient overparameterization ensures the minimum singular value remains lower-bounded, guaranteeing the local landscape is well-conditioned and gradient steps are effective (Oymak et al., 2019, Song et al., 2021, Razborov, 2022).
- Random Matrix Theory: Matrix Chernoff bounds, Bernstein's inequality, and related random matrix results are employed to estimate the spectrum of Jacobians and the NTK, particularly in the region around initialization when the weights have not moved far (Song et al., 2021, Xu et al., 7 Dec 2024).
Gradient Descent and SGD
- Full-batch gradient descent and stochastic gradient descent (SGD) are both analyzed. The results typically show that SGD with sufficiently small step size can closely track the limiting gradient flow (possibly formalized as a differential inclusion for non-smooth activations (Polaczyk et al., 2022)), leading to global convergence at a linear rate.
- Key recursion: For discrete iterations, analysis yields an error recursion of the form
where as increases under strong overparameterization (Xu et al., 7 Dec 2024).
4. Quantitative Rates and Theoretical Results
Geometric and Sublinear Rates
- Under the established overparameterization and initialization conditions, the loss decreases according to
where is the minimal singular value of the layer-wise Jacobian or Khatri–Rao product and depends on the step size and spectral characteristics (Oymak et al., 2019, Song et al., 2021).
- For orthogonal data, convergence rates can be explicitly bounded between $1/n$ and in the exponent (i.e., decay per iteration), with sharp phase transitions depending on initialization and neuron activation patterns (Dana et al., 24 Feb 2025).
- When both layers are trained simultaneously, theory guarantees uniform lower bounds on the minimal eigenvalue of the NTK matrix throughout training, which ensures global exponential convergence, even outside the classical NTK regime (Razborov, 2022).
Polyak–Łojasiewicz Analysis
- The loss decay can be equivalently described along the gradient flow trajectory via the local PL constant:
with reflecting local curvature. Lower bounding explicitly quantifies exponential rates (Dana et al., 24 Feb 2025).
5. Extensions and Model Variants
Non-smooth Activations
- Results are robust to non-differentiable activations (e.g. ReLU). Differential inclusion techniques and generalized derivative notions are leveraged for global convergence proofs (Oymak et al., 2019, Polaczyk et al., 2022).
Layerwise and Multi-rate Training
- Simultaneous multi-rate training of both layers (with independently chosen learning rates) can improve parameterization regimes and allows the translation of convergence results between different initialization norms (Razborov, 2022).
Regularization and Generalization
- Overparameterized models trained with appropriate norm regularization (such as controlling the 2-norm or Barron norm) achieve minimax rates for regression, with error bounds that are independent of the network width and depend only on weight norms and function smoothness (Yang et al., 2023, Beknazaryan, 2022, Yang et al., 2023).
6. Comparative Perspective and Open Questions
Regime | Overparam. | Rate | Characteristic | Limitations/Open Problems |
---|---|---|---|---|
Classical NTK | Geometric | “Lazy” regime | Impractically large | |
Moderate | Geometric | Nonlinear, local | Tightness, extension to deeper networks | |
Subquadratic | Geometric | Adaptive | Requires smooth activations, not ReLU directly | |
High-dimension | Geometric | Weak interaction | Impractical for structured real-world datasets |
Several open directions remain:
- Precise characterization of minimal overparameterization for arbitrary data distributions beyond synthetic or nearly orthogonal settings.
- Extending theory to deeper architectures and more generic activation regimes, possibly relaxing assumptions of bounded higher derivatives.
- Quantitative relationships bridging convergence rates to generalization error, particularly in high-dimensional and overparameterized regimes.
- Understanding transition phenomena and sharp phasic shifts in convergence rates noted in orthogonal settings.
7. Conclusion
Quantitative convergence results for shallow neural networks establish that, under moderate to subquadratic overparameterization and appropriate initialization, (stochastic) gradient descent almost surely converges at a geometric rate to global minima that interpolate training data, even in highly nonconvex settings (Oymak et al., 2019, Song et al., 2021, Razborov, 2022, Polaczyk et al., 2022). The efficiency of this process is sharply governed by the Jacobian spectrum, NTK regularity, data geometry, and initialization. Recent advances have closed much of the gap between overly pessimistic early theory and the much weaker overparameterization required in practice, particularly in high-dimensional weakly-correlated regimes where logarithmic widths suffice. Many results now explicitly link function smoothness, architectural parameters, and training dynamics to rigorous rates, providing a clear blueprint for guaranteeing convergence and informing both theoretical exploration and practical model design.