Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 83 tok/s
Gemini 2.5 Pro 34 tok/s Pro
GPT-5 Medium 24 tok/s Pro
GPT-5 High 21 tok/s Pro
GPT-4o 130 tok/s Pro
Kimi K2 207 tok/s Pro
GPT OSS 120B 460 tok/s Pro
Claude Sonnet 4.5 36 tok/s Pro
2000 character limit reached

Quantitative Convergence of Shallow Neural Networks

Updated 6 October 2025
  • The paper's main contribution is establishing quantitative convergence rates with explicit bounds on error decay dependent on network width and overparameterization.
  • It employs local linearization and spectral analysis of the Jacobian to guarantee effective gradient descent convergence in nonconvex shallow networks.
  • The work bridges theory and practice by reducing impractical overparameterization requirements while providing actionable parameter regimes for reliable model training.

Quantitative convergence of shallow neural networks concerns the rigorous characterization and estimation of convergence rates—typically of training error, excess risk, or empirical processes—when learning (potentially nonconvex) shallow neural architectures from data using gradient-based algorithms. Unlike qualitative results, which may guarantee eventual convergence without specifying rates or parameter regimes, quantitative convergence provides explicit dependencies on network width, sample size, initialization, overparameterization, architecture, loss decay modalities (linear, exponential, polynomial), and properties of data and activations. This area addresses the gap between the impractically large overparameterization requirements of classical theory and the often much weaker requirements observed in practice. The following sections synthesize key definitions, methodologies, parameter regimes, and theoretical guarantees in contemporary literature.

1. Definitions and Formal Guarantees

In the context of shallow (one-hidden-layer) neural networks, quantitative convergence refers to explicit rate statements or bounds, usually of the form

f(Wτ)y(1ρ)τf(W0)y\Vert f(W_\tau) - y \Vert \leq (1 - \rho)^\tau \Vert f(W_0) - y \Vert

for geometric/exponential decay (Oymak et al., 2019), or analogous rates for excess risk or expected loss.

  • Global minimum: The set of parameters Θ\Theta^* achieving zero training loss (exact interpolation of labels), amid typically many global optima due to overparameterization.
  • Overparameterization: The regime where the number of network parameters kdk \cdot d is sufficiently larger than or scaling with the data sample size nn. The minimal regimes that guarantee global geometric convergence are of central interest (Oymak et al., 2019, Song et al., 2021, Razborov, 2022).
  • Polyak–Łojasiewicz (PL) inequality: A relaxation of strong convexity leveraged in nonconvex analyses, stating f(θ)f(θ)2/(2αf)f(\theta) \le \|\nabla f(\theta)\|^2 / (2\alpha_f). A local or global PL-type lower bound on curvature along gradient flows enables linear convergence analyses, even for nonconvex landscapes (Song et al., 2021, Dana et al., 24 Feb 2025).
  • NTK regime: The training dynamics are linearized around initialization, with the Neural Tangent Kernel (NTK) remaining nearly constant, often guaranteeing exponential convergence for overparameterized networks (Xu et al., 7 Dec 2024, Liao et al., 2021), but possibly stifling feature learning (Caron et al., 2023).

2. Parameter Regimes for Convergence

Overparameterization Thresholds

A central object of paper is establishing the minimal network width kk (for input dimension dd and nn samples) such that convergence guarantees hold.

Paper Network Class Overparam. Threshold Rate/Guarantee
(Oymak et al., 2019) Shallow, smooth/ReLU kdn\sqrt{kd} \gtrsim n Geometric (exponential)
(Song et al., 2021) Shallow, smooth, std. init kd=O~(n3/2)k d = \widetilde{O}(n^{3/2}) Geometric
(Razborov, 2022) Shallow, ReLU, both layers m=O~(S)m = \widetilde{O}(S) Exponential, uniform NTK conditioning
(Polaczyk et al., 2022) Shallow, ReLU k=Ω~(N1.25)k = \widetilde{\Omega}(N^{1.25}) Global, exponential under SGD
(Dana et al., 24 Feb 2025) Shallow, ReLU, high-dim data plognp\gtrsim \log n Exponential, PL trajectory

In all cases, compared to early results requiring k=Ω(n4)k = \Omega(n^4) or worse (in NTK analyses), more recent work demonstrates that subquadratic (or even logn\log n for high-dimensional orthogonalized data) regimes suffice for global, geometric convergence under stochastic or batch gradient descent.

Impact of Data Geometry

High-dimensional data (with dn2d \gtrsim n^2) and weak correlations among samples allow extremely modest widths for convergence. For near-orthogonal data, a single neuron per sample is sufficient in principle (Dana et al., 24 Feb 2025).

3. Training Dynamics and Methodology

Local Linearization and Jacobian Spectrum

Convergence under moderate overparameterization is often established via a local linearization approach:

  • Jacobian Spectrum Control: Quantitative bounds are derived for the minimum and maximum singular values of the output's Jacobian with respect to the parameters, both at initialization and along the optimization trajectory. Sufficient overparameterization ensures the minimum singular value remains lower-bounded, guaranteeing the local landscape is well-conditioned and gradient steps are effective (Oymak et al., 2019, Song et al., 2021, Razborov, 2022).
  • Random Matrix Theory: Matrix Chernoff bounds, Bernstein's inequality, and related random matrix results are employed to estimate the spectrum of Jacobians and the NTK, particularly in the region around initialization when the weights have not moved far (Song et al., 2021, Xu et al., 7 Dec 2024).

Gradient Descent and SGD

  • Full-batch gradient descent and stochastic gradient descent (SGD) are both analyzed. The results typically show that SGD with sufficiently small step size can closely track the limiting gradient flow (possibly formalized as a differential inclusion for non-smooth activations (Polaczyk et al., 2022)), leading to global convergence at a linear rate.
  • Key recursion: For discrete iterations, analysis yields an error recursion of the form

f(Wk+1)y(1c)f(Wk)y+ϵ(k)\|f(W_{k+1}) - y\| \le (1 - c)\|f(W_k) - y\| + \epsilon(k)

where ϵ(k)0\epsilon(k)\to 0 as kk increases under strong overparameterization (Xu et al., 7 Dec 2024).

4. Quantitative Rates and Theoretical Results

Geometric and Sublinear Rates

  • Under the established overparameterization and initialization conditions, the loss decreases according to

f(Wτ)y(1ηeffσmin2)τf(W0)y\|f(W_\tau) - y\| \leq \left(1 - \eta_{\text{eff}} \cdot \sigma_{\min}^2\right)^\tau \|f(W_0) - y\|

where σmin\sigma_{\min} is the minimal singular value of the layer-wise Jacobian or Khatri–Rao product and ηeff\eta_{\text{eff}} depends on the step size and spectral characteristics (Oymak et al., 2019, Song et al., 2021).

  • For orthogonal data, convergence rates can be explicitly bounded between $1/n$ and 1/n1/\sqrt{n} in the exponent (i.e., decay per iteration), with sharp phase transitions depending on initialization and neuron activation patterns (Dana et al., 24 Feb 2025).
  • When both layers are trained simultaneously, theory guarantees uniform lower bounds on the minimal eigenvalue of the NTK matrix throughout training, which ensures global exponential convergence, even outside the classical NTK regime (Razborov, 2022).

Polyak–Łojasiewicz Analysis

  • The loss decay can be equivalently described along the gradient flow trajectory via the local PL constant:

ddtL(θt)=μ(t)L(θt)\frac{d}{dt} L(\theta_t) = -\mu(t) L(\theta_t)

with μ(t)\mu(t) reflecting local curvature. Lower bounding μ(t)\mu(t) explicitly quantifies exponential rates (Dana et al., 24 Feb 2025).

5. Extensions and Model Variants

Non-smooth Activations

  • Results are robust to non-differentiable activations (e.g. ReLU). Differential inclusion techniques and generalized derivative notions are leveraged for global convergence proofs (Oymak et al., 2019, Polaczyk et al., 2022).

Layerwise and Multi-rate Training

  • Simultaneous multi-rate training of both layers (with independently chosen learning rates) can improve parameterization regimes and allows the translation of convergence results between different initialization norms (Razborov, 2022).

Regularization and Generalization

  • Overparameterized models trained with appropriate norm regularization (such as controlling the 2-norm or Barron norm) achieve minimax rates for regression, with error bounds that are independent of the network width and depend only on weight norms and function smoothness (Yang et al., 2023, Beknazaryan, 2022, Yang et al., 2023).

6. Comparative Perspective and Open Questions

Regime Overparam. Rate Characteristic Limitations/Open Problems
Classical NTK kn4k\gg n^4 Geometric “Lazy” regime Impractically large kk
Moderate kdn\sqrt{kd} \gtrsim n Geometric Nonlinear, local Tightness, extension to deeper networks
Subquadratic kdn3/2kd \sim n^{3/2} Geometric Adaptive Requires smooth activations, not ReLU directly
High-dimension plognp\sim\log n Geometric Weak interaction Impractical for structured real-world datasets

Several open directions remain:

  • Precise characterization of minimal overparameterization for arbitrary data distributions beyond synthetic or nearly orthogonal settings.
  • Extending theory to deeper architectures and more generic activation regimes, possibly relaxing assumptions of bounded higher derivatives.
  • Quantitative relationships bridging convergence rates to generalization error, particularly in high-dimensional and overparameterized regimes.
  • Understanding transition phenomena and sharp phasic shifts in convergence rates noted in orthogonal settings.

7. Conclusion

Quantitative convergence results for shallow neural networks establish that, under moderate to subquadratic overparameterization and appropriate initialization, (stochastic) gradient descent almost surely converges at a geometric rate to global minima that interpolate training data, even in highly nonconvex settings (Oymak et al., 2019, Song et al., 2021, Razborov, 2022, Polaczyk et al., 2022). The efficiency of this process is sharply governed by the Jacobian spectrum, NTK regularity, data geometry, and initialization. Recent advances have closed much of the gap between overly pessimistic early theory and the much weaker overparameterization required in practice, particularly in high-dimensional weakly-correlated regimes where logarithmic widths suffice. Many results now explicitly link function smoothness, architectural parameters, and training dynamics to rigorous rates, providing a clear blueprint for guaranteeing convergence and informing both theoretical exploration and practical model design.

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Quantitative Convergence of Shallow Neural Networks.