Papers
Topics
Authors
Recent
Search
2000 character limit reached

Statistical Guarantees for High-Dimensional Stochastic Gradient Descent

Published 13 Oct 2025 in stat.ML and cs.LG | (2510.12013v1)

Abstract: Stochastic Gradient Descent (SGD) and its Ruppert-Polyak averaged variant (ASGD) lie at the heart of modern large-scale learning, yet their theoretical properties in high-dimensional settings are rarely understood. In this paper, we provide rigorous statistical guarantees for constant learning-rate SGD and ASGD in high-dimensional regimes. Our key innovation is to transfer powerful tools from high-dimensional time series to online learning. Specifically, by viewing SGD as a nonlinear autoregressive process and adapting existing coupling techniques, we prove the geometric-moment contraction of high-dimensional SGD for constant learning rates, thereby establishing asymptotic stationarity of the iterates. Building on this, we derive the $q$-th moment convergence of SGD and ASGD for any $q\ge2$ in general $\ells$-norms, and, in particular, the $\ell{\infty}$-norm that is frequently adopted in high-dimensional sparse or structured models. Furthermore, we provide sharp high-probability concentration analysis which entails the probabilistic bound of high-dimensional ASGD. Beyond closing a critical gap in SGD theory, our proposed framework offers a novel toolkit for analyzing a broad class of high-dimensional learning algorithms.

Summary

  • The paper establishes exponential contraction and a unique stationary distribution for SGD iterates via a novel coupling framework.
  • It derives explicit non-asymptotic moment bounds in various norms, including the challenging ℓ∞-norm, crucial for high-dimensional inference.
  • High-probability concentration results guide practical learning rate selection and offer reliable statistical guarantees for constant-step algorithms.

Statistical Guarantees for High-Dimensional Stochastic Gradient Descent

Overview

This paper develops a comprehensive theoretical framework for analyzing Stochastic Gradient Descent (SGD) and its Ruppert–Polyak averaged variant (ASGD) in high-dimensional regimes with constant learning rates. The authors address a significant gap in the literature: while constant learning rates are widely used in practice for large-scale, overparameterized models, existing theoretical results are largely restricted to low-dimensional settings or require decaying learning rates. The paper leverages tools from high-dimensional nonlinear time series, particularly geometric-moment contraction (GMC) and coupling techniques, to establish rigorous statistical guarantees for SGD and ASGD under strong convexity and smoothness assumptions.

Main Contributions

1. Geometric-Moment Contraction and Asymptotic Stationarity

The authors reinterpret the SGD recursion as a nonlinear autoregressive process and adapt coupling techniques from high-dimensional time series analysis. They prove that, under a sufficiently small constant learning rate α\alpha, the effect of initialization is forgotten at an exponential rate in any s\ell^s-norm (s2s \geq 2 even), including the \ell^\infty-norm relevant for high-dimensional sparse estimation. Specifically, for two SGD sequences with the same noise but different initializations, the qq-th moment of their distance contracts geometrically:

(Eβkβksq)1/qrα,s,qkβ0β0s,\left(\mathbb{E}|\beta_k - \beta_k'|_s^q\right)^{1/q} \leq r_{\alpha,s,q}^k |\beta_0 - \beta_0'|_s,

where rα,s,q<1r_{\alpha,s,q} < 1 is an explicit contraction constant depending on the learning rate, strong convexity, and Lipschitz constants. This establishes the existence of a unique stationary distribution for the SGD iterates, even in high dimensions.

2. Non-Asymptotic Moment Bounds in General Norms

Building on the GMC property, the paper derives explicit non-asymptotic qq-th moment bounds for the error of both SGD and ASGD iterates in arbitrary s\ell^s-norms, including the max-norm (\ell^\infty) by setting slogds \approx \log d. The results generalize classical mean squared error (MSE) analyses to higher moments and norms, which are critical for high-dimensional inference and uncertainty quantification. The moment bounds are dimension-dependent and account for the growth of Lipschitz and noise constants with dd.

For ASGD, the error decomposition isolates three terms: stochastic variance, initialization bias, and the non-vanishing bias due to the constant learning rate. The authors provide explicit rates for each component, showing that the overall error in \ell^\infty-norm is controlled as

βˉkβ,qO(logdk+1k+αpoly(d)),\|\bar{\beta}_k - \beta^*\|_{\infty, q} \leq O\left(\sqrt{\frac{\log d}{k}} + \frac{1}{k} + \alpha \cdot \text{poly}(d)\right),

where the last term reflects the persistent bias from the constant step size.

3. High-Probability Concentration and Complexity Guarantees

The paper develops a high-probability tail bound for the ASGD estimator in high dimensions, using a novel Fuk-Nagaev-type inequality adapted to the functional dependence structure of the SGD process. This yields, for any target error ε\varepsilon and confidence level 1δ1-\delta, an explicit bound on the number of iterations kk required to achieve βˉkβε\|\bar{\beta}_k - \beta^*\|_{\infty} \leq \varepsilon with probability at least 1δ1-\delta. The complexity bound is of order O(1/ε2)O(1/\varepsilon^2) (up to dimension-dependent factors), matching known results in special cases and providing new insight into the dimension dependence for general high-dimensional models.

4. Gaussian Approximation for ASGD

The authors establish a non-asymptotic Gaussian approximation for the distribution of the averaged SGD iterates, quantifying the rate at which the empirical distribution of ASGD approaches a multivariate normal law. The approximation rate depends on the ratio d/Td/T (dimension to sample size), and the result is valid under finite qq-th moment assumptions on the gradient noise. This provides a theoretical foundation for constructing confidence intervals and conducting inference in high-dimensional online learning.

5. Dimension-Dependent Learning Rate Selection

The analysis yields explicit upper bounds on the admissible constant learning rate α\alpha as a function of the problem dimension dd, the strong convexity parameter μ\mu, and the Lipschitz constant Ls,qL_{s,q}. For linear models with sub-Gaussian or sub-exponential covariates, the admissible α\alpha scales as O(1/(d2logd))O(1/(d^2 \log d)), reflecting the increased instability of SGD in high dimensions.

Technical Approach

  • Norm Equivalence and Surrogates: The non-differentiability of the \ell^\infty-norm is circumvented by working with s\ell^s-norms for large ss, exploiting norm equivalence in high dimensions.
  • High-Dimensional Moment Inequalities: The authors generalize Rio's moment inequality to arbitrary s\ell^s-norms and qq-th moments, enabling sharp control of the error propagation in the SGD recursion.
  • Functional Dependence Measures: The dependence structure of the SGD process is quantified using functional dependence measures, allowing the derivation of maximal inequalities and concentration bounds for the ASGD estimator.
  • Explicit Dimension Dependence: All bounds are given with explicit dependence on dd, qq, ss, and the moments of the gradient noise, facilitating practical guidance for algorithm design in high-dimensional settings.

Implications and Future Directions

The results provide a rigorous foundation for the use of constant learning-rate SGD and ASGD in high-dimensional, overparameterized models, which are ubiquitous in modern machine learning. The explicit moment and concentration bounds in \ell^\infty-norms are particularly relevant for high-dimensional inference, such as constructing confidence intervals and controlling the family-wise error rate in sparse estimation.

The framework is general and can be extended to other online learning algorithms and more complex dependency structures. While the current analysis assumes strong convexity and smoothness, the nonlinear time series perspective and coupling techniques introduced here offer a promising path toward analyzing non-convex and non-smooth objectives, as well as heavy-tailed noise.

Potential future developments include:

  • Extension to non-convex loss functions and adaptive step-size schedules.
  • Application to structured models (e.g., group sparsity, low-rank matrix estimation) where norm-based error control is critical.
  • Development of practical guidelines for learning rate selection and stopping criteria in high-dimensional regimes.
  • Investigation of the interplay between algorithmic stability, generalization, and statistical inference in online learning.

Conclusion

This work closes a critical theoretical gap by providing sharp statistical guarantees for constant learning-rate SGD and ASGD in high-dimensional settings. The integration of high-dimensional time series techniques with online optimization theory yields new insights into the stability, convergence, and reliability of large-scale learning algorithms. The results have direct implications for the design and analysis of optimization algorithms in modern machine learning, particularly in regimes where the number of parameters far exceeds the number of samples.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 2 tweets with 69 likes about this paper.