Stochastic Gradient Descent

Updated 1 December 2025

Stochastic Gradient Descent (SGD) is a fundamental stochastic optimization algorithm that iteratively updates parameters using random mini-batches for scalable machine learning.
SGD adapts to diverse regimes—convex, nonconvex, and over-parameterized—by leveraging variable learning rates and noise structures to achieve both exponential and sublinear convergence.
Advanced SGD variants employ momentum, variance reduction, and preconditioning techniques to enhance robustness, numerical stability, and generalization in deep learning applications.

Stochastic Gradient Descent (SGD) is a foundational algorithm for large-scale stochastic optimization, particularly for over-parameterized and high-dimensional models prevalent in machine learning. SGD operates by iteratively updating parameters using stochastic approximations of gradients derived from random samples or mini-batches, achieving computational and memory scalability not possible with deterministic methods. Its theoretical, methodological, and practical foundations have been generalized extensively to accommodate nonconvex landscapes, variance-structured noise, various geometric regimes, and adaptive or biased oracles.

1. Algorithmic Formulation and Noise Structure

Let $f \colon \mathbb{R}^m \to [0, \infty)$ denote an empirical risk or population loss. The canonical discrete-time SGD iteration reads

$\theta_{k+1} = \theta_k - \eta\,g(\theta_k, \xi_{k+1})$

where $\eta > 0$ is the learning rate, $\{ \xi_k \}$ is an i.i.d. sequence representing sampled data or mini-batch indices, and $g(\theta, \xi)$ is a random gradient estimator of $\nabla f(\theta)$ satisfying $\mathbb{E}_\xi[ g(\theta, \xi) ] = \nabla f(\theta)$ . Classic assumptions require a uniform variance bound: $\mathbb{E}_\xi [ \| g(\theta, \xi) - \nabla f(\theta) \|^2 ] \leq \sigma^2$ However, in over-parameterized and deep learning settings, the variance is typically loss-dependent, i.e.,

$\mathbb{E}_\xi [ \| g(\theta, \xi) - \nabla f(\theta) \|^2 ] \leq \sigma\,f(\theta)$

and the noise can be decomposed as $g(\theta, \xi) = \nabla f(\theta) + \sqrt{\sigma f(\theta)} Y_{\theta,\xi}$ for $Y_{\theta,\xi}$ satisfying certain spread-out (e.g., Gaussian) and bounded second-moment conditions (Wojtowytsch, 2021). This machine-learning-type (ML) noise results in vanishing stochasticity as the loss approaches zero, fundamentally altering both convergence analysis and optimization dynamics.

2. Convergence Theory: Nonconvex, Convex, and Over-parameterized Regimes

SGD's convergence is characterized by the interplay between the geometry of $f$ (e.g., convexity, Łojasiewicz inequality, PL condition), the noise structure, and the learning rate schedule. The classical Robbins–Monro framework requires step sizes $\eta_k \to 0$ , with $\sum \eta_k = \infty$ , to control stationary noise (Patel et al., 2021). In contrast, under ML-type noise with vanishing variance, a constant positive learning rate is admissible: $0 < \eta < \frac{2\Lambda}{C_L(\Lambda+\sigma)}$ where $C_L$ is a one-sided Lipschitz constant and $\Lambda$ comes from a Łojasiewicz-type inequality $\Lambda f(\theta) \leq \| \nabla f(\theta) \|^2$ (Wojtowytsch, 2021). In these settings:

For objectives satisfying a global Łojasiewicz or Polyak–Łojasiewicz (PL) condition, SGD exhibits exponentially fast convergence in expectation, and almost sure exponential convergence to the minimizer $N = \{ \theta : f(\theta) = 0 \}$ in over-parameterized losses (Wojtowytsch, 2021, Lei et al., 2019, Louzi, 8 Dec 2024).
Ergocity and Lyapunov analyses extend global convergence to nonconvex $f$ under weaker $(L,a)$ -Hölder smoothness, yielding almost sure stationarity and vanishing gradient norms (Patel et al., 2021, Louzi, 8 Dec 2024).
Bounded-step-size variants apply in regimes where gradient noise variance shrinks as loss decreases, a property generic in deep learning scenarios.

For nonconvex landscapes, function-value convergence, weak convergence ( $\nabla f(\theta_k)\to 0$ ), and almost sure convergence to stationary points (possibly random) are established under suitable Lyapunov conditions, diminishing step size, and local or global gradient domination (Louzi, 8 Dec 2024, Lei et al., 2019). Rate-optimal guarantees for $\min_{k \leq T} \mathbb{E}\|\nabla f(\theta_k)\|^2 = O(1/\sum_{k=1}^T \eta_k)$ are attainable without uniform gradient bounds (Lei et al., 2019).

3. Noise Geometry, Effective Temperature, and Landscape Implications

The stochastic gradient noise in SGD is not white and is typically state-dependent and non-isotropic. Mini-batch sampling induces a gradient noise covariance $\Sigma(w) \propto b(1-b) \sum_{\mu} \nabla \ell_\mu(w)\nabla \ell_\mu(w)^T$ where $b$ is the batch fraction (Mignacco et al., 2021).

Key developments include:

Effective temperature (UNSAT phase): In under-parameterized regimes, SGD reaches a stationary non-equilibrium state characterized by an effective temperature $T_{\mathrm{eff}}$ , extractable via fluctuation–dissipation plots, and scaling proportional to the learning rate and batch size. In the over-parameterized SAT phase, the replica distance $d_0$ between two independent SGD trajectories quantifies the residual noise (Mignacco et al., 2021).
Decision boundary geometry: Higher noise levels (large $T_{\mathrm{eff}}$ or $d_0$ ) are empirically correlated with wider decision boundaries (lower fraction of support vectors), contributing to the phenomenon where SGD preferentially converges to flat minima with better generalization (Mignacco et al., 2021).

From a geometric viewpoint, SGD dynamics can be framed as relativistic gradient flows on a family of Riemannian metrics parameterized by the local diffusion tensor $D(x)$ , yielding deterministic flows along geodesics in the "diffusion metric" $g_{ij}(x) = \delta_{ij} + \varepsilon D_{ij}(x)$ . This framework recovers and generalizes natural gradient methods, and gives a mechanistic explanation for SGD's implicit bias toward wide, flat regions (Fioresi et al., 2019).

4. Extensions: Variance Reduction, Step-Size Schemes, and Implicit/Preconditioned Methods

Variance reduction is critical for accelerating SGD in convex and nonconvex regimes:

Semi-stochastic and control variate methods: SVRG, SAGA, and S3GD leverage explicit variance correction, either via snapshots (SVRG), full or partial gradient tables (SAGA), or manifold propagation via anchor-graph diffusions (S3GD) to construct variance-reduced directions with only slightly increased computational cost per iteration (Mu et al., 2015).
Least-squares control variates: For continuous expectation objectives (not finite-sum), constructing a control variate via weighted least-square fits to recent gradient evaluations (SG-LSCV) enables sublinear convergence without needing a finite-sum structure. Overhead per-iteration is $O(m^2)$ , where $m$ is the dimension of the surrogate basis, yet is minor in high-cost scenarios such as PDE-constrained optimization (Nobile et al., 28 Jul 2025).
Adaptive, momentum, and preconditioned variants: Integration of momentum (Nesterov’s acceleration), diagonal preconditioning (AdaGrad/Adam), and adaptive Polyak step size (e.g. SPS, PSPS) has yielded schemes (A2Grad, PSPS) that achieve either optimal deterministic and stochastic rates, or substantial practical robustness to scaling, curvature, and ill-conditioning (Deng et al., 2018, Abdukhakimov et al., 2023, Tran et al., 2015).

Pseudocode for a general preconditioned Polyak-type adaptive step-size SGD (from (Abdukhakimov et al., 2023)):

for t in range(T):
    g_t = grad_f_i(w_t)  # stochastic gradient
    # update preconditioner B_t with AdaGrad/Adam/Hutchinson
    eta_t = f_i(w_t) / norm_Binv(g_t)**2
    w_{t+1} = w_t - eta_t * B_t^{-1} @ g_t

5. Adaptive Data, Biased Estimators, and Non-IID Sampling

SGD's convergence extends beyond the classical IID setting to adaptive data and biased or consistent gradient estimators:

Adaptive data: For Markov or state-feedback sampling—ubiquitous in reinforcement learning and policy optimization—the convergence rates degrade only by poly-logarithmic factors in the mixing time of the associated controlled Markov chain, provided geometric ergodicity and Lyapunov drift hold. This ensures robust performance in online learning or policy-gradient actor-critic scenarios when step-sizes are set according to traditional Robbins–Monro or population-PL principles (Che et al., 2 Oct 2024).
Consistent but biased gradient estimators: Generalizing the unbiased requirement, SGD achieves comparable convergence rates for strongly convex, convex, or nonconvex objectives, provided the consistency is sufficiently strong (e.g., probability bounds on estimator error decay exponentially in sample size of local neighborhoods, as for sampled GCNs) (Chen et al., 2018).

6. Specialized Theoretical Regimes and Practical Guidelines

SGD's asymptotic and finite-time behaviors are tightly linked to the geometry and noise regime:

Separable systems (exp-smooth monotone loss): For homogeneous linear classifiers on linearly separable data and $\beta$ -smooth monotone loss functions (e.g. logistic), SGD with a fixed learning rate achieves zero loss and converges in direction to the $\ell_2$ -max margin separator at $O(1/\log t)$ , independently of batch size when step-size scales linearly; the loss decreases as $O(1/t)$ , matching gradient descent (Nacson et al., 2018).
Graduated optimization: Viewing SGD as sequential optimization over a decreasing sequence of increasingly less smoothed versions of $f$ (via convolutional nonnegative approximate identities), one can formally justify practical multiscale smoothing approaches for escaping local minima in nonconvex problems (Li et al., 2023).
Implicit SGD and Polyak–Ruppert averaging: Implicit update rules employing observed Fisher information provide numerical stability and robustness to step-size specification. Weighted and adaptively-averaged iterates (e.g., Polyak–Ruppert, polynomial decay, data-driven) achieve minimax-optimal statistical efficiency, with explicit mean-squared error derivations available in the linear case (Tran et al., 2015, Wei et al., 2023).

7. Impact and Outlook

SGD and its numerous variants are indispensable for modern machine learning and large-scale stochastic approximation problems due to their computational efficiency, strong theoretical guarantees under mild conditions, and empirical robustness in nonconvex, high-dimensional, and overparameterized settings. Recent theoretical advances have relaxed traditional assumptions (global Lipschitzness, bounded variance, unbiased gradients, IID data), enabling rigorous convergence assertions in realistic problem domains including adaptive, structured, and biased sampling. SGD's bias toward flat minima, as captured by diffusion geometry and state-dependent noise, offers a foundation for understanding its generalization behavior in deep networks.

These developments pave the way for continued refinement of stochastic optimization methods, addressing challenges in nonconvexity, adaptivity, and variance reduction, and further integrating statistical efficiency, numerical stability, and hyperparameter-free adaptation (Wojtowytsch, 2021, Mignacco et al., 2021, Li et al., 2023, Louzi, 8 Dec 2024, Li et al., 3 Sep 2025, Abdukhakimov et al., 2023, Mu et al., 2015, Deng et al., 2018, Nacson et al., 2018, Che et al., 2 Oct 2024, Tran et al., 2015, Wei et al., 2023, Nobile et al., 28 Jul 2025, Lei et al., 2019, Chen et al., 2018, Fioresi et al., 2019, Mandt et al., 2016, Patel et al., 2021).