Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Accelerated Stochastic Gradient (ASG) Scheme

Updated 28 June 2025

The Accelerated Stochastic Gradient (ASG) scheme refers to a family of algorithms that enhance the iteration complexity and practical performance of stochastic gradient approaches, particularly in the context of minimizing finite sums of smooth convex functions. These methods are central to large-scale empirical risk minimization and modern machine learning, where the objective often takes the form

minxRdf(x)=1ni=1nfi(x)\min_{x \in \mathbb{R}^d} f(x) = \frac{1}{n} \sum_{i=1}^n f_i(x)

with each fif_i being convex and typically smooth (Nitanda, 2015 ).

1. Algorithmic Foundations and Structure

The ASG scheme, specifically the AMSVRG (Accelerated efficient Mini-batch SVRG) algorithm, combines two key algorithmic principles:

  • Nesterov's Accelerated Gradient Descent (AGD): Delivers an improved iteration complexity for convex optimization by using momentum—an extrapolation step that combines previous iterates and gradients to accelerate convergence.
  • Stochastic Variance-Reduced Gradient (SVRG): Mitigates the noise inherent in stochastic gradients when working with finite-sum objectives by leveraging periodic computation of the full gradient (the "snapshot"), enabling variance reduction of mini-batch updates.

In AMSVRG, these are unified in a multi-stage mini-batch framework. Each stage executes a sequence of accelerated inner loop iterations, mixing “Nesterov-style” extrapolation, a stochastic gradient step, and a stochastic mirror descent or proximal step:

  1. Convex Combination:

xk+1=τkzk+(1τk)ykx_{k+1} = \tau_k z_k + (1 - \tau_k) y_k

  1. SGD Step:

yk+1=argminy{ηvk+1,yxk+1+12yxk+12}y_{k+1} = \arg\min_y \left\{ \eta \langle v_{k+1}, y - x_{k+1} \rangle + \frac{1}{2}\|y - x_{k+1}\|^2 \right\}

  1. Mirror/Proximal Step:

zk+1=argminz{αk+1vk+1,zzk+Vzk(z)}z_{k+1} = \arg\min_z \left\{ \alpha_{k+1} \langle v_{k+1}, z - z_k \rangle + V_{z_k}(z) \right\}

where vk+1v_{k+1} is a variance-reduced gradient, constructed so that its expectation matches the true gradient at xk+1x_{k+1} while its variance remains controlled,

vk+1=fIk+1(xk+1)fIk+1(y0)+v~v_{k+1} = \nabla f_{I_{k+1}}(x_{k+1}) - \nabla f_{I_{k+1}}(y_0) + \tilde{v}

with Ik+1I_{k+1} a mini-batch, y0y_0 the snapshot point, and v~\tilde{v} the full gradient at y0y_0.

The algorithm sets stepsize-related parameters as η=1/L\eta = 1/L, αk+1=14L(k+2)\alpha_{k+1} = \frac{1}{4L}(k+2), and carefully selects τk\tau_k to ensure acceleration.

2. Applicability to Convexity Scenarios

Unlike classical SVRG methods, AMSVRG and related ASG schemes can be applied directly to both strongly convex and general convex (possibly non-strongly convex) problems by adjusting parameters appropriately.

  • General Convex Case: The method applies without requiring artificial regularization or strong convexity parameters. A bounded domain is assumed (with minor technical adjustments if not).
  • Strongly Convex Case: If ff is μ\mu-strongly convex, the algorithm employs the Euclidean norm as the Bregman divergence and may use Nesterov-style restarts, which further accelerates convergence. Here, no domain boundedness assumption is necessary.

This dual applicability is a key distinguishing property compared to prior algorithms which often rely on strong convexity or require problem modification for the general case.

3. Complexity Bounds and Comparative Rates

The ASG/AMSVRG scheme achieves—or surpasses—previous state-of-the-art complexities for both general convex and strongly convex regimes. Letting ϵ\epsilon be the target accuracy, LL the smoothness, μ\mu the strong convexity constant, and nn the data size:

Convexity Type Algorithm Complexity
General Convex SAG, SAGA O~((n+L)/ϵ)\tilde{O}\left( (n+L)/\epsilon \right)
AGD O~(nL/ϵ)\tilde{O}\left( n \sqrt{L/\epsilon} \right)
AMSVRG (ASG) O~(n+min{L/ϵ,nL/ϵ})\tilde{O}\left( n + \min\left\{L/\epsilon,\, n\sqrt{L/\epsilon}\right\} \right)
Strongly Convex SAG O~(max{n,L/μ})\tilde{O}\left( \max\{n,\, L/\mu\} \right)
SVRG O~(n+L/μ)\tilde{O}\left( n+L/\mu \right)
AGD O~(nL/μ)\tilde{O}\left( n\sqrt{L/\mu} \right)
AMSVRG, Acc-Prox-SVRG (ASG) O~(n+min{L/μ,nL/μ})\tilde{O}\left( n + \min\left\{L/\mu,\, n\sqrt{L/\mu}\right\} \right)

Here, O~()\tilde{O}(\cdot) hides problem-independent constants and logarithmic factors. These rates demonstrate that AMSVRG, and thus the general class of ASG schemes with variance reduction and acceleration, adapt to problem characteristics—bridging the gap between acceleration-optimal and variance reduction-optimal performance without added regularization.

4. Empirical Results and Implementation Aspects

The AMSVRG algorithm's empirical evaluation on L2L_2-regularized multi-class/binary logistic regression tasks (using MNIST, covtype, rcv1 datasets) showed that:

  • AMSVRG matches or outperforms SAGA, SVRG, and other variance-reduced methods across various problem regimes.
  • Its advantages are most pronounced as regularization vanishes (i.e., the problem becomes less strongly convex), where competing methods often slow down or require artificial adjustments.
  • The algorithm’s parallelizable mini-batch nature makes it particularly effective on large-scale or distributed computational platforms; mini-batch size can be chosen to exploit available hardware (such as multi-core CPUs or GPUs).
  • Different restart heuristics (for the acceleration schedule) were tested; no single heuristic dominated, suggesting robust practical applicability.

5. Theoretical and Practical Implications

The ASG/AMSVRG scheme demonstrates that it is possible to:

  • Unify acceleration and variance reduction in a mini-batch stochastic method, with strong guarantees for both strongly and non-strongly convex cases, without the need for strong-convexity-inducing regularization.
  • Achieve complexity bounds that are minimax optimal (modulo log factors) for the convex and strongly convex cases, outperforming classic accelerated or purely variance-reduced alternatives depending on the problem's relative size and curvature.
  • Leverage mini-batch and parallel computation: The method is naturally suited to parallelization, meaning that in practice—especially on modern hardware—its gains can be amplified.
  • Suggest new directions, such as:
    • Developing more adaptive restart strategies and parameter schedules for scenarios with unknown or changing problem properties.
    • Extending beyond smooth problems, embracing composite objectives or non-smooth cases.
    • Formalizing rules for dynamic mini-batch sizing to maximize resource utilization and convergence.

6. Future Research Directions

Potential avenues include:

  • Parallel/Distributed Scaling: Investigating aggressive scaling on industrial hardware, especially in data center/cloud environments where communication and heterogeneity present additional challenges.
  • Adaptive and Heuristic Restart Schemes: Refining restart rules or formulating theoretically grounded criteria for restarting acceleration, further enhancing robustness and ease of deployment.
  • Composite and Regularized Objectives: Extending the core ASG scheme to the composite and constrained optimization settings, possibly by leveraging proximal operators or new types of variance-reduction techniques.
  • Automated Parameter Selection: Creating more sophisticated yet efficient heuristics, or even data-driven approaches, for tuning mini-batch size and acceleration parameters to automate deployment in production systems.

7. Summary Table: Complexity Comparison

Scenario Algorithm Complexity
General Convex AGD O~(nL/ϵ)\tilde{O}(n\sqrt{L/\epsilon})
SAG/SAGA O~((n+L)/ϵ)\tilde{O}((n+L)/\epsilon)
AMSVRG (ASG) O~(n+min{L/ϵ,nL/ϵ})\tilde{O}( n + \min\{ L/\epsilon,\, n\sqrt{L/\epsilon} \} )
Strongly Convex AGD O~(nL/μ)\tilde{O}(n\sqrt{L/\mu})
SVRG O~(n+L/μ)\tilde{O}(n + L/\mu)
Acc-Prox-SVRG/AMSVRG O~(n+min{L/μ,nL/μ})\tilde{O}( n + \min\{ L/\mu,\, n\sqrt{L/\mu} \} )

This complexity summary highlights that the AMSVRG/ASG approach adapts to both small-scale and large-scale settings, outperforming standard methods in a wide range of practical contexts.


In conclusion, the Accelerated Stochastic Gradient (ASG) scheme introduces a rigorously justified, robust, and scalable approach to large-scale finite-sum convex optimization, blending momentum-based acceleration and variance reduction for broad practical benefit. Its flexibility, theoretical optimality, and straightforward implementation position it as a core tool in the optimization landscape for statistical machine learning (Nitanda, 2015 ).