Accelerated Stochastic Gradient (ASG)

Updated 1 July 2025

Accelerated Stochastic Gradient (ASG) schemes enhance stochastic gradient methods by integrating Nesterov acceleration and variance reduction techniques.
Unlike previous methods, ASG schemes apply directly to both strongly convex and general convex problems without requiring artificial regularization.
These methods achieve state-of-the-art complexity bounds and demonstrate strong empirical performance on large-scale problems, adapting well to varying problem curvature.

The Accelerated Stochastic Gradient (ASG) scheme refers to a family of algorithms that enhance the iteration complexity and practical performance of stochastic gradient approaches, particularly in the context of minimizing finite sums of smooth convex functions. These methods are central to large-scale empirical risk minimization and modern machine learning, where the objective often takes the form

$\min_{x \in \mathbb{R}^d} f(x) = \frac{1}{n} \sum_{i=1}^n f_i(x)$

with each $f_i$ being convex and typically smooth (Nitanda, 2015).

1. Algorithmic Foundations and Structure

The ASG scheme, specifically the AMSVRG (Accelerated efficient Mini-batch SVRG) algorithm, combines two key algorithmic principles:

Nesterov's Accelerated Gradient Descent (AGD): Delivers an improved iteration complexity for convex optimization by using momentum—an extrapolation step that combines previous iterates and gradients to accelerate convergence.
Stochastic Variance-Reduced Gradient (SVRG): Mitigates the noise inherent in stochastic gradients when working with finite-sum objectives by leveraging periodic computation of the full gradient (the "snapshot"), enabling variance reduction of mini-batch updates.

In AMSVRG, these are unified in a multi-stage mini-batch framework. Each stage executes a sequence of accelerated inner loop iterations, mixing “Nesterov-style” extrapolation, a stochastic gradient step, and a stochastic mirror descent or proximal step:

Convex Combination:

$x_{k+1} = \tau_k z_k + (1 - \tau_k) y_k$

SGD Step:

$y_{k+1} = \arg\min_y \left\{ \eta \langle v_{k+1}, y - x_{k+1} \rangle + \frac{1}{2}\|y - x_{k+1}\|^2 \right\}$

Mirror/Proximal Step:

$z_{k+1} = \arg\min_z \left\{ \alpha_{k+1} \langle v_{k+1}, z - z_k \rangle + V_{z_k}(z) \right\}$

where $v_{k+1}$ is a variance-reduced gradient, constructed so that its expectation matches the true gradient at $x_{k+1}$ while its variance remains controlled,

$v_{k+1} = \nabla f_{I_{k+1}}(x_{k+1}) - \nabla f_{I_{k+1}}(y_0) + \tilde{v}$

with $I_{k+1}$ a mini-batch, $y_0$ the snapshot point, and $\tilde{v}$ the full gradient at $y_0$ .

The algorithm sets stepsize-related parameters as $\eta = 1/L$ , $\alpha_{k+1} = \frac{1}{4L}(k+2)$ , and carefully selects $\tau_k$ to ensure acceleration.

2. Applicability to Convexity Scenarios

Unlike classical SVRG methods, AMSVRG and related ASG schemes can be applied directly to both strongly convex and general convex (possibly non-strongly convex) problems by adjusting parameters appropriately.

General Convex Case: The method applies without requiring artificial regularization or strong convexity parameters. A bounded domain is assumed (with minor technical adjustments if not).
Strongly Convex Case: If $f$ is $\mu$ -strongly convex, the algorithm employs the Euclidean norm as the Bregman divergence and may use Nesterov-style restarts, which further accelerates convergence. Here, no domain boundedness assumption is necessary.

This dual applicability is a key distinguishing property compared to prior algorithms which often rely on strong convexity or require problem modification for the general case.

3. Complexity Bounds and Comparative Rates

The ASG/AMSVRG scheme achieves—or surpasses—previous state-of-the-art complexities for both general convex and strongly convex regimes. Letting $\epsilon$ be the target accuracy, $L$ the smoothness, $\mu$ the strong convexity constant, and $n$ the data size:

Convexity Type	Algorithm	Complexity
General Convex	SAG, SAGA	$\tilde{O}\left( (n+L)/\epsilon \right)$
	AGD	$\tilde{O}\left( n \sqrt{L/\epsilon} \right)$
	AMSVRG (ASG)	$\tilde{O}\left( n + \min\left\{L/\epsilon,\, n\sqrt{L/\epsilon}\right\} \right)$
Strongly Convex	SAG	$\tilde{O}\left( \max\{n,\, L/\mu\} \right)$
	SVRG	$\tilde{O}\left( n+L/\mu \right)$
	AGD	$\tilde{O}\left( n\sqrt{L/\mu} \right)$
	AMSVRG, Acc-Prox-SVRG (ASG)	$\tilde{O}\left( n + \min\left\{L/\mu,\, n\sqrt{L/\mu}\right\} \right)$

Here, $\tilde{O}(\cdot)$ hides problem-independent constants and logarithmic factors. These rates demonstrate that AMSVRG, and thus the general class of ASG schemes with variance reduction and acceleration, adapt to problem characteristics—bridging the gap between acceleration-optimal and variance reduction-optimal performance without added regularization.

4. Empirical Results and Implementation Aspects

The AMSVRG algorithm's empirical evaluation on $L_2$ -regularized multi-class/binary logistic regression tasks (using MNIST, covtype, rcv1 datasets) showed that:

AMSVRG matches or outperforms SAGA, SVRG, and other variance-reduced methods across various problem regimes.
Its advantages are most pronounced as regularization vanishes (i.e., the problem becomes less strongly convex), where competing methods often slow down or require artificial adjustments.
The algorithm’s parallelizable mini-batch nature makes it particularly effective on large-scale or distributed computational platforms; mini-batch size can be chosen to exploit available hardware (such as multi-core CPUs or GPUs).
Different restart heuristics (for the acceleration schedule) were tested; no single heuristic dominated, suggesting robust practical applicability.

5. Theoretical and Practical Implications

The ASG/AMSVRG scheme demonstrates that it is possible to:

Unify acceleration and variance reduction in a mini-batch stochastic method, with strong guarantees for both strongly and non-strongly convex cases, without the need for strong-convexity-inducing regularization.
Achieve complexity bounds that are minimax optimal (modulo log factors) for the convex and strongly convex cases, outperforming classic accelerated or purely variance-reduced alternatives depending on the problem's relative size and curvature.
Leverage mini-batch and parallel computation: The method is naturally suited to parallelization, meaning that in practice—especially on modern hardware—its gains can be amplified.
Suggest new directions, such as:
- Developing more adaptive restart strategies and parameter schedules for scenarios with unknown or changing problem properties.
- Extending beyond smooth problems, embracing composite objectives or non-smooth cases.
- Formalizing rules for dynamic mini-batch sizing to maximize resource utilization and convergence.

6. Future Research Directions

Potential avenues include:

Parallel/Distributed Scaling: Investigating aggressive scaling on industrial hardware, especially in data center/cloud environments where communication and heterogeneity present additional challenges.
Adaptive and Heuristic Restart Schemes: Refining restart rules or formulating theoretically grounded criteria for restarting acceleration, further enhancing robustness and ease of deployment.
Composite and Regularized Objectives: Extending the core ASG scheme to the composite and constrained optimization settings, possibly by leveraging proximal operators or new types of variance-reduction techniques.
Automated Parameter Selection: Creating more sophisticated yet efficient heuristics, or even data-driven approaches, for tuning mini-batch size and acceleration parameters to automate deployment in production systems.

7. Summary Table: Complexity Comparison

Scenario	Algorithm	Complexity
General Convex	AGD	$\tilde{O}(n\sqrt{L/\epsilon})$
	SAG/SAGA	$\tilde{O}((n+L)/\epsilon)$
	AMSVRG (ASG)	$\tilde{O}( n + \min\{ L/\epsilon,\, n\sqrt{L/\epsilon} \} )$
Strongly Convex	AGD	$\tilde{O}(n\sqrt{L/\mu})$
	SVRG	$\tilde{O}(n + L/\mu)$
	Acc-Prox-SVRG/AMSVRG	$\tilde{O}( n + \min\{ L/\mu,\, n\sqrt{L/\mu} \} )$

This complexity summary highlights that the AMSVRG/ASG approach adapts to both small-scale and large-scale settings, outperforming standard methods in a wide range of practical contexts.

In conclusion, the Accelerated Stochastic Gradient (ASG) scheme introduces a rigorously justified, robust, and scalable approach to large-scale finite-sum convex optimization, blending momentum-based acceleration and variance reduction for broad practical benefit. Its flexibility, theoretical optimality, and straightforward implementation position it as a core tool in the optimization landscape for statistical machine learning (Nitanda, 2015).

PDF Markdown Chat (Pro)

References (1)

Accelerated Stochastic Gradient Descent for Minimizing Finite Sums (2015)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Accelerated Stochastic Gradient (ASG) Scheme.