Papers
Topics
Authors
Recent
Search
2000 character limit reached

Stochastic Average Gradient (SAG)

Updated 6 May 2026
  • Stochastic Average Gradient (SAG) is an incremental optimization algorithm that efficiently solves large-scale finite-sum problems by retaining past gradient evaluations.
  • It leverages a memory mechanism to reduce variance and combine the low iteration cost of stochastic methods with the fast convergence of full gradient approaches.
  • Widely applied in machine learning, SAG offers rigorous convergence guarantees for convex and strongly convex problems and has inspired extensions like SAGA and SVRG.

The Stochastic Average Gradient (SAG) method is an incremental optimization algorithm designed to solve large-scale finite-sum problems involving smooth convex and strongly convex objective functions. SAG combines the low per-iteration cost of stochastic gradient (SG) approaches with the rapid convergence properties of full gradient (FG) methods by maintaining a memory of past gradient evaluations for each function component. This memory-based mechanism and the associated variance reduction enable SAG to achieve significantly improved convergence rates relative to vanilla stochastic methods, making it a foundational algorithm in modern large-scale optimization, particularly in machine learning and statistical inference contexts (Schmidt et al., 2013, Hofmann et al., 2015, Notsawo, 2023, Defazio et al., 2014, Chen et al., 2017).

1. Problem Setting and Algorithmic Structure

SAG is designed for optimization problems represented as a finite sum of smooth convex (or strongly convex) functions:

f(x)=1ni=1nfi(x)f(x) = \frac{1}{n} \sum_{i=1}^{n} f_i(x)

where each fi:RpRf_i : \mathbb{R}^p \rightarrow \mathbb{R} is convex, differentiable, and LL-smooth (i.e., its gradient is LL-Lipschitz continuous). SAG is particularly effective in the regime where nn is large and a full gradient computation is prohibitively expensive (Schmidt et al., 2013, Defazio et al., 2014).

The core innovation of SAG is its use of a table of stored gradients yikfi()y_i^k \approx \nabla f_i(\cdot) for each component, which are incrementally updated as the algorithm progresses.

Pseudocode for the Basic SAG Algorithm

  • Initialization: Set x0x^0, yi0=0y_i^0 = 0 for all i=1,,ni = 1,\dots, n, d0=0d^0 = 0.
  • Iteration (fi:RpRf_i : \mathbb{R}^p \rightarrow \mathbb{R}0):

    1. Sample index fi:RpRf_i : \mathbb{R}^p \rightarrow \mathbb{R}1 uniformly at random.
    2. Compute the new gradient fi:RpRf_i : \mathbb{R}^p \rightarrow \mathbb{R}2.
    3. Update running sum: fi:RpRf_i : \mathbb{R}^p \rightarrow \mathbb{R}3.
    4. Set fi:RpRf_i : \mathbb{R}^p \rightarrow \mathbb{R}4; for fi:RpRf_i : \mathbb{R}^p \rightarrow \mathbb{R}5, fi:RpRf_i : \mathbb{R}^p \rightarrow \mathbb{R}6.
    5. Update fi:RpRf_i : \mathbb{R}^p \rightarrow \mathbb{R}7 (Schmidt et al., 2013, Notsawo, 2023, Hofmann et al., 2015).

This update rule allows SAG to maintain an averaged, memory-improved estimate of the full gradient while incurring only fi:RpRf_i : \mathbb{R}^p \rightarrow \mathbb{R}8 gradient evaluations and fi:RpRf_i : \mathbb{R}^p \rightarrow \mathbb{R}9 vector operations per iteration.

2. Memory Mechanism and Variance Reduction

The principal distinction between SAG and classical SG methods lies in its memory array of historical gradients. Each LL0 stores the most recent evaluation of LL1 at some previous iterate, ensuring that over time, the stored values converge toward the gradients at the global optimum. The step at each iteration can be interpreted as an incremental aggregated gradient step, yielding a gradient estimate whose variance diminishes as more components are revisited and their stored gradients approach the optimum values (Schmidt et al., 2013, Hofmann et al., 2015, Chen et al., 2017).

For comparison:

  • SGD computes the search direction using only the current sample gradient, discarding all past information. The resulting estimator has persistent variance, which necessitates a decaying stepsize and yields a sublinear convergence rate.

  • SAG utilizes the running average of stored gradients as a control variate, achieving variance reduction and rapid convergence without requiring shrinking stepsizes (Hofmann et al., 2015, Notsawo, 2023).

3. Convergence Guarantees and Complexity

SAG provides sharp convergence rates for both convex and strongly convex objectives.

  • Convex Case (LL2):

LL3

where LL4 is the average (or "best") iterate. This rate is LL5 in terms of effective data passes, improving upon the typical LL6 of SGD (Schmidt et al., 2013, Notsawo, 2023, Hofmann et al., 2015, Morin et al., 2019).

  • Strongly Convex Case (LL7):

LL8

yielding a linear (geometric) convergence rate (Schmidt et al., 2013, Zhu et al., 5 Feb 2026, Defazio et al., 2014).

The per-iteration cost is comparable to SGD (LL9 for LL0-dimensional parameters), with memory requirements of LL1 to store the gradient table, and no need for decaying stepsizes (Notsawo, 2023, Zhu et al., 5 Feb 2026, Hofmann et al., 2015).

Comparison Table: Work Complexity and Convergence Rates

Method Per-iteration cost Storage Convergence Rate
SGD LL2 LL3 LL4, LL5 (strongly convex)
FG LL6 LL7 LL8, geometrical (strongly convex)
SAG LL9 nn0 nn1, geometric (strongly convex)
SAGA nn2 nn3 Improved linear; unbiased
SVRG nn4* nn5 Improved linear; requires periodic FG pass

**: Average per-iteration cost; periodic full-gradient passes incur nn6 cost (Hofmann et al., 2015, Defazio et al., 2014).

4. Extensions, Variants, and Enhancements

Multiple enhancements and extensions of SAG have been proposed, addressing limitations and expanding applicability:

  • Non-uniform Sampling: Sampling indices with probabilities proportional to local smoothness parameters (nn7) accelerates convergence by aligning sampling with the worst-case smoothness (Schmidt et al., 2013, Schmidt et al., 2015).
  • Mini-batching: Grouping examples for parallelism and reducing storage, with convergence rates preserved under suitable stepsize adjustments (Schmidt et al., 2013).
  • SAG with Momentum / SAG+Adam: Incorporating momentum or adaptive coordinate-wise scaling improves empirical performance on ill-conditioned or nonconvex landscapes; both hybridizations preserve low variance while enhancing optimization dynamics (Notsawo, 2023).
  • Proximal Acceleration: The SAG update extends to composite minimization by including a proximal map for nonsmooth regularizers; convergence rates in this composite regime are established (Schmidt et al., 2015, Driggs et al., 2019).
  • Compositional SAG (C-SAG): For compositional finite-sum objectives, C-SAG maintains memory at both inner and outer function layers, preserving the oracle efficiency of SAG in more complex optimization topologies (Hsieh et al., 2018).
  • Stratified and Structured Variants: SSAG pools gradients across stratified cohorts/classes, reducing the dimension of the memory and accelerating convergence when nn8 (Chen et al., 2017).

Sufficient-Decrease Variants: Recent work incorporates sufficient-decrease line search into SAG updates (SAG-SD), guaranteeing descent and adapting steps on the fly, with the same linear rates (Shang et al., 2017).

5. Bias, Unbiasedness, and Theoretical Developments

SAG's gradient estimator is biased in general, due to the averaging over a mixture of current and stale memory entries. Despite this bias, SAG attains provable linear rates; rigorous analysis shows bias vanishes as the stored gradients converge to the true gradients at the optimum (Driggs et al., 2019, Morin et al., 2019).

Unbiased alternatives, such as SAGA, leverage a modified update that ensures the expectation coincides exactly with the true gradient at each iteration, resulting in marginally improved theoretical constants and facilitating easier extension to composite and non-strongly convex settings (Defazio et al., 2014, Driggs et al., 2019).

High-probability convergence results and unified proof frameworks have recently bridged the gap between biased (SAG) and unbiased (SAGA) estimators, supplying modular Lyapunov-based analyses and extending guarantees to regimes involving Markov sampling and non-convexity (Zhu et al., 5 Feb 2026).

6. Applications and Empirical Observations

SAG and its variants have been widely applied in machine learning for large-scale empirical risk minimization, conditional random field training, neural network optimization, and structured prediction problems. Empirical studies show:

  • Superior convergence speed relative to classic SGD, requiring 2–5× fewer data passes to reach high-accuracy regimes after a short warm-up (Notsawo, 2023, Hofmann et al., 2015).
  • Robustness on ill-conditioned and nonconvex problems, especially with hybrid momentum or adaptive schemes (Notsawo, 2023, Notsawo, 2023).
  • Storage cost nn9 can be limiting for deep learning with massive sample sizes, motivating structured memory reductions and stratified designs (1202.13212, Schmidt et al., 2015, Chen et al., 2017).

7. Limitations, Open Directions, and Comparative Analysis

The primary limitation of SAG is its yikfi()y_i^k \approx \nabla f_i(\cdot)0 memory footprint, a constraint for massive datasets or high-dimensional settings. The bias in its estimator, while vanishing asymptotically, may induce oscillations or slower transient convergence in early epochs. The step-size restriction, originally conservative (yikfi()y_i^k \approx \nabla f_i(\cdot)1), has been relaxed in recent analyses to optimal order (yikfi()y_i^k \approx \nabla f_i(\cdot)2), aligning SAG with SAGA and improving practical stability (Morin et al., 2019, Zhu et al., 5 Feb 2026).

Comparative studies show that SAGA generally achieves better constants and applies natively to composite problems, while SVRG (without memory) remains preferable in contexts with tight memory budgets and can be tailored for non-Euclidean geometries or infrequent storage update scenarios (Defazio et al., 2014, Hofmann et al., 2015, Driggs et al., 2019).

Advances in sufficient decrease techniques, stratified or compositional adaptions, and plug-in variance reduction mechanisms (e.g., SARAH, SARGE) continue to refine the theoretical and empirical landscape. Incorporating adaptive sampling, mini-batching, and structured memory layouts are active research areas aimed at extending the reach of SAG-type algorithms to ever larger and more structured machine learning regimes (Schmidt et al., 2015, Hsieh et al., 2018, 1202.13212, Chen et al., 2017).


References: (Schmidt et al., 2013, Hofmann et al., 2015, Notsawo, 2023, Defazio et al., 2014, Schmidt et al., 2015, Chen et al., 2017, Driggs et al., 2019, Hsieh et al., 2018, Dresdner et al., 2022, Morin et al., 2019, Zhu et al., 5 Feb 2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Stochastic Average Gradient (SAG).