Stochastic Average Gradient (SAG)

Updated 6 May 2026

Stochastic Average Gradient (SAG) is an incremental optimization algorithm that efficiently solves large-scale finite-sum problems by retaining past gradient evaluations.
It leverages a memory mechanism to reduce variance and combine the low iteration cost of stochastic methods with the fast convergence of full gradient approaches.
Widely applied in machine learning, SAG offers rigorous convergence guarantees for convex and strongly convex problems and has inspired extensions like SAGA and SVRG.

The Stochastic Average Gradient (SAG) method is an incremental optimization algorithm designed to solve large-scale finite-sum problems involving smooth convex and strongly convex objective functions. SAG combines the low per-iteration cost of stochastic gradient (SG) approaches with the rapid convergence properties of full gradient (FG) methods by maintaining a memory of past gradient evaluations for each function component. This memory-based mechanism and the associated variance reduction enable SAG to achieve significantly improved convergence rates relative to vanilla stochastic methods, making it a foundational algorithm in modern large-scale optimization, particularly in machine learning and statistical inference contexts (Schmidt et al., 2013, Hofmann et al., 2015, Notsawo, 2023, Defazio et al., 2014, Chen et al., 2017).

1. Problem Setting and Algorithmic Structure

SAG is designed for optimization problems represented as a finite sum of smooth convex (or strongly convex) functions:

$f(x) = \frac{1}{n} \sum_{i=1}^{n} f_i(x)$

where each $f_i : \mathbb{R}^p \rightarrow \mathbb{R}$ is convex, differentiable, and $L$ -smooth (i.e., its gradient is $L$ -Lipschitz continuous). SAG is particularly effective in the regime where $n$ is large and a full gradient computation is prohibitively expensive (Schmidt et al., 2013, Defazio et al., 2014).

The core innovation of SAG is its use of a table of stored gradients $y_i^k \approx \nabla f_i(\cdot)$ for each component, which are incrementally updated as the algorithm progresses.

Pseudocode for the Basic SAG Algorithm

Initialization: Set $x^0$ , $y_i^0 = 0$ for all $i = 1,\dots, n$ , $d^0 = 0$ .
Iteration ( $f_i : \mathbb{R}^p \rightarrow \mathbb{R}$ $f_{i} : R^{p} \to R$ 0):
1. Sample index $f_i : \mathbb{R}^p \rightarrow \mathbb{R}$ 1 uniformly at random.
2. Compute the new gradient $f_i : \mathbb{R}^p \rightarrow \mathbb{R}$ 2.
3. Update running sum: $f_i : \mathbb{R}^p \rightarrow \mathbb{R}$ 3.
4. Set $f_i : \mathbb{R}^p \rightarrow \mathbb{R}$ 4; for $f_i : \mathbb{R}^p \rightarrow \mathbb{R}$ 5, $f_i : \mathbb{R}^p \rightarrow \mathbb{R}$ 6.
5. Update $f_i : \mathbb{R}^p \rightarrow \mathbb{R}$ 7 (Schmidt et al., 2013, Notsawo, 2023, Hofmann et al., 2015).

This update rule allows SAG to maintain an averaged, memory-improved estimate of the full gradient while incurring only $f_i : \mathbb{R}^p \rightarrow \mathbb{R}$ 8 gradient evaluations and $f_i : \mathbb{R}^p \rightarrow \mathbb{R}$ 9 vector operations per iteration.

2. Memory Mechanism and Variance Reduction

The principal distinction between SAG and classical SG methods lies in its memory array of historical gradients. Each $L$ 0 stores the most recent evaluation of $L$ 1 at some previous iterate, ensuring that over time, the stored values converge toward the gradients at the global optimum. The step at each iteration can be interpreted as an incremental aggregated gradient step, yielding a gradient estimate whose variance diminishes as more components are revisited and their stored gradients approach the optimum values (Schmidt et al., 2013, Hofmann et al., 2015, Chen et al., 2017).

For comparison:

SGD computes the search direction using only the current sample gradient, discarding all past information. The resulting estimator has persistent variance, which necessitates a decaying stepsize and yields a sublinear convergence rate.
SAG utilizes the running average of stored gradients as a control variate, achieving variance reduction and rapid convergence without requiring shrinking stepsizes (Hofmann et al., 2015, Notsawo, 2023).

3. Convergence Guarantees and Complexity

SAG provides sharp convergence rates for both convex and strongly convex objectives.

Convex Case ( $L$ 2):

$L$ 3

where $L$ 4 is the average (or "best") iterate. This rate is $L$ 5 in terms of effective data passes, improving upon the typical $L$ 6 of SGD (Schmidt et al., 2013, Notsawo, 2023, Hofmann et al., 2015, Morin et al., 2019).

Strongly Convex Case ( $L$ 7):

$L$ 8

yielding a linear (geometric) convergence rate (Schmidt et al., 2013, Zhu et al., 5 Feb 2026, Defazio et al., 2014).

The per-iteration cost is comparable to SGD ( $L$ 9 for $L$ 0-dimensional parameters), with memory requirements of $L$ 1 to store the gradient table, and no need for decaying stepsizes (Notsawo, 2023, Zhu et al., 5 Feb 2026, Hofmann et al., 2015).

Comparison Table: Work Complexity and Convergence Rates

Method	Per-iteration cost	Storage	Convergence Rate
SGD	$L$ 2	$L$ 3	$L$ 4, $L$ 5 (strongly convex)
FG	$L$ 6	$L$ 7	$L$ 8, geometrical (strongly convex)
SAG	$L$ 9	$n$ 0	$n$ 1, geometric (strongly convex)
SAGA	$n$ 2	$n$ 3	Improved linear; unbiased
SVRG	$n$ 4*	$n$ 5	Improved linear; requires periodic FG pass

**: Average per-iteration cost; periodic full-gradient passes incur $n$ 6 cost (Hofmann et al., 2015, Defazio et al., 2014).

4. Extensions, Variants, and Enhancements

Multiple enhancements and extensions of SAG have been proposed, addressing limitations and expanding applicability:

Non-uniform Sampling: Sampling indices with probabilities proportional to local smoothness parameters ( $n$ 7) accelerates convergence by aligning sampling with the worst-case smoothness (Schmidt et al., 2013, Schmidt et al., 2015).
Mini-batching: Grouping examples for parallelism and reducing storage, with convergence rates preserved under suitable stepsize adjustments (Schmidt et al., 2013).
SAG with Momentum / SAG+Adam: Incorporating momentum or adaptive coordinate-wise scaling improves empirical performance on ill-conditioned or nonconvex landscapes; both hybridizations preserve low variance while enhancing optimization dynamics (Notsawo, 2023).
Proximal Acceleration: The SAG update extends to composite minimization by including a proximal map for nonsmooth regularizers; convergence rates in this composite regime are established (Schmidt et al., 2015, Driggs et al., 2019).
Compositional SAG (C-SAG): For compositional finite-sum objectives, C-SAG maintains memory at both inner and outer function layers, preserving the oracle efficiency of SAG in more complex optimization topologies (Hsieh et al., 2018).
Stratified and Structured Variants: SSAG pools gradients across stratified cohorts/classes, reducing the dimension of the memory and accelerating convergence when $n$ 8 (Chen et al., 2017).

Sufficient-Decrease Variants: Recent work incorporates sufficient-decrease line search into SAG updates (SAG-SD), guaranteeing descent and adapting steps on the fly, with the same linear rates (Shang et al., 2017).

5. Bias, Unbiasedness, and Theoretical Developments

SAG's gradient estimator is biased in general, due to the averaging over a mixture of current and stale memory entries. Despite this bias, SAG attains provable linear rates; rigorous analysis shows bias vanishes as the stored gradients converge to the true gradients at the optimum (Driggs et al., 2019, Morin et al., 2019).

Unbiased alternatives, such as SAGA, leverage a modified update that ensures the expectation coincides exactly with the true gradient at each iteration, resulting in marginally improved theoretical constants and facilitating easier extension to composite and non-strongly convex settings (Defazio et al., 2014, Driggs et al., 2019).

High-probability convergence results and unified proof frameworks have recently bridged the gap between biased (SAG) and unbiased (SAGA) estimators, supplying modular Lyapunov-based analyses and extending guarantees to regimes involving Markov sampling and non-convexity (Zhu et al., 5 Feb 2026).

6. Applications and Empirical Observations

SAG and its variants have been widely applied in machine learning for large-scale empirical risk minimization, conditional random field training, neural network optimization, and structured prediction problems. Empirical studies show:

Superior convergence speed relative to classic SGD, requiring 2–5× fewer data passes to reach high-accuracy regimes after a short warm-up (Notsawo, 2023, Hofmann et al., 2015).
Robustness on ill-conditioned and nonconvex problems, especially with hybrid momentum or adaptive schemes (Notsawo, 2023, Notsawo, 2023).
Storage cost $n$ 9 can be limiting for deep learning with massive sample sizes, motivating structured memory reductions and stratified designs (1202.13212, Schmidt et al., 2015, Chen et al., 2017).

7. Limitations, Open Directions, and Comparative Analysis

The primary limitation of SAG is its $y_i^k \approx \nabla f_i(\cdot)$ 0 memory footprint, a constraint for massive datasets or high-dimensional settings. The bias in its estimator, while vanishing asymptotically, may induce oscillations or slower transient convergence in early epochs. The step-size restriction, originally conservative ( $y_i^k \approx \nabla f_i(\cdot)$ 1), has been relaxed in recent analyses to optimal order ( $y_i^k \approx \nabla f_i(\cdot)$ 2), aligning SAG with SAGA and improving practical stability (Morin et al., 2019, Zhu et al., 5 Feb 2026).

Comparative studies show that SAGA generally achieves better constants and applies natively to composite problems, while SVRG (without memory) remains preferable in contexts with tight memory budgets and can be tailored for non-Euclidean geometries or infrequent storage update scenarios (Defazio et al., 2014, Hofmann et al., 2015, Driggs et al., 2019).

Advances in sufficient decrease techniques, stratified or compositional adaptions, and plug-in variance reduction mechanisms (e.g., SARAH, SARGE) continue to refine the theoretical and empirical landscape. Incorporating adaptive sampling, mini-batching, and structured memory layouts are active research areas aimed at extending the reach of SAG-type algorithms to ever larger and more structured machine learning regimes (Schmidt et al., 2015, Hsieh et al., 2018, 1202.13212, Chen et al., 2017).

References: (Schmidt et al., 2013, Hofmann et al., 2015, Notsawo, 2023, Defazio et al., 2014, Schmidt et al., 2015, Chen et al., 2017, Driggs et al., 2019, Hsieh et al., 2018, Dresdner et al., 2022, Morin et al., 2019, Zhu et al., 5 Feb 2026)