Variance-Reduced Algorithms for Finite-Sum Potentials

Updated 11 November 2025

The paper demonstrates a novel variance reduction framework that leverages recursive gradient estimators to achieve tight convergence in both convex and nonconvex settings.
The topic introduces adaptive, parameter-free algorithms that dynamically tune step-sizes to enhance performance across distributed, compressed, and coordinate descent regimes.
The topic offers rigorous complexity bounds and error analysis that underpin robust empirical performance in large-scale machine learning applications.

Variance-reduced algorithms for finite-sum potentials form a major class of stochastic optimization methods that exploit the structure of smooth functions expressed as finite sums. These techniques systematically reduce the variance of stochastic gradient estimators, leading to sharp convergence guarantees in both convex and nonconvex regimes, and have been generalized to cover complex operator inclusions, composition structures, decentralized networks, zeroth-order settings, and non-smooth or constrained objectives. The domain spans both theoretical convergence analysis and robust practical design (parameter-free adaptation, distributed implementation, acceleration, and inverse scaling), constituting the methodological core of modern large-scale optimization in machine learning and related domains.

1. Mathematical Formulation and Algorithmic Frameworks

Variance reduction for finite-sum potentials addresses problems of the form: $f(x) = \frac{1}{n} \sum_{i=1}^n f_i(x),\qquad x\in\mathbb{R}^d,$ where each $f_i$ is $L$ -smooth and may be convex or nonconvex. The overarching aim is to minimize $f(x)$ efficiently when $n$ is large.

The core recursive update paradigm is: $x^{t+1} = x^t - \gamma_t g^t,$ where $g^t$ is a stochastic estimator of the gradient. Modern frameworks formalize $g^t$ through recursive constructions—such as control variates, memory tables, and anchored snapshots—yielding key families:

SVRG (Stochastic Variance Reduced Gradient)/SARAH/PAGE: Epoch-based and loopless constructions with periodic full gradient reference points or probabilistic resets.
SAGA: Maintains a table of historical per-function gradients for fast incremental correction.
Coordinate and Compressed methods (SEGA, JAGUAR, EF21, DIANA, DASHA): Access random coordinate blocks or compressed gradients, exploiting similar recursive error bounds.

A key assumption (Shestakov et al., 6 Nov 2025) is the coupled recursions for the gradient error $e^t := g^t - \nabla f(x^t)$ and auxiliary variance proxies $\sigma_t^2$ : $\begin{split} \mathbb{E}[\|e^t\|^2|\mathcal{F}_t] &\leq (1-\rho_1)\|e^{t-1}\|^2 + A \sigma_{t-1}^2 + B L^2 \|x^t - x^{t-1}\|^2,\ \mathbb{E}[\sigma_t^2|\mathcal{F}_t] &\leq (1-\rho_2)\sigma_{t-1}^2 + C L^2 \|x^t - x^{t-1}\|^2, \end{split}$ with constants determined by the estimator structure, independent of unknown quantities such as $L$ or $f^*$ .

2. Convergence Guarantees and Complexity

Variance-reduced algorithms for finite-sum potentials admit rigorous convergence rates with precise complexity implications for convex, nonconvex, and Polyak–Łojasiewicz (PL) settings:

Nonconvex smooth: For constant step-size $\gamma \leq (1/L)(1 + \nu)^{-1}$ , they achieve

$\frac{1}{T} \sum_{t=0}^{T-1} \mathbb{E} \|\nabla f(x^t)\|^2 = O(1/T).$

PL condition: If $f(x)-f^*\leq \frac{1}{2\mu}\|\nabla f(x)\|^2$ , then

$V^T \leq (1-\gamma\mu)^T V^0,$

with a Lyapunov function involving function gap, squared estimation error, and variance proxies.

Adaptive sublinear rate (parameter-free execution): Using the step-size schedule

$\gamma_t = \frac{1}{\max\{\nu,1\}^{1-\alpha}\left(\sum_{i=0}^{t-1} \|g^i\|^2\right)^{\alpha}},\quad \alpha\in(0,1/3),$

yields

$\frac{1}{T}\sum_{t=0}^{T-1} \mathbb{E} \|\nabla f(x^t)\| = O(\max\{\nu^{1/2},1\}/\sqrt{T}),$

matching known lower bounds (Shestakov et al., 6 Nov 2025).

Complexity results hold for both unbiased and biased estimators, encode batch/compression/sketches in constants, and extend to distributed and coordinate variants (parameter-free adaptive VR for EF21, DIANA, DASHA, SEGA, JAGUAR). All classical VR methods (SVRG, SAGA, SARAH, PAGE, ZeroSARAH) emerge as special cases.

3. Adaptive and Parameter-free Algorithms

A fundamental advance is the construction of variance-reduced methods which adapt their step size dynamically without explicit tuning of learning rate, smoothness constant, or batch parameters. The adaptive schedule depends only on observed $\|g^i\|$ and known estimator configuration parameters: $\gamma_t=\frac{1}{\max\{\nu,1\}^{1-\alpha} (S^t)^{\alpha}},\quad S^{t+1} = S^t + \|g^t\|^2.$ Here, $\nu = \sqrt{(B\rho_2 + AC)/(\rho_1\rho_2)}$ aggregates structure-dependent coefficients.

This approach robustly resolves the hyperparameter sensitivity issue present in adaptive methods like STORM and Ada-STORM, and natively handles biased estimators, coordinate sampling, and quantized or compressed communication. Ablation studies (Shestakov et al., 6 Nov 2025) confirm that varying the adaptive hyperparameter $\alpha\in(0,1/3)$ yields expected robustness–performance trade-offs, and adaptive schemes absorb batch-size and compression sensitivity.

4. Generalizations: Distributed, Proximal, Coordinate, and Beyond

Variance-reduced algorithms are tightly integrated into distributed and decentralized frameworks (Xin et al., 2020), generalized monotone inclusions (Cai et al., 2023), and operator splitting (Tran-Dinh, 17 Apr 2025):

Distributed/Decentralized: Variance reduction is combined with gradient tracking to synchronize local changes and provide tight convergence—linear in the strongly convex regime—over peer-to-peer or networked architectures. Storage and communication complexity trade-offs are characterized precisely.
Monotone and saddle-point inclusions: VR-Halpern, VR-forward-backward, and accelerated extra-point methods obtain residual norm $\varepsilon$ guarantees in

$\widetilde{O}(n+\sqrt{n}L/\varepsilon)$

oracle complexity for monotone lipschitz inclusions (Cai et al., 2023, Huang et al., 2022, Tran-Dinh, 17 Apr 2025), near lower bounds.

Nonsmooth and constrained extensions: Proximal VR methods in convex and nonsmooth regimes (Song et al., 2021, Traoré et al., 2023) take advantage of primal–dual reformulations, adaptive dual averaging, and random projections. Complexity improvements over previous methods are significant, notably:

$O(nd\log\min\{n,1/\varepsilon\} + d/\varepsilon)$

for the nonsmooth convex setting (VRPDA), and $O(1/k)$ in stochastic-proximal-point VR variants.

Coordinate and zeroth-order regimes: Accelerated VR coordinate descent (ASVRCD) (Hanzely et al., 2020) achieves

$O((n+\sqrt{nL/\mu})\log(1/\epsilon))$

for finite-sum objectives, matching optimal complexity.

Composition and stochastic perturbation: For problems involving finite-sum plus random perturbations or composition layers, tailored VR algorithms such as S-MISO (Bietti et al., 2016) and SCVRI/SCVRII (Liu et al., 2017) achieve improved iteration and query complexity—sometimes sub- $O(n^{2/3}/\epsilon)$ in composition optimization.
Second-order VR: SVRN (Dereziński, 2022) merges Newton-type methods with VR, accelerating convergence to $O(\log(1/\varepsilon)/\log n)$ passes for large-scale strongly-convex finite sums.

5. Statistical Analysis and Error Bounds

Variance-reduced samplers can be analyzed for sampling accuracy and discretization bias, leveraging discrete Poisson equation methodologies (Lu et al., 6 Nov 2025). For sampling in underdamped Langevin dynamics driven by finite-sum potentials, SVRG/UBU and SAGA/UBU schemes exhibit an explicit phase transition: $|\text{Bias}_h|\leq C \left[ \frac{\sigma^2 d}{m^2p}\min\left\{h,\frac{N^2 h^3}{mp^2}\right\} + \frac{dh^2}{m^3} \right].$ For step-size $h<h^*=(mp^2)/N^2$ , the bias becomes $O(h^2)$ , matching the deterministic discretization regime; above this threshold, $O(h)$ stochastic bias due to the variance of the gradient estimator dominates. This yields critical practical guidance in tuning batch and step sizes.

6. Practical Performance and Empirical Studies

Extensive empirical evaluations (Shestakov et al., 6 Nov 2025, Yang et al., 2022, Liu et al., 2017, Ye et al., 13 Jan 2025, Xin et al., 2020) demonstrate that adaptive parameter-free variance-reduced algorithms outperform both theoretically-parametrized and grid-tuned constant learning rate methods on canonical tasks such as $\ell_2$ -regularized logistic regression and constrained classification:

Adaptive VR algorithms show faster and more robust convergence, with minimal need for parameter tuning, across varying values of batch size, compressor bandwidth, and sketch dimension.
In beamforming and robust classification under convex constraints, variance-reduced relaxation-projection methods attain superior optimality and feasibility rates over previous random projection approaches.
In zeroth-order and decentralized regimes, VR methods maintain optimal scaling in high dimensions and robust convergence under stochastic and communication-limited conditions.

A plausible implication is that the unification and extension of VR to cover not only unbiased SGD-like gradient estimators but also biased, compressed, coordinate, and composite regimes, and to support adaptive parameter-free execution, has led to a regime where practical usage closely matches theoretical lower bounds for finite-sum stochastic optimization across a broad array of applications.

7. Outlook and Ongoing Challenges

While variance-reduced algorithms for finite-sum potentials currently match or approach oracle lower bounds for many problem classes, several avenues for further development remain:

The extension to nonconvex–nonconcave saddle-point and min-max problems, especially with complex constraint sets or in decentralized settings.
Tighter rate guarantees and complexity improvements for high-variance or non-smooth objectives, particularly under practical mini-batch and communication constraints.
Automatic adaptation to heterogeneous or time-varying data, and efficient handling of asynchronous, adaptive, and federated deployment scenarios.
Zeroth-order and functional-constraint settings still present open complexity gaps, especially in the nonconvex and high-dimensional regime.

Variance-reduced methods have become a dominant paradigm for large-scale empirical risk minimization, variational inequalities, and operator-splitting scenarios involving finite-sum or empirical potential structures, with ongoing research continuing to expand their generality, robustness, and efficiency.