Polyak–Ruppert Averaging

Updated 9 February 2026

Polyak–Ruppert averaging is a variance-reduction technique that computes a uniform average of all iterates, yielding estimators with minimal asymptotic variance.
It achieves minimax optimality and provides strong theoretical guarantees such as asymptotic normality, high-probability bounds, and bias–variance trade-off in stochastic settings.
The method is practically applied in stochastic gradient descent, reinforcement learning, and distributed optimization with minimal extra computational cost.

Polyak–Ruppert averaging is a variance-reduction and efficiency-enhancement technique in stochastic approximation, stochastic gradient descent, and more broadly in iterative algorithms subject to noise or uncertainty. The method replaces the last iterate of the stochastic process by a uniform (or occasionally weighted) average of all iterates, often yielding estimators with minimal asymptotic variance, improved convergence rates, and provable optimality properties—frequently achieving the minimax asymptotic Cramér–Rao lower bound for mean-squared error. This approach has become foundational in stochastic optimization, machine learning, reinforcement learning, and black-box optimization.

1. Method Definition and Algorithmic Foundations

In a generic stochastic approximation context, given iterates $x_k$ produced by an algorithm such as stochastic gradient descent,

$x_{k+1} = x_k - \gamma_k g(x_k, \xi_k),$

where $g(x_k, \xi_k)$ is a noisy estimator of the gradient, Polyak–Ruppert averaging constructs the averaged iterate

$\bar x_n = \frac{1}{n} \sum_{k=1}^n x_k.$

Uniform averaging is the most studied form, but variants with non-uniform (e.g., geometric) weights have been considered in specific applications (Neu et al., 2018). Averaging can be implemented online with negligible computational and memory overhead.

In linear stochastic approximation (LSA) and reinforcement learning (RL) algorithms such as temporal difference (TD) learning, PR-averaging is typically applied as

$\bar\theta_n = \frac{1}{n}\sum_{k=1}^n \theta_k,$

where

$\theta_{k+1} = \theta_k + \alpha_k [A_k \theta_k - b_k].$

The approach similarly generalizes to two-timescale and distributed settings (Kong et al., 14 Feb 2025, Zhang et al., 2022). In streaming, online, or mini-batch frameworks, averaging is adapted to accommodate variable batch sizes; e.g., the average at time $t$ can be defined as $\bar\theta_t = \frac{1}{N_t}\sum_{i=1}^t n_i \theta_{i-1}$ , $N_t = \sum_{i=1}^t n_i$ (Godichon-Baggioni et al., 2021, Godichon-Baggioni et al., 2022).

2. Theoretical Guarantees: Asymptotic Normality and Rate Optimality

Polyak–Ruppert averaging delivers minimax-optimal asymptotic and non-asymptotic statistical guarantees in a wide class of stochastic approximation and optimization schemes.

Asymptotic Efficiency: Under appropriate regularity (e.g., strong convexity or suitable generalizations), PR-averaged iterates satisfy

$\sqrt{n}(\bar x_n - x^*) \xrightarrow{d} \mathcal{N}(0, V),$

where $V = [\nabla^2 f(x^*)]^{-1} S^* [\nabla^2 f(x^*)]^{-1}$ is the Cramér–Rao lower bound for asymptotic covariance, and $S^*$ is the covariance of the gradient noise (Gadat et al., 2017, Zhu et al., 2019).

Non-asymptotic Bounds: For $f$ strongly convex or satisfying weaker Kurdyka–Łojasiewicz-type conditions, the mean-squared error (MSE) satisfies

$\mathbb{E}\left[\|\bar\theta_n - \theta^*\|^2\right] \leq \frac{\operatorname{Tr} V}{n} + O(n^{-r_\beta}),$

where the second-order exponent $r_\beta$ is typically maximized at step-size exponent $\beta^* = 3/4$ (Gadat et al., 2017).

Bias–Variance Decomposition: For constant-step-size LSA and Markovian noise, the averaged error decomposes as

$\mathbb{E}\left[\|\bar \theta_N - \theta^*\|^2\right] = O(1/N) + O(\alpha_N^2),$

with $O(1/N)$ variance term and $O(\alpha_N^2)$ bias term, the latter dominating for slow-decaying step sizes (e.g., $\rho \leq 1/2$ ) (Lauand et al., 2024, Levin et al., 7 Aug 2025).

Concentration and High-Probability Bounds: Explicit empirical confidence regions for $x^*$ , with coverage tending to $1 - \delta$ at rate $O(\sqrt{\log(1/\delta)/n})$ , are achievable under sub-Gaussian or martingale conditions on the noise (Zhu et al., 2019, Khodadadian et al., 27 May 2025).
Extension to Manifolds and Degenerate Minima: Polyak–Ruppert averaging ensures $\sqrt{n}$ -rate convergence to the normal component of a stable manifold even when minima are non-isolated, with the limiting covariance reflecting only the normal directions to the manifold (Dereich et al., 2019).

3. Algorithmic Extensions and Model Generality

Polyak–Ruppert averaging has robust applicability across various stochastic settings and algorithmic classes:

Order- and Comparison-Only Oracles: In black-box optimization, PR-averaging applies to “Stochastic Order Oracle” models, using only pairwise comparisons, and yields explicit limiting covariance with sharp Hessian dependence, outperforming non-averaged or non-optimally tuned schemes (Smirnov et al., 2024).
Two-Timescale and Coupled Algorithms: In two-timescale linear stochastic approximation, PR-averaging achieves simultaneous minimax rates $O(1/\sqrt{n})$ for both “fast” and “slow” averaged iterates in the presence of martingale or Markovian noise—a significant improvement over the best possible rate without averaging (Kong et al., 14 Feb 2025, Butyrin et al., 11 Aug 2025).
Streaming and Mini-Batch SGD: With time-varying batches and non-stationary, dependent noise, PR-averaging unifies variance reduction with adaptivity to changing regimes, as long as the underlying process maintains some form of strong convexity or smoothness at the optimum (Godichon-Baggioni et al., 2021, Godichon-Baggioni et al., 2022).
Zeroth-Order and Derivative-Free Optimization: PR-averaging extends to stochastic zeroth-order optimization (e.g., via Gaussian smoothing estimators), with the averaged solution achieving valid central limit theorems and online estimators for the limiting covariance (Jin et al., 2021).
Distributed and Decentralized Optimization: In dual-accelerated consensus algorithms or decentralized policy evaluation, PR-averaging is crucial for attaining optimally accelerated deterministic bias decay and optimal $O(1/T)$ stochastic variance scaling, even under communication or topology-induced constraints (Zhang et al., 2022).

4. Statistical Inference, Confidence Intervals, and Non-asymptotic Tools

Beyond point estimation, Polyak–Ruppert averaging enables statistically principled inference:

Batch Means and Bootstrap Procedures: Construction of confidence sets for $x^*$ leverages batch means or multiplier bootstrap approaches, which avoid direct estimation of the high-dimensional asymptotic covariance and enjoy nonasymptotic coverage control at the desired confidence level $1-\delta$ (Zhu et al., 2019, Samsonov et al., 2024).
Functional Central Limit Theorems (FCLT): For stochastic-gradient iterates, FCLTs describe the weak convergence of the whole path of PR-averaged estimators to a Brownian motion with explicit covariance, facilitating inference and adaptive stopping (Zhu et al., 2019, Li et al., 2021).
Finite-time and High-probability Analysis: Refined results yield finite-sample bounds on the error $\|\bar\theta_n - \theta^*\|$ in $L^2$ and with probability $1-\delta$ , with explicit dependence on problem dimension, step size, and mixing time in non-i.i.d. settings (Durmus et al., 2022, Khodadadian et al., 27 May 2025).

5. Algorithmic and Practical Considerations

Polyak–Ruppert averaging is characterized by broad stability and minimal tuning requirements, yet specific considerations are critical for optimal performance:

Setting	Step-size guidance	Key parameter regimes
Strongly convex SGD	$\gamma_n \sim n^{-\beta}$	$\beta\in(1/2,1)$
Linear constant-step LSA	$\alpha \sim n^{-1/2}$	Balances bias and variance
Streaming mini-batch SGD	$\alpha=2/3, \beta=1/3$	Robust $O(1/N)$ for any batch
Two-timescale SA	$1/2	Rate $n^{-1/4}$ when $a\approx b$

Bias Correction and Extrapolation: In LSA with Markovian noise and constant step-size, PR-averaging may leave $O(\alpha)$ bias. Richardson–Romberg extrapolation using two copies with step sizes $\alpha$ and $2\alpha$ can eliminate the leading bias and improve MSE to match the optimal covariance up to $O(1/n)$ (Levin et al., 7 Aug 2025).
Weighted and Geometric Averaging: Geometric PR-averaging introduces implicit regularization, with weighting equivalent to ridge regression in linear models. This variant provides additional finite-sample control at the variance–bias interface (Neu et al., 2018).
Algorithmic Heuristics: For step-size tuning in LSA and related models, practical online halving or stability-based heuristics are available that enable “hands-off” parameter selection while maintaining theoretical guarantees (Lakshminarayanan et al., 2017).

6. Impact and Specialized Applications

Polyak–Ruppert averaging fundamentally improved the theoretical understanding and practical efficiency of stochastic optimization algorithms:

Attainment of Minimax-Optimality: Consistently achieves the first-order Cramér–Rao lower bound both in asymptotic variance and finite-time regimes, providing performance unattainable by unaveraged schemes (Gadat et al., 2017, Mou et al., 2020).
Black-box and Comparison-Based Optimization: In settings where only order or relative comparisons are available, PR-averaged algorithms achieve provably tighter asymptotic dispersion, facilitating efficient black-box optimization (Smirnov et al., 2024).
Non-Isolated and Degenerate Minima: Averaging ensures that stochastic gradient algorithms can retain $\sqrt{n}$ -rate normality even when the set of minimizers is a positive-dimensional manifold or “ridge,” mitigating the slow decay in tangential directions (Dereich et al., 2019).
Distributed Consensus and Policy Evaluation: In networked and distributed systems, PR-averaging is essential for both accelerated deterministic bias decay and minimax-optimal $O(1/T)$ variance decay, outperforming previous approaches in communication- or network-limited environments (Zhang et al., 2022).
Gradient-Free and Quasi-Stochastic Algorithms: When algorithmic noise is deterministic or quasi-stochastic (e.g., in extremum seeking with sinusoidal probing), PR-averaging can induce super-fast convergence rates, sometimes exceeding $O(n^{-1})$ and attaining nearly quartic decay under specialized conditions (Lauand et al., 2022).

The technique’s prominence is reflected by its instantaneous adoption in a wide range of areas, including off-policy RL, two-time-scale actor–critic, decentralized optimization, and black-box derivative-free learning.

7. Contemporary Extensions and Future Directions

Recent research continues to extend Polyak–Ruppert averaging to increasingly complex and realistic model classes:

Time-Dependent and Biased Streaming Data: Robust non-asymptotic analyses have clarified how averaging interacts with temporal dependencies, slow-mixing Markovian data, and persistent bias, ensuring optimal rates under broad conditions (Godichon-Baggioni et al., 2022, Lauand et al., 2024).
General-Purpose Nonasymptotic Concentration: Modular frameworks now produce sharp high-probability error guarantees for averaged stochastic approximation in both contractive and non-contractive regimes, including off-policy reinforcement learning (Khodadadian et al., 27 May 2025).
Two-Timescale and Coupled Processes: Nonasymptotic CLTs and Berry–Esseen-type bounds elucidate how rate-optimal averaging is preserved (or lost) depending on the separation of fast and slow time scales, and provide tuning prescriptions for joint minimax behavior (Kong et al., 14 Feb 2025, Butyrin et al., 11 Aug 2025).
Inference under Model Misspecification: In extension to zeroth-order and order-oracle black-box optimization, PR averaging secures efficient confidence bounds and minimal-variance estimates even when classical gradient information is inaccessible (Smirnov et al., 2024, Jin et al., 2021).
Practical Methodologies: Online covariance estimation, batch means, and multiplier bootstrap facilitate finite-sample inference, while geometric averaging provides implicit regularization and enables fast parameter tuning (Zhu et al., 2019, Samsonov et al., 2024, Neu et al., 2018).

A plausible implication is continued broadening of PR-averaging to high-dimensional, nonconvex, nonstationary, and distributed machine learning scenarios, especially as the underlying theoretical tools extend to handle less restrictive noise, dependency, and convexity conditions.

Markdown Upgrade to Chat

References (20)

Iterate averaging as regularization for stochastic gradient descent (2018)

Nonasymptotic CLT and Error Bounds for Two-Time-Scale Stochastic Approximation (2025)

A Dual Accelerated Method for Online Stochastic Distributed Averaging: From Consensus to Decentralized Policy Evaluation (2022)

Non-Asymptotic Analysis of Stochastic Approximation Algorithms for Streaming Data (2021)

Learning from time-dependent streaming data with online stochastic algorithms (2022)

Optimal non-asymptotic bound of the Ruppert-Polyak averaging without strong convexity (2017)

On Constructing Confidence Region for Model Parameters in Stochastic Gradient Descent via Batch Means (2019)

Revisiting Step-Size Assumptions in Stochastic Approximation (2024)

High-Order Error Bounds for Markovian LSA with Richardson-Romberg Extrapolation (2025)

10.

A General-Purpose Theorem for High-Probability Bounds of Stochastic Approximation with Polyak Averaging (2025)

11.

Central limit theorems for stochastic gradient descent with averaging for stable manifolds (2019)

12.

Ruppert-Polyak averaging for Stochastic Order Oracle (2024)

13.

Gaussian Approximation for Two-Timescale Linear Stochastic Approximation (2025)

14.

Statistical Inference for Polyak-Ruppert Averaged Zeroth-order Stochastic Gradient Algorithm (2021)

15.

Gaussian Approximation and Multiplier Bootstrap for Polyak-Ruppert Averaged Linear Stochastic Approximation with Applications to TD Learning (2024)

16.

A Statistical Analysis of Polyak-Ruppert Averaged Q-learning (2021)

17.

Finite-time High-probability Bounds for Polyak-Ruppert Averaged Iterates of Linear Stochastic Approximation (2022)

18.

Linear Stochastic Approximation: Constant Step-Size and Iterate Averaging (2017)

19.

On Linear Stochastic Approximation: Fine-grained Polyak-Ruppert and Non-Asymptotic Concentration (2020)

20.

Extremely Fast Convergence Rates for Extremum Seeking Control with Polyak-Ruppert Averaging (2022)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Polyak–Ruppert Averaging.

Polyak–Ruppert Averaging

1. Method Definition and Algorithmic Foundations

2. Theoretical Guarantees: Asymptotic Normality and Rate Optimality

3. Algorithmic Extensions and Model Generality

4. Statistical Inference, Confidence Intervals, and Non-asymptotic Tools

5. Algorithmic and Practical Considerations

6. Impact and Specialized Applications

7. Contemporary Extensions and Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics