Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
94 tokens/sec
Gemini 2.5 Pro Premium
55 tokens/sec
GPT-5 Medium
18 tokens/sec
GPT-5 High Premium
24 tokens/sec
GPT-4o
103 tokens/sec
DeepSeek R1 via Azure Premium
93 tokens/sec
GPT OSS 120B via Groq Premium
462 tokens/sec
Kimi K2 via Groq Premium
254 tokens/sec
2000 character limit reached

Polyak-Ruppert Averaging

Updated 11 August 2025
  • Polyak-Ruppert Averaging is a stochastic method that forms estimators from the average of past iterates, offering improved convergence and variance reduction.
  • It achieves optimal statistical efficiency with nonasymptotic error bounds and matches the Cramér–Rao lower bound under mild regularity conditions.
  • The technique extends to reinforcement learning, distributed optimization, and nonconvex settings, providing robust performance across diverse applications.

Polyak-Ruppert (PR) Averaging is a stochastic iterative procedure designed to regularize and accelerate convergence in stochastic approximation, notably in stochastic gradient descent (SGD) and related stochastic approximation algorithms. Its core idea is to form the estimator not from the final iterate of a stochastic recursion but from the average (or weighted average) of all previous iterates, thereby reducing variance and improving statistical efficiency. Over the past three decades, PR averaging has become foundational across stochastic optimization, reinforcement learning, online learning, and distributed computation due to its ability to attain asymptotic statistical efficiency and provide sharp, nonasymptotic performance guarantees.

1. Fundamental Principles and Algorithmic Frameworks

Given a stochastic approximation recursion such as

θn=θn1γnf(θn1,ξn)\theta_{n} = \theta_{n-1} - \gamma_n \nabla f(\theta_{n-1}, \xi_n)

the PR average is computed as

θ^n=1ni=1nθi\hat{\theta}_n = \frac{1}{n} \sum_{i=1}^n \theta_i

or, more generally, as a weighted sum. The key insight, first made rigorous by Ruppert (1988) and Polyak and Juditsky (1992), is that while individual iterates may have slow and noisy convergence, the average θ^n\hat{\theta}_n achieves an O(1/n)O(1/n) mean-squared error (MSE) rate with a variance matching the Cramér–Rao lower bound in the strongly convex case, with minimal tuning and mild regularity conditions (Gadat et al., 2017).

A typical result, under regularity conditions (strong convexity or its relaxed variants), is the nonasymptotic bound

E[θ^nθ2]Tr(Σ)n+Cnrβ\mathbb{E}\bigl[|\hat{\theta}_n - \theta^\star|^2\bigr] \leq \frac{\operatorname{Tr}(\Sigma^\star)}{n} + C\, n^{-r_\beta}

with Σ\Sigma^\star the asymptotic covariance structure, and rβr_\beta a second-order rate exponent determined by the step-size schedule (Gadat et al., 2017). In linear stochastic approximation (LSA), using a constant step-size, the bias decays as O(1/n2)O(1/n^2) and the MSE as O(1/n)O(1/n) under Hurwitz mean dynamics (Lakshminarayanan et al., 2017, Mou et al., 2020, Durmus et al., 2022).

PR averaging also extends to diverse settings including:

2. Theoretical Guarantees: Asymptotic and Nonasymptotic Results

PR averaging achieves several optimality properties, often simultaneously:

  • Minimax Asymptotic MSE: θ^n\hat{\theta}_n achieves the Cramér–Rao lower bound for the asymptotic variance, that is,

n(θ^nθ)dN(0,Σ)\sqrt{n}(\hat{\theta}_n - \theta^\star) \xrightarrow{d} N(0, \Sigma^\star)

with Σ=[D2f(θ)]1S[D2f(θ)]1\Sigma^\star = [D^2 f(\theta^\star)]^{-1} S^\star\,[D^2 f(\theta^\star)]^{-1} (Gadat et al., 2017, Mou et al., 2020).

  • Tight Nonasymptotic Bounds: For decreasing step-size γn=γnβ\gamma_n = \gamma n^{-\beta} with β(1/2,1)\beta \in (1/2, 1), optimal second-order corrections appear: the MSE is bounded by O(1/n)O(1/n) plus O(nrβ)O(n^{-r_\beta}) with rβ=min{β+1/2,2β}r_\beta = \min\{\beta + 1/2, 2-\beta\} (Gadat et al., 2017). The choice β=3/4\beta=3/4 yields the sharp r3/4=5/4r_{3/4}=5/4 correction.
  • High-Probability and Moment Bounds: Concentration inequalities demonstrate that with high probability,

θ^nθ=O(Tr(Σ)log(1/δ)n)\|\hat{\theta}_n - \theta^\star\| = O\left(\sqrt{\frac{\operatorname{Tr}(\Sigma^\star)\log(1/\delta)}{n}}\right)

up to explicit dimension- and problem-dependent constants (Mou et al., 2020, Durmus et al., 2022, Khodadadian et al., 27 May 2025).

  • Gaussian Approximation and Berry–Esseen Bounds: Rates for the distributional approximation (e.g., in the convex or Wasserstein–1 distance) between the normalized error and the corresponding Gaussian are rigorously established, typically attaining O(n1/4)O(n^{-1/4}) to O(1/n)O(1/\sqrt{n}) convergence for decreasing step-size (γ=1/2\gamma = 1/2 to $3/4$) (Samsonov et al., 26 May 2024, Sheshukova et al., 10 Feb 2025).

These results remain robust in both i.i.d. and Markovian settings, the latter encompassing reinforcement learning scenarios and online learning from dependent data streams (Godichon-Baggioni et al., 2022, Lauand et al., 28 May 2024, Samsonov et al., 26 May 2024, Levin et al., 7 Aug 2025).

3. PR Averaging Beyond Strong Convexity: KL and Nonconvex Settings

Classical PR theory assumes uniform strong convexity. However, the modern literature extends optimal convergence rates to non-strongly convex and even certain nonconvex problems:

  • Kurdyka–Łojasiewicz-type conditions generalize convexity and provide sufficient control of the geometry, enabling O(1/n)O(1/n) leading-order rates and sharp higher-order terms for the averaged iterate (Gadat et al., 2017). This encompasses objectives where the gradient only vanishes sufficiently fast (e.g., logistic regression, recursive quantile estimation).
  • Stable Manifolds: When minimizers are not isolated but lie on a manifold, central limit theorems for the PR average are recovered in the normal directions, with larger tangential oscillations that do not affect statistical efficiency after appropriate projection (Dereich et al., 2019).
  • Multilevel and Bias-Corrected Extensions: In algorithms with multilevel estimators or Markovian noise, PR averaging yields CLTs provided deterministic bias is controlled; under suitable extrapolation (e.g., Richardson–Romberg), even the persistent O(α)O(\alpha) bias (with constant step-size) can be eliminated (Dereich, 2019, Levin et al., 7 Aug 2025).

4. Extensions: Variants, Regularization, and Bootstrap Inference

Several variants and methodological extensions of PR averaging have been proposed:

  • Geometric (Exponentially Weighted) PR Averaging: By weighing iterates with geometrically decaying weights, the averaging induces implicit regularization, with the limiting estimator asymptotically matching ridge regression (Neu et al., 2018). For linear regression with step-size η\eta and decay λ\lambda,

wn=(t=0n(1ηλ)twt)/(t=0n(1ηλ)t)w_n = \left( \sum_{t=0}^n (1-\eta\lambda)^t w_t \right) / \left( \sum_{t=0}^n (1-\eta\lambda)^t \right)

converges to the Tikhonov solution as nn\to\infty.

  • Batch Means and Multiplier Bootstrap: For inference (confidence regions) on θ\theta^\star, batch means and multiplier bootstrap procedures exploit the process-level functional CLT for PR averages (Zhu et al., 2019, Samsonov et al., 26 May 2024, Sheshukova et al., 10 Feb 2025). The batch means method splits the trajectory into batches, empirically cancels the unknown asymptotic covariance, and produces regions with prescribed asymptotic coverage. The multiplier bootstrap runs LSA/SGD recursions with random weights to simulate the distribution of the error in a computationally efficient online fashion.
  • Distributed and Decentralized Averaging: In multi-agent and distributed optimization, dual-based Nesterov-accelerated algorithms coupled with PR averaging achieve exponential bias decay (in network parameters) and O(1/T)O(1/T) error, outperforming primal schemes especially on poorly connected networks (Zhang et al., 2022).
  • Streaming and Time-Dependent Data: PR averaging is particularly effective when combined with time-varying mini-batches, as it stabilizes variance reductions obtained by large batches while keeping the asymptotic (Cramér–Rao) constant optimal (Godichon-Baggioni et al., 2021, Godichon-Baggioni et al., 2022).

5. Bias, Step-Size, and Statistical Efficiency

The regularization and statistical optimality of PR averaging depend crucially on the interplay of bias, variance, and step-size:

  • Bias–Variance Decomposition: For linear stochastic approximation with step-size αn=α0nρ\alpha_n = \alpha_0 n^{-\rho}, the averaged error decomposes as αnβn\alpha_n \beta_n (bias term) plus an O(1/n)O(1/\sqrt{n}) variance term. When ρ>1/2\rho > 1/2, the bias decays sufficiently fast and does not dominate; otherwise, it may preclude fast convergence unless the bias constant vanishes (Lauand et al., 28 May 2024).
  • Persistent Bias under Constant Step-Size and Markov Noise: In LSA with Markovian noise, even after averaging, an O(α)O(\alpha) bias may remain. Richardson–Romberg extrapolation (running the algorithm at two step-sizes and subtracting) cancels this bias, restoring optimal rates (Levin et al., 7 Aug 2025).
  • High-Order Corrections: Nonasymptotic bounds include higher-order terms of O(nrβ)O(n^{-r_\beta}) where rβr_\beta depends on the decay rate of step-size, allowing practitioners to trade sharpness of error for step-size simplicity (Gadat et al., 2017, Sheshukova et al., 10 Feb 2025).
  • Variance Constants: The leading constant in the asymptotic covariance is problem-dependent but independent of detailed step-size tuning, provided conditions are met for averaging to be effective; for linear LSA, this constant matches the minimax lower bound (Durmus et al., 2022, Samsonov et al., 26 May 2024).

6. Applications and Empirical Evidence

PR averaging underpins many core learning and control procedures:

  • Reinforcement Learning: PR averaging enables tabular and function-approximation TD-learning, GTD, and Q-learning to efficiently estimate value functions and optimal Q-values with minimax-optimal error scaling, robust confidence intervals, and sample complexity matching known lower bounds (Lakshminarayanan et al., 2017, Li et al., 2021, Durmus et al., 2022, Samsonov et al., 26 May 2024).
  • Stochastic Order-Oracles and Black-Box Optimization: In scenarios relying only on order comparisons (rather than exact function values), PR averaging reduces the asymptotic covariance dependence from [2f(x)]1[\nabla^2 f(x^*)]^{-1} to [2f(x)]2[\nabla^2 f(x^*)]^{-2}, narrowing the distribution of the estimator around the true minimizer and eliminating dependence on unknown constants (Smirnov et al., 24 Nov 2024).
  • Extremum Seeking and Gradient-Free Optimization: In deterministic quasi-stochastic settings, PR averaging can accelerate the MSE rate to O(n4+δ)O(n^{-4+\delta}) (for any δ>0\delta>0), beyond what is achievable with vanilla quasi-Monte Carlo or unaveraged QSA, conditional on the design of appropriate probing signals and bias cancellation (Lauand et al., 2022).
  • Two-Time-Scale Methods: When fast and slow variables are present (e.g., actor–critic), PR averaging yields O(1/n)O(1/\sqrt{n}) expected error for both variable sets, surpassing prior finite-sample guarantees (Kong et al., 14 Feb 2025).

Empirical observations confirm that PR averaging stabilizes iterates, lowers MSE faster than unaveraged methods, and performs robustly to bias and dependencies in data streams, thereby delivering nearly optimal statistical and computational performance (Godichon-Baggioni et al., 2022, Durmus et al., 2022).

7. Limitations, Parameter Choices, and Open Directions

Despite its broad applicability and optimality, effective use of PR averaging relies on several key considerations:

  • Step-Size Scheduling: Theoretical optimality (MSE, variance minimization) is guaranteed for step-size schedules γn=γnβ\gamma_n = \gamma n^{-\beta} with β(0.5,1)\beta \in (0.5, 1). Constant step-sizes are only effective under strong mixing (or in certain i.i.d. or additive noise regimes) and when bias is controlled. Persistent bias under constant step-size (especially with Markovian noise) may require extrapolation schemes (Lauand et al., 28 May 2024, Levin et al., 7 Aug 2025).
  • Bias Sensitivity: PR averaging cancels variance optimally but not deterministic bias. In applications with model misspecification, non-vanishing drift, or insufficiently decaying step-size, bias must be controlled via more refined procedures.
  • Nonasymptotic Tuning: Although central limit theorems and high-probability bounds are sharp, finite-sample constants, batch parameters, and mini-batch schedules can influence pre-asymptotic performance. Careful calibration (via batch means, multiplier bootstrap, or step-size adaptation heuristics) remains an ongoing area (Zhu et al., 2019, Samsonov et al., 26 May 2024, Khodadadian et al., 27 May 2025).
  • Nonconvex and Pathological Settings: While KL-type results give broad coverage, further generalization of PR averaging to non-Euclidean geometries, highly structured nonconvex objectives, or non-Markovian dependencies remains active (Gadat et al., 2017, Dereich et al., 2019).
  • Distributional Robustness and Heavy-Tailed Regimes: PR averaging's guarantees are classically established under subgaussian or exponentially tailed noise; behavior (and concentration) under heavy-tailed or adversarial noise is less well understood.

In summary, Polyak-Ruppert averaging is a central tool in stochastic approximation, offering minimax-optimal statistical efficiency, robust nonasymptotic guarantees, and practical robustness to step-size tuning and problem structure. Its current theoretical development encompasses strong/non-strong convexity, bias correction and extrapolation, high-probability deviation bounds, and powerful bootstrap methodologies for statistical inference, all supported by empirical validation in modern large-scale statistical and learning environments.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (20)