Polyak-Ruppert Averaging
- Polyak-Ruppert Averaging is a stochastic method that forms estimators from the average of past iterates, offering improved convergence and variance reduction.
- It achieves optimal statistical efficiency with nonasymptotic error bounds and matches the Cramér–Rao lower bound under mild regularity conditions.
- The technique extends to reinforcement learning, distributed optimization, and nonconvex settings, providing robust performance across diverse applications.
Polyak-Ruppert (PR) Averaging is a stochastic iterative procedure designed to regularize and accelerate convergence in stochastic approximation, notably in stochastic gradient descent (SGD) and related stochastic approximation algorithms. Its core idea is to form the estimator not from the final iterate of a stochastic recursion but from the average (or weighted average) of all previous iterates, thereby reducing variance and improving statistical efficiency. Over the past three decades, PR averaging has become foundational across stochastic optimization, reinforcement learning, online learning, and distributed computation due to its ability to attain asymptotic statistical efficiency and provide sharp, nonasymptotic performance guarantees.
1. Fundamental Principles and Algorithmic Frameworks
Given a stochastic approximation recursion such as
the PR average is computed as
or, more generally, as a weighted sum. The key insight, first made rigorous by Ruppert (1988) and Polyak and Juditsky (1992), is that while individual iterates may have slow and noisy convergence, the average achieves an mean-squared error (MSE) rate with a variance matching the Cramér–Rao lower bound in the strongly convex case, with minimal tuning and mild regularity conditions (Gadat et al., 2017).
A typical result, under regularity conditions (strong convexity or its relaxed variants), is the nonasymptotic bound
with the asymptotic covariance structure, and a second-order rate exponent determined by the step-size schedule (Gadat et al., 2017). In linear stochastic approximation (LSA), using a constant step-size, the bias decays as and the MSE as under Hurwitz mean dynamics (Lakshminarayanan et al., 2017, Mou et al., 2020, Durmus et al., 2022).
PR averaging also extends to diverse settings including:
- Online streaming optimization (update with mini-batches, non-i.i.d. temporal dependence) (Godichon-Baggioni et al., 2021, Godichon-Baggioni et al., 2022)
- Non-strongly convex, KL-type objectives (e.g., online logistic regression, recursive quantile estimation) (Gadat et al., 2017)
- Geometric or exponentially decaying averages (implicit Tikhonov regularization) (Neu et al., 2018)
- Multilevel stochastic approximation (bias/noise balancing across multiple accuracy levels) (Dereich, 2019)
2. Theoretical Guarantees: Asymptotic and Nonasymptotic Results
PR averaging achieves several optimality properties, often simultaneously:
- Minimax Asymptotic MSE: achieves the Cramér–Rao lower bound for the asymptotic variance, that is,
with (Gadat et al., 2017, Mou et al., 2020).
- Tight Nonasymptotic Bounds: For decreasing step-size with , optimal second-order corrections appear: the MSE is bounded by plus with (Gadat et al., 2017). The choice yields the sharp correction.
- High-Probability and Moment Bounds: Concentration inequalities demonstrate that with high probability,
up to explicit dimension- and problem-dependent constants (Mou et al., 2020, Durmus et al., 2022, Khodadadian et al., 27 May 2025).
- Gaussian Approximation and Berry–Esseen Bounds: Rates for the distributional approximation (e.g., in the convex or Wasserstein–1 distance) between the normalized error and the corresponding Gaussian are rigorously established, typically attaining to convergence for decreasing step-size ( to $3/4$) (Samsonov et al., 26 May 2024, Sheshukova et al., 10 Feb 2025).
These results remain robust in both i.i.d. and Markovian settings, the latter encompassing reinforcement learning scenarios and online learning from dependent data streams (Godichon-Baggioni et al., 2022, Lauand et al., 28 May 2024, Samsonov et al., 26 May 2024, Levin et al., 7 Aug 2025).
3. PR Averaging Beyond Strong Convexity: KL and Nonconvex Settings
Classical PR theory assumes uniform strong convexity. However, the modern literature extends optimal convergence rates to non-strongly convex and even certain nonconvex problems:
- Kurdyka–Łojasiewicz-type conditions generalize convexity and provide sufficient control of the geometry, enabling leading-order rates and sharp higher-order terms for the averaged iterate (Gadat et al., 2017). This encompasses objectives where the gradient only vanishes sufficiently fast (e.g., logistic regression, recursive quantile estimation).
- Stable Manifolds: When minimizers are not isolated but lie on a manifold, central limit theorems for the PR average are recovered in the normal directions, with larger tangential oscillations that do not affect statistical efficiency after appropriate projection (Dereich et al., 2019).
- Multilevel and Bias-Corrected Extensions: In algorithms with multilevel estimators or Markovian noise, PR averaging yields CLTs provided deterministic bias is controlled; under suitable extrapolation (e.g., Richardson–Romberg), even the persistent bias (with constant step-size) can be eliminated (Dereich, 2019, Levin et al., 7 Aug 2025).
4. Extensions: Variants, Regularization, and Bootstrap Inference
Several variants and methodological extensions of PR averaging have been proposed:
- Geometric (Exponentially Weighted) PR Averaging: By weighing iterates with geometrically decaying weights, the averaging induces implicit regularization, with the limiting estimator asymptotically matching ridge regression (Neu et al., 2018). For linear regression with step-size and decay ,
converges to the Tikhonov solution as .
- Batch Means and Multiplier Bootstrap: For inference (confidence regions) on , batch means and multiplier bootstrap procedures exploit the process-level functional CLT for PR averages (Zhu et al., 2019, Samsonov et al., 26 May 2024, Sheshukova et al., 10 Feb 2025). The batch means method splits the trajectory into batches, empirically cancels the unknown asymptotic covariance, and produces regions with prescribed asymptotic coverage. The multiplier bootstrap runs LSA/SGD recursions with random weights to simulate the distribution of the error in a computationally efficient online fashion.
- Distributed and Decentralized Averaging: In multi-agent and distributed optimization, dual-based Nesterov-accelerated algorithms coupled with PR averaging achieve exponential bias decay (in network parameters) and error, outperforming primal schemes especially on poorly connected networks (Zhang et al., 2022).
- Streaming and Time-Dependent Data: PR averaging is particularly effective when combined with time-varying mini-batches, as it stabilizes variance reductions obtained by large batches while keeping the asymptotic (Cramér–Rao) constant optimal (Godichon-Baggioni et al., 2021, Godichon-Baggioni et al., 2022).
5. Bias, Step-Size, and Statistical Efficiency
The regularization and statistical optimality of PR averaging depend crucially on the interplay of bias, variance, and step-size:
- Bias–Variance Decomposition: For linear stochastic approximation with step-size , the averaged error decomposes as (bias term) plus an variance term. When , the bias decays sufficiently fast and does not dominate; otherwise, it may preclude fast convergence unless the bias constant vanishes (Lauand et al., 28 May 2024).
- Persistent Bias under Constant Step-Size and Markov Noise: In LSA with Markovian noise, even after averaging, an bias may remain. Richardson–Romberg extrapolation (running the algorithm at two step-sizes and subtracting) cancels this bias, restoring optimal rates (Levin et al., 7 Aug 2025).
- High-Order Corrections: Nonasymptotic bounds include higher-order terms of where depends on the decay rate of step-size, allowing practitioners to trade sharpness of error for step-size simplicity (Gadat et al., 2017, Sheshukova et al., 10 Feb 2025).
- Variance Constants: The leading constant in the asymptotic covariance is problem-dependent but independent of detailed step-size tuning, provided conditions are met for averaging to be effective; for linear LSA, this constant matches the minimax lower bound (Durmus et al., 2022, Samsonov et al., 26 May 2024).
6. Applications and Empirical Evidence
PR averaging underpins many core learning and control procedures:
- Reinforcement Learning: PR averaging enables tabular and function-approximation TD-learning, GTD, and Q-learning to efficiently estimate value functions and optimal Q-values with minimax-optimal error scaling, robust confidence intervals, and sample complexity matching known lower bounds (Lakshminarayanan et al., 2017, Li et al., 2021, Durmus et al., 2022, Samsonov et al., 26 May 2024).
- Stochastic Order-Oracles and Black-Box Optimization: In scenarios relying only on order comparisons (rather than exact function values), PR averaging reduces the asymptotic covariance dependence from to , narrowing the distribution of the estimator around the true minimizer and eliminating dependence on unknown constants (Smirnov et al., 24 Nov 2024).
- Extremum Seeking and Gradient-Free Optimization: In deterministic quasi-stochastic settings, PR averaging can accelerate the MSE rate to (for any ), beyond what is achievable with vanilla quasi-Monte Carlo or unaveraged QSA, conditional on the design of appropriate probing signals and bias cancellation (Lauand et al., 2022).
- Two-Time-Scale Methods: When fast and slow variables are present (e.g., actor–critic), PR averaging yields expected error for both variable sets, surpassing prior finite-sample guarantees (Kong et al., 14 Feb 2025).
Empirical observations confirm that PR averaging stabilizes iterates, lowers MSE faster than unaveraged methods, and performs robustly to bias and dependencies in data streams, thereby delivering nearly optimal statistical and computational performance (Godichon-Baggioni et al., 2022, Durmus et al., 2022).
7. Limitations, Parameter Choices, and Open Directions
Despite its broad applicability and optimality, effective use of PR averaging relies on several key considerations:
- Step-Size Scheduling: Theoretical optimality (MSE, variance minimization) is guaranteed for step-size schedules with . Constant step-sizes are only effective under strong mixing (or in certain i.i.d. or additive noise regimes) and when bias is controlled. Persistent bias under constant step-size (especially with Markovian noise) may require extrapolation schemes (Lauand et al., 28 May 2024, Levin et al., 7 Aug 2025).
- Bias Sensitivity: PR averaging cancels variance optimally but not deterministic bias. In applications with model misspecification, non-vanishing drift, or insufficiently decaying step-size, bias must be controlled via more refined procedures.
- Nonasymptotic Tuning: Although central limit theorems and high-probability bounds are sharp, finite-sample constants, batch parameters, and mini-batch schedules can influence pre-asymptotic performance. Careful calibration (via batch means, multiplier bootstrap, or step-size adaptation heuristics) remains an ongoing area (Zhu et al., 2019, Samsonov et al., 26 May 2024, Khodadadian et al., 27 May 2025).
- Nonconvex and Pathological Settings: While KL-type results give broad coverage, further generalization of PR averaging to non-Euclidean geometries, highly structured nonconvex objectives, or non-Markovian dependencies remains active (Gadat et al., 2017, Dereich et al., 2019).
- Distributional Robustness and Heavy-Tailed Regimes: PR averaging's guarantees are classically established under subgaussian or exponentially tailed noise; behavior (and concentration) under heavy-tailed or adversarial noise is less well understood.
In summary, Polyak-Ruppert averaging is a central tool in stochastic approximation, offering minimax-optimal statistical efficiency, robust nonasymptotic guarantees, and practical robustness to step-size tuning and problem structure. Its current theoretical development encompasses strong/non-strong convexity, bias correction and extrapolation, high-probability deviation bounds, and powerful bootstrap methodologies for statistical inference, all supported by empirical validation in modern large-scale statistical and learning environments.