Polyak-Ruppert Averaged Estimator
- The Polyak-Ruppert averaged estimator is defined as the running average of iterates from stochastic approximation methods, achieving variance reduction and optimal asymptotic covariance.
- It applies to diverse settings—including linear, streaming, and two-timescale frameworks—yielding non-asymptotic error bounds and refined Gaussian approximations.
- The methodology underpins efficiency in reinforcement learning, adaptive preconditioning, and momentum-based algorithms by recovering optimal first-order statistical properties.
The Polyak–Ruppert averaged estimator is the averaged output of a stochastic approximation procedure, most commonly the arithmetic mean of successive iterates of Robbins–Monro or stochastic gradient descent. In its classical form, one seeks to minimize a smooth function $f:\mathbb{R}^d\to\mathbb{R}$ with unique minimizer $\theta^\star$, runs the stochastic recursion
$\theta_{n+1}=\theta_n-\gamma_{n+1}\nabla f(\theta_n;X_{n+1}),$
and then replaces the last iterate by
$\bar\theta_n=\frac1n\sum_{k=1}^n\theta_k.$
The central phenomenon is variance reduction: under the standard conditions of the Polyak–Juditsky theory, the averaged estimator attains the $O(1/n)$ scale with asymptotically optimal covariance, and subsequent work has extended that conclusion to weakly convex, linear, Markovian, reinforcement-learning, multilevel, two-timescale, preconditioned, momentum, streaming, and zeroth-order settings (Gadat et al., 2017, Godichon-Baggioni et al., 2021, Li et al., 2021).
1. Classical formulation and averaging schemes
In the basic stochastic-gradient setting, the Polyak–Ruppert estimator is the running average of raw stochastic approximation iterates. The canonical update is the Robbins–Monro recursion
$\theta_{n+1}=\theta_n-\gamma_{n+1}\nabla f(\theta_n)+\gamma_{n+1}\Delta M_{n+1},$
where $(\Delta M_{n+1})$ is a martingale increment with conditional covariance $S(\theta_n)$, and the averaged estimator is $\bar\theta_n=\tfrac1n\sum_{k=1}^n\theta_k$ (Gadat et al., 2017). In linear stochastic approximation, the same principle appears as averaging iterates of a linear recursion for solving $A\theta^\star=b$, sometimes in full-average form and sometimes in tail-average form, for example
$\theta^\star$0
or, with burn-in, $\theta^\star$1 (Samsonov et al., 2024, Durmus et al., 2022).
The averaging operator is not unique. In streaming stochastic approximation with time-varying mini-batches $\theta^\star$2, the averaged estimator is weighted by the arriving batch sizes,
$\theta^\star$3
and this admits a recursive implementation (Godichon-Baggioni et al., 2021). In multilevel stochastic approximation, the Polyak–Ruppert average is written
$\theta^\star$4
with a nondecreasing weight sequence $\theta^\star$5 (Dereich, 2019). In continuous-time quasi-stochastic approximation, the same idea becomes time averaging over a terminal window,
$\theta^\star$6
with $\theta^\star$7 a burn-in time (Lauand et al., 2022).
These variants share the same structural purpose: they suppress the high-variance component of the raw recursion while preserving the first-order statistical information. This suggests that “Polyak–Ruppert averaging” is best understood as a family of averaging operators applied to stochastic approximation trajectories rather than as a single formula.
2. Structural assumptions and asymptotic efficiency
A standard analytical framework assumes $\theta^\star$8, $\theta^\star$9-Lipschitz gradient, positive-definite Hessian $\theta_{n+1}=\theta_n-\gamma_{n+1}\nabla f(\theta_n;X_{n+1}),$0 at the minimizer, and martingale-difference noise with conditional covariance $\theta_{n+1}=\theta_n-\gamma_{n+1}\nabla f(\theta_n;X_{n+1}),$1 converging to $\theta_{n+1}=\theta_n-\gamma_{n+1}\nabla f(\theta_n;X_{n+1}),$2 (Gadat et al., 2017). In the strongly convex case, one also imposes moment bounds such as
$\theta_{n+1}=\theta_n-\gamma_{n+1}\nabla f(\theta_n;X_{n+1}),$3
while a weaker alternative replaces global strong convexity by a Kurdyka–Łojasiewicz-type condition
$\theta_{n+1}=\theta_n-\gamma_{n+1}\nabla f(\theta_n;X_{n+1}),$4
with bounded $\theta_{n+1}=\theta_n-\gamma_{n+1}\nabla f(\theta_n;X_{n+1}),$5 (Gadat et al., 2017).
Under these assumptions, the asymptotic covariance of the averaged estimator takes the classical sandwich form. In the smooth optimization setting,
$\theta_{n+1}=\theta_n-\gamma_{n+1}\nabla f(\theta_n;X_{n+1}),$6
and the leading $\theta_{n+1}=\theta_n-\gamma_{n+1}\nabla f(\theta_n;X_{n+1}),$7 term in the mean-squared error is exactly $\theta_{n+1}=\theta_n-\gamma_{n+1}\nabla f(\theta_n;X_{n+1}),$8, which matches the Cramér–Rao lower bound (Gadat et al., 2017). In streaming stochastic approximation, the corresponding constant is
$\theta_{n+1}=\theta_n-\gamma_{n+1}\nabla f(\theta_n;X_{n+1}),$9
and the leading term $\bar\theta_n=\frac1n\sum_{k=1}^n\theta_k.$0 again attains the Cramér–Rao lower bound (Godichon-Baggioni et al., 2021).
The same efficiency interpretation persists in reinforcement learning. For averaged tabular Q-learning, $\bar\theta_n=\frac1n\sum_{k=1}^n\theta_k.$1 is shown to be a regular asymptotically linear estimator for $\bar\theta_n=\frac1n\sum_{k=1}^n\theta_k.$2 with influence function
$\bar\theta_n=\frac1n\sum_{k=1}^n\theta_k.$3
and no other regular estimator can attain smaller asymptotic variance than $\bar\theta_n=\frac1n\sum_{k=1}^n\theta_k.$4 (Li et al., 2021). In this sense, Polyak–Ruppert averaging is not merely a variance-reduction heuristic; in the settings above it is an efficiency mechanism that recovers the optimal first-order covariance without prior knowledge of the strong-convexity parameter or Hessian (Gadat et al., 2017).
3. Non-asymptotic theory and the extension beyond strong convexity
A central non-asymptotic result is the bound established for step sizes $\bar\theta_n=\frac1n\sum_{k=1}^n\theta_k.$5 with $\bar\theta_n=\frac1n\sum_{k=1}^n\theta_k.$6. Assuming $\bar\theta_n=\frac1n\sum_{k=1}^n\theta_k.$7-consistency of the raw iterates,
$\bar\theta_n=\frac1n\sum_{k=1}^n\theta_k.$8
one obtains
$\bar\theta_n=\frac1n\sum_{k=1}^n\theta_k.$9
Optimizing $O(1/n)$0 yields $O(1/n)$1 and
$O(1/n)$2
so the averaged estimator is non-asymptotically minimax-optimal up to a negligible second-order term (Gadat et al., 2017).
A notable feature of that result is that it does not require uniform strong convexity. The same theorem applies when $O(1/n)$3 satisfies the global Kurdyka–Łojasiewicz-type condition, including pathological examples such as online logistic regression and recursive quantile estimation. For logistic loss with bounded design, the paper verifies the global KL condition with $O(1/n)$4 and derives
$O(1/n)$5
while for recursive quantile estimation it gives
$O(1/n)$6
with the Cramér–Rao-optimal leading term (Gadat et al., 2017).
High-probability analogues are now available in more abstract stochastic approximation models. A general-purpose theorem shows that if one already has concentration bounds for the raw iterates, then the averaged iterate $O(1/n)$7 satisfies a sharp non-asymptotic bound with leading order $O(1/n)$8, and a matching lower-bound example shows that the resulting $O(1/n)$9 scale cannot be improved in order, up to constants (Khodadadian et al., 27 May 2025). For linear stochastic approximation with i.i.d. data or uniformly geometrically ergodic Markov noise, finite-time high-probability bounds recover the Gaussian leading term with covariance $\theta_{n+1}=\theta_n-\gamma_{n+1}\nabla f(\theta_n)+\gamma_{n+1}\Delta M_{n+1},$0, and all dependence on the dimension $\theta_{n+1}=\theta_n-\gamma_{n+1}\nabla f(\theta_n)+\gamma_{n+1}\Delta M_{n+1},$1 enters only through $\theta_{n+1}=\theta_n-\gamma_{n+1}\nabla f(\theta_n)+\gamma_{n+1}\Delta M_{n+1},$2 in the step size and in exponentially small transient terms (Durmus et al., 2022).
4. Linear, streaming, multilevel, and two-timescale variants
Linear stochastic approximation has provided a particularly precise laboratory for Polyak–Ruppert analysis. In the constant-step Hurwitz case, the averaged iterate obeys a CLT with covariance
$\theta_{n+1}=\theta_n-\gamma_{n+1}\nabla f(\theta_n)+\gamma_{n+1}\Delta M_{n+1},$3
where $\theta_{n+1}=\theta_n-\gamma_{n+1}\nabla f(\theta_n)+\gamma_{n+1}\Delta M_{n+1},$4 is an $\theta_{n+1}=\theta_n-\gamma_{n+1}\nabla f(\theta_n)+\gamma_{n+1}\Delta M_{n+1},$5 correction to the classical Polyak–Ruppert covariance; in the critical case where $\theta_{n+1}=\theta_n-\gamma_{n+1}\nabla f(\theta_n)+\gamma_{n+1}\Delta M_{n+1},$6 has non-negative real parts and is diagonalizable, the averaged procedure still achieves an $\theta_{n+1}=\theta_n-\gamma_{n+1}\nabla f(\theta_n)+\gamma_{n+1}\Delta M_{n+1},$7 mean-squared error rate (Mou et al., 2020). This refined picture shows that PR averaging may preserve the optimal $\theta_{n+1}=\theta_n-\gamma_{n+1}\nabla f(\theta_n)+\gamma_{n+1}\Delta M_{n+1},$8 scale even when the drift is only marginally stable.
In the streaming framework, Polyak–Ruppert averaging is combined with time-varying mini-batches $\theta_{n+1}=\theta_n-\gamma_{n+1}\nabla f(\theta_n)+\gamma_{n+1}\Delta M_{n+1},$9 and learning rates $(\Delta M_{n+1})$0. Under smooth, convex, noise, Hessian-Lipschitz, and score-covariance assumptions, the averaged iterate achieves $(\Delta M_{n+1})$1 in both the constant and time-varying mini-batch regimes, and the schedule $(\Delta M_{n+1})$2 fixes the effective exponent $(\Delta M_{n+1})$3 for all $(\Delta M_{n+1})$4, making the method robust to the arrival rate (Godichon-Baggioni et al., 2021). The paper emphasizes that increasing mini-batches lower the intercept of the error curve while averaging yields the flat slope $(\Delta M_{n+1})$5 on logarithmic scale (Godichon-Baggioni et al., 2021).
Multilevel stochastic approximation combines PR averaging with a hierarchy of biased approximations $(\Delta M_{n+1})$6. Under mild step-size, weight, contractivity, variance, bias, and cost assumptions, the averaged estimator satisfies CLTs both in a slow regime with error $(\Delta M_{n+1})$7, $(\Delta M_{n+1})$8, and in a critical regime with error $(\Delta M_{n+1})$9, with sandwich covariance $S(\theta_n)$0 up to explicit scalar prefactors (Dereich, 2019). The resulting error-versus-cost rates match classical multilevel Monte Carlo rates up to logarithms (Dereich, 2019).
Two-timescale linear stochastic approximation introduces a further distinction between last-iterate and averaged behavior. For the PR average $S(\theta_n)$1 in a scheme with step sizes $S(\theta_n)$2, $S(\theta_n)$3, $S(\theta_n)$4, one obtains the convex-distance normal approximation bound
$S(\theta_n)$5
Optimizing $S(\theta_n)$6 gives rate $S(\theta_n)$7, and, unlike the last iterate, the PR-averaged rate is minimized when $S(\theta_n)$8, so larger timescale separation degrades the averaged regime (Butyrin et al., 11 Aug 2025).
A further variant arises in quasi-stochastic approximation. In that deterministic-noise setting, PR averaging over $S(\theta_n)$9 yields
$\bar\theta_n=\tfrac1n\sum_{k=1}^n\theta_k$0
and if the design ensures $\bar\theta_n=\tfrac1n\sum_{k=1}^n\theta_k$1, then $\bar\theta_n=\tfrac1n\sum_{k=1}^n\theta_k$2. Taking $\bar\theta_n=\tfrac1n\sum_{k=1}^n\theta_k$3 arbitrarily close to $\bar\theta_n=\tfrac1n\sum_{k=1}^n\theta_k$4 gives $\bar\theta_n=\tfrac1n\sum_{k=1}^n\theta_k$5 for any $\bar\theta_n=\tfrac1n\sum_{k=1}^n\theta_k$6 (Lauand et al., 2022).
5. Gaussian approximation, bootstrap, and online inference
The classical asymptotic normality of PR averages has been strengthened into quantitative Gaussian approximation and bootstrap theory. For decreasing-step linear stochastic approximation, a Berry–Esseen bound in convex distance gives the fastest rate $\bar\theta_n=\tfrac1n\sum_{k=1}^n\theta_k$7 at $\bar\theta_n=\tfrac1n\sum_{k=1}^n\theta_k$8, and a multiplier bootstrap based on perturbed LSA recursions yields non-asymptotically valid confidence intervals with the same $\bar\theta_n=\tfrac1n\sum_{k=1}^n\theta_k$9 scale (Samsonov et al., 2024). A later refinement improves the convex-distance approximation to
$A\theta^\star=b$0
when approximating $A\theta^\star=b$1 by the Gaussian law with Polyak–Juditsky covariance $A\theta^\star=b$2, and establishes bootstrap validity at rate $A\theta^\star=b$3 up to logarithmic and smaller remainder terms (Butyrin et al., 14 Oct 2025).
For nonlinear strongly convex SGD, the averaged estimator admits a decomposition
$A\theta^\star=b$4
where $A\theta^\star=b$5 is a linearized sum of independent increments and $A\theta^\star=b$6 collects initial-condition, linearization, and Lipschitz-noise remainders. Using Gaussian approximation for nonlinear statistics, one obtains convex-distance CLT bounds of order up to $A\theta^\star=b$7, together with a weighted-SGD multiplier bootstrap that avoids plugging in the limiting covariance $A\theta^\star=b$8 directly (Sheshukova et al., 10 Feb 2025). This places Polyak–Ruppert inference within a non-asymptotic framework rather than a purely qualitative CLT regime.
Reinforcement learning provides a second major inference domain. In tabular synchronous Q-learning, the standardized partial-sum process
$A\theta^\star=b$9
converges weakly in $\theta^\star$00 to $\theta^\star$01, and the induced FCLT implies a fully online inference method for reinforcement learning (Li et al., 2021). The same paper proves that $\theta^\star$02 is semiparametric-efficient and gives a nonasymptotic $\theta^\star$03-error bound whose leading term matches the instance-dependent lower bound up to logarithms (Li et al., 2021).
In zeroth-order optimization, the averaged stochastic zeroth-order gradient algorithm $\theta^\star$04 satisfies
$\theta^\star$05
for the Gaussian-smoothed objective $\theta^\star$06, with
$\theta^\star$07
The paper further constructs an online covariance estimator $\theta^\star$08 via overlapping batches and derives asymptotically valid confidence intervals and ellipsoids (Jin et al., 2021).
6. Adaptive preconditioning, momentum, and limits of averaging
One recurrent misconception is that averaging automatically removes all first-order distortions introduced by algorithmic modifications. Recent work shows a more qualified picture. For time-varying preconditioning, an exact decomposition isolates the preconditioner sequence into a dynamic remainder $\theta^\star$09, and the Polyak–Ruppert CLT
$\theta^\star$10
holds when the stabilization rate $\theta^\star$11 of $\theta^\star$12 satisfies
$\theta^\star$13
Within the class of polynomial rate hypotheses used in the upper bound, this threshold is sharp; the paper also verifies $\theta^\star$14 for SA-AdaGrad, SA-RMSProp, and SA-ONS, so these methods preserve the same first-order sandwich covariance, and under bounded inputs they also preserve the $\theta^\star$15 Wasserstein rate at $\theta^\star$16 (An et al., 26 Apr 2026).
Momentum and non-convergent adaptive preconditioning require a different analysis. For SA-Adam with first-moment buffer $\theta^\star$17, diagonal preconditioner $\theta^\star$18, step sizes $\theta^\star$19, and momentum gains $\theta^\star$20 with $\theta^\star$21, the augmented-state analysis yields
$\theta^\star$22
so the iterate-marginal covariance is exactly the plain SGD sandwich and the adaptivity is asymptotically invisible. The paper stresses that this holds for SA-Adam with sub-linearly vanishing momentum gain, not constant-$\theta^\star$23 deployed Adam; with coupled $\theta^\star$24 weight decay, the limit becomes the ridge-penalized sandwich (An et al., 15 Jun 2026).
A different limitation appears in constant-step LSA under Markovian noise. There, PR averaging reduces variance but does not remove the invariant-distribution bias: under stationarity,
$\theta^\star$25
and the paper states explicitly that the $\theta^\star$26 bias cannot be eliminated by PR averaging (Levin et al., 7 Aug 2025). Richardson–Romberg extrapolation,
$\theta^\star$27
cancels that linear bias term while preserving the $\theta^\star$28 variance term with asymptotically optimal covariance $\theta^\star$29 (Levin et al., 7 Aug 2025).
Taken together, these results delineate the scope of the Polyak–Ruppert principle. Averaging reliably recovers the optimal first-order variance in a remarkably broad class of stochastic approximation algorithms, but its interaction with constant step sizes, Markov dependence, timescale separation, dynamic preconditioning, and momentum is structure-dependent rather than automatic.