Papers
Topics
Authors
Recent
2000 character limit reached

Multiscale Experience Replay (MER)

Updated 11 January 2026
  • The paper introduces a multi-scale replay schedule that achieves O(1/T) convergence without requiring knowledge of the Markov chain's mixing time.
  • MER is a structured algorithm that employs epoch-based, coarse-to-fine buffer sampling to emulate nearly independent, i.i.d. data performance under Markovian noise.
  • MER outperforms serial stochastic approximation and skip-sampling methods by automatically adapting to the chain’s correlation structure for robust, efficient convergence.

Multiscale Experience Replay (MER) is a provably correct algorithmic framework for solving stochastic variational inequalities (VIs) when sample observations are generated from a Markov chain and stored in a finite replay buffer. MER circumvents the bias and slow convergence rates inherent in standard serial stochastic approximation (SA) under Markovian noise by deploying a multi-scale sampling schedule over the buffer, emulating nearly independent sampling and achieving iteration complexity rates characteristic of i.i.d. scenarios—without requiring any knowledge of the Markov chain’s mixing time (Nakul et al., 4 Jan 2026).

1. Problem Setting: Stochastic VIs with Markovian Data and Buffer Bias

MER addresses the problem of finding xXRnx^* \in X \subset \mathbb{R}^n such that the monotone variational inequality

F(x),xx0,xX\langle F(x^*), x - x^* \rangle \ge 0, \quad \forall x \in X

is satisfied, where FF is assumed LL–Lipschitz and μ\mu–strongly monotone: F(x)F(y),xyμxy2,x,yX.\langle F(x) - F(y), x - y \rangle \ge \mu \|x - y\|^2, \qquad \forall x, y \in X.

Instead of direct access to F(x)F(x), only stochastic oracle evaluations F~(x,ξ)\widetilde{F}(x,\xi) with mean F(x)F(x) are available, but where the samples ξt\xi_t are correlated through a Markov chain of mixing time tmixt_{\rm mix}. Naive serial SA iterations,

xt+1=ΠX[xtηF~(xt,ξt)],x_{t+1} = \Pi_X[x_t - \eta\,\widetilde{F}(x_t, \xi_t)],

accumulate bias proportional to τˉ/t\bar{\tau}/t (with τˉtmix\bar{\tau}\approx t_{\rm mix}), resulting in suboptimal O(τˉ/T)O(\bar{\tau}/T) convergence. Classical skip-sampling (e.g., CTD) can restore O(1/T)O(1/T) rates with prior knowledge of mixing time, but is brittle to poor tuning.

MER relies solely on standard buffer access—allowing arbitrary selection from a buffer of size BB containing recent samples—deploying a principled multi-epoch, multi-scale usage pattern, automatically adapting to the chain’s correlation structure.

2. MER Algorithm: Multi-Scale Epoch-Based Replay

MER operates in Klog2BK \approx \log_2 B epochs, indexed by kk, where each epoch uses a buffer sampling gap τk=B/2k\tau_k = B/2^k for exactly Tk=2kT_k = 2^k updates. This geometric progression traverses coarse-to-fine time scales, with early epochs exploiting widely separated samples and later epochs focusing on finer spacings.

Algorithmic steps per epoch kk:

  • Initialize x1(k)Xx^{(k)}_1 \in X.
  • For t=1,,Tkt = 1,\ldots,T_k:

    1. Select buffer index i=tτki = t\,\tau_k, extract ξi\xi_i.
    2. Update:

      xt+1(k)=argminxX{ηkF~(xt(k),ξi),x+12xt(k)x2}.x_{t+1}^{(k)} = \arg\min_{x \in X} \left\{ \eta_k \langle \widetilde{F}(x_t^{(k)}, \xi_i), x \rangle + \frac{1}{2} \|x_t^{(k)} - x\|^2 \right\}.

3. (Online setting) Replace ξi\xi_i with the next incoming chain sample.

Pseudo-code:

1
2
3
4
5
6
7
8
9
10
11
Algorithm MER
Input: buffer {ξ₁,…,ξ_B}, epochs K, step sizes {η_k}
for k = 1 to K:
    set τ_k = B/2^k, T_k = 2^k; re-init x₁^(k)
    for t = 1 to T_k:
        pick ξ ← ξ_{t τ_k}
        x_{t+1}^{(k)} = argmin_{x∈X} {η_k⟨\widetilde{F}(x_t^{(k)},ξ), x⟩ + 0.5‖x_t^{(k)}-x‖²}
        delete ξ_{t τ_k}, append new sample
    end
end
Output: x_{T_K + 1}^{(K)} or average
If mixing time were known, a constant gap τk=tmix\tau_k = t_{\rm mix} would suffice (skip-sampling/CTD); MER’s geometric approach obviates parameter tuning.

3. Convergence Guarantees and Complexity Bounds

MER’s central theoretical results quantify error and iteration complexity as follows. Let τM=ln(18C/μ)ln(1/ρ)\tau_M = \frac{\ln(18 C/\mu)}{\ln(1/\rho)}; αk=τM/τk\alpha_k = \tau_M/\tau_k; Lˉ=L+L~1\bar L = L + \widetilde L_1.

Theorem 1 (General Convergence):

If B=Ω(τMlogτM)B = \Omega(\tau_M \log \tau_M) and

ηkmin{μLˉ2(αk+1),logTkμTk},\eta_k \asymp \min \left\{ \frac{\mu}{\bar{L}^2(\alpha_k + 1)}, \,\, \frac{\log T_k}{\mu T_k} \right\},

then

$\E [\|x_{T_k + 1} - x^*\|^2 ] \le M \left( 1 + \frac{3\mu^2}{8 (\alpha_k + 1) (\zeta^2 + 16\bar L^2)} \right)^{-T_k}(D^2 + 1) + \frac{20 C_M \rho^{\tau_M+\tau_k-1}}{\mu} + O\left( \frac{\alpha_k+1}{\mu^2 T_k} \right),$

where O((αk+1)/Tk)O((\alpha_k+1)/T_k) matches i.i.d. SA up to logarithmic factors when αk=O(1)\alpha_k = O(1).

Theorem 2 (i.i.d. Emulation):

For τk=βτM\tau_k = \beta \tau_M (with β>1\beta > 1),

ΔT+1Δ~T+13c0L~2L~1Teβ,|\|\Delta_{T+1}\| - \|\widetilde{\Delta}_{T+1}\|| \le 3c_0 \frac{\widetilde{L}_2}{\widetilde{L}_1} \sqrt{T} \, e^{-\beta},

so with β=(3/2)lnT\beta = (3/2) \ln T, the i.i.d. path is tracked up to O(1/T)O(1/T) accuracy.

These statements show MER recovers O(1/T)O(1/T) stochastic error rates in epochs with separation exceeding mixing time, and automatically transitions across scales, matching i.i.d. performance without explicit knowledge of tmixt_{\rm mix}.

4. Robustness and Comparative Analysis

MER’s robustness is manifest relative to alternatives:

  • Serial SA (no replay): Yields O(τˉ/T)O(\bar{\tau}/T) rate with Markovian data-induced bias.
  • Uniform buffer replay: Reduces bias relative to serial, but interleaves scales arbitrarily; lacks structured epoch-wise guarantees.
  • Skip-sampling (CTD): Requires accurate skip parameter; suboptimal if under- or overshooting tmixt_{\rm mix}, either retaining bias or wasting samples.
  • MER: Covers all time scales geometrically, harvesting the coarse-scale acceleration early (τktmix\tau_k \gg t_{\rm mix}), then gracefully exhausting buffer resolution at fine scales.

Empirical comparisons (see Figures 1–4 in (Nakul et al., 4 Jan 2026)) report that MER nearly matches i.i.d. SA in early epochs and surpasses skip-sampling unless perfectly tuned, asymptotically outperforming naive serial SA.

Approach Parameter dependence Rate
Serial SA Markovian chain (tmixt_{\rm mix}) O(τˉ/T)O(\bar{\tau}/T)
Skip-sampling (CTD) Requires tmixt_{\rm mix}, fragile O(1/T)O(1/T) if optimal
Uniform replay Buffer size, no epoch structure Varies, lacks guarantee
MER Buffer size BB, no mixing time O(1/T)O(1/T) whenever possible

5. Applications: RL Policy Evaluation and Generalized Linear Models

MER applies to core estimation problems affected by temporal dependence:

(a) Policy Evaluation (TD(0) with MER):

For a Markov reward process (S,P,R,γ)(\mathcal{S}, P, R, \gamma), value function approximations via projected Bellman VI reduce to finding θ\theta s.t.

F(θ)=0,withF~(θ,(s,s,R))=(ψ(s),θRγψ(s),θ)ψ(s).F(\theta) = 0, \quad \text{with} \quad \widetilde F(\theta, (s,s',R)) = (\langle \psi(s), \theta \rangle - R - \gamma \langle \psi(s'), \theta \rangle) \psi(s).

Under bounded features and rewards, FF is Lipschitz and strongly monotone. Corollary 6.1 (Nakul et al., 4 Jan 2026) gives MER’s iteration complexity: O(max{αk+1(1γ)2ln1ϵ,αk+1(1γ)2ϵ}),O \left( \max \left\{ \frac{\alpha_k + 1}{(1 - \gamma)^2} \ln \frac{1}{\epsilon}, \, \frac{\alpha_k + 1}{(1 - \gamma)^2 \epsilon} \right\} \right), recapturing i.i.d. SA sample efficiency with no mixing time dependence.

(b) Generalized Linear Models:

For samples (at,yt)(a_t, y_t) with yt=f(atx)+vty_t = f(a_t^\top x^*) + v_t, where ata_t is a Markov chain and ff is Lipschitz and strongly monotone,

F~(x,(a,y))=af(ax)ay.\widetilde F(x, (a,y)) = a f(a^\top x) - a y.

Corollary 5.1 (Nakul et al., 4 Jan 2026) bounds MER’s sample complexity by

O(max{αk+1μf2κ2ln1ϵ,αk+1μf2κ2ϵ}),O \left( \max \left\{ \frac{\alpha_k + 1}{\mu_f^2 \kappa^2} \ln \frac{1}{\epsilon}, \, \frac{\alpha_k + 1}{\mu_f^2 \kappa^2 \epsilon} \right\} \right),

matching i.i.d. optimality up to logarithmic factors.

6. Summary of Theoretical and Practical Attributes

MER’s epoch-wise replay schedule yields:

  • O(1/T)O(1/T) stochastic error rates whenever replay separation exceeds mixing time,
  • Automatic adaptation to mixing dynamics without tuning,
  • Two-sided guarantees bounding deviation from i.i.d. trajectory during early epochs,
  • Theoretical and empirical superiority or parity relative to uniform and skip replay,
  • Best-known mixing-agnostic guarantees for policy evaluation and statistical estimation under Markovian sampling.

By integrating experience replay with a structured multi-scale sequence, MER achieves practical robustness and theoretically optimal rates in a Markovian data setting (Nakul et al., 4 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Multiscale Experience Replay (MER).