Multiscale Experience Replay (MER)
- The paper introduces a multi-scale replay schedule that achieves O(1/T) convergence without requiring knowledge of the Markov chain's mixing time.
- MER is a structured algorithm that employs epoch-based, coarse-to-fine buffer sampling to emulate nearly independent, i.i.d. data performance under Markovian noise.
- MER outperforms serial stochastic approximation and skip-sampling methods by automatically adapting to the chain’s correlation structure for robust, efficient convergence.
Multiscale Experience Replay (MER) is a provably correct algorithmic framework for solving stochastic variational inequalities (VIs) when sample observations are generated from a Markov chain and stored in a finite replay buffer. MER circumvents the bias and slow convergence rates inherent in standard serial stochastic approximation (SA) under Markovian noise by deploying a multi-scale sampling schedule over the buffer, emulating nearly independent sampling and achieving iteration complexity rates characteristic of i.i.d. scenarios—without requiring any knowledge of the Markov chain’s mixing time (Nakul et al., 4 Jan 2026).
1. Problem Setting: Stochastic VIs with Markovian Data and Buffer Bias
MER addresses the problem of finding such that the monotone variational inequality
is satisfied, where is assumed –Lipschitz and –strongly monotone:
Instead of direct access to , only stochastic oracle evaluations with mean are available, but where the samples are correlated through a Markov chain of mixing time . Naive serial SA iterations,
accumulate bias proportional to (with ), resulting in suboptimal convergence. Classical skip-sampling (e.g., CTD) can restore rates with prior knowledge of mixing time, but is brittle to poor tuning.
MER relies solely on standard buffer access—allowing arbitrary selection from a buffer of size containing recent samples—deploying a principled multi-epoch, multi-scale usage pattern, automatically adapting to the chain’s correlation structure.
2. MER Algorithm: Multi-Scale Epoch-Based Replay
MER operates in epochs, indexed by , where each epoch uses a buffer sampling gap for exactly updates. This geometric progression traverses coarse-to-fine time scales, with early epochs exploiting widely separated samples and later epochs focusing on finer spacings.
Algorithmic steps per epoch :
- Initialize .
- For :
- Select buffer index , extract .
Update:
3. (Online setting) Replace with the next incoming chain sample.
Pseudo-code:
1 2 3 4 5 6 7 8 9 10 11 |
Algorithm MER
Input: buffer {ξ₁,…,ξ_B}, epochs K, step sizes {η_k}
for k = 1 to K:
set τ_k = B/2^k, T_k = 2^k; re-init x₁^(k)
for t = 1 to T_k:
pick ξ ← ξ_{t τ_k}
x_{t+1}^{(k)} = argmin_{x∈X} {η_k⟨\widetilde{F}(x_t^{(k)},ξ), x⟩ + 0.5‖x_t^{(k)}-x‖²}
delete ξ_{t τ_k}, append new sample
end
end
Output: x_{T_K + 1}^{(K)} or average |
3. Convergence Guarantees and Complexity Bounds
MER’s central theoretical results quantify error and iteration complexity as follows. Let ; ; .
Theorem 1 (General Convergence):
If and
then
$\E [\|x_{T_k + 1} - x^*\|^2 ] \le M \left( 1 + \frac{3\mu^2}{8 (\alpha_k + 1) (\zeta^2 + 16\bar L^2)} \right)^{-T_k}(D^2 + 1) + \frac{20 C_M \rho^{\tau_M+\tau_k-1}}{\mu} + O\left( \frac{\alpha_k+1}{\mu^2 T_k} \right),$
where matches i.i.d. SA up to logarithmic factors when .
Theorem 2 (i.i.d. Emulation):
For (with ),
so with , the i.i.d. path is tracked up to accuracy.
These statements show MER recovers stochastic error rates in epochs with separation exceeding mixing time, and automatically transitions across scales, matching i.i.d. performance without explicit knowledge of .
4. Robustness and Comparative Analysis
MER’s robustness is manifest relative to alternatives:
- Serial SA (no replay): Yields rate with Markovian data-induced bias.
- Uniform buffer replay: Reduces bias relative to serial, but interleaves scales arbitrarily; lacks structured epoch-wise guarantees.
- Skip-sampling (CTD): Requires accurate skip parameter; suboptimal if under- or overshooting , either retaining bias or wasting samples.
- MER: Covers all time scales geometrically, harvesting the coarse-scale acceleration early (), then gracefully exhausting buffer resolution at fine scales.
Empirical comparisons (see Figures 1–4 in (Nakul et al., 4 Jan 2026)) report that MER nearly matches i.i.d. SA in early epochs and surpasses skip-sampling unless perfectly tuned, asymptotically outperforming naive serial SA.
| Approach | Parameter dependence | Rate |
|---|---|---|
| Serial SA | Markovian chain () | |
| Skip-sampling (CTD) | Requires , fragile | if optimal |
| Uniform replay | Buffer size, no epoch structure | Varies, lacks guarantee |
| MER | Buffer size , no mixing time | whenever possible |
5. Applications: RL Policy Evaluation and Generalized Linear Models
MER applies to core estimation problems affected by temporal dependence:
(a) Policy Evaluation (TD(0) with MER):
For a Markov reward process , value function approximations via projected Bellman VI reduce to finding s.t.
Under bounded features and rewards, is Lipschitz and strongly monotone. Corollary 6.1 (Nakul et al., 4 Jan 2026) gives MER’s iteration complexity: recapturing i.i.d. SA sample efficiency with no mixing time dependence.
(b) Generalized Linear Models:
For samples with , where is a Markov chain and is Lipschitz and strongly monotone,
Corollary 5.1 (Nakul et al., 4 Jan 2026) bounds MER’s sample complexity by
matching i.i.d. optimality up to logarithmic factors.
6. Summary of Theoretical and Practical Attributes
MER’s epoch-wise replay schedule yields:
- stochastic error rates whenever replay separation exceeds mixing time,
- Automatic adaptation to mixing dynamics without tuning,
- Two-sided guarantees bounding deviation from i.i.d. trajectory during early epochs,
- Theoretical and empirical superiority or parity relative to uniform and skip replay,
- Best-known mixing-agnostic guarantees for policy evaluation and statistical estimation under Markovian sampling.
By integrating experience replay with a structured multi-scale sequence, MER achieves practical robustness and theoretically optimal rates in a Markovian data setting (Nakul et al., 4 Jan 2026).