Cross-time Replay Mechanisms
- Cross-time replay is a temporal mechanism that integrates past signals and states with current data to enhance decision-making and memory consolidation.
- It spans diverse applications including reinforcement learning, continual learning, transformer pre-training, and debugging, using methods like experience replay and generative replay.
- Techniques such as priority sampling, replay scheduling, and replay-free surrogates demonstrate its role in reducing variance, mitigating forgetting, and improving model efficiency.
Cross-time replay denotes the deliberate reuse of signals, samples, or state extracted at earlier times so that later computation depends on both present and past trajectories. In the broad formulation used in neuroscience and machine learning, replay is the mechanism by which information propagates “across time” from earlier experiences and tasks into later learning, memory consolidation, and forgetting mitigation (Hayes et al., 2021). In reinforcement learning, the canonical instantiation is experience replay: a finite buffer of past transitions is sampled at later updates, converting purely online Markovian learning into stochastic approximation driven by reused past data (Lim et al., 2023).
| Domain | Replayed object | Cross-time role |
|---|---|---|
| Reinforcement learning | Transitions or replay buffers | Decorrelation, variance reduction, bootstrapping |
| Continual learning | Real samples, pseudo-samples, or task schedules | Retention of earlier tasks |
| Transformer pre-training | Corrupted token sequences | Sample-efficiency for the discriminator |
| Long-horizon reasoning | Distilled intermediate insights | Maintain accessibility of critical state |
| Cyber-physical security | Reused measurements | Attack vector requiring detection |
| Debugging systems | Execution events and syscall effects | Deterministic re-execution across runs |
These forms differ in substrate and purpose, but all instantiate the same temporal operation: later updates or decisions are conditioned on artifacts produced earlier rather than only on the current input stream (Liu et al., 2022, Lei et al., 14 May 2026, Porter et al., 2019, O'Callahan et al., 2016).
1. Conceptual scope and variants
Replay is not limited to literal storage of raw examples. In deep learning it includes veridical replay of original inputs, representational replay of internal feature activations, and generative replay of regenerated past-like samples; all three mix earlier information with current data when parameters are updated later in time (Hayes et al., 2021). In continual learning, this appears as rehearsal with a memory buffer or with a generator, typically through an objective of the form
where is current-task data and is a replay distribution over past tasks (Yang et al., 2022).
A common misconception is that replay is necessarily a storage-heavy mechanism. Several lines of work treat replay more abstractly as preservation of past structure. In long-horizon reasoning, the replayed object can be a compact list of “critical insights” extracted from an earlier segment of a chain-of-thought rather than the full trace (Lei et al., 14 May 2026). In replay-free continual multimodal learning, the stored object can be a compact skill-wise prototype of attention spectra rather than image-question pairs or teacher snapshots (Zhao et al., 22 Jun 2026). Conversely, in cyber-physical systems the same temporal reuse operation becomes adversarial: replay attacks reuse earlier measurements at later times under changed conditions (Porter et al., 2019).
This breadth makes cross-time replay less a single algorithm than a family of temporal reuse operators. The central design questions are therefore: what is replayed, when it is replayed, how it is selected, and what invariances or guarantees the replay mechanism is intended to preserve.
2. Experience replay in reinforcement learning
In policy evaluation with linear TD(0) under a fixed policy, the environment induces a Markov chain and the value approximation is . Standard online TD uses the current transition only, whereas TD with experience replay inserts each new transition into a finite buffer of size , samples a mini-batch of size , forms
and updates
Under the Markovian observation model, the replay update decomposes into a mean operator plus two noise terms: sampling noise within the buffer and bias from the fact that the buffer contents are a correlated sliding window of the Markov chain (Lim et al., 2023).
The finite-time contribution of replay is explicit. For both averaged iterates and final iterates, the constant-step-size error can be controlled by the replay buffer size and the mini-batch size . Schematically, the averaged-iterate error has the form
0
so larger 1 reduces i.i.d.-like sampling variance and larger 2 reduces the Markovian buffer-bias term. When 3 is on the order of or larger than the mixing time, the dominant rate can match the i.i.d.-style rate up to constants, rather than the more unfavorable mixing-dependent behavior of purely online TD (Lim et al., 2023).
A second extension pushes replay beyond a single run. Replay across Experiments (RaE) treats prior experiments as persistent offline data 4 and mixes them with online data from the current run 5, sampling from
6
The underlying off-policy RL objectives remain unchanged; only the replay distribution changes. Empirically, this reuses trajectories across seeds, hyperparameter settings, and previous runs to improve exploration and bootstrap learning in locomotion, manipulation, and hard exploration settings from egocentric vision (Tirumala et al., 2023).
Together these results place cross-time replay at the center of finite-time RL analysis: it reshapes the effective noise process, changes the step-size–bias trade-off, and can be extended from “earlier transitions in one trajectory” to “earlier trajectories across research runs.”
3. Continual learning, scheduling, and generative replay
In continual learning, replay is primarily a defense against catastrophic forgetting. A large empirical benchmark comparing replay strategies shows that the choice of what to replay is nontrivial. On simple streams such as MNIST, random replay is often sufficient, but on more complex streams such as CIFAR-10 and MiniImageNet, replaying difficult samples—high entropy, low confidence, low margin, or high Bayesian disagreement—generally improves accuracy and reduces forgetting relative to replaying “simple” samples (Yang et al., 2022). The same benchmark also reports that, when storage of real data is feasible, experience replay is markedly stronger than a VAE-based generative replay baseline on MNIST: experience replay reaches 7 accuracy with 8 forgetting in 9 s, whereas generative replay reaches 0 accuracy with 1 forgetting in 2 s (Yang et al., 2022).
The temporal allocation of replay is itself a learning problem. Replay scheduling formalizes a schedule as
3
where each 4 specifies the proportions of past tasks to sample into the replay buffer at stage 5. Under a fixed processing budget, Monte Carlo tree search can find schedules that outperform equal replay across tasks, and reinforcement learning can learn policies over these task proportions from state vectors of per-task validation accuracies (Klasson et al., 2022). The resulting schedules are not uniform; the reported visualizations show non-monotonic, spaced-repetition-like allocation patterns in which some tasks are replayed heavily soon after acquisition and later revisited only selectively (Klasson et al., 2022).
Generative replay becomes essential in strict class-incremental settings where model size must remain constant, no pre-training dataset is allowed, and no memory buffer may store past raw data. In that regime, a 6-VAE-style generator and classifier are trained with
7
where the inferred age 8 is derived from the classifier’s predicted label and the schedules
9
decrease regularization for older memories (Hu et al., 2023). This time-aware regularization is explicitly cross-time-sensitive: old classes are not replayed with the same generative pressure as newly learned ones. Under the stated strict constraints, the reported average accuracies improve from BI-R to BI-R + time-aware on MNIST 0, permMNIST 1, and CIFAR-100 2 (Hu et al., 2023).
Cross-time replay in continual learning is therefore governed by three coupled choices: the replay substrate (real or generated samples), the replay policy (random, uncertainty-based, interference-based, or scheduled), and the temporal weighting with which earlier knowledge is reintroduced.
4. Sequence models: transformer memory replay and stateful reasoning
Transformer with Memory Replay (TMR) introduces replay into ELECTRA-style pre-training by storing corrupted examples produced by the generator in a fixed-size buffer and training the discriminator on samples from that buffer rather than only on the generator’s immediate output. Sampling is weight-proportional,
3
the memory size is set to 4, and the buffer uses priority-based eviction of the lowest-weight example when full (Liu et al., 2022). The stated motivation is distribution drift in the generator: as masked-token prediction improves, the discriminator receives less informative negatives. Replaying older corrupted sequences stabilizes this signal. With the same number of pre-training examples, the reported small-model GLUE average improves from 5 for baseline ELECTRA to 6 for TMR(loss_diff) at 7k pre-training steps (Liu et al., 2022). A key engineering result is that the cheap loss-difference weighting strategy preserves runtime efficiency: for 8 iterations on the small model, baseline ELECTRA takes 9 s, TMR(loss_diff) 0 s, and TMR(grad_norm) 1 s (Liu et al., 2022).
A different temporal problem arises in long chain-of-thought reasoning. InsightReplay identifies an inverted-2 relation between CoT length and accuracy on a fixed problem: longer reasoning helps only up to a point, after which performance declines because earlier critical insights become less accessible as their distance from the active generation frontier grows (Lei et al., 14 May 2026). The method periodically extracts compact “insights” from the reasoning trace and replays them near the current frontier. The theoretical formulation introduces an insight accessibility function 3, strictly decreasing with distance 4, and contrasts standard CoT accuracy
5
with InsightReplay
6
which keeps insights at a fixed small distance 7 (Lei et al., 14 May 2026). On a 8 grid over model scale, model family, and benchmark, 3-round InsightReplay improves accuracy in all 9 settings, with an average gain of 0 points over standard CoT and a largest gain of 1 points on R1-Distill-32B’s LiveCodeBench v5 subset (Lei et al., 14 May 2026).
These two cases illustrate distinct temporal granularities. TMR replays examples across optimization steps in pre-training; InsightReplay replays distilled state within a single test-time trajectory. In both, the common function is to keep earlier high-value information accessible when later computation would otherwise drift away from it.
5. Adversarial and deterministic replay outside learning
Cross-time replay is not inherently benign. In cyber-physical systems, replay attacks reuse earlier measurements at later times to deceive the controller. For discrete-time linear time-varying systems, generalized replay attacks are modeled by injecting measurement signals generated by a false internal state evolving under the same time-varying dynamics, optionally with scaling of the genuine measurement (Porter et al., 2019). Dynamic watermarking counters this by injecting a private excitation 2 into the control input and normalizing residuals with a time-varying factor
3
Under no attack, the normalized residual covariance tends to identity and its correlation with past watermarks vanishes; under any generalized replay attack with non-zero asymptotic power, these two conditions cannot both hold (Porter et al., 2019). In this setting, “cross-time replay” names a temporal threat model rather than a learning mechanism.
At the systems level, record-and-replay debugging turns cross-time replay into deterministic re-execution. RR records enough information about one concrete execution of a Linux user-space process group to reproduce it later under debugger control, while remaining entirely in user space on stock hardware and operating systems (O'Callahan et al., 2016). The key technical device is the deterministic hardware counter of retired conditional branches, coupled with the general-purpose register file, so that asynchronous events are located by
4
RR records system calls, signals, context switches, and a few nondeterministic instructions, then later replays the same user-space state transitions. With syscall buffering and file/block cloning, recording overhead is below 5 on all reported workloads except the highly parallel make workload; for example, recording slowdown is 6 on cp, 7 on octane, 8 on htmltest, and 9 on sambatest (O'Callahan et al., 2016).
The contrast is instructive. In CPS, replay is a form of temporal deception to be detected. In debugging, replay is a fidelity mechanism enabling reverse execution and forensic analysis. Both rely on the same temporal reuse primitive—later use of earlier signals—but attach radically different semantics to it.
6. Replay-free surrogates and unifying principles
A recent replay-free line of work replaces stored past data with compact structural statistics. Attention-Spectrum Regularization (ASR) for continual multimodal LLMs extracts cross-modal attention maps, treats them as 0-D signals, computes Fourier power spectra, and stores only skill-wise prototype distributions
1
rather than past image-question pairs, pseudo-examples, or teacher snapshots (Zhao et al., 22 Jun 2026). During later stages, current spectral descriptors are constrained by a skill-weighted Mahalanobis term and an angular symmetric-KL term,
2
and the theory shows that forgetting is bounded by skill-conditioned spectral drift under a spectral sufficiency assumption (Zhao et al., 22 Jun 2026). Empirically, ASR improves final performance and reduces forgetting relative to replay-, regularization-, and adapter-based baselines on VQA v2, VQACL, CLT-VQA, CoIN, and UCIT; for example, on VQA v2 with LLaVA-1.5-7B it reports AP 3 and AF 4, compared with ER at AP 5 and AF 6 (Zhao et al., 22 Jun 2026).
Across the literature, three recurrent design variables emerge. The first is the replay unit: raw transitions, rehearsal samples, generated pseudo-data, task proportions, corrupted sequences, distilled insights, execution events, or attention prototypes. The second is the selection policy: uniform sampling, difficult-example replay, task scheduling, priority weighting, or skill-weighted prototype matching. The third is the temporal invariance demanded of the replayed object: i.i.d.-like noise reduction in TD, preservation of old class distributions, persistence of informative corrupted examples, accessibility of critical reasoning state, or phase-invariant preservation of cross-modal attention structure (Lim et al., 2023, Yang et al., 2022, Klasson et al., 2022, Lei et al., 14 May 2026).
A second misconception is that more temporal context automatically improves performance. The sequence-model results show that longer CoT can degrade accuracy when critical early insights are not replayed near the frontier, and continual-learning results show that indiscriminate replay is often inferior to difficult-example selection or learned schedules on harder streams (Lei et al., 14 May 2026, Yang et al., 2022). A third misconception is that replay necessarily implies data storage; replay-free structural surrogates such as ASR preserve past skill-conditioned behavior without storing past samples at all (Zhao et al., 22 Jun 2026).
Cross-time replay is therefore best understood as a general temporal control principle rather than a single buffer mechanism. It governs how systems preserve, refresh, verify, exploit, or defend against information that originated earlier but remains consequential later.