Papers
Topics
Authors
Recent
Search
2000 character limit reached

ReGen: Recurrent Generative Replay in Robotics

Updated 4 July 2026
  • REGEN is a continual imitation learning framework that uses a World Action Model (WAM) to generate pseudo-replay trajectories, mitigating catastrophic forgetting in sequential robotic tasks.
  • It constructs replay by recursively feeding generated observations back into the policy, enabling trajectory-level rehearsal without storing original demonstrations.
  • Empirical results show that REGEN significantly reduces forgetting compared to sequential fine-tuning, though challenges remain with long-horizon visual fidelity and action-observation consistency.

Recurrent Generative Replay (REGEN), stylized as ReGen in the originating work, is a continual imitation learning framework that uses a World Action Model (WAM) to synthesize pseudo-replay trajectories for previously learned robot manipulation tasks, allowing rehearsal without storing the original human demonstrations. In REGEN, replay is trajectory-level and recurrent: the model conditions on a prior task instruction and a seed observation, predicts future actions and future observations, and then recursively feeds its own generated observations back as inputs to continue rollout. The framework was introduced for continual robot learning in simulation and on a real robot, where it reduces catastrophic forgetting relative to sequential fine-tuning while approaching replay methods that rely on privileged access to real historical data (Govind et al., 25 Jun 2026).

1. Definition and conceptual scope

REGEN belongs to the broader family of generative replay methods in continual learning, in which a model substitutes stored historical data with synthesized pseudo-examples. Earlier work established the core teacher-chain pattern: a model learned at task t−1t-1 generates replay samples for task tt, and the resulting model becomes the replay source for the next stage (Lesort et al., 2018). In image generation, this pattern appears in Memory Replay GANs, where a frozen previous conditional GAN supplies replay for joint training or replay alignment (Wu et al., 2018).

REGEN specializes that general idea to continual imitation learning. Its replay target is not a static image, a feature vector, or a class-conditional sample, but a pseudo-demonstration trajectory of observation-action pairs. The central novelty is that the replay generator is the policy itself, provided that the policy is a WAM capable of predicting not only actions but also future observations. This differentiates REGEN from replay systems that require a separate VAE, GAN, or exemplar buffer (Govind et al., 25 Jun 2026).

A common misconception is to treat REGEN as equivalent to conventional experience replay. It is not. The framework retains previous task instructions, but not prior task trajectories, and therefore does not satisfy the assumptions of privileged replay baselines that store real demonstrations. Another misconception is to treat it as mere action prediction with hallucinated states added post hoc. In REGEN, the capacity to model future observations is structurally necessary, because recurrent rollout cannot proceed from an action-only policy once real observations are no longer available (Govind et al., 25 Jun 2026).

2. Problem formulation and the role of World Action Models

The continual learning setting is sequential imitation learning over robot tasks Tk\mathcal{T}^k, each specified by a language instruction â„“k\ell^k and demonstrations

Dk={(ℓk,τik)}i=1Nk,\mathcal{D}^k = \{(\ell^k,\tau_i^k)\}_{i=1}^{N_k},

with trajectories

Ï„={(ot,at)}t=1T.\tau = \{(\mathbf{o}_t,\mathbf{a}_t)\}_{t=1}^{T}.

Each observation contains multi-view RGB images and proprioception,

ot={It1,…,Itn,qt}.\mathbf{o}_t = \{\mathbf{I}_t^1,\ldots,\mathbf{I}_t^n,\mathbf{q}_t\}.

The learner starts from a pretrained policy πθ0\pi_{\theta_0} trained on previous tasks

Tprev={T1,T2,…,TM},\mathcal{T}_{\mathrm{prev}} = \{\mathcal{T}_1,\mathcal{T}_2,\ldots,\mathcal{T}_M\},

and then adapts sequentially to a novel task Tk\mathcal{T}_k, with access to tt0 and tt1, but for old tasks retaining only the instructions tt2 rather than the original demonstrations (Govind et al., 25 Jun 2026).

The WAM interface is the enabling assumption. At time tt3, the model predicts an action chunk, a future observation, and a task-progress reward:

tt4

A standard action-only visuomotor policy cannot synthesize replay trajectories from scratch because it has no mechanism for producing the next observation on which later actions depend. A WAM can instead roll itself forward as a learned simulator conditioned on language and sensory context (Govind et al., 25 Jun 2026).

This formulation is closely related to earlier replay literature in which the generator acts as an active memory system. In continual GAN learning, MeRGAN freezes the previous conditional generator and reuses it to represent old classes (Wu et al., 2018). REGEN inherits the same continual-memory logic, but the memory carrier is a multimodal robot policy whose outputs are temporally extended and jointly visuomotor (Govind et al., 25 Jun 2026).

3. Replay construction and recurrent rollout

REGEN constructs replay by recurrently querying the WAM under a previous-task instruction tt5. Rollout begins from a real observation context sampled from the current task data. For the first tt6 steps, the input observation is real; after that, generated observations are recursively fed back:

tt7

The pseudo-trajectory for previous task tt8 is then

tt9

Across all previous tasks, replay data are aggregated as

Tk\mathcal{T}^k0

and mixed with current-task demonstrations:

Tk\mathcal{T}^k1

Training then proceeds on this union (Govind et al., 25 Jun 2026).

Several design choices are distinctive. First, replay is conditioned on old instructions but initialized from current-task observations rather than stored old-task states. Second, the stored replay signal consists of observation-action pairs, using the first action of each predicted action chunk. Third, rollout is terminated either at a fixed maximum horizon or early when the reward head indicates task completion; in the reported implementation, termination occurs when predicted reward exceeds Tk\mathcal{T}^k2 for three consecutive steps and reaches Tk\mathcal{T}^k3 at least once in that window (Govind et al., 25 Jun 2026).

This recurrent construction directly addresses a central limitation of ordinary generative replay in robotics: static sample synthesis is insufficient when old competence is encoded in action-conditioned interaction sequences. Earlier replay systems for images, features, or class-incremental classifiers can preserve category-level information without modeling sequential dependence (Liu et al., 2020), whereas REGEN is explicitly sequence-generating and instruction-conditioned (Govind et al., 25 Jun 2026).

4. Optimization, architecture, and implementation

The continual update objective is behavioral cloning on mixed real and generated data:

Tk\mathcal{T}^k4

For samples from Tk\mathcal{T}^k5, the instruction is Tk\mathcal{T}^k6; for replayed samples from Tk\mathcal{T}^k7, the instruction is the corresponding prior-task instruction Tk\mathcal{T}^k8. REGEN therefore introduces no separate continual-learning regularizer; forgetting mitigation comes from augmenting the imitation dataset with generated old-task trajectories (Govind et al., 25 Jun 2026).

The reported instantiation uses Cosmos-Policy as the WAM, initialized from Cosmos-Predict2-2B. Cosmos-Policy is built on a latent video diffusion model. Visual observations are tokenized with the Wan2.1 spatiotemporal VAE tokenizer, language is encoded with a pretrained T5-XXL encoder, and actions and proprioception are normalized to Tk\mathcal{T}^k9 and converted into latent frames. Training uses a flow-matching diffusion objective to jointly denoise action chunk, future observation, and reward latents (Govind et al., 25 Jun 2026).

Implementation details are consequential because replay quality depends on rollout stability. The reported system uses image resolution â„“k\ell^k0, action dimension â„“k\ell^k1, proprioception dimension â„“k\ell^k2, and action chunk size â„“k\ell^k3. Base policy training runs for 10K iterations; each continual adaptation stage fine-tunes for 2K iterations. Replay generation uses 10 pseudo-trajectories per previous task unless otherwise stated. Optimization uses Adam, peak learning rate â„“k\ell^k4, batch size per GPU â„“k\ell^k5, 4 GPUs, and gradient accumulation â„“k\ell^k6, with random crop, color jitter, and Gaussian blur as augmentations. Inference uses 5 denoising steps for actions and 1 denoising step for observations and value (Govind et al., 25 Jun 2026).

Relative to prior generative replay systems, REGEN trades data storage for generation compute. This is structurally similar to earlier findings in image replay, where avoiding exemplar buffers requires repeated synthetic sample generation at each task transition (Wu et al., 2018). A plausible implication is that replay cost in robotics is more acute because diffusion-based rollout is substantially heavier than class-conditional image sampling, although the paper itself frames the tradeoff in terms of storage reduction versus generation overhead (Govind et al., 25 Jun 2026).

5. Evaluation protocol and empirical performance

Performance is evaluated with continual-learning metrics over task success rates â„“k\ell^k7: forward transfer,

â„“k\ell^k8

negative backward transfer,

â„“k\ell^k9

and area under the curve,

Dk={(ℓk,τik)}i=1Nk,\mathcal{D}^k = \{(\ell^k,\tau_i^k)\}_{i=1}^{N_k},0

Higher Dk={(ℓk,τik)}i=1Nk,\mathcal{D}^k = \{(\ell^k,\tau_i^k)\}_{i=1}^{N_k},1 and Dk={(ℓk,τik)}i=1Nk,\mathcal{D}^k = \{(\ell^k,\tau_i^k)\}_{i=1}^{N_k},2 are better, whereas lower Dk={(ℓk,τik)}i=1Nk,\mathcal{D}^k = \{(\ell^k,\tau_i^k)\}_{i=1}^{N_k},3 indicates less forgetting (Govind et al., 25 Jun 2026).

The simulation benchmark is LIBERO, using the Spatial, Object, and Goal suites. Each suite contains 10 tasks and 50 human teleoperated demonstrations per task. The continual protocol pretrains on 6 tasks and then introduces the remaining 4 tasks sequentially. Real-world experiments use an xArm7 with a wrist-mounted gripper camera and a third-person RGB-D camera on three tasks learned in sequence: Put carrot in bowl, Put carrot on plate, and Put eggplant in bowl, with 50 teleoperated demonstrations per task at 15 Hz (Govind et al., 25 Jun 2026).

The empirical pattern is consistent across benchmarks: sequential fine-tuning forgets sharply, REGEN substantially reduces that forgetting, and privileged replay methods remain stronger because they use real or simulator-grounded replay.

Benchmark Seq-FT ReGen
LIBERO-Object FWT 92.7, NBT 82.6, AUC 24.9 FWT 95.3, NBT 26.1, AUC 65.5
LIBERO-Goal FWT 90.6, NBT 100, AUC 10.3 FWT 90.6, NBT 44.9, AUC 40.8
LIBERO-Spatial FWT 87.4, NBT 99.8, AUC 10.8 FWT 87.2, NBT 17.6, AUC 76.9
Real world FWT 50, NBT 96.3, AUC 13.8 FWT 80, NBT 60.5, AUC 53.8

These results support the paper’s headline claim that REGEN reduces catastrophic forgetting by up to Dk={(ℓk,τik)}i=1Nk,\mathcal{D}^k = \{(\ell^k,\tau_i^k)\}_{i=1}^{N_k},4 relative to sequential fine-tuning, and in the LIBERO-Object and LIBERO-Goal suites the reduction in Dk={(ℓk,τik)}i=1Nk,\mathcal{D}^k = \{(\ell^k,\tau_i^k)\}_{i=1}^{N_k},5 is greater than half (Govind et al., 25 Jun 2026). REGEN does not match experience replay or Rollouts-as-Replay, but it narrows the gap without storing original demonstrations. The paper explicitly treats experience replay as a privileged upper bound because it violates the no-stored-demonstration assumption, and Rollouts-as-Replay as privileged in simulation because it requires simulator interaction with previous-task policies (Govind et al., 25 Jun 2026).

The work also reports representation-level evidence. Mean Dk={(ℓk,τik)}i=1Nk,\mathcal{D}^k = \{(\ell^k,\tau_i^k)\}_{i=1}^{N_k},6 drift of action latent centroids reaches up to Dk={(ℓk,τik)}i=1Nk,\mathcal{D}^k = \{(\ell^k,\tau_i^k)\}_{i=1}^{N_k},7 under sequential fine-tuning, Dk={(ℓk,τik)}i=1Nk,\mathcal{D}^k = \{(\ell^k,\tau_i^k)\}_{i=1}^{N_k},8 under ReGen, and Dk={(ℓk,τik)}i=1Nk,\mathcal{D}^k = \{(\ell^k,\tau_i^k)\}_{i=1}^{N_k},9 under experience replay, indicating that generated replay partially preserves internal action geometry. On LIBERO-Object, 5 replays per task yield FWT τ={(ot,at)}t=1T.\tau = \{(\mathbf{o}_t,\mathbf{a}_t)\}_{t=1}^{T}.0, NBT τ={(ot,at)}t=1T.\tau = \{(\mathbf{o}_t,\mathbf{a}_t)\}_{t=1}^{T}.1, AUC τ={(ot,at)}t=1T.\tau = \{(\mathbf{o}_t,\mathbf{a}_t)\}_{t=1}^{T}.2, while 10 replays per task yield FWT τ={(ot,at)}t=1T.\tau = \{(\mathbf{o}_t,\mathbf{a}_t)\}_{t=1}^{T}.3, NBT τ={(ot,at)}t=1T.\tau = \{(\mathbf{o}_t,\mathbf{a}_t)\}_{t=1}^{T}.4, AUC τ={(ot,at)}t=1T.\tau = \{(\mathbf{o}_t,\mathbf{a}_t)\}_{t=1}^{T}.5, suggesting modest benefit from additional replay diversity (Govind et al., 25 Jun 2026).

6. Relation to prior replay paradigms, limitations, and open questions

REGEN sits at the intersection of several replay lineages. Like classic generative replay, it replaces stored data with pseudo-examples generated from a learned model (Lesort et al., 2018). Like MeRGAN, it uses a generator-like object as a continually updated memory that must preserve previous distributions while absorbing new ones (Wu et al., 2018). Unlike feature-space replay methods, however, it does not replay high-level embeddings or penultimate activations; its replay signal is executable observation-action trajectory data (Liu et al., 2020). Unlike offline self-recovery systems that feed generated samples back through a VAE-classifier architecture after training, REGEN performs replay during continual adaptation itself and does so in a recurrent visuomotor rollout regime rather than in a purely static generative setting (Zhou et al., 2023).

The principal limitations identified in the REGEN study are long-horizon visual degradation and action-observation inconsistency. Generated observations become progressively blurrier and more artifact-ridden over rollout horizon and across continual stages, and because those observations are recursively fed back, errors compound. The paper reports that goal-reward termination improves replay fidelity relative to fixed-horizon rollout: PSNR is τ={(ot,at)}t=1T.\tau = \{(\mathbf{o}_t,\mathbf{a}_t)\}_{t=1}^{T}.6 for fixed horizon τ={(ot,at)}t=1T.\tau = \{(\mathbf{o}_t,\mathbf{a}_t)\}_{t=1}^{T}.7, τ={(ot,at)}t=1T.\tau = \{(\mathbf{o}_t,\mathbf{a}_t)\}_{t=1}^{T}.8 for τ={(ot,at)}t=1T.\tau = \{(\mathbf{o}_t,\mathbf{a}_t)\}_{t=1}^{T}.9, and ot={It1,…,Itn,qt}.\mathbf{o}_t = \{\mathbf{I}_t^1,\ldots,\mathbf{I}_t^n,\mathbf{q}_t\}.0 under goal-reward termination (Govind et al., 25 Jun 2026). This diagnosis echoes earlier continual-generation findings that replay errors can accumulate and snowball over task sequences when the generator itself is imperfect (Lesort et al., 2018).

Action-observation inconsistency is a distinct bottleneck. On LIBERO-Goal Stage 1, imagined success rate is ot={It1,…,Itn,qt}.\mathbf{o}_t = \{\mathbf{I}_t^1,\ldots,\mathbf{I}_t^n,\mathbf{q}_t\}.1, whereas grounded success rate obtained by executing the predicted actions in the simulator is ot={It1,…,Itn,qt}.\mathbf{o}_t = \{\mathbf{I}_t^1,\ldots,\mathbf{I}_t^n,\mathbf{q}_t\}.2. The model can therefore generate visually plausible future frames while producing actions that would not physically realize those futures (Govind et al., 25 Jun 2026). This distinguishes REGEN from conventional sample replay failures, where the central issue is often realism or diversity alone; in robot imitation learning, replay must also preserve action-state consistency.

The broader significance of REGEN is that it turns a generative robot policy into an implicit memory system. Prior replay work already suggested that the generator can serve as memory instead of a stored dataset (Wu et al., 2018). REGEN extends that principle to embodied sequential control, where the memory is not just a class-conditional data model but a world-conditioned action model capable of synthesizing trajectories from instructions and sensory context. The remaining performance gap to privileged replay suggests that future progress depends less on the continual-learning wrapper than on stronger WAMs: specifically, models with improved long-horizon visual fidelity, tighter alignment between imagined observations and executable actions, and more stable recurrent rollout dynamics (Govind et al., 25 Jun 2026).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Recurrent Generative Replay (REGEN).