Papers
Topics
Authors
Recent
Search
2000 character limit reached

Recurrent Generative Replay (ReGen)

Updated 2 July 2026
  • Recurrent Generative Replay (ReGen) is a continual imitation learning framework that uses pretrained World Action Models to generate pseudo-replay trajectories without storing original demonstrations.
  • It employs a recurrent rollout mechanism conditioned on language instructions and real observations to recreate prior task trajectories, thus addressing the challenge of catastrophic forgetting.
  • Empirical evaluations reveal that ReGen significantly reduces negative backward transfer and approaches the performance of methods using real demonstration data in both simulation and real-robot tasks.

Recurrent Generative Replay (ReGen) is a continual imitation learning framework that leverages the generative capabilities of pretrained World Action Models (WAMs) to synthesize pseudo-replay trajectories, enabling robot policies to rehearse previously learned tasks without the need to store original human demonstrations. By employing WAMs as implicit replay buffers, ReGen addresses the challenge of catastrophic forgetting inherent in sequential Behavioral Cloning (BC) and circumvents the impracticality of storing large or proprietary datasets. The framework enables continual adaptation by recursively generating trajectory rollouts for previous tasks conditioned only on task instructions and current observations. Empirical evaluations demonstrate substantial reductions in forgetting relative to sequential fine-tuning, with performance approaching that of privileged experience replay methods that rely on real demonstration data (Govind et al., 25 Jun 2026).

1. Formal Definition and Continual Imitation Learning Paradigm

In standard continual imitation learning, a policy is optimized on a stream of tasks via BC, but suffers catastrophic forgetting due to the absence of prior-task data. Experience replay methods directly interleave stored demonstrations from all tasks, which is often infeasible at scale. ReGen utilizes pretrained WAMs—models that jointly predict future observations, actions, and scalar rewards—to generate pseudo-demonstrations for all prior tasks without storing ground-truth data.

The WAM policy is given by

πθ:(ot, ℓ)→(at:t+H, ot+H, rt),\pi_\theta : (o_t,\,\ell) \to (a_{t:t+H},\,o_{t+H},\,r_t),

where oto_t is the current observation (visual and proprioceptive), ℓ\ell is a language instruction, at:t+Ha_{t:t+H} is an HH-step action chunk, ot+Ho_{t+H} is a predicted future observation, and rt∈[0,1]r_t \in [0,1] is task progress. During adaptation to a new task TkT_k, ReGen operates as follows:

  • For each prior task ii, the WAM is conditioned on instruction â„“i\ell_i.
  • A real current-task observation seeds the recurrent rollout.
  • The model recursively generates its own next observation, using this as input for further rollout, to yield full pseudo-trajectories oto_t0 for replay.
  • Pseudo-demonstrations for all prior tasks are combined with current-task demonstrations, and BC is performed using this replay-augmented set.

Through this process, ReGen approximates the trajectory distributions of all previous tasks without storing any original demonstrations (Govind et al., 25 Jun 2026).

2. World Action Model Architecture and Generative Rollout Mechanism

The WAM instantiation in ReGen utilizes Cosmos-Policy, based on Cosmos-Predict2-2B, structured as follows:

  • Visual Encoder: Wan2.1 spatiotemporal VAE tokenizer encodes RGB images from third-person and wrist cameras, as well as proprioceptive state, into latent space oto_t1.
  • Language Encoder: T5-XXL transformer converts oto_t2 into a fixed-dimensional embedding oto_t3.
  • Recurrent Diffusion Module: A U-Net-style diffusion model processes inputs oto_t4, denoising into latent actions oto_t5, latent code for predicted observation oto_t6, and scalar reward oto_t7.
  • Decoders: Mirror the encoder stages to reconstruct pixel-level images from latent codes and decode latent actions into joint-space commands.

Inference proceeds in two phases:

  • Initialization (for oto_t8): Real observations oto_t9 are used to initialize and generate the first action chunk.
  • Recurrent Generation (â„“\ell0): The model's predicted observation â„“\ell1 is used as the next input, recursively rolling out future trajectory steps.

This structure enables the generation of pseudo-trajectories for arbitrary prior-task instructions, supporting implicit replay without data storage (Govind et al., 25 Jun 2026).

3. Mathematical Formulation and Training Objectives

The joint generative formulation for pseudo-replay uses:

  • For WAM inference:

â„“\ell2

where

â„“\ell3

and pseudo-trajectory construction yields

â„“\ell4

where â„“\ell5 denotes the first action in each chunk.

  • Replay-Augmented Behavioral Cloning:

Let â„“\ell6 (real demos) and â„“\ell7 (generated replays), with combined dataset â„“\ell8. The BC objective is minimized:

â„“\ell9

where at:t+Ha_{t:t+H}0 is mean-squared error (continuous actions) or cross-entropy (discrete).

  • Pretraining Objective: Jointly optimizes at:t+Ha_{t:t+H}1 for actions, a diffusion-based flow-matching loss for observation latents, and at:t+Ha_{t:t+H}2 regression for rewards.

Pseudocode Summary: At each continual stage at:t+Ha_{t:t+H}3, generated pseudo-replays for all previous tasks are constructed by WAM rollouts, combined with current-task data, and the policy is fine-tuned using BC. Replay generation terminates either at a maximum horizon or when WAM’s predicted reward exceeds threshold at:t+Ha_{t:t+H}4.

4. Experimental Protocols and Baseline Comparisons

ReGen was evaluated across diverse imitation learning settings:

  • Simulation (LIBERO):
    • Three suites (Spatial, Object, Goal) with 10 tasks and 50 demos each.
    • Offline pretraining on first 6 tasks; remaining introduced sequentially with 2K BC steps per stage.
    • Baselines: Sequential Fine-Tuning (Seq-FT), LoRA adapters (Seq-LoRA), EWC, PackNet, Experience Replay (ER; privileged real-demos), and RAR (rollout of frozen policies in simulator).
    • Metrics: Forward Transfer (FWT), Negative Backward Transfer (NBT), and AUC on success rates.
  • Real-Robot (xArm7):
    • Three sequential pick-and-place tasks (50 teleop demos each).
    • Each stage: 10 randomized evals per task; scores based on contact and correct placement.

5. Empirical Results and Analysis

Results demonstrate efficacy of ReGen in mitigating catastrophic forgetting:

  • Simulation (LIBERO Object/Goal):
    • Seq-FT exhibits near-complete forgetting (NBT ≈ 80–100%).
    • ER achieves NBT < 10%, AUC > 90%.
    • ReGen reduces NBT by >50% vs Seq-FT (e.g., NBT from 82.6→26.1 on Object), AUC up to ~65.5%, approaching ER.
  • Spatial Tasks (LIBERO):
    • ReGen† (with object re-placement) attains AUC ≈ 76.9 vs ER 87.8.
  • Real-Robot:
    • Seq-FT: FWT = 50, NBT = 96.3, AUC = 13.8
    • ReGen: FWT = 80, NBT = 60.5, AUC = 53.8
  • Representation Stability: Action-latent at:t+Ha_{t:t+H}5 drift after Stage 1: Seq-FT ≈ 0.30, ReGen ≈ 0.12, ER ≈ 0.04.
  • Trajectory Fidelity: Hallucinated ReGen trajectories closely match ground truth in XY-space; Seq-FT trajectories diverge substantially.

6. Algorithmic Limitations and Ablation Studies

Observed bottlenecks include:

  • Long-Horizon Visual Degradation: PSNR of hallucinated frames drops monotonically over continual learning stages, compounding recursive blur and correlating with increased NBT. This suggests visual fidelity of pseudo-replay is a limiting factor over longer horizons.
  • Action-Observation Inconsistency: Discrepancy between high imagined success rates (from predicted at:t+Ha_{t:t+H}6) and much lower grounded success (using at:t+Ha_{t:t+H}7 executed in simulation) exposes cases where visual rollouts misrepresent action efficacy.
  • Ablation Findings:
    • Doubling replay samples per task (5→10) slightly reduces NBT with comparable FWT/AUC.
    • Reward-based trajectory termination yields higher PSNR (20.3 dB) compared to fixed horizons (19.5/18.4 dB), indicating better quality of hallucinated data.

7. Practical Deployment Considerations

Several factors affect real-world utility:

  • Scalability: Cosmos-Predict2-2B contains ~2B parameters and requires multi-GPU inference (4 GPUs, batch size 40, gradient accumulation 12).
  • Inference Efficiency: Diffusion steps reduced to 5 for actions and 1 for observations optimize fidelity/latency tradeoffs; generating 10 O(100)-length trajectories is feasible offline.
  • Data Security and Storage: Only language tokens at:t+Ha_{t:t+H}8 need to be stored, removing the need to retain images or joint-state logs, which alleviates security and storage constraints.
  • Integration Paradigm: Pseudo-trajectories are generated off-robot and policy updates occur between deployment stages.
  • Assumptions: Success depends on WAM generative capacity, with future progress in diffusion fidelity and action-observation alignment expected to further approach real-data replay performance.

In summary, ReGen leverages the generative modeling capabilities of WAMs to function as implicit memory, offering a scalable and secure approach to continual imitation learning without the need to store demonstrations, achieving substantial reductions in catastrophic forgetting (Govind et al., 25 Jun 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Recurrent Generative Replay (ReGen).