Recurrent Generative Replay (ReGen)
- Recurrent Generative Replay (ReGen) is a continual imitation learning framework that uses pretrained World Action Models to generate pseudo-replay trajectories without storing original demonstrations.
- It employs a recurrent rollout mechanism conditioned on language instructions and real observations to recreate prior task trajectories, thus addressing the challenge of catastrophic forgetting.
- Empirical evaluations reveal that ReGen significantly reduces negative backward transfer and approaches the performance of methods using real demonstration data in both simulation and real-robot tasks.
Recurrent Generative Replay (ReGen) is a continual imitation learning framework that leverages the generative capabilities of pretrained World Action Models (WAMs) to synthesize pseudo-replay trajectories, enabling robot policies to rehearse previously learned tasks without the need to store original human demonstrations. By employing WAMs as implicit replay buffers, ReGen addresses the challenge of catastrophic forgetting inherent in sequential Behavioral Cloning (BC) and circumvents the impracticality of storing large or proprietary datasets. The framework enables continual adaptation by recursively generating trajectory rollouts for previous tasks conditioned only on task instructions and current observations. Empirical evaluations demonstrate substantial reductions in forgetting relative to sequential fine-tuning, with performance approaching that of privileged experience replay methods that rely on real demonstration data (Govind et al., 25 Jun 2026).
1. Formal Definition and Continual Imitation Learning Paradigm
In standard continual imitation learning, a policy is optimized on a stream of tasks via BC, but suffers catastrophic forgetting due to the absence of prior-task data. Experience replay methods directly interleave stored demonstrations from all tasks, which is often infeasible at scale. ReGen utilizes pretrained WAMs—models that jointly predict future observations, actions, and scalar rewards—to generate pseudo-demonstrations for all prior tasks without storing ground-truth data.
The WAM policy is given by
where is the current observation (visual and proprioceptive), is a language instruction, is an -step action chunk, is a predicted future observation, and is task progress. During adaptation to a new task , ReGen operates as follows:
- For each prior task , the WAM is conditioned on instruction .
- A real current-task observation seeds the recurrent rollout.
- The model recursively generates its own next observation, using this as input for further rollout, to yield full pseudo-trajectories 0 for replay.
- Pseudo-demonstrations for all prior tasks are combined with current-task demonstrations, and BC is performed using this replay-augmented set.
Through this process, ReGen approximates the trajectory distributions of all previous tasks without storing any original demonstrations (Govind et al., 25 Jun 2026).
2. World Action Model Architecture and Generative Rollout Mechanism
The WAM instantiation in ReGen utilizes Cosmos-Policy, based on Cosmos-Predict2-2B, structured as follows:
- Visual Encoder: Wan2.1 spatiotemporal VAE tokenizer encodes RGB images from third-person and wrist cameras, as well as proprioceptive state, into latent space 1.
- Language Encoder: T5-XXL transformer converts 2 into a fixed-dimensional embedding 3.
- Recurrent Diffusion Module: A U-Net-style diffusion model processes inputs 4, denoising into latent actions 5, latent code for predicted observation 6, and scalar reward 7.
- Decoders: Mirror the encoder stages to reconstruct pixel-level images from latent codes and decode latent actions into joint-space commands.
Inference proceeds in two phases:
- Initialization (for 8): Real observations 9 are used to initialize and generate the first action chunk.
- Recurrent Generation (0): The model's predicted observation 1 is used as the next input, recursively rolling out future trajectory steps.
This structure enables the generation of pseudo-trajectories for arbitrary prior-task instructions, supporting implicit replay without data storage (Govind et al., 25 Jun 2026).
3. Mathematical Formulation and Training Objectives
The joint generative formulation for pseudo-replay uses:
- For WAM inference:
2
where
3
and pseudo-trajectory construction yields
4
where 5 denotes the first action in each chunk.
- Replay-Augmented Behavioral Cloning:
Let 6 (real demos) and 7 (generated replays), with combined dataset 8. The BC objective is minimized:
9
where 0 is mean-squared error (continuous actions) or cross-entropy (discrete).
- Pretraining Objective: Jointly optimizes 1 for actions, a diffusion-based flow-matching loss for observation latents, and 2 regression for rewards.
Pseudocode Summary: At each continual stage 3, generated pseudo-replays for all previous tasks are constructed by WAM rollouts, combined with current-task data, and the policy is fine-tuned using BC. Replay generation terminates either at a maximum horizon or when WAM’s predicted reward exceeds threshold 4.
4. Experimental Protocols and Baseline Comparisons
ReGen was evaluated across diverse imitation learning settings:
- Simulation (LIBERO):
- Three suites (Spatial, Object, Goal) with 10 tasks and 50 demos each.
- Offline pretraining on first 6 tasks; remaining introduced sequentially with 2K BC steps per stage.
- Baselines: Sequential Fine-Tuning (Seq-FT), LoRA adapters (Seq-LoRA), EWC, PackNet, Experience Replay (ER; privileged real-demos), and RAR (rollout of frozen policies in simulator).
- Metrics: Forward Transfer (FWT), Negative Backward Transfer (NBT), and AUC on success rates.
- Real-Robot (xArm7):
- Three sequential pick-and-place tasks (50 teleop demos each).
- Each stage: 10 randomized evals per task; scores based on contact and correct placement.
5. Empirical Results and Analysis
Results demonstrate efficacy of ReGen in mitigating catastrophic forgetting:
- Simulation (LIBERO Object/Goal):
- Seq-FT exhibits near-complete forgetting (NBT ≈ 80–100%).
- ER achieves NBT < 10%, AUC > 90%.
- ReGen reduces NBT by >50% vs Seq-FT (e.g., NBT from 82.6→26.1 on Object), AUC up to ~65.5%, approaching ER.
- Spatial Tasks (LIBERO):
- ReGen†(with object re-placement) attains AUC ≈ 76.9 vs ER 87.8.
- Real-Robot:
- Seq-FT: FWT = 50, NBT = 96.3, AUC = 13.8
- ReGen: FWT = 80, NBT = 60.5, AUC = 53.8
- Representation Stability: Action-latent 5 drift after Stage 1: Seq-FT ≈ 0.30, ReGen ≈ 0.12, ER ≈ 0.04.
- Trajectory Fidelity: Hallucinated ReGen trajectories closely match ground truth in XY-space; Seq-FT trajectories diverge substantially.
6. Algorithmic Limitations and Ablation Studies
Observed bottlenecks include:
- Long-Horizon Visual Degradation: PSNR of hallucinated frames drops monotonically over continual learning stages, compounding recursive blur and correlating with increased NBT. This suggests visual fidelity of pseudo-replay is a limiting factor over longer horizons.
- Action-Observation Inconsistency: Discrepancy between high imagined success rates (from predicted 6) and much lower grounded success (using 7 executed in simulation) exposes cases where visual rollouts misrepresent action efficacy.
- Ablation Findings:
- Doubling replay samples per task (5→10) slightly reduces NBT with comparable FWT/AUC.
- Reward-based trajectory termination yields higher PSNR (20.3 dB) compared to fixed horizons (19.5/18.4 dB), indicating better quality of hallucinated data.
7. Practical Deployment Considerations
Several factors affect real-world utility:
- Scalability: Cosmos-Predict2-2B contains ~2B parameters and requires multi-GPU inference (4 GPUs, batch size 40, gradient accumulation 12).
- Inference Efficiency: Diffusion steps reduced to 5 for actions and 1 for observations optimize fidelity/latency tradeoffs; generating 10 O(100)-length trajectories is feasible offline.
- Data Security and Storage: Only language tokens 8 need to be stored, removing the need to retain images or joint-state logs, which alleviates security and storage constraints.
- Integration Paradigm: Pseudo-trajectories are generated off-robot and policy updates occur between deployment stages.
- Assumptions: Success depends on WAM generative capacity, with future progress in diffusion fidelity and action-observation alignment expected to further approach real-data replay performance.
In summary, ReGen leverages the generative modeling capabilities of WAMs to function as implicit memory, offering a scalable and secure approach to continual imitation learning without the need to store demonstrations, achieving substantial reductions in catastrophic forgetting (Govind et al., 25 Jun 2026).