DreamGen: Unlocking Generalization in Robot Learning through Video World Models (2505.12705v2)

Published 19 May 2025 in cs.RO, cs.AI, and cs.LG

Abstract: We introduce DreamGen, a simple yet highly effective 4-stage pipeline for training robot policies that generalize across behaviors and environments through neural trajectories - synthetic robot data generated from video world models. DreamGen leverages state-of-the-art image-to-video generative models, adapting them to the target robot embodiment to produce photorealistic synthetic videos of familiar or novel tasks in diverse environments. Since these models generate only videos, we recover pseudo-action sequences using either a latent action model or an inverse-dynamics model (IDM). Despite its simplicity, DreamGen unlocks strong behavior and environment generalization: a humanoid robot can perform 22 new behaviors in both seen and unseen environments, while requiring teleoperation data from only a single pick-and-place task in one environment. To evaluate the pipeline systematically, we introduce DreamGen Bench, a video generation benchmark that shows a strong correlation between benchmark performance and downstream policy success. Our work establishes a promising new axis for scaling robot learning well beyond manual data collection. Code available at https://github.com/NVIDIA/GR00T-Dreams.

PDF Abstract

This paper introduces DREAMGEN, a 4-stage pipeline for training robot policies that can generalize across different behaviors and environments by using "neural trajectories" – synthetic robot data generated from video world models (VWMs). The core problem DreamGen addresses is the high cost and labor involved in collecting manual teleoperation data for every new task and environment in robot learning, and the limitations (e.g., sim2real gap) of traditional simulation-based synthetic data.

The DREAMGEN pipeline consists of:

Video World Model Fine-tuning: A pre-trained video generative model (e.g., WAN2.1) is fine-tuned on a small amount of human-teleoperated robot trajectories for a specific robot embodiment. This adapts the model to the robot's kinematics and dynamics, often using Low-Rank Adaptation (LoRA) to retain prior knowledge.
Video World Model Rollout: The fine-tuned VWM is prompted with an initial image frame and a language instruction (e.g., "Water the flowers") to generate photorealistic videos of the robot performing the instructed task, potentially in new environments or exhibiting novel behaviors not seen in the fine-tuning data.
Pseudo Action Labeling: Since the generated videos lack action labels, pseudo-actions are inferred. Two methods are explored:
- Inverse Dynamics Model (IDM): A diffusion transformer model (trained on the same data as the VWM) predicts action chunks given two consecutive image frames from the synthetic video.
- Latent Action Pretraining from Videos (LAPA): A transformer encoder-decoder model trained with a VQ-VAE objective extracts latent actions representing the visual delta between frames. This can be trained on diverse robot/human videos without needing ground-truth actions for the target robot.
Visuomotor Policy Training: The generated video-action sequences (neural trajectories) are used to train visuomotor policies (e.g., Diffusion Policy, TTO, GR00T N1). These policies can be co-trained with real trajectories or, if using IDM actions, trained solely on neural trajectories.

Key Experiments and Results:

Training Data Augmentation:
- Simulation (RoboCasa): Co-training with neural trajectories (up to 333x the original data) showed log-linear improvements in policy performance across different ground-truth data regimes (low, mid, high). Training solely on neural trajectories (IDM actions) achieved a non-trivial 20.6% average success rate across 24 tasks.
- Real-world: On 9 diverse tasks across Fourier GR1 humanoid, Franka Emika, and SO-100 robots (e.g., folding towels, wiping, hammering, scooping M&Ms), co-training with neural trajectories significantly improved success rates using only 10-13 real-world trajectories per task. For example, average success rates on GR1 tasks increased from 37% to 46.4%, Franka tasks from 23% to 37%, and SO-100 tasks from 21% to 45.5%.
Unlocking Generalization (GR1 Humanoid):
- Behavior Generalization: A GR1 humanoid, with teleoperation data only for a single pick-and-place task (in one environment), was able to perform 22 novel behaviors (e.g., pouring, opening articulated objects, tool use) when its policy was trained solely on neural trajectories for these new tasks. Success rates improved from 11.2% (baseline pick-and-place policy, some partial credit) to 43.2% (DREAMGEN-trained policy) on these new behaviors in seen environments.
- Environment Generalization: The VWM, fine-tuned on data from a single environment, generated videos for 10 new, unseen environments. Policies trained solely on these neural trajectories achieved a 28.5% success rate on both seen (pick-and-place variants) and novel behaviors in these completely unseen environments (baseline: 0%). This demonstrated a zero-shot transfer to new environments.
DREAMGEN BENCH:
- A new video generation benchmark was introduced to evaluate how well VWMs adapt to robot embodiments and generalize to new objects, behaviors, and environments while respecting physics.
- It measures "Instruction Following" (IF, using Qwen-VL-2.5) and "Physics Alignment" (PA, using VideoCon-Physics and Qwen-VL-2.5).
- Evaluated 4 VWMs (Hunyuan, CogVideoX, WAN2.1, Cosmos) in zero-shot and fine-tuned settings.
- Crucially, models with higher scores on DREAMGEN BENCH yielded stronger downstream robot policy performance when their generated neural trajectories were used for training, suggesting the benchmark is a good proxy for robotics applicability.

Limitations:

Tasks are relatively simple.
Significant computational resources are needed for video generation (e.g., 240k RoboCasa samples took 54 hours on 1500 NVIDIA L40 GPUs).
Initial frames for video generation are currently manually provided.
The automatic evaluators in DREAMGEN BENCH can sometimes hallucinate.

Conclusion:

DREAMGEN presents a promising approach to scale robot learning beyond manual data collection by leveraging state-of-the-art video generative models as synthetic data generators. It significantly enhances data augmentation capabilities and, more importantly, unlocks strong generalization to novel behaviors and environments with minimal real-world data. The introduced DREAMGEN BENCH provides a tool to connect video model research with robotics.

PDF Markdown Bookmark Chat (Pro)

Authors (28)

Joel Jang (30 papers)
Seonghyeon Ye (25 papers)
Zongyu Lin (15 papers)
Jiannan Xiang (12 papers)
Johan Bjorck (16 papers)
Yu Fang (30 papers)
Fengyuan Hu (6 papers)
Spencer Huang (2 papers)
Kaushil Kundalia (3 papers)
Yen-Chen Lin (13 papers)
Loic Magne (3 papers)
Ajay Mandlekar (41 papers)
Avnish Narayan (4 papers)
You Liang Tan (9 papers)
Guanzhi Wang (14 papers)
Jing Wang (740 papers)
Qi Wang (561 papers)
Yinzhen Xu (6 papers)
Xiaohui Zeng (28 papers)
Kaiyuan Zheng (7 papers)

Related Papers

Find Related Papers

Tweets

https://twitter.com/DrJimFan/status/1924819888670974129

https://twitter.com/SeonghyeonYe/status/1924824696689987793

https://twitter.com/szxiangjn/status/1924817735642710315

https://twitter.com/WilliamLamkin/status/1924825142040867290

https://twitter.com/chris_j_paxton/status/1936452614591287383

https://twitter.com/papers_anon/status/1924712865883423195