Papers
Topics
Authors
Recent
Search
2000 character limit reached

DreamGen: Synthetic Data for Robotic Policies

Updated 2 July 2026
  • DreamGen is a four-stage synthetic-data pipeline that uses fine-tuned video world models to generate photorealistic video trajectories for robot policy learning.
  • It couples pseudo-action inference—via inverse dynamics and latent-action techniques—with synthetic roll-outs to enable zero-shot transfer across diverse tasks and environments.
  • Empirical results show significant improvements in behavior, environment generalization, and data efficiency, reducing reliance on extensive manual data collection.

DreamGen is a four-stage synthetic-data pipeline for training generalizable robot policies using video world models—large-scale image-to-video diffusion models adapted for robotics through lightweight fine-tuning. By coupling photorealistic video generation with pseudo-action inference, DreamGen enables efficient policy learning across a combinatorial space of behaviors and environments, substantially reducing reliance on manual data collection. Empirical evidence demonstrates strong performance in behavior and environment generalization, including zero-shot policy transfer to unseen tasks and settings, as well as significant data efficiency benefits compared to conventional approaches (Jang et al., 19 May 2025).

1. Pipeline Structure and Methodology

DreamGen operates via a four-stage pipeline:

Stage 1: Video World Model Fine-tuning

  • Begins with a pre-trained text-to-video diffusion model (e.g., WAN 2.1, Cosmos, CogVideoX, HunyuanVideo).
  • Fine-tunes only LoRA parameters (Δϕ) while keeping the backbone (Ï•) frozen, using a small teleoperated dataset D0D_0 composed of TT demonstrations for a single robot behavior in a single environment.
  • The fine-tuning objective is

Lfine=E(x1:H,τ)∼D0[Ldiffusion(φ+Δφ;x1:H ∣ τ)]L_\text{fine} = \mathbb{E}_{(x_{1:H}, \tau) \sim D_0}\left[ L_\text{diffusion}(\varphi+\Delta \varphi; x_{1:H} ~|~ \tau) \right]

where Ï„\tau is the task's language caption and x1:Hx_{1:H} are the demonstration frames.

  • Instruction-following and physics-following metrics are monitored to prevent overfitting.

Stage 2: Synthetic Video Roll-outs

  • At deployment, sampling starts from an initial RGB frame x0x_0 and a language instruction Ï„\tau (either known or novel).
  • The fine-tuned model generates a photorealistic video trajectory x0,…,xH∼pφ(x1:H ∣ x0,Ï„)x_0,\ldots,x_H \sim p_\varphi(x_{1:H} ~|~ x_0, \tau).
  • Repeated sampling covers diverse objects, tasks, and environments, thus enabling large-scale data augmentation.

Stage 3: Pseudo-Action Labeling

  • Since only videos are produced, DreamGen infers action sequences a0,…,aH−1a_0,\ldots,a_{H-1} via one of two methods:
    • Inverse Dynamics Model (IDM):
    • fIDM:RW×H×3×RW×H×3→Rdf_\text{IDM}: \mathbb{R}^{W \times H \times 3} \times \mathbb{R}^{W \times H \times 3} \rightarrow \mathbb{R}^d predicts TT0 given TT1.
    • Parameter TT2 is trained via

    TT3 - Windowed inference yields executable controls for simulators and physical hardware. - Latent-Action Model (LAPA): - Each frame pair is encoded into a discrete latent action via a VQ-VAE encoder and transformer. - The latent action model is trained with reconstruction and commitment losses:

    TT4 - This model allows adaptation to novel robot embodiments without requiring real action labels.

Stage 4: Policy Learning on Neural Trajectories

  • Neural trajectories, i.e., pairs TT5, are used to train a visuomotor policy TT6.

  • The supervised loss is

TT7

  • Policy architectures include Diffusion Policy, TTO, and GROOT N1.

  • Both pure synthetic and mixed real/synthetic co-training regimes are supported (e.g., 1:1 sampling between TT8 and neural trajectories).

2. DREAMGEN BENCH: Video-Generation Benchmark

DreamGen introduces DREAMGEN BENCH, a two-axis evaluation suite for video world models, designed to correlate video quality with downstream policy effectiveness:

Metric Evaluation method Correlation with policy success
Instruction-Following (IF) VLMs (GPT-4o, Qwen-VL-2.5), human-calibrated (0/1 scores) Pearson TT9 with human judgment
Physics Alignment (PA) VideoCon-Physics + VLM prompt (0/1 scores) Bench Score Lfine=E(x1:H,τ)∼D0[Ldiffusion(φ+Δφ;x1:H ∣ τ)]L_\text{fine} = \mathbb{E}_{(x_{1:H}, \tau) \sim D_0}\left[ L_\text{diffusion}(\varphi+\Delta \varphi; x_{1:H} ~|~ \tau) \right]0 with RoboCasa policies

Bench Scores are computed by aggregating IF and PA over 1000 held-out synthetic robot videos. Superior Bench Score strongly predicts the actual success rates of robot policies learned from the corresponding synthetic data.

3. Behavior, Environment, and Data Generalization

Empirical evaluations on RoboCasa and several real robot platforms underline the method's scalability, generalization, and efficiency:

  • Data Augmentation in Simulation:

    • Co-training on up to 240 K neural trajectories produces log-linear improvements.
    • With only 7 K real trajectories: baseline average success is 17.4%; with neural trajectories, success improves to 39.9% (IDM) and 57.6% (latent).
    • Training solely on IDM-generated trajectories yields 20.6% success.
  • Real-World Robot Tasks:
    • On nine tasks—including GR1 humanoid (hammering, wiping, folding, stacking), Franka arm, and SO-100 arm—co-training with 100–300 neural trajectories per task (with only 10–13 real demos) increases success: GR1 (37% to 46.4%), Franka (23% to 37%), SO-100 (21% to 45.5%).
  • Behavior Generalization:
    • Starting from a single pick-and-place teleoperation dataset (2,884 demos), DreamGen prompts for tasks like "water the flowers," "hit the tambourine," "open the microwave"—a total of 14 novel verbs and 22 behaviors.
    • Zero-shot policies trained solely on neural trajectories achieve 43.2% average success (vs. 11.2% baseline) on these held-out behaviors.
  • Environment Generalization:
    • Initial RGB images from 10 previously unseen rooms yield consistent synthetic videos for known and novel behaviors.
    • Policies trained purely on synthetic data reach 28.5% success on both seen-behavior/new-environment and novel-behavior/novel-environment benchmarks (baseline 0%).
  • Data Efficiency:
    • Generalization across both behaviors and environments is enabled by training on only a single teleoperated task in one setting; all others are generated and pseudo-labeled by the pipeline.

4. Pseudo-Action Labeling Techniques

DreamGen’s action inference methods enable policy training without ground-truth robot actions for every generated video. The IDM approach produces physically executable control sequences by learning from real demonstrations, while the latent-action model permits transfer to uninstrumented robots and datasets through visual representation learning. The windowed application of IDM during inference ensures temporally coherent and executable torque or position controls for both simulators and hardware (Jang et al., 19 May 2025). The VQ-VAE-based latent-action model captures fine-grained visual deltas as discrete actions, enhancing transferability across robotic platforms and tasks when traditional control interfaces are unavailable.

5. Limitations and Future Directions

Limitations of DreamGen, as evidenced by experimental and practical constraints, include:

  • Computational Cost: Generating extensive synthetic datasets (e.g., 240 K RoboCasa videos) required approximately 54 hours on 1,500 NVIDIA L40 GPUs. Efficient video sampling and reduced compute demands remain open challenges.
  • Manual Initial Frames: Each new environment currently requires a manually-captured initial RGB frame. Automation via image-to-image diffusion or viewpoint randomization is proposed as a future extension.
  • Task Complexity: Current results cover mid-level manipulation. Tasks requiring dexterous, contact-rich behaviors (such as in-hand manipulation or fine tool use) are not yet supported and represent an open direction.
  • Evaluation Robustness: While DREAMGEN BENCH achieves high correlation with human judgments, automatic instruction-following and physics-alignment raters occasionally produce spurious outputs. Improved VLMs or learned video encoders are needed for further refinement.

6. Significance and Implications

DreamGen demonstrates that fine-tuned video world models, originally conceived for prediction and planning, can serve as a scalable synthetic data engine for robot policy learning. The ability to generate, pseudo-label, and utilize vast numbers of photorealistic neural trajectories from a single teleoperated behavior accelerates zero-shot generalization to new tasks and environments, reducing the cost and effort associated with hand-collected robot datasets. Strong empirical correlations between DREAMGEN BENCH scores and real-world policy outcomes position benchmarked video world models as a crucial foundation for scalable, generalizable robot learning (Jang et al., 19 May 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to DreamGen.