Papers
Topics
Authors
Recent
Search
2000 character limit reached

DreamDojo: Foundation World Model for Robotics

Updated 9 February 2026
  • DreamDojo is a foundation world model for robotics that learns physical interactions and dexterous controls from a massive 44,000-hour egocentric human video dataset.
  • It utilizes a latent video diffusion architecture with continuous latent actions from a VAE, enabling effective bridging between human video and robotic control signals.
  • Robot-specific post-training and model distillation deliver real-time inference and strong generalization, enhancing model-based planning, policy evaluation, and teleoperation.

DreamDojo is a foundation world model for robotics that leverages large-scale egocentric human video to learn physics, diverse object interactions, and dexterous controls, and subsequently bridges to real robot domains via robot-specific post-training and model distillation. It addresses the challenges of world modeling for generalist agents, especially when action labels are scarce or unavailable, by introducing unified continuous latent actions and operating on an unprecedented 44,000-hour human video corpus. The resulting system achieves strong generalization, real-time inference, and applicability to model-based planning, policy evaluation, and teleoperation tasks (Gao et al., 6 Feb 2026).

1. Model Architecture and Action Representation

DreamDojo is built atop Cosmos-Predict2.5, utilizing a latent video diffusion backbone. The core building blocks are as follows:

  • Latent Encoding and Diffusion: Four-frame video chunks are encoded into discrete latents via WAN2.2. These latents, along with conditioning signals (actions, text, previous frames), are processed by a modified DiT Transformer, which produces the next-step latent via a denoising diffusion model. The transition model is expressed as:

p(zt+1zt,at)p(z_{t+1} \mid z_t, a_t)

where ztz_t is the world latent and ata_t the action chunk.

  • Continuous Latent Actions: A self-supervised variational autoencoder (VAE) ingests frame pairs (ft,ft+1)(f^t, f^{t+1}) and outputs a continuous low-dimensional action embedding a^tRd\hat{a}_t \in \mathbb{R}^d. The encoder and decoder are spatiotemporal Transformers. The VAE loss combines reconstruction and KL divergence:

Lθ,ϕpred=Eqϕ(a^ft,ft+1)[logpθ(ft+1ft,a^)]+βD ⁣KL(qϕ(a^ft,ft+1)    p(a^))\mathcal{L}^{\rm pred}_{\theta,\phi} = \mathbb{E}_{q_\phi(\hat a\,|\,f^t,f^{t+1})}\bigl[-\log p_\theta(f^{t+1}\mid f^t,\hat a)\bigr] + \beta\,D_{\!KL}\bigl(q_\phi(\hat a\mid f^t,f^{t+1})\;\|\;p(\hat a)\bigr)

This action representation serves as a unified proxy across diverse, action-unlabeled human video domains. Actions are projected and incorporated into the model via an MLP within each adaptive layer norm block.

  • Training Losses: The world model pretraining objective includes diffusion flow matching and temporal consistency:
    • Flow matching:

    Lflow=Eu(xt,t,c;θ)(ϵx)2\mathcal{L}_{\rm flow} = \mathbb{E}\left\|\mathbf{u}(\mathbf{x}_t, t, \mathbf{c};\theta) - (\epsilon - \mathbf{x})\right\|^2 - Temporal consistency:

    Ltemporal=E[i=1K1(zi+1zi)(vi+1vi)2]\mathcal{L}_{\rm temporal} = \mathbb{E}\left[\sum_{i=1}^{K-1}\left\|(z^{i+1} - z^i) - (v^{i+1} - v^i)\right\|^2\right] - Total pretraining loss:

    Lworld=Lflow+λLtemporal,λ=0.1\mathcal{L}_{\rm world} = \mathcal{L}_{\rm flow} + \lambda \mathcal{L}_{\rm temporal}, \quad \lambda = 0.1

2. Data Scale, Preparation, and Self-Supervision

DreamDojo is pretrained on the largest video corpus ever leveraged for world model pretraining:

  • Dataset Composition: The DreamDojo-HV dataset provides 43,800 hours of egocentric human video, supplemented with In-lab and EgoDex datasets, summing to ~44,000 hours. This encompasses approximately 6,000 distinct skills, 10,000 scenes, and over 43,000 objects, surpassing prior robot datasets by an order of magnitude (Table 1, Fig. 2).

  • Data Processing: Data is temporally downsampled (factor 1–4), center-cropped to 320×240320{\times}240 and upsampled to 640×480640{\times}480 for world modeling input. Language annotations are discarded for vision-only modeling.

  • Self-Supervised Learning Paradigm: As most human video lacks explicit action labels, the latent-action VAE provides a portable action signal. Proxy actions are extracted across all human datasets at ~10 Hz, facilitating knowledge transfer without manual action mapping. Empirically, conditioning on these latent actions yields nearly oracle performance relative to ground-truth action sources (Table 2).

3. Robot Post-Training and Model Distillation

After pretraining, DreamDojo undergoes adaptation to target robotic domains and is distilled for real-time use.

  • Robot-Specific Post-Training: Raw joint trajectories from robot data (e.g., GR-1, AgiBot) are rebased into relative action sequences over 4 frames, and the action-MLP is reinitialized. Full model finetuning is performed over ~25,000 robot trajectories. This process bridges the gap between human physics and robot kinematics, enabling zero-shot generalization to novel scenes after approximately 30,000 steps (Table 3).

  • Distillation Pipeline (Self-Forcing):

    • Warmup Stage: The student model GstudentG_{\rm student} mimics teacher trajectories with teacher forcing, minimizing

    Lwarmup=Ex,tGstudent(xt,t)x02\mathcal{L}_{\rm warmup} = \mathbb{E}_{x,t}\left\|G_{\rm student}(x_t, t) - x_0\right\|^2 - Main Distillation: The student minimizes the KL divergence between teacher and student:

    Ldistill=DKL(pteacherpstudent)Ez,t[(srealsfake)Gstudentθ]\mathcal{L}_{\rm distill} = D_{KL}(p_{\rm teacher} \| p_{\rm student}) \longrightarrow \mathbb{E}_{z,t}[(s_{\rm real} - s_{\rm fake}) \tfrac{\partial G_{\rm student}}{\partial \theta}]

    Rollouts are unrolled longer than the teacher horizon, and losses are applied only to teacher-length sliding windows. - Result: The distilled model achieves 10.81 FPS (vs. 2.72 FPS for the teacher) with only minor degradation in generation quality (Table 5).

4. Empirical Evaluation and Downstream Applications

DreamDojo is systematically validated on out-of-distribution (OOD) benchmarks and multiple downstream tasks.

  • Out-of-Distribution Benchmarks: Six OOD evaluations are conducted (In-lab, EgoDex, DreamDojo-HV, Counterfactual, and two novel-background modifications). Metrics include PSNR, SSIM, and LPIPS. Latent-action pretraining markedly surpasses ablations both with no pretraining and without action conditioning and closely approaches ground-truth action conditioning (Table 2).

  • Scaling Trends: Larger data scale (adding EgoDex and DreamDojo-HV) yields monotonic improvements in OOD and counterfactual scores (Table 3). Model scaling from 2B to 14B parameters yields a 0.3 dB PSNR improvement.

  • Human Preference: On EgoDex-novel and DreamDojo-HV-novel sets, DreamDojo is favored over Cosmos-Predict2.5 for both physical realism and action fidelity in more than 60% of assessments (Table 4).

  • Downstream Robotics Tasks:

    • Policy Evaluation: For AgiBot fruit packing (20 scenes), DreamDojo’s simulated task success rates correlate with real-world rates at r=0.995r=0.995 (MMRV=0.003) (Fig. 7a).
    • Model-Based Planning: Ensembled policy checkpoints produce stepwise proposals via DreamDojo, scored using a DINOv2-based value network. Planning improves task success by up to 17 percentage points (nearly 2×2\times) over random proposals (Fig. 7b).
    • Live Teleoperation: The distilled model paired with a PICO VR controller provides 10.8 FPS real-time teleoperation on G1 robots in previously unseen environments (Fig. 8).

5. Insights, Limitations, and Future Directions

Key takeaways from DreamDojo include:

  • Cross-Embodiment Transfer: Training on massive-scale human video confers a generalized understanding of physical principles—such as object permanence, friction, and collision—that carries over to robot control.
  • Enhanced Controllability: Continuous latent actions enable fine-grained and counterfactual control inputs, extending beyond trajectories found in expert demonstrations.
  • Limitations:
    • Imperfect modeling of rare human actions (e.g., slapping, rapid waving).
    • DreamDojo may estimate success rates too optimistically relative to real robot deployments.
    • The current implementation supports only single-view inputs; multi-view world modeling is not yet realized.
    • Some pretrained knowledge may be lost during robot post-training. Mitigation strategies such as PaLM-style adapters or LoRA warrant further study.
  • Future Directions: Expansion to larger human video priors, integration of policy rollouts, further inference optimizations (quantization, pruning), and research into multi-camera support are proposed avenues (Gao et al., 6 Feb 2026).

6. Context and Significance

DreamDojo represents a shift in robot world modeling, demonstrating that large-scale, self-supervised learning on egocentric human video can provide broad physics and interaction priors for dexterous robotic tasks. The use of continuous latent actions addresses the perennial problem of action label scarcity in human datasets and enables more scalable proxy action conditioning. The methodological advances—particularly in transfer, real-time distillation, and OOD generalization—set a new benchmark for generalist robot world modeling and open new directions for foundation models in robotics (Gao et al., 6 Feb 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to DreamDojo.