GigaWorld-0: World Models as Data Engine to Empower Embodied AI (2511.19861v1)

Published 25 Nov 2025 in cs.CV and cs.RO

Abstract: World models are emerging as a foundational paradigm for scalable, data-efficient embodied AI. In this work, we present GigaWorld-0, a unified world model framework designed explicitly as a data engine for Vision-Language-Action (VLA) learning. GigaWorld-0 integrates two synergistic components: GigaWorld-0-Video, which leverages large-scale video generation to produce diverse, texture-rich, and temporally coherent embodied sequences under fine-grained control of appearance, camera viewpoint, and action semantics; and GigaWorld-0-3D, which combines 3D generative modeling, 3D Gaussian Splatting reconstruction, physically differentiable system identification, and executable motion planning to ensure geometric consistency and physical realism. Their joint optimization enables the scalable synthesis of embodied interaction data that is visually compelling, spatially coherent, physically plausible, and instruction-aligned. Training at scale is made feasible through our efficient GigaTrain framework, which exploits FP8-precision and sparse attention to drastically reduce memory and compute requirements. We conduct comprehensive evaluations showing that GigaWorld-0 generates high-quality, diverse, and controllable data across multiple dimensions. Critically, VLA model (e.g., GigaBrain-0) trained on GigaWorld-0-generated data achieve strong real-world performance, significantly improving generalization and task success on physical robots without any real-world interaction during training.

Summary

The paper presents a unified world model framework that uses synthetic data generation for Vision-Language-Action policy training in embodied AI.
It details dual modules for photorealistic video generation and physically plausible 3D scene simulation to ensure multi-view and temporal consistency.
The framework achieves high task success and zero-shot generalization in real-world robot deployments by addressing sim2real gaps with efficient augmentation.

GigaWorld-0: Unified World Models as Synthetic Data Engines for Embodied AI

Introduction

The GigaWorld-0 framework (2511.19861) advances the paradigm of world models as scalable synthetic data engines specifically structured for Vision-Language-Action (VLA) policy learning in embodied AI domains. By integrating large-scale photorealistic video generation with physically grounded 3D scene simulation, GigaWorld-0 addresses critical bottlenecks in real-world data collection and offers instruction-aligned, diverse training signals for robot learning across manipulation, locomotion, and multi-modal environments.

Framework Overview

GigaWorld-0 combines two synergistic modules:

GigaWorld-0-Video facilitates the generation of temporally consistent, texture-rich, and controllable videos, enabling manipulation of scene appearance, camera viewpoints, and action semantics.
GigaWorld-0-3D enforces geometric and physical realism via 3D asset generation, scene reconstruction using 3D Gaussian Splatting, differentiable system identification, and executable, collision-free motion planning.

This unified architecture enables joint synthesis of spatially coherent, physically plausible, and visually diverse datasets suitable for training VLA models without extensive real-world robot interaction.

Figure 1: The framework of GigaWorld-0-Video-Dreamer.

Video Foundation Models and Controllable Augmentation

GigaWorld-0-Video-Dreamer

The flagship model, GigaWorld-0-Video-Dreamer, achieves image-text-to-video (IT2V) generation using a sparse-attention DiT backbone with MoE FFN blocks and FP8-precision training. The architecture leverages a flow-matching generative process with 3D-VAE video latents, T5-based text conditioning, and loss-balanced expert routing to dynamically specialize video regions. This design yields superior capacity-to-efficiency trade-offs compared to parameter-heavy baselines.

GigaWorld-0-Video-Dreamer enables high-throughput generation of synthetic embodied trajectories, which are temporally aligned with predicted joint actions inferred via the GigaWorld-0-IDM inverse dynamics network. Masked training over arm regions mitigates background clutter, boosting robustness and alignment fidelity.

Figure 2: Qualitative comparison of action inference on the test set.

Controllable Post-Training: Appearance, Viewpoint, and Embodiment Transfer

Three dedicated post-trained branches facilitate further domain augmentation:

AppearanceTransfer enables editable scene appearance (texture, material, illumination) via text-driven prompts, narrowing sim2real gaps by leveraging parameter-efficient control layers rather than duplicative ControlNet heads.
ViewTransfer synthesizes novel camera viewpoints and adapts corresponding robot action trajectories using dual-condition control branches, with double-reprojection self-supervision to ensure background-robot geometric consistency.
MimicTransfer translates first-person human manipulation videos into robot-executable trajectories, rendering the hand-robot mapping via background masking and inverse kinematics simulation.
Figure 3: Training data pair of GigaWorld-0-Video-ViewTransfer.

Figure 4: Training data pair of GigaWorld-0-Video-MimicTransfer.

These data engines enable large-scale, diverse augmentation across appearance, viewpoint, and embodiment, directly boosting generalization and policy robustness.

Multi-View Consistency and Generation Acceleration

GigaWorld-0-Video supports multi-view generation by panoramic concatenation and in-context learning, facilitating robust geometric and spatial reasoning during VLA training. Denoising step distillation and FP8 inference enable $50\times$ speedup over standard diffusion models, supporting real-time synthesis.

Figure 5: GigaWorld can generate multi-view consistent videos, thereby enabling 3D-aware training and improving spatial reasoning in downstream tasks.

Quality control pipelines score each video for geometric consistency, coherence, instruction alignment, and physical plausibility, gating suitability for downstream pre-training or fine-tuning.

Physically Realistic 3D Scene Construction

Foreground and Background Generation

GigaWorld-0-3D-FG synthesizes high-quality manipulable assets using state-of-the-art generative 3D models (Trellis, Clay), enhanced by aesthetic and segmentation quality control. Only assets passing MeshGeoChecker validation enter the URDF catalog.

GigaWorld-0-3D-BG leverages 3DGS/3DGRUT for scene reconstruction, augmented by generative view restoration to mitigate sparse-view artifacts. Poisson-based meshing yields watertight, simulation-ready backgrounds.

Figure 6: Overall pipeline of GigaWorld-0-3D-FG.

Figure 7: Visualization of novel view synthesis before and after view restoration.

Differentiable Physics and Action Synthesis

GigaWorld-0-3D-Phys employs PINN-based differentiable system identification for robot arms (friction, stiffness, damping), optimizing physical parameters via surrogate MSE minimization. For objects, Qwen3-VL-based agents infer modal properties from multi-view orthographic projections, supporting both rigid and deformable asset simulation.

GigaWorld-0-3D-Act generates manipulation trajectories via MimicGen-based geometric augmentation for simple scenarios and RL-based policy bootstrapping for complex tasks.

Figure 8: The learning pipeline of the differentiable physics network in GigaWorld-0-3D-Phys.

Figure 9: The overall pipeline of GigaWorld-0-3D-Act.

Integration of 3D asset generation, physically plausible dynamics, and scalable action synthesis forms a robust, simulation-ready platform for embodied policy training.

Experimental Results and Evaluation

Quantitative Benchmarks

GigaWorld-0-Video-Dreamer achieves the highest overall scores on both PBench and DreamGen Bench, outperforming Cosmos-Predict2, Wan2.2, and other large-scale video generation models—even when activating only 2B parameters. Metrics span background/object/behavior instruction-following fidelity (Qwen-IF, GPT-IF), physical authenticity (PA), and domain generalization, reflecting strong multi-dimensional performance.

Qualitative Visualization

Generated trajectories demonstrate semantic diversity and instruction adherence (Figure 10), multi-view spatial coherence (Figure 11), and high-fidelity appearance transfer across real and simulated domains (Figure 12). ViewTransfer outputs produce physically consistent action adaptation under novel viewpoints (Figure 13), while MimicTransfer enables accurate cross-embodiment translation from human to robot trajectories (Figure 14). 3D-generated scenes display geometric and dynamic realism (Figure 15).

Figure 10: Visualization results of GigaWorld-0-Video-Dreamer conditioned on the same initial frame but different text prompts, demonstrating its ability to produce diverse, semantically consistent future trajectories.

Figure 11: Multi-view visualization results of GigaWorld-0-Video-Dreamer conditioned on the same initial frame but different text prompts.

Figure 12: Visualization results of GigaWorld-0-Video-AppearanceTransfer, which enables photorealistic editing of texture, material, and lighting in real-world or simulation-acquired videos while preserving scene geometry, object semantics, and temporal coherence.

Figure 13: Visualization results of GigaWorld-0-Video-ViewTransfer, which synthesizes photorealistic videos from arbitrary camera viewpoints while simultaneously adapting robot arm trajectories to maintain physical plausibility and action consistency, enabling the generation of diverse embodied manipulation data.

Figure 14: Visualization results of GigaWorld-0-Video-MimicTransfer, which translates first-person human demonstration videos into robot-executable manipulation trajectories, enabling scalable synthesis of cross-embodiment training data for VLA models.

Figure 15: Visualization results of GigaWorld-0-3D, showcasing geometrically consistent rendering and physically realistic robot actions.

Downstream Policy Performance

Policies trained solely on GigaWorld-0-generated data with the GigaBrain-0 VLA model achieve high task success rates in real robot deployments across laundry folding, paper towel preparation, table bussing, juice preparation, basket and box movement (Figures 17–22). This demonstrates cross-domain, zero-shot generalization and robust performance without real-world interaction during training.

Figure 16: Deployment of GigaBrain-0 on the G1 humanoid robot for real-world laundry folding.

Figure 17: Deployment of GigaBrain-0 on the PiPER arms for real-world paper towel preparation.

Figure 18: Deployment of GigaBrain-0 on PiPER arms for real-world table bussing.

Figure 19: Deployment of GigaBrain-0 on G1 humanoid robot for real-world juice preparation.

Figure 20: Deployment of GigaBrain-0 on the G1 humanoid robot for real-world paper towel preparation.

Figure 21: Deployment of GigaBrain-0 on the PiPER arms for real-world laundry baskets moving.

Implications and Future Directions

GigaWorld-0 establishes world models as core data infrastructure for embodied AI, enabling high-throughput generation of instruction-aligned, physically realistic, and geometrically coherent synthetic datasets. The capacity for appearance, viewpoint, and embodiment generalization addresses longstanding sim2real transfer challenges and enables scalable policy training for heterogeneous robotic platforms.

Potential future avenues include deployment of GigaWorld-0 as an interactive policy environment for closed-loop model-based RL, leveraging learned physical and semantic priors for active policy proposal and task decomposition, and enabling self-improving pipelines through continual real-world and synthetic data co-training.

Conclusion

GigaWorld-0 demonstrates state-of-the-art performance in synthetic data generation for embodied AI and VLA policy training, with empirically validated improvements in generalization, robustness, and scalability. Its unified modeling and large-scale training infrastructure provide a solid foundation for advancing universal robot learning and embodied intelligence research, moving toward integrated simulation-policy loops and lifelong adaptation via synthetic world models.