GigaWorld-0: Scalable VLA Data Engine Framework

Updated 18 December 2025

GigaWorld-0 is a unified world model framework that integrates scalable synthetic video and 3D scene synthesis for embodied VLA learning.
It combines advanced generative techniques with hardware-optimized distributed training to yield high-fidelity, physically grounded data.
Models trained on GigaWorld-0 data showcase a 15–30% improvement in robotic task success and robust cross-domain generalization.

GigaWorld-0 is a unified world model framework designed as a scalable data engine for Vision-Language-Action (VLA) learning in embodied AI. By integrating large-scale synthetic video generation with physically grounded 3D scene synthesis, GigaWorld-0 enables the creation of diverse, visually and physically plausible data for downstream policy training. It introduces two principal modules—GigaWorld-0-Video and GigaWorld-0-3D—jointly optimized through a multi-term loss, and is underpinned by the high-efficiency GigaTrain distributed training framework with hardware-oriented optimizations. Models trained exclusively on GigaWorld-0–generated data demonstrate significant improvements in real-world robotic generalization and task success without real-world interaction during training (Team et al., 25 Nov 2025).

1. System Architecture and Data-Engine Pipeline

GigaWorld-0 consists of three tightly-coupled subsystems: GigaWorld-0-Video, GigaWorld-0-3D, and GigaTrain.

GigaWorld-0-Video serves as a foundation for large-scale, controllable image-text-to-video (IT2V) generation via four model variants:
- Dreamer: IT2V foundation model based on mixture-of-experts (MoE) and sparse attention.
- AppearanceTransfer: Text-driven foreground/background editing over texture, material, and lighting.
- ViewTransfer: Novel camera-view synthesis with action remapping and pose transformations.
- MimicTransfer: Human-to-robot manipulation translation via video-to-video mapping.
- All models support multi-view generation, single-step distillation, and FP8-accelerated inference.
GigaWorld-0-3D generates simulation-ready 3D scenes and physically realistic trajectories through:
- 3DGS-FG: Single-image 3D generative foreground via Trellis-based latent diffusion.
- 3DGS-BG: Sparse-view 3D Gaussian splatting for background reconstruction and novel view synthesis.
- Phys: Differentiable system identification using PINN surrogates.
- Act: Motion planning from seed demonstrations (MimicGen) or reinforcement learning (RLPD).
GigaTrain is a distributed training framework based on DeepSpeed ZeRO and FSDP, supporting mixed-precision (FP16/BF16/FP8), sparse attention (NATTEN), activation checkpointing, and efficient gradient accumulation for both large-scale pretraining and post-training phases.

The overall data-engine loop proceeds from textual and visual inputs to synthetic videos, action trajectories, 3D scenes, physically plausible rollouts, fine-grained rendering, and on to VLA model pretraining and deployment in real robots. This pipeline enables scalable synthesis and filtering of high-fidelity data for embodied learning.

2. GigaWorld-0-Video: Foundation and Control

2.1 Dreamer Foundation Model

At the core, Dreamer uses latent diffusion with flow-matching within a 3D-VAE latent space:

$\frac{d\mathbf z_t}{dt} = \mathbf v_\theta(\mathbf z_t, t, \mathbf c)$

where $\mathbf z_t \in \mathbb R^{16 \times H' \times W'}$ denotes the latent, and $\mathbf c$ is the joint image/text conditioning. The objective is:

$\mathcal L_{\mathrm{flow}} = \mathbb E_{t,\mathbf z_0,\mathbf c} \|\mathbf v_\theta(\mathbf z_t, t, \mathbf c) - \dot{\mathbf z}_t^*\|^2$

The model employs DiT Transformers with sparse neighborhood attention (NATTEN), MoE FFN layers (4 experts with top-2 routing), and 3D-RoPE embeddings. MoE load balancing is regulated by:

$\mathcal L_{\mathrm{Load}} = \alpha \sum_{i=1}^{N_r}(f_i P_i)$

2.2 Controllable Generation Branches

AppearanceTransfer encodes depth and normal controls into the 3D-VAE latent, allowing text-guided modulation of foreground and background properties.
ViewTransfer remaps camera views through depth-based reprojection and pose transformations, increasing novelty and alignment:

$K_t = (T^{\mathrm{base} \to W_B})^{-1} T^{\mathrm{base} \to W_A} T_t^{ee \to base}$
MimicTransfer reconstructs full robot manipulations by blending real masked-out videos with synthetic robot trajectories mapped from human demonstrations.

2.3 Temporal Coherence and Diversity

Temporal coherence emerges from the ODE-based latent transitions, while diversity across prompts results from the entropy properties of diffusion. Auxiliary losses can reinforce smoothness and diversity:

$\mathcal L_{\mathrm{smooth}} = \lambda_{\mathrm{tc}} \sum_{t=2}^T \|z_t - z_{t-1}\|^2$

$\mathcal L_{\mathrm{div}} = -\lambda_{\mathrm{div}} \mathrm{Cov}(z^{(i)},z^{(j)})_{i\neq j}$

3. GigaWorld-0-3D: 3D Synthesis and Physical Realism

3.1 Foreground Generation (3DGS-FG)

Single-image or text inputs are processed through Trellis-based latent diffusion to yield a 3D mesh with a Gaussian splat representation. Automated quality gates using aesthetic, segmentation, and geometry checkers (e.g., Aesthetic-Checker, ImageSegChecker, MeshGeoChecker) ensure output fidelity.

3.2 Background Reconstruction (3DGS-BG)

Backgrounds are reconstructed in two stages: sparse-view 3DGRUT fitting on real video with rolling-shutter corrections, followed by densification and novel-view hallucination using Diffusion-NVS. Meshes are generated via Poisson surface reconstruction.

3.3 Differentiable System Identification (3DGS-Phys)

Surrogate models $\mathcal M_\phi$ are trained to simulate system dynamics:

$\mathcal L_{\mathrm{dyn}} = \sum_t \|\tilde s_t - \mathcal M_\phi(s_{t-1},a_{t-1},f,p,d)\|^2$

and parameters $(f,p,d)$ are tuned for physical match: $\mathcal L_{\mathrm{iden}} = \sum_t \|\mathcal M_\phi(s_{t-1},a_{t-1},f,p,d) - s_t^{\mathrm{real}}\|^2$ For deformable objects, spring–mass parameters are inferred from video via CNN.

3.4 Action Synthesis (3DGS-Act)

Teleoperation demonstrations are expanded using MimicGen (minimizing trajectory smoothness under task and joint constraints):

$\min_{\tau} \sum_t \|\tau_{t+1} - \tau_t\|^2, \quad \text{s.t.}~C_{\mathrm{task}}(\tau)=0,~q_{\min} \le q \le q_{\max}$

For complex tasks, cold-start RL (RLPD) provides trajectories.

4. GigaTrain: Distributed and Efficient Learning

4.1 FP8 Quantization

Weights and activations are quantized to an 8-bit floating format (1-bit sign, 5-bit exponent, 2-bit mantissa):

$\hat w = \mathrm{Sign}(w) \times 2^{\mathrm{clamp}(\lfloor \log_2 |w| \rfloor, E_{\min}, E_{\max})}\left(1 + \mathrm{frac}(w)\times 2^{-2}\right)$

FP8 yields approximately 15–25% memory saving and 10–20% speedup.

4.2 Sparse Neighborhood Attention (NATTEN)

Attention computation is localized to a fixed neighborhood:

$A = \mathrm{softmax}\left(\frac{QK^\top + M}{\sqrt{d_k}}\right)V$

where $M_{ij}=-\infty$ if $|i-j|>W$ . This reduces complexity from $\mathcal O(N^2)$ to $\mathcal O(NW)$ and yields roughly a 15% throughput gain.

4.3 Distributed Training Strategies

ZeRO-2 and FSDP-2 are compared for memory efficiency (FSDP-2: ∼74 GB, ZeRO-2: ∼77 GB) and step time, with activation checkpointing on MoE FFNs facilitating training of 2B-parameter networks on 8×H20 GPUs.

5. Joint Optimization and Multitask Loss

The three major subsystems are trained jointly with a multi-term objective:

$\mathcal L_{\mathrm{total}} = \lambda_{\mathrm{flow}} \mathcal L_{\mathrm{flow}} + \lambda_{\mathrm{MoE}}\mathcal L_{\mathrm{Load}} + \lambda_{\mathrm{view}}\mathcal L_{\mathrm{view}} + \lambda_{\mathrm{app}}\mathcal L_{\mathrm{appear}} + \lambda_{\mathrm{iden}}\mathcal L_{\mathrm{iden}} + \lambda_{\mathrm{phys}}\mathcal L_{\mathrm{dyn}} + \lambda_{\mathrm{cons}}\mathcal L_{\mathrm{cons}} + R$

with $\mathcal L_{\mathrm{flow}}$ (diffusion), $\mathcal L_{\mathrm{view}}$ , $\mathcal L_{\mathrm{appear}}$ (reconstruction), $\mathcal L_{\mathrm{iden}}$ and $\mathcal L_{\mathrm{dyn}}$ (physics), $\mathcal L_{\mathrm{cons}}$ (rendering consistency), and regularization $R$ .

This joint schedule enforces photorealism, geometric/3D alignment, and physicodynamic realism in the generated data.

6. Evaluation and Downstream Performance

6.1 Synthetic Data Quality

PBench (Robot Set) overall quality rises to 88.2, outperforming other 2B-act models.
DreamGenBench (GR1-Env/Obj/Beh) shows instruction-following improvements by 2–5% over previous baselines at similar scale.
Visual evaluations confirm the system's ability to generate diverse, multi-view coherent, and physically plausible manipulations across textures, lighting, and camera angles.

6.2 Impact on VLA Model Performance

VLA agents (e.g., GigaBrain-0) trained exclusively on GigaWorld-0 data demonstrate increased success on real-robot benchmarks—including laundry folding, paper towel preparation, table bussing, juice mixing, and box/basket moving—with success rates enhanced by approximately 15–30% relative to agents trained solely on real or simulated data. These results include robust zero-shot generalization to novel objects, viewpoints, and lighting, indicating pronounced cross-domain efficacy.

GigaWorld-0 establishes a scalable methodology for embodied AI data synthesis by unifying controllable video generation, structured 3D scene and physics modeling, and resource-efficient training protocols. Its synthetic corpus enables state-of-the-art downstream policy learning for real-world robotics without real-world training data (Team et al., 25 Nov 2025).

PDF Markdown Chat (Pro)

References (1)

GigaWorld-0: World Models as Data Engine to Empower Embodied AI (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to GigaWorld-0 Framework.