Papers
Topics
Authors
Recent
2000 character limit reached

GigaWorld-0: Scalable VLA Data Engine Framework

Updated 18 December 2025
  • GigaWorld-0 is a unified world model framework that integrates scalable synthetic video and 3D scene synthesis for embodied VLA learning.
  • It combines advanced generative techniques with hardware-optimized distributed training to yield high-fidelity, physically grounded data.
  • Models trained on GigaWorld-0 data showcase a 15–30% improvement in robotic task success and robust cross-domain generalization.

GigaWorld-0 is a unified world model framework designed as a scalable data engine for Vision-Language-Action (VLA) learning in embodied AI. By integrating large-scale synthetic video generation with physically grounded 3D scene synthesis, GigaWorld-0 enables the creation of diverse, visually and physically plausible data for downstream policy training. It introduces two principal modules—GigaWorld-0-Video and GigaWorld-0-3D—jointly optimized through a multi-term loss, and is underpinned by the high-efficiency GigaTrain distributed training framework with hardware-oriented optimizations. Models trained exclusively on GigaWorld-0–generated data demonstrate significant improvements in real-world robotic generalization and task success without real-world interaction during training (Team et al., 25 Nov 2025).

1. System Architecture and Data-Engine Pipeline

GigaWorld-0 consists of three tightly-coupled subsystems: GigaWorld-0-Video, GigaWorld-0-3D, and GigaTrain.

  • GigaWorld-0-Video serves as a foundation for large-scale, controllable image-text-to-video (IT2V) generation via four model variants:
    • Dreamer: IT2V foundation model based on mixture-of-experts (MoE) and sparse attention.
    • AppearanceTransfer: Text-driven foreground/background editing over texture, material, and lighting.
    • ViewTransfer: Novel camera-view synthesis with action remapping and pose transformations.
    • MimicTransfer: Human-to-robot manipulation translation via video-to-video mapping.
    • All models support multi-view generation, single-step distillation, and FP8-accelerated inference.
  • GigaWorld-0-3D generates simulation-ready 3D scenes and physically realistic trajectories through:
  • GigaTrain is a distributed training framework based on DeepSpeed ZeRO and FSDP, supporting mixed-precision (FP16/BF16/FP8), sparse attention (NATTEN), activation checkpointing, and efficient gradient accumulation for both large-scale pretraining and post-training phases.

The overall data-engine loop proceeds from textual and visual inputs to synthetic videos, action trajectories, 3D scenes, physically plausible rollouts, fine-grained rendering, and on to VLA model pretraining and deployment in real robots. This pipeline enables scalable synthesis and filtering of high-fidelity data for embodied learning.

2. GigaWorld-0-Video: Foundation and Control

2.1 Dreamer Foundation Model

At the core, Dreamer uses latent diffusion with flow-matching within a 3D-VAE latent space:

dztdt=vθ(zt,t,c)\frac{d\mathbf z_t}{dt} = \mathbf v_\theta(\mathbf z_t, t, \mathbf c)

where ztR16×H×W\mathbf z_t \in \mathbb R^{16 \times H' \times W'} denotes the latent, and c\mathbf c is the joint image/text conditioning. The objective is:

Lflow=Et,z0,cvθ(zt,t,c)z˙t2\mathcal L_{\mathrm{flow}} = \mathbb E_{t,\mathbf z_0,\mathbf c} \|\mathbf v_\theta(\mathbf z_t, t, \mathbf c) - \dot{\mathbf z}_t^*\|^2

The model employs DiT Transformers with sparse neighborhood attention (NATTEN), MoE FFN layers (4 experts with top-2 routing), and 3D-RoPE embeddings. MoE load balancing is regulated by:

LLoad=αi=1Nr(fiPi)\mathcal L_{\mathrm{Load}} = \alpha \sum_{i=1}^{N_r}(f_i P_i)

2.2 Controllable Generation Branches

  • AppearanceTransfer encodes depth and normal controls into the 3D-VAE latent, allowing text-guided modulation of foreground and background properties.
  • ViewTransfer remaps camera views through depth-based reprojection and pose transformations, increasing novelty and alignment:

    Kt=(TbaseWB)1TbaseWATteebaseK_t = (T^{\mathrm{base} \to W_B})^{-1} T^{\mathrm{base} \to W_A} T_t^{ee \to base}

  • MimicTransfer reconstructs full robot manipulations by blending real masked-out videos with synthetic robot trajectories mapped from human demonstrations.

2.3 Temporal Coherence and Diversity

Temporal coherence emerges from the ODE-based latent transitions, while diversity across prompts results from the entropy properties of diffusion. Auxiliary losses can reinforce smoothness and diversity:

Lsmooth=λtct=2Tztzt12\mathcal L_{\mathrm{smooth}} = \lambda_{\mathrm{tc}} \sum_{t=2}^T \|z_t - z_{t-1}\|^2

Ldiv=λdivCov(z(i),z(j))ij\mathcal L_{\mathrm{div}} = -\lambda_{\mathrm{div}} \mathrm{Cov}(z^{(i)},z^{(j)})_{i\neq j}

3. GigaWorld-0-3D: 3D Synthesis and Physical Realism

3.1 Foreground Generation (3DGS-FG)

Single-image or text inputs are processed through Trellis-based latent diffusion to yield a 3D mesh with a Gaussian splat representation. Automated quality gates using aesthetic, segmentation, and geometry checkers (e.g., Aesthetic-Checker, ImageSegChecker, MeshGeoChecker) ensure output fidelity.

3.2 Background Reconstruction (3DGS-BG)

Backgrounds are reconstructed in two stages: sparse-view 3DGRUT fitting on real video with rolling-shutter corrections, followed by densification and novel-view hallucination using Diffusion-NVS. Meshes are generated via Poisson surface reconstruction.

3.3 Differentiable System Identification (3DGS-Phys)

Surrogate models Mϕ\mathcal M_\phi are trained to simulate system dynamics:

Ldyn=ts~tMϕ(st1,at1,f,p,d)2\mathcal L_{\mathrm{dyn}} = \sum_t \|\tilde s_t - \mathcal M_\phi(s_{t-1},a_{t-1},f,p,d)\|^2

and parameters (f,p,d)(f,p,d) are tuned for physical match: Liden=tMϕ(st1,at1,f,p,d)streal2\mathcal L_{\mathrm{iden}} = \sum_t \|\mathcal M_\phi(s_{t-1},a_{t-1},f,p,d) - s_t^{\mathrm{real}}\|^2 For deformable objects, spring–mass parameters are inferred from video via CNN.

3.4 Action Synthesis (3DGS-Act)

Teleoperation demonstrations are expanded using MimicGen (minimizing trajectory smoothness under task and joint constraints):

minτtτt+1τt2,s.t. Ctask(τ)=0, qminqqmax\min_{\tau} \sum_t \|\tau_{t+1} - \tau_t\|^2, \quad \text{s.t.}~C_{\mathrm{task}}(\tau)=0,~q_{\min} \le q \le q_{\max}

For complex tasks, cold-start RL (RLPD) provides trajectories.

4. GigaTrain: Distributed and Efficient Learning

4.1 FP8 Quantization

Weights and activations are quantized to an 8-bit floating format (1-bit sign, 5-bit exponent, 2-bit mantissa):

w^=Sign(w)×2clamp(log2w,Emin,Emax)(1+frac(w)×22)\hat w = \mathrm{Sign}(w) \times 2^{\mathrm{clamp}(\lfloor \log_2 |w| \rfloor, E_{\min}, E_{\max})}\left(1 + \mathrm{frac}(w)\times 2^{-2}\right)

FP8 yields approximately 15–25% memory saving and 10–20% speedup.

4.2 Sparse Neighborhood Attention (NATTEN)

Attention computation is localized to a fixed neighborhood:

A=softmax(QK+Mdk)VA = \mathrm{softmax}\left(\frac{QK^\top + M}{\sqrt{d_k}}\right)V

where Mij=M_{ij}=-\infty if ij>W|i-j|>W. This reduces complexity from O(N2)\mathcal O(N^2) to O(NW)\mathcal O(NW) and yields roughly a 15% throughput gain.

4.3 Distributed Training Strategies

ZeRO-2 and FSDP-2 are compared for memory efficiency (FSDP-2: ∼74 GB, ZeRO-2: ∼77 GB) and step time, with activation checkpointing on MoE FFNs facilitating training of 2B-parameter networks on 8×H20 GPUs.

5. Joint Optimization and Multitask Loss

The three major subsystems are trained jointly with a multi-term objective:

Ltotal=λflowLflow+λMoELLoad+λviewLview+λappLappear+λidenLiden+λphysLdyn+λconsLcons+R\mathcal L_{\mathrm{total}} = \lambda_{\mathrm{flow}} \mathcal L_{\mathrm{flow}} + \lambda_{\mathrm{MoE}}\mathcal L_{\mathrm{Load}} + \lambda_{\mathrm{view}}\mathcal L_{\mathrm{view}} + \lambda_{\mathrm{app}}\mathcal L_{\mathrm{appear}} + \lambda_{\mathrm{iden}}\mathcal L_{\mathrm{iden}} + \lambda_{\mathrm{phys}}\mathcal L_{\mathrm{dyn}} + \lambda_{\mathrm{cons}}\mathcal L_{\mathrm{cons}} + R

with Lflow\mathcal L_{\mathrm{flow}} (diffusion), Lview\mathcal L_{\mathrm{view}}, Lappear\mathcal L_{\mathrm{appear}} (reconstruction), Liden\mathcal L_{\mathrm{iden}} and Ldyn\mathcal L_{\mathrm{dyn}} (physics), Lcons\mathcal L_{\mathrm{cons}} (rendering consistency), and regularization RR.

This joint schedule enforces photorealism, geometric/3D alignment, and physicodynamic realism in the generated data.

6. Evaluation and Downstream Performance

6.1 Synthetic Data Quality

  • PBench (Robot Set) overall quality rises to 88.2, outperforming other 2B-act models.
  • DreamGenBench (GR1-Env/Obj/Beh) shows instruction-following improvements by 2–5% over previous baselines at similar scale.
  • Visual evaluations confirm the system's ability to generate diverse, multi-view coherent, and physically plausible manipulations across textures, lighting, and camera angles.

6.2 Impact on VLA Model Performance

VLA agents (e.g., GigaBrain-0) trained exclusively on GigaWorld-0 data demonstrate increased success on real-robot benchmarks—including laundry folding, paper towel preparation, table bussing, juice mixing, and box/basket moving—with success rates enhanced by approximately 15–30% relative to agents trained solely on real or simulated data. These results include robust zero-shot generalization to novel objects, viewpoints, and lighting, indicating pronounced cross-domain efficacy.


GigaWorld-0 establishes a scalable methodology for embodied AI data synthesis by unifying controllable video generation, structured 3D scene and physics modeling, and resource-efficient training protocols. Its synthetic corpus enables state-of-the-art downstream policy learning for real-world robotics without real-world training data (Team et al., 25 Nov 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to GigaWorld-0 Framework.