GigaBrain-0-Small: Efficient VLA for Edge

Updated 23 October 2025

GigaBrain-0-Small is a lightweight VLA model designed for efficient deployment on resource-constrained edge platforms, supporting robust robotic manipulation.
The model achieves significant computational efficiency with 840 GFLOPs and a 0.13s inference latency while maintaining an 80% task success rate.
Integration of world model–generated data pipelines ensures robust generalization across diverse appearances, object placements, and camera viewpoints.

GigaBrain-0-Small is an optimized, lightweight vision-language-action (VLA) foundation model engineered for efficient deployment on resource-constrained edge platforms such as the NVIDIA Jetson AGX Orin. It is a streamlined variant of GigaBrain‑0, designed explicitly to retain high generalization and performance in dexterous, long-horizon, and mobile robot manipulation tasks, while dramatically reducing computational overhead through architectural compression and system-level optimizations. A hallmark of GigaBrain‑0‑Small is its exploitation of world model–generated data pipelines, enabling robust policy generalization across variations in appearance, object placements, and camera viewpoints.

1. Architectural Design and Training Paradigms

GigaBrain‑0‑Small adopts a mixture-of-transformers approach, where the full GigaBrain‑0 utilizes a pretrained Vision-LLM (VLM, specifically PaliGemma2) in tandem with an action diffusion transformer (DiT). In the compact variant, the VLM is substituted by a reduced-capacity SmolVLM2 and the action expert is dimensioned to approximately 100M parameters, bringing the total parameter count to around 402M. This design choice induces a significant decrease in both FLOPs and memory consumption.

The unified end-to-end training leverages RGB(D) inputs and high-level language instructions, performing multi-modal prediction of intermediate embodied chain-of-thought (CoT) tokens—including subgoal language, discrete actions, and manipulation trajectories—and flow-matched continuous action chunks. The training objective is formalized as: $\mathcal{L} = \mathbb{E}_{\mathcal{D}, \tau, \epsilon}\left[ -\sum_{j=1}^{n-1} M_{\text{CoT},j} \log p_{\theta}(x_{j+1} | x_{1:j}) + \left\| \epsilon - \left(a_{\text{chunk}} - f_{\theta}(a_{\text{chunk}}^{\tau,\epsilon})\right) \right\|^2 + \lambda \left\| \mathrm{GRU}(\hat{\mathbf{t}}_{1:10}) - \mathbf{t}_{1:10} \right\|^2 \right]$ where $M_{\text{CoT},j}$ is a mask for the reasoning tokens, $\tau$ is the flow-matching timestep, $\epsilon$ is injected Gaussian noise, $\lambda$ weights the trajectory regression, and $\mathcal{D}$ denotes the training data. Knowledge Insulation, implemented during training, ensures non-interference between action learning and semantic processing in the VLM stream. This training promotes reasoning about spatial geometry, object states, and long-horizon dependencies for robust embodied planning.

2. Computational Efficiency and Edge Deployment

Distinct from its full-scale predecessor, GigaBrain‑0‑Small is architecturally and operationally refined for edge deployment, most notably on the NVIDIA Jetson AGX Orin. Key system-level optimizations include:

SmolVLM2 replaces PaliGemma2, shrinking VRAM and computation demands.
Redundancy is eliminated in CPU-GPU transfers and dtype conversion.
Inference acceleration via automatic mixed-precision, e.g., torch.autocast.
Rotary Position Embedding (RoPE) sine and cosine look-up tables are precomputed and cached.
torch.compile is applied for static graph execution in both denoising and VLM forward passes.

The empirical profiling on Orin demonstrates a reduction in inference costs: GigaBrain‑0‑Small utilizes 840 GFLOPs, 402M parameters, 1.9 GB VRAM, and attains an inference latency of 0.13s, in contrast to a baseline with 4,400 GFLOPs, 3.2B parameters, 17.5 GB VRAM, and latency of 1.28s. This represents an order-of-magnitude improvement in both speed and resource consumption with negligible sacrifice in task success rates (80% for table bussing).

Model	Parameters	VRAM (GB)	Inference Latency (s)	Success Rate
Baseline	3.2B	17.5	1.28	~80%
GigaBrain-0-Small	402M	1.9	0.13	~80%

3. Integration of World Model–Generated Data

An essential innovation in both GigaBrain‑0 and GigaBrain‑0‑Small is the utilization of world model–generated data to achieve broad generalization and policy robustness. The data pipeline comprises:

Real2Real Transfer: Re-rendered real-world robot trajectories with variations in texture, color, and lighting that maintain spatial relationships.
View Transfer: Multi-view synthesis to instill viewpoint invariance.
Sim2Real Transfer: Photorealistic augmentation of simulation data to bridge domain gaps.
Human Transfer & Video Generation: Inpainting human demonstrations as feasible robot trajectories.

Sampling strategies (controlled by probability $\alpha$ ) mix synthetic and real data during training; increased $\alpha$ correlates with substantial improvements in robustness to environmental variability. For instance, in viewpoint generalization tasks, setting $\alpha=0.9$ yields success rates exceeding 80%.

4. Experimental Validation and Real-World Performance

Comprehensive experiments across dexterous, long-horizon, and mobile manipulation tasks established the effectiveness of the GigaBrain‑0‑Small variant:

Fine-tuning on 1,000 episodes of table bussing on Orin achieved a comparable success rate (80%) at 12.5% of the parameter footprint and substantially lower latency.
Appearance variation (garment textures), object placement, and viewpoint tests revealed clear monotonic gains as the share of world model–generated data increased.
The unified architecture supports CoT reasoning, allowing the agent to adapt to previously unseen configurations.

5. Comparison with Full Model and Prior Systems

GigaBrain‑0‑Small maintains the architectural principles and generalist reasoning capabilities of GigaBrain‑0 but with vastly improved computational frugality. The reduction in parameter count (from billions to hundreds of millions), aggressive system-level optimizations, and targeted use of synthetic data allow deployment on tight-resource devices without requiring extensive real robot data or powerful server GPUs.

Relative to prior VLA models and baselines, GigaBrain‑0‑Small achieves comparable or better real-world task success rates, orders-of-magnitude lower inference time, and operational VRAM well within the constraints of edge platforms.

6. Significance and Application Scope

GigaBrain‑0‑Small’s design enables state-of-the-art VLA reasoning capabilities to be brought to edge devices, facilitating generalist robotics in real-world physically diverse and perceptually challenging environments. By leveraging large-scale world model data generation, the model is less reliant on costly physical robot datasets and better insulated against domain shifts. This enables practical deployments for object manipulation, mobile robotics, and potentially other embedded VLM domains where efficiency and robustness are required.

A plausible implication is that further advances in world model data generation and system-level optimization may push the operating limits of on-device VLA models, democratizing high-performance embodied intelligence for a wide spectrum of real-world robotic settings (Team et al., 22 Oct 2025).

PDF Markdown Chat (Pro)

References (1)

GigaBrain-0: A World Model-Powered Vision-Language-Action Model (2025)

Follow Topic

Get notified by email when new papers are published related to GigaBrain-0-Small.