SPEAR-1: Scaling Beyond Robot Demonstrations via 3D Understanding (2511.17411v1)

Published 21 Nov 2025 in cs.RO and cs.LG

Abstract: Robotic Foundation Models (RFMs) hold great promise as generalist, end-to-end systems for robot control. Yet their ability to generalize across new environments, tasks, and embodiments remains limited. We argue that a major bottleneck lies in their foundations: most RFMs are built by fine-tuning internet-pretrained Vision-LLMs (VLMs). However, these VLMs are trained on 2D image-language tasks and lack the 3D spatial reasoning inherently required for embodied control in the 3D world. Bridging this gap directly with large-scale robotic data is costly and difficult to scale. Instead, we propose to enrich easy-to-collect non-robotic image data with 3D annotations and enhance a pretrained VLM with 3D understanding capabilities. Following this strategy, we train SPEAR-VLM, a 3D-aware VLM that infers object coordinates in 3D space from a single 2D image. Building on SPEAR-VLM, we introduce our main contribution, $~\textbf{SPEAR-1}$: a robotic foundation model that integrates grounded 3D perception with language-instructed embodied control. Trained on $\sim$45M frames from 24 Open X-Embodiment datasets, SPEAR-1 outperforms or matches state-of-the-art models such as $π0$-FAST and $π{0.5}$, while it uses 20$\times$ fewer robot demonstrations. This carefully-engineered training strategy unlocks new VLM capabilities and as a consequence boosts the reliability of embodied control beyond what is achievable with only robotic data. We make our model weights and 3D-annotated datasets publicly available.

Summary

The paper introduces a novel 3D-aware VLM that enriches non-robotic images with 3D cues to efficiently learn robot control policies.
The paper employs a staged training pipeline combining VLM pretraining, 3D-aware fine-tuning, and action expert training to significantly reduce reliance on extensive robot demonstration data.
The paper demonstrates state-of-the-art performance in both simulations and real-world tests, achieving up to 10% higher success rates with drastically fewer training frames.

SPEAR-1: Data-Efficient Robotic Foundation Models via 3D-Aware Vision-Language Backbones

Motivation and Problem Definition

The proliferation of vision-language-action (VLA) models has drawn attention to their inherent limitations in generalizing control policy across unseen robot embodiments, environments, and tasks. Central to this bottleneck is the over-reliance on internet-pretrained vision-LLMs (VLMs), which predominantly operate on 2D image-language pairs and lack grounded 3D spatial reasoning—an essential prerequisite for robotic manipulation in real-world environments. Directly bridging this gap with large-scale robot demonstration data is both costly and logistically unscalable, particularly for diverse scene distributions and camera intrinsics. SPEAR-1 addresses this defect by introducing explicit 3D awareness into the VLM backbone using comparatively accessible non-robotic image datasets, enriched with 3D spatial annotations, then bootstrapping these representations for efficient robot action policy learning.

Staged Training Pipeline and Architectural Innovations

The SPEAR-1 methodology consists of a progressive, three-stage pipeline that enables high transferability and zero-shot policy deployment in real-world robotic settings:

General VLM Pretraining: SPEAR-1 inherits common-sense semantics from large-scale internet VLMs (e.g., PaliGemma).
3D-Aware VLM Fine-Tuning: By integrating a monocular depth encoder (MoGe) with SigLIP, the model is tasked to solve embodied VQA problems—such as 3D keypoint localization and object-to-object 3D distance estimation—on 2D images that have been enriched with automatically generated 3D cues.
Action Expert Training: The model is further augmented with an action expert, trained on $\sim$ 45M frames from diverse Open X-Embodiment (OXE) datasets, to learn vision-language-action mappings via flow matching in mixed translation ( $\mathbb{R}^3$ ) and rotation ( $\mathbb{S}^3$ quaternion) spaces.
Figure 1: SPEAR-1's staged training schema learns first from internet-scale VLM corpora, then augments with depth cues and embodied VQA, and finally transitions to robot demonstration data for control policy learning.

This architecture allows SPEAR-VLM—SPEAR-1's backbone—to encode explicit geometric representations by fusing SigLIP and MoGe features and introducing 1024 dedicated 3D tokens into the language embedding space.

Figure 2: SPEAR-VLM architecture overview with explicit fusion of visual and depth features and embedding expansion for 3D tokenization.

3D VLM Enhancement and Data Annotation

Unlike prior attempts at spatially-aware robots, SPEAR-1 implements a rigorous VQA-driven pretraining protocol that targets explicitly control-relevant spatial reasoning. The annotation pipeline for 3D pretraining data leverages Gemini for 2D bounding box generation, SAM2 for instance segmentation, and MoGe for 3D point cloud prediction—even when only 2D images are available. Oriented bounding boxes, inter-object distances, and structured spatial prompts support both generalization and robustness to environmental diversity.

Experimental ablations confirm that object-level 3D VQA—rather than low-level pixel-based 3D cues—in SPEAR-VLM pretraining is the critical determinant of improved downstream VLA control performance. Models initialized with SPEAR-VLM significantly outperform those built upon vanilla PaliGemma when evaluated on both simulated and real distributions under substantial scene shift. Architecture-level ablations further emphasize that training both SigLIP and MoGe encoders before locking MoGe weights during VLA fine-tuning yields the highest success rates.

Figure 3: 3D ablation environments showcase training evaluation setups used to stress test zero-shot generalization over diverse backgrounds and object distributions.

Action Expert Design: Flow Matching on $\mathbb{R}^3$ and $\mathbb{S}^3$

The action expert leverages conditional flow matching across translation and rotation action components, with rotation modeled on the unit quaternion manifold for stability and robustness. This choice, in conjunction with global quantile normalization of control targets and action chunking at 5Hz, prevents dataset memorization and encourages shared policy learning across embodiments. Robust experimental design, including deterministic CUDA operations and EMA checkpointing, stabilizes evaluation outcomes.

Empirical Evaluation: Simulation and Real-World Robotic Manipulation

SPEAR-1 is extensively validated in both simulation (SIMPLER/WidowX) and real hardware (WidowX, Franka Research 3) platforms, using rigorous scoring rubrics across manipulation tasks with partial credit for sub-goal achievement. In simulation, SPEAR-1 achieves over 10% higher average success rate than SpatialVLA and OpenVLA. Real-world results are more striking: On WidowX, SPEAR-1 delivers an average task progress 10% higher than OpenVLA; on Franka, without any environment-specific fine-tuning, SPEAR-1 matches or noticeably outperforms both $\pi_0$ -FAST and $\pi_{0.5}$ policies, despite being trained on $20\times$ fewer robot demonstration frames and less diverse environments.

Figure 4: SPEAR-1's real-world performance on WidowX, exhibiting consistently higher task progress across a variety of manipulation challenges relative to OpenVLA.

Figure 5: On Franka, SPEAR-1 matches $\pi_{0.5}$ zero-shot performance and exceeds $\pi_0$ -FAST, with no target environment fine-tuning, demonstrating superior data efficiency.

Implications and Future Directions

SPEAR-1 provides compelling empirical evidence that substantial generalization in embodied robotic policies is achievable with dramatically less robot demonstration data if appropriately enriched VLM backbones are used. The findings suggest that explicit 3D pretraining on non-robotic image datasets presents a scalable alternative to large-scale robot data collection, particularly for diverse or out-of-distribution deployment scenarios. The approach catalyzes future work in model architectures capable of handling deformable or non-rigid objects and in refining metrics for VLM-derived 3D tokenization. There remains unexplored territory linking scaling laws for VLM pretraining data and transfer efficacy to robot control, as well as strategies for eliminating target-specific embodiment fine-tuning.

Conclusion

SPEAR-1 establishes that generalist robot policies can be robustly deployed across diverse platforms and environments, using an architecture that explicitly encodes 3D spatial reasoning in a vision-language backbone. Empirical results demonstrate state-of-the-art or superior performance with orders-of-magnitude less robot demonstration data, validating the hypothesis that non-robotic 3D annotation is a scalable proxy for acquiring control-relevant embodied knowledge. This work strongly advocates for future robotic foundation models to prioritize architectural and data-centric enhancements in the VLM pretraining stage to achieve substantial generalization and data efficiency in embodied AI.

Reference: SPEAR-1: Scaling Beyond Robot Demonstrations via 3D Understanding (2511.17411)