SPEAR-1: 3D-Aware Vision–Language Robot Control
- SPEAR-1 is a two-stage vision–language–action foundation model that augments 2D pretrained VLMs with explicit 3D spatial reasoning to enable effective robot policy learning.
- It leverages the SPEAR-VLM architecture by integrating a monocular depth encoder with a vision–language model, thus extracting 3D information from single RGB images.
- Its design significantly reduces reliance on large-scale robot demonstrations while achieving competitive zero-shot control across diverse embodiments and unseen environments.
Searching arXiv for SPEAR-1 and closely related robotic foundation model work to ground the article. SPEAR-1 is a two-stage vision–language–action foundation model for robot control that injects explicit 3D spatial reasoning into an off-the-shelf Vision–LLM before learning embodied action. Introduced in “SPEAR-1: Scaling Beyond Robot Demonstrations via 3D Understanding” (Nikolov et al., 21 Nov 2025), it is built by first training a 3D-aware VLM, termed SPEAR-VLM, from monocular RGB images with 3D annotations, and then attaching a Flow-Matching action expert following the architecture for end-to-end robot-policy learning. The central claim is that enriching pretrained VLMs with grounded 3D understanding can reduce dependence on large-scale robot demonstrations while preserving or improving zero-shot control performance across embodiments and unseen environments (Nikolov et al., 21 Nov 2025).
1. Conceptual basis and scope
SPEAR-1 is presented as a response to a specific limitation in Robotic Foundation Models: most RFMs are obtained by fine-tuning internet-pretrained Vision-LLMs, yet those VLMs are trained on 2D image-language tasks and therefore lack the 3D spatial reasoning required for embodied control in the physical world (Nikolov et al., 21 Nov 2025). The proposed remedy is not to scale robotic data collection directly, but to enhance a pretrained VLM with 3D understanding using easy-to-collect non-robotic image data annotated in 3D, and then transfer that capability into robot learning.
The model is explicitly organized into two stages. In Stage 1, PaliGemma, described as a 3 billion-parameter late-fusion VLM, is augmented with a monocular depth encoder, MoGe, to form SPEAR-VLM. In Stage 2, a Flow-Matching action expert is attached on top of SPEAR-VLM and trained end-to-end on approximately 45 million frames from 24 Open X-Embodiment datasets, producing SPEAR-1 (Nikolov et al., 21 Nov 2025).
This design suggests a division of labor between perception-language grounding and control. A plausible implication is that the architecture treats 3D grounding as a reusable substrate rather than as a capability that must be relearned from robot demonstrations alone.
2. SPEAR-VLM: 3D-aware vision–language foundation
The Stage 1 model, SPEAR-VLM, starts from PaliGemma, whose base components are specified as a SigLIP vision encoder, a linear projector, and a Gemma LLM (Nikolov et al., 21 Nov 2025). To introduce 3D awareness, SPEAR-VLM adds the MoGe monocular depth encoder. From MoGe’s last four ViT layers, intermediate features are extracted, concatenated along the channel dimension, and projected via a learned linear layer into the same embedding dimension as SigLIP’s projector. The two projected feature streams, SigLIP and MoGe, are then averaged token-wise and fed into the Gemma LLM together with text tokens (Nikolov et al., 21 Nov 2025).
A key representational modification is the introduction of new “depth” tokens, each representing one quantized scalar in , computed from the 1st/99th depth quantiles. Their embeddings are randomly initialized from the pretrained token distribution (Nikolov et al., 21 Nov 2025). This converts 3D coordinate prediction into an autoregressive token-generation problem within the LLM.
The resulting model is described as answering control-relevant 3D questions from a single RGB image. Denoting an image by , the model effectively implements a regression function
by generating strings of 3D tokens corresponding to quantized real-valued coordinates. During VQA pretraining, each example provides a ground-truth point , such as the center of an oriented 3D bounding box. The continuous proxy loss is written as
although optimization is performed through next-token prediction over the 3D tokens (Nikolov et al., 21 Nov 2025).
The prompt format explicitly binds semantic queries to geometric outputs. An example prompt is “Output the 3D bounding box vertices of object X.” The sequence is structured as [<image-tokens>], “⟨IMG⟩”, question tokens, <3D-token-slots>…, and the model attends jointly over the SigLIP+MoGe image tokens and the question tokens. The 3D-token slots are then filled autoregressively by Gemma’s next-token head (Nikolov et al., 21 Nov 2025). This establishes a language-mediated interface to 3D coordinate space rather than a separate geometric head detached from the LLM.
3. Data construction and Stage 1 training
Stage 1 training uses approximately 230 k images: 200 k from EgoExo4D cooking and bike-repair, and 30 k frames from Bridge-V2 demonstrations (Nikolov et al., 21 Nov 2025). These images are converted into 3D training data through a semi-automatic annotation pipeline consisting of Gemini for 2D bounding boxes and labels, SAM2 for instance masks, and MoGe for a dense point cloud . From the masked point cloud, oriented 3D bounding boxes and distances are computed (Nikolov et al., 21 Nov 2025).
The VQA supervision includes several task families: predicting 3D keypoints, 3D bounding-box vertices, object-to-object distances in scalar and xyz form, 2D back-projections of 3D vertices, and simple chain-of-thought comparisons such as “Which object is closer?” (Nikolov et al., 21 Nov 2025). These tasks define the 3D grounding prior that later supports robot control.
Training proceeds in two steps. In the first step, lasting 2 k steps on 16 H200s, the SigLIP-to-projector and MoGe-to-projector weights are initialized randomly, Gemma and MoGe encoders are frozen, and only the newly initialized layers are trained. In the second step, lasting 10 k steps, SigLIP and MoGe encoders remain frozen, the expanded Gemma is finetuned, and 3D-token cross-entropy is upweighted by (Nikolov et al., 21 Nov 2025).
These choices indicate that the 3D augmentation is inserted with limited disturbance to the underlying pretrained visual and depth encoders. This suggests that SPEAR-VLM is intended as a lightweight adaptation of an existing VLM into a 3D-aware VLM rather than a fully rederived multimodal architecture.
4. Policy architecture and robot-learning stage
SPEAR-1 uses the architecture in its action stage (Nikolov et al., 21 Nov 2025). The VLM, now SPEAR-VLM, processes two camera views and a language instruction 0, while an action expert, described as a 12-layer Gemma-sized transformer of approximately 300 M parameters, processes the robot proprioceptive state
1
At time step 2, the observation is
3
The action expert attends to SPEAR-VLM’s intermediate key/value pairs via cross-attention. If 4 denotes the attended 3D-grounded visual–language features and 5 the language embedding, the per-time-step policy is written abstractly as
6
and is trained via conditional flow matching (Nikolov et al., 21 Nov 2025).
Stage 2 uses purely robot demonstrations; the non-robotic 3D-annotated images appear only in Stage 1 (Nikolov et al., 21 Nov 2025). The pretraining mixture contains 45 M frames from 24 Open X-Embodiment datasets. The camera configuration is fixed at external resolution 7 and wrist resolution 8, with center-cropping and padding performed without aspect distortion so that MoGe depth estimation remains consistent (Nikolov et al., 21 Nov 2025).
This pipeline makes the role of Stage 1 precise: 3D VQA supervision is not jointly optimized with robot-policy loss in Stage 2. Instead, the robot policy inherits language grounding from SPEAR-VLM.
5. Flow-matching formulation and optimization details
The Stage 2 action-training objective is conditional flow matching. Each action
9
is noised at a random flow step 0 using linear interpolation for translation and spherical linear interpolation on 1 for rotation:
2
3
The model predicts a denoising vector field 4 (Nikolov et al., 21 Nov 2025).
The loss decomposes into Euclidean and quaternion terms:
5
and
6
The combined loss is
7
The training specification adds several implementation constraints. There is no additional language loss in Stage 2; grounding is inherited from SPEAR-VLM. Training uses 8 action chunks at 5 Hz, global min-max normalization on translation deltas, and half-space quaternions with 9 to avoid sign ambiguity (Nikolov et al., 21 Nov 2025).
The reported ablations state that 0-flow matching is better than linear flow on quaternion, and that min-max translation normalization is better than mean-std normalization (Nikolov et al., 21 Nov 2025). Within the scope of the reported experiments, these ablations identify the rotation parameterization and translation normalization as nontrivial contributors to policy performance.
6. Empirical results, ablations, and comparison points
The principal empirical claim is that SPEAR-1 outperforms or matches state-of-the-art models such as 1-FAST and 2 while using 3 fewer robot demonstrations (Nikolov et al., 21 Nov 2025). More specifically, on Franka (DROID) real-world zero-shot evaluation in unseen environments, SPEAR-1 outperforms 4-FAST and matches 5, despite using 6 fewer robot-demo frames, with 45 M frames versus 7 M (Nikolov et al., 21 Nov 2025). On WidowX real-world evaluation, SPEAR-1 versus OpenVLA is reported as an average task progress improvement of +10% (Nikolov et al., 21 Nov 2025).
The ablation studies isolate the contribution of Stage 1 3D grounding. On SIMPLER WidowX, using a single kitchen sink for training and unseen environments for evaluation, the following VLM-stage 3D ablation is reported (Nikolov et al., 21 Nov 2025):
| Variant | Result |
|---|---|
| No 3D tasks (PaliGemma backbone) | 20.8% success |
| SPEAR-VLM without object-level tasks | 20.8% |
| SPEAR-VLM with object-level 3D prompts (SigLIP+MoGe train at VLM, freeze MoGe at VLA) | 35.4% |
A separate VLM ablation on Franka (DROID) pretraining 8 reports 9-PaliGemma at 34% average progress and 0-SPEAR-VLM at 46%, a gain of 12 points (Nikolov et al., 21 Nov 2025). The details further state that “Fine-tune SigLIP at VLM, freeze MoGe at VLA” is the best among the compared VLA design variants (Nikolov et al., 21 Nov 2025).
Taken together, these ablations indicate that not all forms of 3D supervision are equally beneficial. The identical 20.8% success of the “No 3D tasks” and “SPEAR-VLM without object-level tasks” conditions suggests that object-level 3D prompting, rather than generic 3D augmentation alone, is a decisive component in the reported gains. This suggests that semantically grounded 3D queries are more relevant to downstream manipulation than depth features without explicit object-level supervision.
7. Limitations, deployment considerations, and significance
Several limitations are stated explicitly. SPEAR-VLM’s 3D tokens represent affine-invariant depths rather than metric depth, so deformable or highly non-rigid objects are challenging. Generalization degrades for objects outside the quantile-computed depth range. Fine-tuning on the target embodiment still helps. Scaling laws for 3D VLM data remain to be studied (Nikolov et al., 21 Nov 2025). These caveats define the boundaries of the method’s current generality.
Practical deployment details are also specified. SPEAR-VLM requires approximately 18 hours on 16 Nvidia H200 GPUs, while SPEAR-1 Stage 2 requires approximately 6 days on 32 H200 GPUs for the 300 M action expert plus 3 B VLM. Inference runs in real time at 5 Hz on a single H200 with mixed precision (Nikolov et al., 21 Nov 2025). For adaptation to new robots or environments, the prescribed procedure is to reuse Stage 1 SPEAR-VLM as is, fine-tune only the action expert on a small set of demonstrations of about 50 k frames, maintain the same camera resolutions, avoid aspect-ratio distortion, freeze MoGe during robot training to preserve depth priors, recompute 3D token quantiles if the workspace depth range shifts significantly, and use EMA checkpointing together with deterministic GPU settings to stabilize fine-tuning (Nikolov et al., 21 Nov 2025).
The associated dataset and release plan are part of the system’s intended research impact. The 3D-annotated VLM data comprise 230 k images from EgoExo4D and Bridge-V2 annotated with labels, masks, MoGe point clouds, and 3D bounding boxes; the robot data come from 24 OXE datasets; and the model weights and 3D-annotated datasets are to be publicly released, with standard JSON for masks, point clouds, and coordinate files accompanying images (Nikolov et al., 21 Nov 2025).
In the reported framing, SPEAR-1 demonstrates that explicit 3D spatial reasoning can be injected into a VLM backbone through non-robotic 3D-annotated images and then transferred into robot policy learning, reducing reliance on robot demonstrations while sustaining strong zero-shot generalization across embodiments and unseen environments (Nikolov et al., 21 Nov 2025). A plausible implication is that the work situates 3D-aware VLM pretraining as an intermediate scaling axis for robotic foundation models, complementary to direct scaling of teleoperation data rather than a replacement for it.