Papers
Topics
Authors
Recent
Search
2000 character limit reached

SPEAR-VLM: 3D Robotics & Space Perception

Updated 3 July 2026
  • SPEAR-VLM is a dual-purpose vision-language approach that integrates 3D geometric reasoning for robotic control and pseudo-labeling pipelines for spacecraft imagery.
  • In robotics, it extends PaliGemma with a monocular depth encoder and 3D token embeddings to boost object localization from 2D images.
  • For spacecraft perception, it refines VLM-derived pseudo-labels via test-time augmentation, weighted fusion, and distillation for real-time lightweight inference.

Searching arXiv for the relevant SPEAR-VLM papers to ground the article in current arXiv usage. In current arXiv usage, SPEAR-VLM denotes two distinct vision-language systems with different problem settings and objectives. In robotics, SPEAR-VLM is a 3D-aware VLM introduced as the perceptual backbone of SPEAR-1, where it extends PaliGemma with a monocular depth encoder, 3D token embeddings, and VQA-style 3D supervision so that the backbone can infer object coordinates in 3D space from a single 2D image (Nikolov et al., 21 Nov 2025). In space-domain perception, SPEAR-VLM denotes an annotation-free detection and segmentation pipeline that uses a pre-trained VLM to generate pseudo-labels for spacecraft imagery, refines those labels with test-time augmentation and Weighted Boxes Fusion, and distills them into lightweight student models for inference (Hicsonmez et al., 4 Feb 2026). The shared designation reflects a common reliance on VLMs, but the two systems differ fundamentally in architecture, supervision, downstream tasks, and deployment assumptions.

1. Terminological scope and research context

The robotics usage emerges from the argument that most vision-LLMs are pretrained on 2D web image-language data, making them strong at semantics but weak at the 3D spatial reasoning required for embodied control (Nikolov et al., 21 Nov 2025). Within that framing, SPEAR-VLM is not itself the final robot policy; rather, it is the 3D-aware perceptual substrate used inside SPEAR-1. The stated motivation is to “replace” some robot data with non-robotic images enriched with 3D annotations, thereby reducing dependence on expensive robot demonstrations.

The space-domain usage addresses a different bottleneck: manual annotation is expensive and difficult for spacecraft imagery because targets are often small, low-visibility, partially occluded, and embedded in cluttered planetary or laboratory backgrounds (Hicsonmez et al., 4 Feb 2026). Here, SPEAR-VLM is not a 3D-aware backbone for control, but an annotation-free pseudo-labeling, refinement, and distillation pipeline for spacecraft detection and segmentation.

This naming overlap can lead to a common misconception that SPEAR-VLM designates a single unified model family. The available evidence instead indicates two independent usages. A plausible implication is that the shared acronym should be interpreted locally, in relation to each paper’s domain and objective, rather than as a stable cross-domain benchmark family.

2. SPEAR-VLM in robotics: architectural definition

In the robotics formulation, SPEAR-VLM extends PaliGemma into a 3D-aware VLM (Nikolov et al., 21 Nov 2025). PaliGemma is described as having three components: a SigLIP image encoder, a linear projector into language space, and a Gemma LLM. SPEAR-VLM adds a monocular depth encoder, namely MoGe, extra 3D token embeddings in the tokenizer, and training tasks that require explicit 3D reasoning.

The fusion mechanism is specified in implementation detail. The model keeps SigLIP and MoGe as separate encoders, extracts SigLIP last-layer visual tokens, extracts MoGe features from the last 4 layers of the MoGe ViT encoder, concatenates the MoGe features along the channel dimension, projects both SigLIP and MoGe features into the LLM embedding space, and averages the outputs of the SigLIP and MoGe projectors before feeding them to the LLM (Nikolov et al., 21 Nov 2025). The tokenizer is extended with N=1024N=1024 3D tokens, representing quantized distance values.

MoGe is selected because it is monocular depth estimation based and affine-invariant, which the paper associates with better generalization across cameras with different intrinsics. This is significant in robotics because camera variation is pervasive across embodiments and environments. The architecture therefore combines semantic visual tokens from SigLIP with dense geometric priors from MoGe, while preserving the autoregressive language-model interface of the base VLM.

A central claim in the paper is that object-centric 3D pretraining teaches the backbone camera-to-object geometry, spatial relations between objects, 3D localization, and viewpoint robustness (Nikolov et al., 21 Nov 2025). This suggests that the architectural additions are intended not as generic multimodal scaling, but as a targeted intervention on the geometric deficiencies of 2D-only VLM pretraining.

3. Robotics pretraining pipeline and 3D objectives

Because open datasets with explicit 3D object annotations are scarce, the robotics paper constructs a semi-automatic annotation pipeline using only 2D images plus off-the-shelf models (Nikolov et al., 21 Nov 2025). For each image, Gemini detects 2D bounding boxes and semantic labels, SAM2 is prompted with these boxes to generate instance masks, and MoGe predicts a dense 3D point cloud. The point cloud is then filtered with the instance mask, an oriented 3D bounding box is computed around the resulting object points, and VQA-style question-answer pairs are created from that result.

The pretraining data consists of ~200k images from the cooking and bike repair subsets of EgoExo4D and ~30k frames from Bridge-V2 robot demonstrations, for a total of ~230k images (Nikolov et al., 21 Nov 2025). The paper emphasizes that this is a relatively small amount of 3D-annotated image data.

SPEAR-VLM is trained as a VQA-style autoregressive model that predicts 3D-related textual outputs. The listed task families are 3D keypoints prediction, 3D bounding box prediction, object-to-object distance prediction, object-to-object bounding box distance, backprojection tasks, and chain-of-thought comparison tasks about which object is closer to the camera (Nikolov et al., 21 Nov 2025). Example prompts include “Output the vertices of the 3D bounding box of object X” and “Output the xyzxyz components of the distance between object X and object Y.”

The 3D coordinates are represented using 1024 quantized tokens, each corresponding to a distance bin in a range defined by the 1st and 99th percentiles of the 3D point-cloud coordinate distribution (Nikolov et al., 21 Nov 2025). The training objective remains next-token prediction, but the loss for 3D tokens is scaled by λ=2\lambda = 2. The model is explicitly not described as directly regressing continuous coordinates with a separate head; instead, it generates 3D token sequences corresponding to quantized coordinates.

Training follows two stages, similar to LLaVA-style VLM training. In Stage 1, the model initializes from pretrained PaliGemma and pretrained MoGe, randomly initializes the MoGe projector, the new 3D token embeddings, and the SigLIP projector, trains only the randomly initialized weights and SigLIP projector, and keeps the rest frozen. In Stage 2, only the SigLIP and MoGe encoders are frozen, training continues for the remaining components, and the 3D-token loss weight is increased with λ=2\lambda = 2 (Nikolov et al., 21 Nov 2025).

4. Integration into SPEAR-1 and embodied-control performance

SPEAR-1 is described as a VLA / flow-matching action policy built on top of SPEAR-VLM (Nikolov et al., 21 Nov 2025). Its high-level structure follows the general π0\pi_0-style design: the VLM processes image and language inputs, an action expert predicts robot actions, and that action expert attends to the VLM’s intermediate key-value representations.

The observation at time tt is

ot=[It1,…,Itn,pt,lt]\mathbf{o}_t = [\mathbf{I}_t^1, \dots, \mathbf{I}_t^n, \mathbf{p}_t, \mathbf{l}_t]

where Iti\mathbf{I}_t^i are camera images, pt\mathbf{p}_t is proprioception or robot state, and lt\mathbf{l}_t is the language instruction (Nikolov et al., 21 Nov 2025). The policy outputs a horizon-xyzxyz0 action sequence

xyzxyz1

with each action decomposed as

xyzxyz2

for translation, rotation, and gripper state.

The action expert is reported to have about 300M parameters, the same as xyzxyz3, and to use shared attention with the VLM transformer (Nikolov et al., 21 Nov 2025). Tokens are organized into blocks xyzxyz4, xyzxyz5, and xyzxyz6 with block-wise causal attention. The paper’s key claim is that SPEAR-VLM provides a stronger geometric backbone, reducing the need for the action model to infer 3D structure from scratch.

SPEAR-1 is trained on 24 Open X-Embodiment datasets comprising about ~45M frames total (Nikolov et al., 21 Nov 2025). The most heavily weighted datasets are listed as droid: 35.0, bridge: 18.0, and fractal20220817_data: 12.0. Actions are resampled to a common 5 Hz control frequency.

The reported headline comparison is that SPEAR-1 outperforms xyzxyz7-FAST and matches xyzxyz8 while using 20× fewer robot demonstrations (Nikolov et al., 21 Nov 2025). On challenging unseen Franka environments in a DROID-like setup, it beats xyzxyz9-FAST, matches λ=2\lambda = 20, and does so without fine-tuning on the target environment. The paper further states that SPEAR-1 can reach about 5× higher performance than λ=2\lambda = 21-FAST in the zero-shot Franka setting. On WidowX, it achieves ~10% higher average task progress than OpenVLA. On SIMPLER WidowX tasks, reported average success is 1.0% for OpenVLA, 42.7% for SpatialVLA, and 57.3% for SPEAR-1 (Nikolov et al., 21 Nov 2025).

Ablations attribute these gains specifically to object-level 3D reasoning. In a Bridge V2 subset / SIMPLER setup, baseline PaliGemma and SPEAR-VLM without object-level 3D tasks both obtain 20.8% avg success, whereas SPEAR-VLM with object-level 3D tasks reaches 35.4% (Nikolov et al., 21 Nov 2025). On Franka tasks, λ=2\lambda = 22-PaliGemma (DROID) achieves 34% avg task progress, while λ=2\lambda = 23-SPEAR-VLM (DROID) reaches 46% avg task progress. The paper interprets this as evidence that the benefit comes from task-relevant 3D supervision, not merely from adding another encoder.

5. SPEAR-VLM in spacecraft perception: annotation-free pipeline

In the spacecraft-perception formulation, SPEAR-VLM denotes an annotation-free detection and segmentation pipeline for space targets using VLMs (Hicsonmez et al., 4 Feb 2026). The pipeline has four stages: pseudo-labeling stage, label refinement stage, label distillation stage, and inference stage.

For an image λ=2\lambda = 24 and text prompt λ=2\lambda = 25, the VLM λ=2\lambda = 26 produces segmentation and box predictions

λ=2\lambda = 27

where λ=2\lambda = 28 denotes instance segmentation predictions and λ=2\lambda = 29 denotes bounding box predictions (Hicsonmez et al., 4 Feb 2026). Because the spacecraft datasets used in the paper mostly contain single-instance images, the method keeps only the top detection and/or segmentation prediction for each image, yielding a single pseudo box and pseudo mask. The fixed prompt is “spacecraft.”

The paper evaluates three open-source VLMs as teachers: SEEM, OpenSEED, and GroundedSAM-2 (Hicsonmez et al., 4 Feb 2026). Their zero-shot predictions are first evaluated directly; later, only GroundedSAM-2 is carried forward, because it performs best.

Label refinement combines test-time augmentation (TTA), Weighted Boxes Fusion (WBF), and confidence filtering. With augmentations λ=2\lambda = 20, each augmented image λ=2\lambda = 21 is processed by the VLM, box predictions are mapped back through the inverse augmentation, and all predictions are collected into a common set before fusion (Hicsonmez et al., 4 Feb 2026). WBF groups boxes whose IoU exceeds a threshold λ=2\lambda = 22, and the paper sets the overlap threshold to λ=2\lambda = 23. For a cluster

λ=2\lambda = 24

the fused box is

λ=2\lambda = 25

The fused class label is chosen by majority voting or weighted voting based on scores (Hicsonmez et al., 4 Feb 2026).

Confidence filtering then retains only fused boxes whose score exceeds a dataset-specific threshold λ=2\lambda = 26. The reported thresholds are λ=2\lambda = 27 for SPARK-2024, λ=2\lambda = 28 for TANGO, and λ=2\lambda = 29 for SPEED+ (Hicsonmez et al., 4 Feb 2026). This step is presented as necessary because fused confidence may remain low when predictions across augmentations disagree strongly.

The distillation stage treats the refined pseudo-labels as teacher outputs and trains a shallow student model. Given the refined pseudo boxes π0\pi_00, the student π0\pi_01 predicts

Ď€0\pi_02

The supervised loss is written as

Ď€0\pi_03

where π0\pi_04 are teacher box and class predictions, and π0\pi_05 are student outputs (Hicsonmez et al., 4 Feb 2026). The method also uses a single round of iterative distillation: train a student on pseudo-labels, use that student to relabel the training images, and train a new student from scratch.

At inference time, only the student model is used. The stated rationale is that the student is lightweight enough for real-time or near-real-time deployment onboard spacecraft (Hicsonmez et al., 4 Feb 2026).

6. Spacecraft datasets, models, and empirical results

The spacecraft paper evaluates on SPARK-2024, SPEED+, and TANGO (Hicsonmez et al., 4 Feb 2026). SPARK-2024 is used for detection only, because it contains bounding box annotations but no segmentation annotations. The paper uses only test sequences, with 500 images randomly selected as a training split and the remaining 1600 for evaluation. For SPEED+, the authors create custom splits of 500 images for training, with 6200 Lightbox images and 2200 Sunlamp images for testing. For TANGO, the paper uses the test split as-is together with 500 randomly selected training images.

Evaluation uses COCO-style Average Precision (AP), averaged over IoU thresholds from π0\pi_06 to π0\pi_07 with step size π0\pi_08, together with APπ0\pi_09 and APtt0 (Hicsonmez et al., 4 Feb 2026). The distilled students are Efficient-Det for object detection and YOLOv11 for object detection and segmentation. Both are described as around 7M parameters, running in real time with FPS > 60, trained with official implementations for 300 epochs, and using images resized to 640×640.

The zero-shot VLM baseline results identify GroundedSAM-2 as the strongest teacher. On SPARK-2024 detection, SEEM obtains AP 17.7, APtt1 49.8, APtt2 2.6; OpenSEED obtains AP 8.5, APtt3 28.0, APtt4 0.8; and GroundedSAM-2 obtains AP 53.3, APtt5 97.9, APtt6 61.7 (Hicsonmez et al., 4 Feb 2026). For SPEED+ Sunlamp, GroundedSAM-2 reaches detection AP 73.9, APtt7 97.9, APtt8 92.1, and segmentation AP 72.5, APtt9 94.7, APot=[It1,…,Itn,pt,lt]\mathbf{o}_t = [\mathbf{I}_t^1, \dots, \mathbf{I}_t^n, \mathbf{p}_t, \mathbf{l}_t]0 85.0. For SPEED+ Lightbox, it achieves detection AP 65.8, APot=[It1,…,Itn,pt,lt]\mathbf{o}_t = [\mathbf{I}_t^1, \dots, \mathbf{I}_t^n, \mathbf{p}_t, \mathbf{l}_t]1 91.4, APot=[It1,…,Itn,pt,lt]\mathbf{o}_t = [\mathbf{I}_t^1, \dots, \mathbf{I}_t^n, \mathbf{p}_t, \mathbf{l}_t]2 74.0, and segmentation AP 66.2, APot=[It1,…,Itn,pt,lt]\mathbf{o}_t = [\mathbf{I}_t^1, \dots, \mathbf{I}_t^n, \mathbf{p}_t, \mathbf{l}_t]3 83.9, APot=[It1,…,Itn,pt,lt]\mathbf{o}_t = [\mathbf{I}_t^1, \dots, \mathbf{I}_t^n, \mathbf{p}_t, \mathbf{l}_t]4 73.5. For TANGO, it obtains detection AP 58.2, APot=[It1,…,Itn,pt,lt]\mathbf{o}_t = [\mathbf{I}_t^1, \dots, \mathbf{I}_t^n, \mathbf{p}_t, \mathbf{l}_t]5 91.8, APot=[It1,…,Itn,pt,lt]\mathbf{o}_t = [\mathbf{I}_t^1, \dots, \mathbf{I}_t^n, \mathbf{p}_t, \mathbf{l}_t]6 75.7, and segmentation AP 58.2, APot=[It1,…,Itn,pt,lt]\mathbf{o}_t = [\mathbf{I}_t^1, \dots, \mathbf{I}_t^n, \mathbf{p}_t, \mathbf{l}_t]7 85.9, APot=[It1,…,Itn,pt,lt]\mathbf{o}_t = [\mathbf{I}_t^1, \dots, \mathbf{I}_t^n, \mathbf{p}_t, \mathbf{l}_t]8 69.0.

The paper reports that TTA and WBF improve zero-shot predictions modestly, with the best result obtained using only vertical flip; adding more augmentations tended to decrease performance (Hicsonmez et al., 4 Feb 2026). After TTA and WBF, example results include SPARK AP 54.4 with APot=[It1,…,Itn,pt,lt]\mathbf{o}_t = [\mathbf{I}_t^1, \dots, \mathbf{I}_t^n, \mathbf{p}_t, \mathbf{l}_t]9 64.0, Sunlamp detection AP 74.6 with APIti\mathbf{I}_t^i0 93.7 and segmentation AP 73.2, Lightbox detection AP 66.9 with APIti\mathbf{I}_t^i1 75.6 and segmentation AP 66.8, and TANGO detection AP 58.5 with APIti\mathbf{I}_t^i2 76.3 and segmentation AP 58.7.

The principal contribution is the improvement after distillation. On SPARK-2024, the paper reports Efficient-Det AP 65.0 and YOLOv11 AP 68.9 under TTA+WBF, YOLOv11 AP 70.8 after confidence filtering, and YOLOv11 AP 76.4 with APIti\mathbf{I}_t^i3 98.2 after iterative relabeling (Hicsonmez et al., 4 Feb 2026). On SPEED+ Sunlamp, the distilled TTA+WBF+CF+TR result is detection AP 75.0 and segmentation AP 74.0. On SPEED+ Lightbox, the corresponding final result is detection AP 71.0 and segmentation AP 63.5. On TANGO, the distilled pipeline reaches detection AP 86.5 and segmentation AP 70.0, and the authors explicitly state that, compared with the initial VLM predictions, the method improves more than 10 points on TANGO segmentation AP.

The paper’s explanation is that the student acts as a denoising regularizer: it can ignore some teacher noise through limited capacity, generalize better with less overfitting, and learn a task-specific decision boundary adapted to the space domain (Hicsonmez et al., 4 Feb 2026). This is a domain-adaptation argument rather than a claim about the student exceeding the teacher in general open-vocabulary capability.

7. Comparative significance, limitations, and recurrent themes

Despite their different domains, the two SPEAR-VLM usages share a methodological pattern: each treats a VLM not only as a semantic recognizer, but as a platform that can be specialized by injecting task-relevant structure. In robotics, that structure is explicit 3D reasoning grounded in quantized geometric tokens and object-centric VQA supervision (Nikolov et al., 21 Nov 2025). In spacecraft perception, it is annotation-free pseudo-supervision refined through TTA, WBF, confidence filtering, and teacher–student distillation (Hicsonmez et al., 4 Feb 2026).

The most important distinction is where the additional structure enters the learning system. In the robotics paper, SPEAR-VLM modifies the backbone representation itself by augmenting PaliGemma with MoGe, 3D token embeddings, and 3D pretraining tasks. In the spacecraft paper, SPEAR-VLM primarily modifies the training pipeline around an existing teacher VLM, then transfers its predictions into compact students. This suggests two broad interpretations of VLM specialization: one centered on backbone geometric enrichment, the other on pseudo-label-mediated downstream adaptation.

Both formulations also expose clear limitations. The robotics paper notes that gains depend on the quality of object-level 3D supervision and on model-training choices such as whether MoGe is frozen during VLA training; it also states that most RFMs remain limited in their ability to generalize across new environments, tasks, and embodiments (Nikolov et al., 21 Nov 2025). The spacecraft paper states that performance still depends on the quality of the initial VLM predictions, that gains are smaller on some easier or cleaner datasets, that iterative relabeling can occasionally slightly hurt performance due to overfitting, and that validation is restricted to datasets with mostly single spacecraft instances and similar target categories (Hicsonmez et al., 4 Feb 2026).

A common misconception would be to read either system as evidence that generic 2D web-scale VLM pretraining alone is sufficient for specialized perception. Both papers argue the opposite in domain-specific ways. The robotics work contends that 2D-only pretraining lacks the spatial priors required for manipulation, while the spacecraft work shows that direct zero-shot VLM inference can be substantially improved by refinement and distillation. A plausible implication is that VLMs, in these settings, function most effectively when coupled with domain-aligned inductive bias: 3D grounding for embodied control, and label-refinement plus compact adaptation for space-domain detection and segmentation.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SPEAR-VLM.