VLM3: Native 3D Vision-Language Learning

Updated 4 July 2026

VLM3 is a family of vision-language models that natively learn 3D geometry by integrating spatial cues through focal length unification and text-based pixel references.
The approach leverages standard VLMs with innovative data mixture and scaling strategies to achieve competitive results in depth estimation, pixel correspondence, and camera pose tasks.
Variants of VLM3 extend its applications to streaming spatial understanding, autonomous driving, and medical imaging, unifying diverse 3D tasks under a common text-generation framework.

VLM3 denotes a family of vision-LLMs that natively reason in 3D, and also the specific framework introduced in “VLM3: Vision LLMs Are Native 3D Learners,” which argues that standard VLMs can master diverse 3D tasks through focal length unification, text-based pixel or region reference, and careful data mixture and scaling rather than specialized 3D architectures or regression heads (Cai et al., 28 May 2026). In adjacent work, the label is also used more broadly for 3D-enhanced VLM systems in streaming spatial understanding, autonomous driving, dense occupancy prediction, and volumetric medical image reporting, where language modeling is coupled to geometry priors, voxel reasoning, or 3D medical encoders (Yu et al., 5 Jun 2026, Lübberstedt et al., 30 Apr 2025, Doruk et al., 3 Mar 2026, Li et al., 2024, Doruk et al., 15 May 2026).

1. Concept and scope

The central claim of VLM3 is that vision-LLMs are “native 3D learners”: 3D understanding need not be delegated to expert models with complex task-specific designs if camera ambiguity, spatial reference, and data composition are handled correctly (Cai et al., 28 May 2026). In this formulation, the model remains a standard VLM, and 3D tasks are expressed as language tasks with visual conditioning. Depth, correspondence, pose, and object-level geometry are all reduced to prompting plus next-token prediction.

A complementary definition appears in the streaming literature, where VLM3 is described as “a vision-LLM that natively reasons in 3D” and is extended from offline clips or complete scans to real-time operation with incremental 3D geometry priors, streaming control, and efficient visual-token compression (Yu et al., 5 Jun 2026). In autonomous driving and medical imaging, related systems adopt the term to denote models that inject 3D spatial structure into LVLMs or LLaMA-family models through textualized scene geometry, voxel-space fusion, or volumetric encoders (Lübberstedt et al., 30 Apr 2025, Li et al., 2024).

This usage suggests that VLM3 is not restricted to one architecture. Rather, it denotes a research program in which language-conditioned models internalize geometry, camera structure, and spatial relations as first-class reasoning variables.

2. Core methodological principles

The canonical VLM3 method is built on three ingredients: focal length unification, text-based pixel or region reference, and data mixture with scaling (Cai et al., 28 May 2026). The paper argues that model architecture changes, large models, heavy data augmentations, and complex losses including regression formulation are not necessary conditions for effective 3D learning.

Focal length unification addresses the camera ambiguity of single-view metric tasks. Under the pinhole model,

$u = f_x X/Z + c_x,\qquad v = f_y Y/Z + c_y,$

with

$K = \begin{bmatrix} f_x & 0 & c_x \ 0 & f_y & c_y \ 0 & 0 & 1 \end{bmatrix}.$

VLM3 rescales each image so that the focal length becomes a fixed target of 1000 pixels. If intrinsics are missing, they are estimated with AnyCalib; then the image is resized by

$s = 1000/\bar f,\qquad K' = \operatorname{diag}(s,s,1)K.$

After resizing, pixel coordinates are normalized to the fixed text range $[0,2000)$ , which stabilizes token distributions and makes coordinates comparable across datasets (Cai et al., 28 May 2026).

Text-based pixel reference replaces rendered markers with direct textual references such as a queried pixel, correspondence target, or normalized bounding box. The learning objective remains standard autoregressive language modeling,

$L = -\sum_t \log p(y_t \mid x, y_{<t}),$

where $x$ includes images and textual questions, and $y$ is the textual answer (Cai et al., 28 May 2026). Even continuous outputs such as depth, translation distance, or yaw–pitch–roll are emitted as text tokens rather than regressed by a dedicated head.

Data mixture and scaling are treated as decisive. For depth, the base mixture spans 26M images, while training uses 32M samples with 10 labeled pixels per sample, yielding 320M pixel labels (Cai et al., 28 May 2026). The backbone is Qwen3-VL-4B, trained with standard SFT, AdamW, cosine learning rate with warmup, minimal augmentations, FlashAttention-2, and no architectural modifications. An important negative result is that larger models do not help at the reported data scale: 32B reaches 0.873, 8B reaches 0.880, 4B with 64M+10 QA reaches 0.880, while the best configuration is 4B with 32M+10 QA at 0.904 $\delta_1$ on depth (Cai et al., 28 May 2026).

3. Task formulations and learning interface

VLM3 frames 3D perception as a unified text-generation problem over visual inputs. This yields a common interface across tasks that are traditionally handled by distinct expert pipelines (Cai et al., 28 May 2026).

For depth estimation, the model is asked for the metric distance from the camera to a queried pixel in normalized coordinates. Performance is reported with Absolute Relative Error and $\delta$ thresholds, where $\delta_1$ uses the condition $K = \begin{bmatrix} f_x & 0 & c_x \ 0 & f_y & c_y \ 0 & 0 & 1 \end{bmatrix}.$ 0. VLM3-4B raises average $K = \begin{bmatrix} f_x & 0 & c_x \ 0 & f_y & c_y \ 0 & 0 & 1 \end{bmatrix}.$ 1 from 0.838 for DepthLM-7B to 0.904, with per-dataset values including 0.970 on NuScenes, 0.960 on iBims1, 0.867 on sunRGBD, and 0.810 on ETH3D (Cai et al., 28 May 2026).

For pixel correspondence, the source pixel in image 1 is given textually and the model outputs the corresponding pixel in image 2, again in normalized coordinates. The evaluation metric is End-Point Error. VLM3-4B improves average EPE from 153.28 for the base Qwen3-VL-4B to 15.37, with 15.18 on ETH3D, 10.71 on DTU, and 20.21 on TA-WB; it outperforms DKM and RoMa but trails UFM (Cai et al., 28 May 2026).

For camera pose estimation, the model outputs three textual components: translation distance in meters, translation direction as a unit vector in the first camera’s local axes, and yaw–pitch–roll with specified conventions. Rotation is evaluated with the geodesic distance on $K = \begin{bmatrix} f_x & 0 & c_x \ 0 & f_y & c_y \ 0 & 0 & 1 \end{bmatrix}.$ 2,

$K = \begin{bmatrix} f_x & 0 & c_x \ 0 & f_y & c_y \ 0 & 0 & 1 \end{bmatrix}.$ 3

and aggregate performance is summarized by AUC@30°. VLM3 reaches 93.3 on ETH3D and 94.7 on ScanNet++, for an average of 94.0, surpassing VGGT at 88.0 and matching DA3-Giant at 94.7 (Cai et al., 28 May 2026).

For object-level 3D understanding, object references are normalized bounding boxes and outputs include qualitative relations such as Below or Above, Left or Right, Big or Small, Tall or Short, Wide or Thin, and Behind or Front, together with quantitative attributes such as direct, horizontal, and vertical distances, width, height, and direction. On SpatialRGPT-Bench, VLM3-4B reaches 91.35 overall qualitative accuracy versus 89.80 for SpatialRGPT-8B, with overall quantitative accuracy 58.51 versus 58.33, overall AbsRel 0.35 versus 0.37, and direction accuracy 95.42 with 10.5° average error versus 95.3 with 17.1° (Cai et al., 28 May 2026).

Ablation results emphasize that the unified language interface is not merely a convenience layer. Text-based reference performs at least as well as visual prompting in the reported setting, and careful dataset weighting is necessary: uniform weighting at large scale yields 0.842 $K = \begin{bmatrix} f_x & 0 & c_x \ 0 & f_y & c_y \ 0 & 0 & 1 \end{bmatrix}.$ 4, size-based weighting yields 0.884, and tuned VLM3 weighting yields 0.904 (Cai et al., 28 May 2026).

4. Architectural variants across domains

The broader VLM3 literature includes both minimalist and geometry-explicit realizations.

System	Domain	Distinguishing mechanism
VLM3 (Cai et al., 28 May 2026)	General 3D vision-language learning	Focal length unification, text-based pixel/region reference, data mixture and scaling
Stream3D-VLM (Yu et al., 5 Jun 2026)	Streaming 3D scene understanding	VSFI, GAVC, and next-token streaming control
V3LMA (Lübberstedt et al., 30 Apr 2025)	Autonomous driving VQA	Monocular 3D preprocessing and late LVLM/LLM feature fusion
VLMFusionOcc3D (Doruk et al., 3 Mar 2026)	3D semantic occupancy	InstVLM, WeathFusion, and DAGA
WeatherOcc3D (Doruk et al., 15 May 2026)	Adverse-weather occupancy	CLIP-conditioned gating over camera and LiDAR voxels
ViT3D Alignment of LLaMA3 (Li et al., 2024)	3D medical imaging	ViT3D projector into Asclepius-Llama3-8B

V3LMA is a two-branch autonomous driving system built from frozen pre-trained components: a video-capable LVLM receives the driving video and query, while a matching base LLM receives a long textual description generated from 3D object detections and tracking plus the query (Lübberstedt et al., 30 Apr 2025). The preprocessing pipeline uses Grounded SAM, YOLOv5, MiDaS monocular depth, a specialized YOLO model for traffic lights, and CLIP fine-tuned on GTSRB for traffic sign retrieval. Internal token features are merged late in the decoder by weighted sums, with the LLM branch emitting the answer. V3LMA-Q reaches 0.56 on LingoQA without fine-tuning, improving by about 15% over the best isolated zero-shot LVLM or LLM prompt score of 0.419 and outperforming the LingoQA model without fine-tuning at 0.33 by about 23% (Lübberstedt et al., 30 Apr 2025).

“ViT3D Alignment of LLaMA3” applies the VLM3 idea to volumetric CT (Li et al., 2024). A ViT3D from M3D-CLIP processes 3D scans, a Spatial Pooling layer compacts the token sequence, and a 59M-parameter connector projects visual features into the Asclepius-Llama3-8B token space. The LLM is tuned with LoRA, with 1.1B trainable LoRA parameters reported, while ViT3D is fully fine-tuned. On AMOS-MM validation, the system reaches average GREEN 0.30 for medical report generation and VQA accuracy 0.61, improving over a baseline of 0.25 GREEN and 0.46 VQA accuracy (Li et al., 2024).

For dense 3D semantic occupancy, VLMFusionOcc3D and WeatherOcc3D use CLIP-derived language priors in voxel space. VLMFusionOcc3D introduces Instance-driven VLM Attention, Weather-Aware Adaptive Fusion, and the Depth-Aware Geometric Alignment loss in a camera-LiDAR pipeline. On nuScenes validation with OccMamba, IoU improves from 34.7 to 37.0 and mIoU from 25.2 to 26.6; on SemanticKITTI, OccMamba rises from 24.6% to 26.4% mIoU (Doruk et al., 3 Mar 2026). WeatherOcc3D instead decomposes environmental uncertainty into visibility and illumination, derives a CLIP text prompt, and uses channel-wise gates plus a global fusion scalar:

$K = \begin{bmatrix} f_x & 0 & c_x \ 0 & f_y & c_y \ 0 & 0 & 1 \end{bmatrix}.$ 5

On nuScenes-OpenOccupancy, OccMamba rises from 25.2 to 26.3 mIoU and M-CONet from 20.1 to 21.1, with rainy conditions improving from 24.1 to 27.3 and night from 11.8 to 15.7 (Doruk et al., 15 May 2026).

Taken together, these systems show two coexisting interpretations of VLM3: one treats 3D as a text-native capability of standard VLMs, while the other treats language priors as a conditioning signal for explicit geometric modules.

5. Streaming and online spatial understanding

Stream3D-VLM is a VLM3 specifically designed for online or streaming spatial understanding, advancing prior 3D LMMs from offline clips or complete scans to real-time operation (Yu et al., 5 Jun 2026). Its architecture combines a Qwen2.5-VL-3B or 7B backbone with StreamVGGT-1B as a causal streaming 3D reconstruction model that outputs per-frame geometry tokens $K = \begin{bmatrix} f_x & 0 & c_x \ 0 & f_y & c_y \ 0 & 0 & 1 \end{bmatrix}.$ 6, a camera token $K = \begin{bmatrix} f_x & 0 & c_x \ 0 & f_y & c_y \ 0 & 0 & 1 \end{bmatrix}.$ 7, predicted depth $K = \begin{bmatrix} f_x & 0 & c_x \ 0 & f_y & c_y \ 0 & 0 & 1 \end{bmatrix}.$ 8, and camera intrinsics and extrinsics $K = \begin{bmatrix} f_x & 0 & c_x \ 0 & f_y & c_y \ 0 & 0 & 1 \end{bmatrix}.$ 9.

The first key component is streaming control modeling. The query is followed by a sequence of <img> tokens for each new frame, and at every frame the LLM predicts either <SEP> to continue ingesting frames or <END> to trigger response generation. The control policy is learned under the same autoregressive objective as text generation:

$s = 1000/\bar f,\qquad K' = \operatorname{diag}(s,s,1)K.$ 0

with

$s = 1000/\bar f,\qquad K' = \operatorname{diag}(s,s,1)K.$ 1

and empirical best trade-off at $s = 1000/\bar f,\qquad K' = \operatorname{diag}(s,s,1)K.$ 2 (Yu et al., 5 Jun 2026).

The second component is Visual-Spatial Feature Integration. Per-frame 2D visual tokens $s = 1000/\bar f,\qquad K' = \operatorname{diag}(s,s,1)K.$ 3 are fused with projected geometry and camera tokens

$s = 1000/\bar f,\qquad K' = \operatorname{diag}(s,s,1)K.$ 4

then updated by cross-attention,

$s = 1000/\bar f,\qquad K' = \operatorname{diag}(s,s,1)K.$ 5

followed by a residual addition $s = 1000/\bar f,\qquad K' = \operatorname{diag}(s,s,1)K.$ 6 (Yu et al., 5 Jun 2026). This yields incrementally enriched visual features aligned to the incoming frame without re-encoding history.

The third component is Geometry-Adaptive Voxel Compression. Each patch token is back-projected to 3D using predicted depth and camera parameters,

$s = 1000/\bar f,\qquad K' = \operatorname{diag}(s,s,1)K.$ 7

lifted to voxel tokens with 3D positional encoding, clustered by K-Means, and aggregated by dual attention. The compression ratio is $s = 1000/\bar f,\qquad K' = \operatorname{diag}(s,s,1)K.$ 8, and decoding cost is reduced from $s = 1000/\bar f,\qquad K' = \operatorname{diag}(s,s,1)K.$ 9 to $[0,2000)$ 0 (Yu et al., 5 Jun 2026). The paper reports substantial latency and memory gains with minimal accuracy loss even at 25–50% retention.

Training is supported by a scalable data-generation pipeline over ScanNet, ScanNet++, and ARKitScenes, producing 1,003,203 online spatio-temporal 3D QA pairs across 5,154 scans. The distribution emphasizes Ego-Motion at about 50%, Object-Camera at about 25%, Chronology at about 14%, Environment at about 8%, and Attributes at about 2%, with temporal modes of Backward memory at about 46%, Forward monitoring at about 34%, and Realtime observation at about 20% (Yu et al., 5 Jun 2026). Evaluation uses Stream3D-Bench, a 29-task benchmark with 10,037 samples and 518 videos, including Answer-Timing Accuracy,

$[0,2000)$ 1

Experimentally, Stream3D-VLM-8B reaches average accuracy 58.8 across modes with Answer-Timing Accuracy 86.7%, TTFT 62 ms, end-to-end latency 0.39 s, and memory about 36.6 GB at 504×392 resolution; the 4B model reaches 54.6 average accuracy with ATA 75.4%, TTFT 43 ms, end-to-end latency 0.24 s, and memory about 20.7 GB (Yu et al., 5 Jun 2026). On the offline VSI-Bench, the 8B model reaches 65.9 average, and on ScanQA, ScanRefer, and Scan2Cap it reports ScanQA BLEU-4 17.8, ROUGE 50.2, CIDEr 104.5, EM 30.9; ScanRefer [email protected] 58.4 and [email protected] 52.5; and Scan2Cap BLEU-4 42.8, METEOR 31.0, ROUGE 64.2, CIDEr 81.2 (Yu et al., 5 Jun 2026).

6. Evaluation, limitations, and open questions

The VLM3 literature contains a visible methodological tension. The general VLM3 study argues that specialized 3D layers, complex losses, and regression formulations are not necessary conditions for strong 3D performance (Cai et al., 28 May 2026). Yet many domain-specific systems continue to introduce explicit geometry modules, including StreamVGGT priors, VSFI, and GAVC in streaming scenes; monocular 3D preprocessing and late feature fusion in driving VQA; CLIP-conditioned voxel attention and weather gates in occupancy; and a ViT3D projector for clinical CT (Yu et al., 5 Jun 2026, Lübberstedt et al., 30 Apr 2025, Doruk et al., 3 Mar 2026, Li et al., 2024, Doruk et al., 15 May 2026). This suggests that “native 3D learning” and “explicit 3D injection” are currently complementary rather than mutually exclusive positions.

The reported failure modes are also domain-specific. In VLM3 proper, performance depends on intrinsics quality, can degrade in textureless or reflective areas, and is sensitive to inconsistent coordinate normalization; larger backbones appear to overfit at the current data scale (Cai et al., 28 May 2026). Stream3D-VLM reports degradation under severe occlusions, rapid motion, motion blur, low-texture regions, sparse geometry, and miscalibrated camera parameters (Yu et al., 5 Jun 2026). V3LMA notes that monocular depth is relative and can be affected by shadows, and that fusion parameters exhibit interdependencies (Lübberstedt et al., 30 Apr 2025). VLMFusionOcc3D identifies prompt reliance, recursive prompting error propagation, CLIP text-space domain shift, depth discretization mismatch, and non-zero latency and memory overhead (Doruk et al., 3 Mar 2026). WeatherOcc3D remains dependent on correct visibility and illumination estimation, is vulnerable to prompt ambiguity and domain shift, and cannot recover signal when both modalities fail severely (Doruk et al., 15 May 2026). In medical imaging, GREEN and VQA accuracy are acknowledged as incomplete proxies for clinical validity, with hallucinations, domain shift, interpretability, and safety still unresolved (Li et al., 2024).

A recurrent misconception is that VLM3 refers only to a single benchmark number or only to one architecture. The literature instead presents it as a broader design space: standard VLM backbones can be trained to output geometric quantities as text, but the same principle is also being adapted to streaming control, voxel occupancy, autonomous driving reasoning, and volumetric report generation. A plausible implication is that VLM3 is best understood as a unification strategy for 3D perception and language rather than a closed model family.

Across papers, the empirical pattern is consistent. Language modeling can represent metric depth, correspondence, pose, scene occupancy, online timing decisions, and volumetric clinical findings, but robust deployment still depends on geometry fidelity, calibration quality, data mixture, and domain-specific evaluation. That combination of unification and residual specialization presently defines the state of VLM3 research (Cai et al., 28 May 2026, Yu et al., 5 Jun 2026).