VLM3: Vision Language Models Are Native 3D Learners

Published 28 May 2026 in cs.CV and cs.AI | (2605.30561v1)

Abstract: Vision LLMs (VLMs) enable a unified model to solve various vision tasks through prompting. They have shown promising performance in semantic understanding. However, 3D understanding still largely relies on expert vision models with complex task-specific designs. The key argument this work wants to make is that VLMs are native 3D learners. Our in-depth large scale study shows that 1) focal length unification, 2) text-based pixel reference and 3) data mixture and scaling, are all you need for effective 3D learning. Model architecture changes, large models, heavy data augmentations, and complex losses including the regression formulation, many of which form the foundation of expert vision models, are actually not necessary conditions. As a result, we propose VLM3, a scalable method with the simplest design that enables standard VLMs to master diverse 3D tasks. VLM3 not only advances the VLM depth estimation accuracy by a large margin (0.84 -> 0.9), but also enables diverse 3D tasks such as pixel correspondence, camera pose estimation and object-level 3D understanding, matching expert vision model accuracy while maintaining standard architectures and text-based training. We believe VLM3 opens up a new paradigm for simple and scalable 3D learning.

Abstract PDF Upgrade to Chat

Authors (6)

Summary

The paper demonstrates that with minimal modifications, vision language models can be repurposed for high-fidelity 3D tasks.
It introduces focal length unification and text-based pixel reference to overcome camera ambiguity and enhance geometric understanding.
Experimental results reveal that VLM3 outperforms or matches expert 3D models in depth, object-level, correspondence, and pose estimation tasks.

Vision LLMs Are Native 3D Learners: An Expert Analysis of VLM3 (2605.30561)

Motivation and Problem Setting

The paper "VLM3: Vision LLMs Are Native 3D Learners" addresses the longstanding dichotomy between expert 3D vision models—characterized by specialized architectures, complex losses, and heavy augmentations—and generalist vision LLMs (VLMs). Historically, VLMs have excelled in semantic tasks but have lagged dramatically in 3D geometric understanding, particularly at fine-grained levels such as depth metrics, correspondence, and pose. The central claim of the paper is that standard VLMs, equipped with proper task formulation and minimal design, are intrinsically capable of high-fidelity 3D understanding across a wide spectrum of tasks.

Methodology: Minimal Design, Maximal Generality

The VLM3 framework relies on three key simplifications:

Focal Length Unification: By resizing every input image to a fixed focal length (1000 pixels), VLM3 eliminates camera intrinsic ambiguity and enables homogenous data mixing. This removes the need for any architectural changes or external modules for handling intrinsics.
Figure 1: VLM3 pipeline showing focal length unification to solve camera ambiguity and text-based pixel reference with normalized axes.
Text-Based Pixel/Object Reference: Pixels and object regions are referred to directly in text, normalized to $[0, 2000)$ for both axes, circumventing previous reliance on visual markers or extra encoder modules. This drastically simplifies input/output handling and scaling, enabling batched QA for a single image without data duplication.
Data Mixture and Scaling: Intelligent data mixture—weighting datasets based on size and difficulty—was demonstrated to be more critical than architectural or augmentative complexity. The authors systematically analyze mixture ratios, showing substantial performance gains from proper weighting.

The entire training pipeline retains vanilla VLM architectures and text-supervised finetuning (SFT) without regression losses, special decoders, or augmentation-heavy regimes.

Experimental Evaluation: Diverse 3D Tasks

VLM3 is evaluated across four mainstream 3D vision tasks:

Metric Depth Estimation: The model surpasses prior SOTA VLMs (e.g., DepthLM-7B) in $\delta_1$ accuracy, achieving 0.90 vs. 0.84 with half the parameters. Results approach those of expert models like UnidepthV2 and MoGe-2, with new SOTA on key benchmarks.
Object-Level 3D Understanding: VLM3 matches or exceeds SpatialRGPT-8B, outperforming it in both qualitative and quantitative spatial reasoning, despite using a smaller model and eschewing region-mask encoders.
Pixel Correspondence Estimation: EPE is reduced by an order of magnitude relative to baseline VLMs. VLM3 demonstrates competitive performance with expert models DKM and RoMa, with only minor gap to full SOTA (UFM), indicating strong geometric generalization capacity.
Camera Pose Estimation: AUC30 is improved from 5% to 94%, exceeding models like VGGT and matching DA3-Giant, even though the problem is formulated purely as text QA without regression.
Figure 2: VLM3 achieves high accuracy across varied 3D tasks using a simple, scalable design, matching or surpassing expert baseline models.

Qualitative Results and Visualizations

Empirical results are reinforced by visualizations demonstrating reliable outputs across depth, spatial reasoning, correspondence, and pose, both in single and multi-view settings, for indoor and outdoor images. Notably, dense point clouds are generated via text prompting alone, and object/pose queries are resolved without architectural modifications or auxiliary markers.

Figure 3: Representative outputs for depth, object-level, correspondence, and pose estimation tasks; raw inputs suffice for robust predictions.

Analysis: Design Choices and Scaling Laws

Text vs. Visual Pixel Reference: Text-based pixel reference performs equivalently to visual prompting when normalization is applied, offering greater efficiency and scalability for large batch QA.
Data Mixture Weighting: Dataset size-based weights outperform uniform weighting, preventing saturation or regression performance when scaling to tens of millions of samples.
Model Size: Increasing model size does not yield accuracy gains at current data scales; 4B models suffice for SOTA results, and overfitting emerges with larger models or higher data volume unless mixture is optimized.

Theoretical Implications

The results overturn several axioms in 3D vision:

Regression Formulation Is Unnecessary: Treating 3D outputs as text QA achieves parity with regression-based expert models for depth, correspondence, and pose.
Task-Specific Design Is Not Required: Standard VLM architectures, with focal length unification and normalized text queries, master diverse 3D tasks.
Scalability Is Strongly Data-Limited: Proper data mixture, rather than architectural innovation, is the decisive factor for generalist model performance in 3D vision.

Practical Implications and Future Directions

VLM3 demonstrates that highly generalist VLMs are practical for 3D vision deployment, simplifying both training and inference pipelines. Foundation models with text-based supervision can now match expert accuracy in dense 3D tasks without custom architectures, losses, or augmentations. The findings invite new paradigms in vision-language training, suggesting that further scaling in data and cross-task QA packing could unlock even broader capabilities.

Research avenues include exploring larger models at escalated data scales, optimizing mixture ratios, and applying the VLM3 approach to new modalities (e.g., video, point cloud). Theoretical work may revisit the nature of geometric inductive biases and their necessity in multi-modal model learning.

Conclusion

VLM3 establishes that vision LLMs, when configured with minimal design and intelligent dataset weighting, are native 3D learners capable of mastering fine-grained geometric tasks across depth, objects, correspondence, and pose. The study significantly reduces the complexity of designing 3D vision models, shifting focus towards scalable, unified QA-based frameworks and reinforcing the primacy of data mixture over traditional architectural innovations. The work represents a fundamental step toward generalist 3D foundation models and provides a rigorous baseline for future AI research in multi-modal geometric learning.

Markdown Report Issue