DepthLM: VLM for Metric Depth Estimation
- DepthLM is a vision-language model for per-pixel metric depth estimation that leverages text-based supervision and visual markers to accurately determine depth.
- It employs intrinsic-conditioned augmentation to harmonize camera scales across datasets, addressing scale ambiguity without modifying the VLM architecture.
- The approach achieves state-of-the-art accuracy with sparse supervision and small model sizes while naturally extending to various 3D reasoning tasks.
DepthLM is a vision-LLM (VLM) approach for per-pixel metric depth estimation from 2D images. Unlike specialized pure vision models that rely upon custom architectures and loss designs, DepthLM leverages standard VLM infrastructure and text-based supervised fine-tuning, complemented by visual prompting for pixel reference and intrinsic-conditioned image augmentation. This enables VLMs to achieve expert-level accuracy on metric depth tasks, surpassing the performance of advanced multimodal models without requiring architectural or loss modifications. DepthLM demonstrates state-of-the-art results even with sparse supervision and small model sizes, and its framework generalizes naturally to a wide range of 3D reasoning tasks.
1. Motivation and Design Principles
DepthLM addresses two major deficits in VLMs for metric depth estimation: difficulty with pixel reference and scale ambiguity due to varying camera intrinsics. While expert vision models attain super-human metric depth accuracy, they do so at the expense of bespoke dense prediction heads and complex losses. DepthLM asks whether a standard VLM, trained only with sparse text-based supervision and loss, can reach comparable levels. The core design choices are:
- No modifications to the VLM architecture or loss.
- Sparse supervised fine-tuning using natural language metric supervision (e.g., "How many meters is this point from the camera?").
- Explicit strategies for referencing pixels (visual prompting) and harmonizing metric scale across datasets (intrinsic-conditioned augmentation).
2. Methodological Innovations
a) Visual Prompting for Pixel Reference
DepthLM renders discrete visual markers (such as small arrows) directly onto images to indicate the query pixel position. This marker-based approach replaces coordinate-based textual prompts used in previous works, which VLMs struggle to interpret. The model is queried using standard language, dramatically improving depth estimate reliability, especially in cluttered scenes.
b) Intrinsic-Conditioned Augmentation
Cross-dataset training introduces scale ambiguity because differences in camera intrinsics (e.g., focal length) mean that similar pixel locations can correspond to very different metric depths. DepthLM resolves this ambiguity via a pre-processing step. For an input image , the image is resized so that both axes correspond to a unified focal length , using:
Here, are the horizontal and vertical focal lengths, and is typically chosen as 1000 pixels. This augmentation aligns metric scale and enables a single model to generalize accurately across datasets without additional network modification.
c) Sparse Supervised Fine-Tuning via Language
A standard text-based next-token prediction loss (cross-entropy) is applied using sparse depth labels, often as few as one annotated pixel per image. The target output is a descriptive metric statement (e.g., "3.1 meters"), with no explicit regression or regularization losses or dense output heads. Experiments demonstrate that diversity of imagery is more important than dense labeling. Reinforcement learning alternatives (e.g., GRPO with negative depth error reward) were also verified, but classic SFT was found to be more efficient for comparable accuracy.
3. Performance and Comparative Analysis
DepthLM convincingly surpasses existing vision-LLMs and is comparable with pure vision baselines:
- On standard benchmarks, a 3B parameter DepthLM model achieves (fraction of pixels within 25% relative error), a greater than 2× improvement over advanced VLMs such as GPT-5 and Seed1.5-VL (which typically score below 0.4), and competitive with pure vision models.
- Even using vastly smaller models (e.g., compared to Qwen2.5-VL at 72B parameters), DepthLM delivers precise metric depth.
- When independently querying every pixel, DepthLM produces point clouds whose metric scale matches ground-truth for both indoor and outdoor scenes.
- Without explicit boundary regularization, DepthLM naturally avoids over-smoothness and flying points at object boundaries, a common problem in pure vision approaches.
4. Technical Mechanisms
The success of DepthLM is rooted in meticulous input encoding and pre-processing:
- Pixel reference uses visual markers, not textual coordinates, based on empirical evidence of superior VLM response.
- Intrinsic-conditioned resizing is derived from the pinhole camera model, normalizing scale and improving cross-dataset generalization; performance plateaus at pixels.
- Supervised fine-tuning exploits standard next-token prediction, with no auxiliary regression or dense heads.
- Training is efficient—sparse labeling de-emphasizes annotation density in favor of dataset diversity.
5. Multi-Task Extensions
DepthLM's unified framework extends naturally to other 3D vision tasks beyond metric depth:
- Principal axis depth estimation (front–back positioning)
- Speed and time estimation (e.g., predicting time to destination given estimated speed and distance)
- Two-point metric distance evaluation
- Multi-image metric camera pose estimation
This suggests a model-agnostic approach wherein a single VLM can flexibly handle a suite of metric 3D reasoning tasks using consistent architectural and training protocols.
6. Future Directions
Potential research directions inspired by DepthLM include:
- Improved dataset curation/filtering pipelines to enrich diversity and generalization.
- Fine-grained intrinsic normalization for even subtler camera differences, pushing VLMs beyond pure vision model accuracy.
- Multi-task continuous training to elevate unified 3D reasoning capability across segmentation, scene flow, pose, and related spatial inference domains.
These avenues reflect the paper's hypothesis that careful visual prompting and intrinsic augmentation, supported by sparse supervised fine-tuning, can unlock expert 3D understanding from standard VLMs without the burden of custom architectures or loss designs.
DepthLM provides a paradigm shift in leveraging VLMs for metric depth estimation. Through visual prompting, intrinsic-conditioned scale harmonization, and sparse supervised finetuning, it reaches expert-level accuracy previously reserved for purpose-built vision models, while retaining architectural generality and extensibility across a range of 3D vision tasks (Cai et al., 29 Sep 2025).