BenchDepth: Depth Model Benchmarking

Updated 3 July 2026

BenchDepth is a benchmarking paradigm that evaluates depth foundation models via task-driven metrics focused on practical downstream utility.
It integrates evaluations across diverse tasks such as depth completion, stereo matching, 3D scene reconstruction, SLAM, and vision-language spatial understanding.
Results indicate that affine-invariant disparity models often outperform metric depth models, offering actionable insights for future model improvements.

BenchDepth is a benchmarking paradigm for evaluating depth-oriented models—particularly Depth Foundation Models (DFMs)—via task-driven metrics that capture practical downstream utility rather than conventional alignment-based comparisons. It encapsulates a family of benchmarks and methodologies unified by the objective of measuring “depth” across different domains, ranging from computer vision and information retrieval to logic and reasoning. The defining feature of the BenchDepth approach is the focus on real-world tasks as the basis for model assessment, in contrast to per-pixel or per-step alignment with ground truth, thereby enabling more robust and application-relevant model comparisons (Li et al., 21 Jul 2025).

1. Alignment-Free Evaluation in Depth Foundation Models

Traditional evaluation protocols for DFMs, particularly in monocular depth estimation, are built around aligning a model’s predicted depth map $\hat D$ to ground-truth $D_{gt}$ through affine transformations (scale and shift), often using least-squares minimization. This introduces biases that favor models with specific output representations and complicates cross-model comparisons. BenchDepth, by contrast, bypasses all per-pixel or affine alignment: models are plugged directly into downstream pipelines for five application-driven proxy tasks, and are scored by their effect on real systems, not by similarity to ground truth at the depth-map level (Li et al., 21 Jul 2025).

2. Proxy Task Suite and Evaluation Metrics

BenchDepth’s evaluation involves five proxy tasks reflecting distinct modes of depth utility:

Depth Completion: Given sparse depth samples and an RGB image, the DFM’s dense depth output guides the generation of a completed depth map. Metrics: RMSE, MAE on NYU Depth V2.
Stereo Matching: DFM outputs condition a stereo matcher, improving disparity estimation. Metrics: EPE and large-error pixel rates on SceneFlow and zero-shot sets.
Monocular Feed-Forward 3D Scene Reconstruction (3DGS): DFM depth initializes geometry for photorealistic novel-view synthesis. Metrics: PSNR, SSIM, LPIPS on RealEstate10K.
Simultaneous Localization and Mapping (SLAM): DFM depth assists camera tracking and mapping in monocular SLAM pipelines. Metrics: accuracy (Acc) and completeness (Comp) in Replica.
Vision–Language Spatial Understanding: Depth is provided to vision-LLMs (VLMs) for answering 3D spatial queries on SpatialBench. Metric: question answering accuracy by category.

This suite encompasses sensor fusion, geometric reasoning, high-fidelity rendering, robotics, and vision-language integration, and uses only the raw depth outputs of each foundation model as inputs to each pipeline (Li et al., 21 Jul 2025).

3. Models, Data Splits, and Benchmarking Protocols

Eight representative SOTA monocular DFMs are benchmarked, including affine-invariant disparity, metric depth, and affine-invariant point-map representations. All public releases are used with minimal changes (e.g. unified feature extractors, direct pointmap-to-depth conversion). Datasets and their splits are held constant across all tasks (e.g. NYU Depth V2 for completion, SceneFlow for stereo, RealEstate10K for 3DGS, Replica for SLAM), enabling fair cross-task ranking:

Model	Representation Type	Avg. Rank (↓)
DAV2-Rel	affine-inv. disparity	1.50
GenPercept	affine-inv. depth	3.25
DAV2-Met	metric depth	3.75
Midas	affine-inv. disparity	4.25
Marigold	affine-inv. depth	5.50
UniDepth	metric depth	5.25
MoGe	affine-inv. point map	5.75
Metric3DV2	metric depth	6.33

Proxy-task-specific improvements are reported as percentage gains over a “no-depth” baseline for each metric and dataset (Li et al., 21 Jul 2025).

4. Quantitative Outcomes and Key Findings

DAV2-Rel, an affine-invariant disparity DFM, leads overall—providing the highest average gains in depth completion (+9.26% MAE/RMSE improvement), stereo matching (+5.77% EPE improvement), and SLAM (+10.0% Acc+Comp improvement). Metric-depth models close the gap in metric-sensitive tasks (e.g. DAV2-Met in 3DGS, UniDepth in SLAM). Notably, all models score comparably in VLM spatial understanding, with no depth model conferring a decisive advantage.

Task	Best Model(s)	Notable Finding
Completion	DAV2-Rel	Large MAE/RMSE gains
Stereo	DAV2-Rel, MoGe	Affine-inv. best, MoGe promising
3DGS	Midas, DAV2-Met	Metric models competitive
SLAM	DAV2-Rel, UniDepth	Diverse representation utility
VLMs	– (All similar)	No clear improvement via depth

Observations:

Affine-invariant disparity models outperform metric-depth on most low- and mid-level tasks.
Fine-tuned metric models are closing gaps in metric-sensitive settings (suggesting more diverse synthetic data is needed).
Diffusion-based models (Marigold, GenPercept) scale with effective fine-tuning; further broadening of training regimes is suggested.
Depth signals do not yet substantially enhance VLM reasoning, indicating current VLM architectures may not exploit spatial cues effectively (Li et al., 21 Jul 2025).

5. Methodological Innovations and Avoidance of Alignment Bias

BenchDepth systematically eliminates sources of benchmark bias by:

Disallowing any post-hoc per-pixel alignment between predicted and reference depths;
Using “zero convolution” layers or direct pipeline replacement for integrating DFM predictions, thereby avoiding implicit normalization to specific coordinate spaces;
Employing task-level utility metrics (e.g., RMSE, LPIPS, Acc+Comp), which reflect end-user or system-level impact rather than intermediary metric satisfaction (Li et al., 21 Jul 2025).

This methodology ensures that DFMs are evaluated by their true effect on heterogeneous downstream systems rather than their fit to laboratory-specific representations.

6. Recommendations, Limitations, and Research Directions

BenchDepth results motivate several directions:

Synthetic Data Scaling: More and varied synthetic data are critical for closing performance gaps, especially in metric-sensitive tasks.
Representation Selection: Affine-invariant disparity emerges as presently optimal for general-purpose utility; model architecture should be aligned with intended deployment scenarios.
Co-training for Vision–LLMs: VLMs show minimal gain from depth, indicating the importance of either depth-pretraining or new multi-modal architectures.
Efficiency and Integration: Models such as MoGe present integration challenges, suggesting the need for further work on architectural adaptation.
Extension to Additional Pipelines: Incorporating further real-world proxies can enhance the ecological validity of depth model evaluations (Li et al., 21 Jul 2025).

A plausible implication is that universal DFM evaluation may require modular, application-specific suites such as BenchDepth, with clear demarcation between representation, integration method, and task-driven metric.

7. BenchDepth and the Broader Benchmarking Ecosystem

BenchDepth represents a shift from dataset-centric, alignment-normalized evaluation to real-system-oriented benchmarking. Its design mirrors recent developments in adjacent domains (e.g., sequence reasoning with seqBench (Ramezanali et al., 21 Sep 2025), variable-depth pooling in IR (Ganguly et al., 2023)), all focused on moving beyond shallow, one-dimensional metrics toward more authentic representations of task complexity and practical utility. BenchDepth thus offers a scalable, reproducible, and rigorously application-aligned evaluation protocol, setting a new standard for depth model comparison and fostering progress in both computer vision and multi-modal AI (Li et al., 21 Jul 2025).