Scale-Recovered Monocular Depth Estimator

Updated 16 November 2025

The paper introduces a novel modular approach that decomposes depth into a global scale factor and a relative depth map, enabling accurate metric predictions.
It combines language semantics, geometric cues, and self-supervised losses to resolve global scale ambiguity and enhance cross-domain robustness.
The method supports applications such as 3D reconstruction, robot navigation, and SLAM while addressing challenges like cue dependency and computational complexity.

A scale-recovered monocular depth estimator refers to a computational model or algorithm that, given a single RGB image (or a temporal sequence), predicts dense depth maps in metric units, i.e., with the global scale ambiguity resolved. This enables monocular depth estimation to support downstream tasks such as metric 3D reconstruction, robot navigation, or mapping, which demand not only up-to-scale geometry but true real-world scale.

1. Theoretical Foundations and Scale Ambiguity

Monocular depth estimation’s ill-posed nature stems from projective ambiguity: the mapping from a 3D scene to a 2D image discards all absolute scale information, so any depth map $D(x)$ is only determined up to a global affine transformation:

$D_\text{metric}(x) \approx a D_\text{rel}(x) + b$

where $D_\text{rel}$ is (normalized, relative, or inverse) depth and $(a,b)$ are undetermined. Classical supervised models collapse this ambiguity by training on metric datasets, but these approaches often perform poorly out-of-domain due to scale bias.

Two principal approaches exist for resolving scale:

External cues or priors: geometric features (known object size, camera height, or the 3D pose of scene elements)
Global statistical or semantic inference: leveraging language descriptions, scene semantics, or learned priors to predict the appropriate transformation

Recent methods employ a wide variety of strategies—ranging from multi-modal fusion and geometric regression to semantic/linguistic conditioning and explicit geometric modeling—to predict either the affine parameters $(a, b)$ or directly recover metric scale within the network.

2. Decomposition and Modular Architectures

A prominent trend is the explicit decomposition of metric depth into two factors:

$D(x) = s \cdot R(x)$

where $s$ is a global scene scale (scalar, image-level) and $R(x)$ is a normalized relative depth map. This modular design, as exemplified by ScaleDepth (Zhu et al., 11 Jul 2024), decouples "what is farther/closer" from "how big is the scene." Architectures following this paradigm typically comprise:

Scale-prediction module: infers $s$ using image-level features, often leveraging global structure or scene semantics; semantic-aware variants may inject CLIP-derived or text-supervised features.
Relative-depth module: predicts $R(x)$ using dense local features, often via adaptive binning or masked transformers to capture the scene’s ordinal structure.

Other approaches, such as Depth Map Decomposition (Jun et al., 2022), model the entire metric map as a recombination of a normalized map $N(x)$ and scale parameters (mean/variance), learned via multi-headed decoders.

A distinct class uses non-parametric postprocessing: relative depths are produced by the base network and then globally aligned to metric with parameters inferred from auxiliary inputs (e.g., language, sparse depth priors, or geometric primitives).

3. Methods for Scale Recovery

3.1 Language and Semantics-Driven Approaches

RSA (Zeng et al., 3 Oct 2024) and VGLD (Wu et al., 5 May 2025) pioneer methods that use language descriptions to infer scale for monocular depth maps. A text encoder (e.g., frozen CLIP) processes a semantically rich caption, and a compact regression network predicts affine transformation parameters that operate globally on the output of a relative depth backbone:

$\hat d_\text{metric}(i,j) = \frac{1}{k \, x(i,j) + b}$

VGLD fuses both CLIP-image and CLIP-text embeddings and employs a routing mechanism to specialize predictions for indoor/outdoor, yielding reduced variance and improved transfer compared to text-only approaches (RSA). Both models function as universal metric-alignment modules, are robust under cross-domain/zero-shot evaluation, and support out-of-the-box integration with any relative-depth backbone.

3.2 Geometric and Sparse Priors

Multiple works address scale with geometric, visual, or sparse depth priors:

Sparse Anchor Alignment: Local or global least-squares regression aligns the predicted depth to sparse metric anchor points, often from SLAM, stereo, or LiDAR (Guizilini et al., 2019, Zhang et al., 29 Oct 2025). This can be as minimal as 4–100 LiDAR points or synthetic stereo correspondences, e.g.,

$\min_{s, t} \sum_{i} (s z(i) + t - v_i)^2$

Instrument-based Geometric Modeling: In specialized domains such as endoscopy (Wei et al., 14 Aug 2024), absolute scale is anchored by reconstructing the 3D pose of a tool with known geometry in the image; Plücker line geometry provides closed-form surface correspondences to compute the affine scale/shift between predicted and real depth.
Camera Height or Known Baseline: In automotive contexts, a known camera height above ground can be used to anchor plane distance, supplying an observed scale anchor that propagates to arbitrary scene points via planar-parallax or ground-plane reasoning (Wagstaff et al., 2020, Elazab et al., 29 Nov 2024).

3.3 Self-supervision and Video Consistency

Self-supervised monocular depth estimation benefits from geometric consistency losses to enforce scale alignment across frames in video:

Geometry consistency loss (Bian et al., 2021): enforces depth maps from adjacent frames to be mutually consistent after warping according to predicted pose.
Planar-parallax teacher–student frameworks (Elazab et al., 29 Nov 2024): a multi-frame “teacher” with access to planar geometry and mounting height supervises a monocular “student” to distill metric knowledge, enabling robust self-supervised metric prediction.
SLAM-based teacher–student frameworks (Choi et al., 2022): the teacher, a large pre-trained relative-depth model, distills high-frequency structure to a smaller student, while SLAM provides extrinsic scale via metric camera pose; the student network is fine-tuned with compositional photometric, distillation, and scale-alignment losses.
Relative/Metric Frame Consistency: Models such as Self-Supervised Scale Recovery (Wagstaff et al., 2020) augment photometric and smoothness losses with scale-recovery terms, e.g., ensuring the predicted camera height from ground-plane pixels matches the physical mounting height.

4. Loss Functions and Training Protocols

Losses for scale-recovered monocular depth estimation cluster into the following classes:

Scale-invariant losses: Penalize per-pixel depth in a manner that is independent of global scaling, e.g.,

$L_{\text{SI}} = \alpha\,\sqrt{ \text{Var}[\delta(x)] + \lambda\,( E[\epsilon(x)])^2}$

as in Eigen et al., adopted by ScaleDepth (Zhu et al., 11 Jul 2024) and others.

Direct supervised loss on metric depth: $L_1$ or $L_2$ between the predicted (scale-aligned) map and ground truth.
Geometric consistency: Measures pixelwise inconsistency when warping the depth map under predicted pose; encourages temporal coherence and stabilizes the learned scale (Bian et al., 2021).
Semantic cross-entropy on scale-classification: Used when scale is inferred via semantic alignment to predefined scene classes with CLIP (Zhu et al., 11 Jul 2024).
Auxiliary terms: Edge-aware smoothness, multi-scale gradients, normal-based geometry loss, and domain-router classification (VGLD).

Distinct strategies are adopted for training with mixed supervision: for joint exploitation of relative and metric data, modular decoders may be trained only on samples with available ground-truth, or weakly supervised losses (e.g., image-level normalized regression (Yin et al., 2022)) are employed to remove range bias.

5. Quantitative Performance and Experimental Findings

State-of-the-art methods demonstrate that robust scale recovery is possible, often approaching the performance of oracle linear fits to ground truth:

Global scene scale estimation: ScaleDepth (Zhu et al., 11 Jul 2024) achieves AbsRel=0.074, RMSE=0.267 m, δ₁=0.957 (NYUv2) and AbsRel=0.048, RMSE=1.987 m, δ₁=0.980 (KITTI), setting the benchmark on both indoor and outdoor domains without test-time tuning or manual scale alignment.
Language-driven alignment: VGLD (Wu et al., 5 May 2025) and RSA (Zeng et al., 3 Oct 2024) reduce AbsRel to within ~0.02–0.04 of the best possible linear fit (NYUv2: VGLD-TCI AbsRel ≈ 0.12, LM-fit ≈ 0.056).
Sparse priors and geometric cues: Robust scale recovery in challenging underwater (Zhang et al., 29 Oct 2025), medical (Wei et al., 14 Aug 2024), and endoscopic scenes by leveraging a handful of metric points or simple geometric primitives; even 4 LiDAR beams suffice to constrain global scale (Guizilini et al., 2019).
Zero-shot and cross-domain generalization: Multiple frameworks report strong zero-shot transfer: e.g., ScaleDepth outperforms cross-domain baselines by up to 23.1% relative AbsRel, and VGLD achieves robust alignment on unseen sets (SUN-RGBD, DDAD, without tuning).
Self-supervised and video-based: MonoPP (Elazab et al., 29 Nov 2024) directly estimates metric depth in automotive video using only mounting height, outperforming prior camera-height-constrained methods on KITTI without explicit GT depth.

A summary of typical experimental metrics and backbone/model combinations is useful for comparative analysis:

Method	Indoor (NYUv2)	Outdoor (KITTI)	Cross-Domain / Zero-Shot
ScaleDepth	AbsRel=0.074	AbsRel=0.048	Top-1 on 8 unseen sets
VGLD (TCI/MiDaS)	AbsRel≈0.12	AbsRel≈0.12–0.15	Robust ZS on SUN-RGBD, DDAD
SPADE (DAT, FLSea)	AbsRel=0.042	—	SOTA on underwater SFM
MonoPP	AbsRel=0.107	RMSE=4.658	SOTA on Cityscapes
DMD (Jun et al., 2022)	AbsRel=0.098	—	Small-data reg. robust

6. Limitations and Open Challenges

Despite significant advances, several limitations persist:

Semantic generalization: Language or category-driven methods remain sensitive to out-of-vocabulary scenes or atypical compositions. Scale prediction via fixed scene prompts may fail on rare or ambiguous images (Zhu et al., 11 Jul 2024, Zeng et al., 3 Oct 2024).
Global affine assumption: Most alignment methods model the depth error as a global affine transformation. When underlying biases are spatially non-uniform, regional adaptation or per-pixel correction may be necessary (Zeng et al., 3 Oct 2024).
Dependency on external cues: Methods leveraging sparse metric priors or geometric markers require those cues to be available at runtime. In some domains, robust feature detection (e.g., tool shaft in surgery) cannot be guaranteed (Wei et al., 14 Aug 2024, Zhang et al., 29 Oct 2025).
Resource and complexity constraints: Incorporation of CLIP or large vision-backbones inflates computational cost. Lighter backbones or depth-specific pretraining may alleviate deployment barriers (Zhu et al., 11 Jul 2024).
Scene and camera dependence: Geometric or calibration-based strategies hinge on accurate camera intrinsics or the presence of canonical planes (e.g., ground). Performance degrades with miscalibration or planar scene scarcity (Guizilini et al., 2023).

Potential future research directions include open-vocabulary scale prediction, multi-modal cue fusion (e.g., combining language, geometric and sparse sensory cues), regionalized or per-object scaling, and further improvements in explicit domain adaptation.

7. Applications and Integrated Pipelines

Scale-recovered monocular depth estimation underpins a wide spectrum of applications:

3D scene reconstruction: Combining scale-consistent per-frame prediction with geometry-based pipelines enables accurate metric scene reconstructions from ordinary RGB video (Xu et al., 2022, Yin et al., 2022).
Robotics and navigation: Metric depth is critical for robot localization, path planning, and obstacle avoidance—especially when only a monocular camera is feasible (Wagstaff et al., 2020, Choi et al., 2022).
SLAM and mapping: Dense monocular depth may be fused with ORB-SLAM2 or similar multi-view systems for globally metric mapping and improved trajectory accuracy (Bian et al., 2021, Choi et al., 2022).
Medical and scientific imaging: Scale-aware depth estimation in endoscopic/underwater scenes enables advanced navigation, intervention planning, and integrated robotic control (Wei et al., 14 Aug 2024, Zhang et al., 29 Oct 2025).
Zero-shot and domain-adaptive inference: Language- and geometry-driven approaches, robust to dataset shifts, support open-world deployment scenarios.

In summary, scale-recovered monocular depth estimators integrate advances in semantic inference, geometric reasoning, and self-supervision to produce dense, metric-consistent depth maps from a single frame or monocular video. This capacity enables a broad range of downstream applications and marks a significant milestone in both the generalization and practical relevance of monocular depth estimation methods.