Papers
Topics
Authors
Recent
2000 character limit reached

Monocular Depth Estimation Overview

Updated 10 December 2025
  • Monocular depth estimation is the process of predicting a dense depth map from a single image, despite inherent scale and projection ambiguities.
  • It leverages deep convolutional, transformer, and hybrid architectures to fuse multiscale, semantic, and geometric cues for accurate scene reconstruction.
  • Emerging methods integrate supervised, self-supervised, and physics-based strategies to tackle challenges in dynamic scenes, domain adaptation, and uncertainty estimation.

Monocular depth estimation is the task of predicting a dense depth map from a single RGB image, recovering for each pixel the distance to the scene surface via a process that is fundamentally ill-posed due to perspective projection ambiguities and the loss of scale information. This problem is central to robotic vision, autonomous driving, 3D scene understanding, and many downstream applications including AR/VR, semantic reconstruction, and SLAM. The field has progressed rapidly, with deep convolutional and transformer-based architectures driving advances in both supervised and self-supervised paradigms. This article provides a comprehensive summary of the scientific foundations, architectural innovations, evaluation methodologies, remaining challenges, and future directions in monocular depth estimation.

1. Problem Formulation and Ill-Posedness

The goal of monocular depth estimation (MDE) is, given a single RGB image IRH×W×3I \in \mathbb{R}^{H \times W \times 3}, to predict a dense per-pixel depth map DRH×WD \in \mathbb{R}^{H \times W} (Bhoi, 2019). Ambiguity arises because the 3D scene induces infinitely many configurations that are consistent with any given 2D projection, causing scale, viewpoint, and shape ambiguities.

Classical geometric approaches rely on multi-view or stereo cues to resolve these ambiguities. In monocular form, human or machine vision must instead exploit learned priors, local and global scene context, semantic cues, object size knowledge, and surface orientation assumptions. Formally, the mapping Φ:ID\Phi: \mathcal{I} \to \mathcal{D} is learned from a dataset {(Ii,Di)}i=1N\{(I_i, D_i)\}_{i=1}^N, typically by minimizing a loss such as the scale-invariant log-MSE: D(y,y)=12ni=1n(logyilogyi+α)2,α=1ni(logyilogyi)D(y, y^*) = \frac{1}{2n} \sum_{i=1}^n ( \log y_i - \log y^*_i + \alpha )^2, \qquad \alpha = \frac{1}{n} \sum_i ( \log y^*_i - \log y_i ) This loss compensates for global scale ambiguity and is widely used in direct regression schemes (Bhoi, 2019).

Ambiguity in absolute scale is a critical issue: monocular pipelines can only recover depth up to an unknown factor unless external cues (object size priors, inertial measurements, or semantic anchors) are incorporated (Wei et al., 2022, Choi et al., 2022, Wofk et al., 2023, Zhang et al., 18 Mar 2025).

2. Network Architectures and Representation Learning

Current MDE models leverage various deep learning architectures and output parameterizations to encode multiscale, semantic, and geometric information.

3. Training Paradigms: Supervision, Self-Supervision, and Cues

  • Supervised Learning: Early and contemporary approaches rely on per-pixel metric ground truth from depth sensors or synthetic environments (Li et al., 2017, Gurram et al., 2021). Loss functions include scale-invariant log-space MSE, absolute/relative errors, cross-entropy for classification, and SSIM-augmented regression.
  • Self-Supervision via Photometric Consistency: Leveraging multi-view sequences, pose estimation, and differentiable warping, models are trained with photometric reconstruction losses. These methods perform explicit view synthesis using predicted depth and relative camera poses:

pˉs=KTtsDt(pt)K1pˉt (rigid),K(LsLt1)Dt(pt)K1pˉt (dynamic)\bar{p}_s = K T_{t \rightarrow s} D_t(p_t) K^{-1} \bar{p}_t \ \text{(rigid)}, \quad K (L_s L_t^{-1}) D_t(p_t) K^{-1} \bar{p}_t \ \text{(dynamic)}

where KK is the intrinsic matrix, TT the pose, and LL the 3D object pose (Wei et al., 2022).

  • Modeling Dynamic Scenes: Explicit identification of dynamic/static pixels via panoptic segmentation and monocular 3D detection enables separate geometric treatment of moving objects, mitigating violations of the static-world assumption (Wei et al., 2022).
  • Eliminating Scale Ambiguity: Supervision from 3D cuboid detections with real-world size, physics-based priors, semantic size embeddings, or external metric cues anchors the depth scale without external sensors (Wei et al., 2022, Auty et al., 2022, Zhang et al., 18 Mar 2025).
  • Self-Supervision from Simulators and SLAM: Combinations of perfect virtual-world supervision and real-world self-supervision, augmented by domain adaptation, are used to bridge synthetic-to-real gaps (Gurram et al., 2021). Visual-inertial pipelines and teacher-student strategies exploit metric pose estimates from SLAM or proprioceptive sensors to metrically calibrate the network (Choi et al., 2022, Wofk et al., 2023).
  • Ranking-Based Formulations: Plackett-Luce listwise ranking enables training on ordinal relation data alone, with metric depth recovered up to an affine transformation (Lienen et al., 2020).
  • Diffusion Models: Conditioning denoising diffusion models on RGB and noisy/incomplete depth enables robust handling of sparse or ambiguous regions, uncertainty quantification, and depth inpainting (Saxena et al., 2023).

4. Evaluation Metrics, Results, and Ablation Studies

MDE models are compared primarily on indoor (NYU Depth V2, SUN RGB-D) and outdoor (KITTI, Make3D) datasets using metrics:

  • Error Metrics: Absolute/Squared Relative Error (AbsRel/SqRel), RMSE, RMSE log, log10 error
  • Accuracy Metrics: δ<t\delta < t, the percentage of pixels with max(D^i/Di,Di/D^i)<t\max(\hat{D}_i/D_i, D_i/\hat{D}_i) < t, commonly for t=1.25,1.252,1.253t = 1.25, 1.25^2, 1.25^3
  • Edge/Boundary Metrics: Boundary precision/recall (F1), especially for sharpness-oriented methods (Yang et al., 2021)
  • Scale-Invariance: Many evaluations align predictions via per-image or dataset medians to compensate for global scale ambiguity in standard self-supervised methods (Bhoi, 2019)

State-of-the-art supervised models achieve RMSE < 0.2 m on NYU Depth V2 and < 2 m on KITTI (Ma et al., 2022, Shao et al., 2023). Self-supervised and domain-adaptive approaches yield similar performance when equipped with physics or semantic priors (Wei et al., 2022, Gurram et al., 2021). Ablations consistently demonstrate that explicit handling of dynamic objects, semantic guidance, and multi-scale context lead to significant quantitative improvements (Wei et al., 2022, Yang et al., 2021, Shao et al., 2023).

Diffusion-based models achieve REL = 0.074 on NYU and are competitive on KITTI, with added benefits in uncertainty estimation and text-to-3D reconstruction (Saxena et al., 2023).

5. Key Innovations and Theoretical Advances

  • Explicit Dynamic Modeling: Separating object and background pixels via 3D detection and warping resolves photometric artifact and depth errors in scenes with moving agents (Wei et al., 2022).
  • Soft-Weighted-Sum and Ordinal Inference: Interpreting depth prediction as a probabilistic dense labeling task with soft inference bridges classification and regression, reducing quantization error (Li et al., 2017, Lienen et al., 2020).
  • Plane-Aware Decoders: Hybrid architectures that decompose scenes into piecewise planes via normal-distance heads, while retaining data-driven heads, achieve robust results across both planar and non-planar regions (Shao et al., 2023).
  • Vision-Language and Physics Priors: Integration of camera geometry-based depth priors and language-based semantic cues further constrains monocular inference, enabling metric estimation in challenging road environments (Zhang et al., 18 Mar 2025, Auty et al., 2022).
  • Domain Adaptation: Feature-level adversarial alignment allows mixing simulated and real data or semantic and depth supervision from independent datasets (Gurram et al., 2018, Gurram et al., 2021).
  • Uncertainty and Multimodal Outputs: Models such as diffusion nets generate multimodal predictive posteriors, yielding explicit depth uncertainty and enabling downstream tasks such as inpainting and text-to-3D (Saxena et al., 2023).

6. Open Challenges and Future Directions

Open research challenges include:

  • Resolving Global Scale Ambiguity: Although progress has been made using metric cues and object size priors, robust scale estimation in novel environments remains difficult, especially in the absence of known semantic anchors or inertial signals (Wei et al., 2022, Wofk et al., 2023).
  • Handling Non-Static and Non-Rigid Scenes: Scenes with independently moving or deformable objects challenge existing self-supervised and photometric pipelines (Wei et al., 2022).
  • Domain Generalization: Methods must close the performance gap between synthetic and real data, and self-supervised adaptation for generalizing across indoor/outdoor and day/night domains is an active area (Gurram et al., 2021, Wofk et al., 2023).
  • Temporal Consistency: Temporal models or regularization enforcing geometric consistency across frames could further stabilize predictions and improve 3D reconstruction (Xu et al., 2018).
  • Model Efficiency and Scalability: Balancing model size and computational demands for real-time deployment in edge devices remains a key consideration, with compact architectures and lightweight decoders showing strong potential (Ma et al., 2022).
  • Uncertainty, Multi-Task Learning, and Multimodality: Integrating uncertainty quantification, probabilistic outputs, and joint learning of related tasks (normals, semantics, flow) can enhance interpretability and fusion with other perception systems (Saxena et al., 2023).

Continued progress in monocular depth estimation will likely require hybrid approaches combining geometric, semantic, linguistic, and physics-based priors, leveraging both vast simulated data and carefully designed real-world cues, while scaling efficiently to the requirements of modern robotic and vision systems.

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Monocular Depth Estimation.