Monocular Depth Estimation Overview
- Monocular depth estimation is the process of predicting a dense depth map from a single image, despite inherent scale and projection ambiguities.
- It leverages deep convolutional, transformer, and hybrid architectures to fuse multiscale, semantic, and geometric cues for accurate scene reconstruction.
- Emerging methods integrate supervised, self-supervised, and physics-based strategies to tackle challenges in dynamic scenes, domain adaptation, and uncertainty estimation.
Monocular depth estimation is the task of predicting a dense depth map from a single RGB image, recovering for each pixel the distance to the scene surface via a process that is fundamentally ill-posed due to perspective projection ambiguities and the loss of scale information. This problem is central to robotic vision, autonomous driving, 3D scene understanding, and many downstream applications including AR/VR, semantic reconstruction, and SLAM. The field has progressed rapidly, with deep convolutional and transformer-based architectures driving advances in both supervised and self-supervised paradigms. This article provides a comprehensive summary of the scientific foundations, architectural innovations, evaluation methodologies, remaining challenges, and future directions in monocular depth estimation.
1. Problem Formulation and Ill-Posedness
The goal of monocular depth estimation (MDE) is, given a single RGB image , to predict a dense per-pixel depth map (Bhoi, 2019). Ambiguity arises because the 3D scene induces infinitely many configurations that are consistent with any given 2D projection, causing scale, viewpoint, and shape ambiguities.
Classical geometric approaches rely on multi-view or stereo cues to resolve these ambiguities. In monocular form, human or machine vision must instead exploit learned priors, local and global scene context, semantic cues, object size knowledge, and surface orientation assumptions. Formally, the mapping is learned from a dataset , typically by minimizing a loss such as the scale-invariant log-MSE: This loss compensates for global scale ambiguity and is widely used in direct regression schemes (Bhoi, 2019).
Ambiguity in absolute scale is a critical issue: monocular pipelines can only recover depth up to an unknown factor unless external cues (object size priors, inertial measurements, or semantic anchors) are incorporated (Wei et al., 2022, Choi et al., 2022, Wofk et al., 2023, Zhang et al., 18 Mar 2025).
2. Network Architectures and Representation Learning
Current MDE models leverage various deep learning architectures and output parameterizations to encode multiscale, semantic, and geometric information.
- Backbones: Standard encoder-decoder or U-Net topologies with ResNet (Li et al., 2017, Wei et al., 2022, Sagar, 2020), DenseNet (Ma et al., 2022), HRNet (Gurram et al., 2021), SE-Net (Yang et al., 2021), Swin (Shao et al., 2023), and transformers (Zhang et al., 18 Mar 2025) are common.
- Multiscale and Hierarchical Fusion: Exploiting multi-resolution side-outputs with hierarchical or CRF-based fusion yields improved local-global reasoning (Li et al., 2017, Xu et al., 2018). Dilated convolutions and skip connections further enlarge the receptive field and preserve spatial detail.
- Non-Regression Outputs: Casting depth as multi-category dense labeling with soft-weighted-sum inference mitigates quantization errors (Li et al., 2017). Ordinal regression and probabilistic ranking leverage ordering relations and scale-agnostic supervision (Lienen et al., 2020).
- Specialized Modules: Scene understanding (SU) and scale transform modules explicitly aggregate global and boundary features for sharpness (Yang et al., 2021). Physics-inspired heads predict surface normals and signed distances for plane-consistent depth (Shao et al., 2023). Vision-language fusion encodes geometric priors and scene semantics (Zhang et al., 18 Mar 2025).
- Domain Fusion: Hybrid pipelines integrate feature-level or task-level transfer from simulated to real domains, often using domain adaptation modules such as gradient reversal layers (Gurram et al., 2021).
- Lightweight Inference: Efficient models replace heavyweight decoders with convolutional upsampling and compact backbones to facilitate edge deployment (Ma et al., 2022).
3. Training Paradigms: Supervision, Self-Supervision, and Cues
- Supervised Learning: Early and contemporary approaches rely on per-pixel metric ground truth from depth sensors or synthetic environments (Li et al., 2017, Gurram et al., 2021). Loss functions include scale-invariant log-space MSE, absolute/relative errors, cross-entropy for classification, and SSIM-augmented regression.
- Self-Supervision via Photometric Consistency: Leveraging multi-view sequences, pose estimation, and differentiable warping, models are trained with photometric reconstruction losses. These methods perform explicit view synthesis using predicted depth and relative camera poses:
where is the intrinsic matrix, the pose, and the 3D object pose (Wei et al., 2022).
- Modeling Dynamic Scenes: Explicit identification of dynamic/static pixels via panoptic segmentation and monocular 3D detection enables separate geometric treatment of moving objects, mitigating violations of the static-world assumption (Wei et al., 2022).
- Eliminating Scale Ambiguity: Supervision from 3D cuboid detections with real-world size, physics-based priors, semantic size embeddings, or external metric cues anchors the depth scale without external sensors (Wei et al., 2022, Auty et al., 2022, Zhang et al., 18 Mar 2025).
- Self-Supervision from Simulators and SLAM: Combinations of perfect virtual-world supervision and real-world self-supervision, augmented by domain adaptation, are used to bridge synthetic-to-real gaps (Gurram et al., 2021). Visual-inertial pipelines and teacher-student strategies exploit metric pose estimates from SLAM or proprioceptive sensors to metrically calibrate the network (Choi et al., 2022, Wofk et al., 2023).
- Ranking-Based Formulations: Plackett-Luce listwise ranking enables training on ordinal relation data alone, with metric depth recovered up to an affine transformation (Lienen et al., 2020).
- Diffusion Models: Conditioning denoising diffusion models on RGB and noisy/incomplete depth enables robust handling of sparse or ambiguous regions, uncertainty quantification, and depth inpainting (Saxena et al., 2023).
4. Evaluation Metrics, Results, and Ablation Studies
MDE models are compared primarily on indoor (NYU Depth V2, SUN RGB-D) and outdoor (KITTI, Make3D) datasets using metrics:
- Error Metrics: Absolute/Squared Relative Error (AbsRel/SqRel), RMSE, RMSE log, log10 error
- Accuracy Metrics: , the percentage of pixels with , commonly for
- Edge/Boundary Metrics: Boundary precision/recall (F1), especially for sharpness-oriented methods (Yang et al., 2021)
- Scale-Invariance: Many evaluations align predictions via per-image or dataset medians to compensate for global scale ambiguity in standard self-supervised methods (Bhoi, 2019)
State-of-the-art supervised models achieve RMSE < 0.2 m on NYU Depth V2 and < 2 m on KITTI (Ma et al., 2022, Shao et al., 2023). Self-supervised and domain-adaptive approaches yield similar performance when equipped with physics or semantic priors (Wei et al., 2022, Gurram et al., 2021). Ablations consistently demonstrate that explicit handling of dynamic objects, semantic guidance, and multi-scale context lead to significant quantitative improvements (Wei et al., 2022, Yang et al., 2021, Shao et al., 2023).
Diffusion-based models achieve REL = 0.074 on NYU and are competitive on KITTI, with added benefits in uncertainty estimation and text-to-3D reconstruction (Saxena et al., 2023).
5. Key Innovations and Theoretical Advances
- Explicit Dynamic Modeling: Separating object and background pixels via 3D detection and warping resolves photometric artifact and depth errors in scenes with moving agents (Wei et al., 2022).
- Soft-Weighted-Sum and Ordinal Inference: Interpreting depth prediction as a probabilistic dense labeling task with soft inference bridges classification and regression, reducing quantization error (Li et al., 2017, Lienen et al., 2020).
- Plane-Aware Decoders: Hybrid architectures that decompose scenes into piecewise planes via normal-distance heads, while retaining data-driven heads, achieve robust results across both planar and non-planar regions (Shao et al., 2023).
- Vision-Language and Physics Priors: Integration of camera geometry-based depth priors and language-based semantic cues further constrains monocular inference, enabling metric estimation in challenging road environments (Zhang et al., 18 Mar 2025, Auty et al., 2022).
- Domain Adaptation: Feature-level adversarial alignment allows mixing simulated and real data or semantic and depth supervision from independent datasets (Gurram et al., 2018, Gurram et al., 2021).
- Uncertainty and Multimodal Outputs: Models such as diffusion nets generate multimodal predictive posteriors, yielding explicit depth uncertainty and enabling downstream tasks such as inpainting and text-to-3D (Saxena et al., 2023).
6. Open Challenges and Future Directions
Open research challenges include:
- Resolving Global Scale Ambiguity: Although progress has been made using metric cues and object size priors, robust scale estimation in novel environments remains difficult, especially in the absence of known semantic anchors or inertial signals (Wei et al., 2022, Wofk et al., 2023).
- Handling Non-Static and Non-Rigid Scenes: Scenes with independently moving or deformable objects challenge existing self-supervised and photometric pipelines (Wei et al., 2022).
- Domain Generalization: Methods must close the performance gap between synthetic and real data, and self-supervised adaptation for generalizing across indoor/outdoor and day/night domains is an active area (Gurram et al., 2021, Wofk et al., 2023).
- Temporal Consistency: Temporal models or regularization enforcing geometric consistency across frames could further stabilize predictions and improve 3D reconstruction (Xu et al., 2018).
- Model Efficiency and Scalability: Balancing model size and computational demands for real-time deployment in edge devices remains a key consideration, with compact architectures and lightweight decoders showing strong potential (Ma et al., 2022).
- Uncertainty, Multi-Task Learning, and Multimodality: Integrating uncertainty quantification, probabilistic outputs, and joint learning of related tasks (normals, semantics, flow) can enhance interpretability and fusion with other perception systems (Saxena et al., 2023).
Continued progress in monocular depth estimation will likely require hybrid approaches combining geometric, semantic, linguistic, and physics-based priors, leveraging both vast simulated data and carefully designed real-world cues, while scaling efficiently to the requirements of modern robotic and vision systems.