Monocular Depth Estimation Models

Updated 22 December 2025

Monocular depth estimation (MDE) is the task of predicting dense depth maps from a single RGB image, despite its inherently ill-posed nature.
Recent advancements span encoder–decoder CNNs, vision transformer hybrids, and multimodal frameworks that integrate geometric consistency, self-supervision, and domain adaptation.
Practical implementations include lightweight on-device models, robust error diagnosis with uncertainty quantification, and adversarial training to enhance real-world deployment.

Monocular depth estimation (MDE) refers to the prediction of a dense depth map from a single RGB image. Unlike stereo or multi-view depth methods, MDE lacks explicit geometric constraints from multiple viewpoints, making it a fundamentally ill-posed inverse problem. Recent research has developed a diverse ecosystem of learning-based models, spanning fully supervised, self-supervised, hybrid, and multimodal approaches. Progress is paced by innovations in architecture, loss formulations, pretraining strategies, geometric consistency enforcement, device installation, uncertainty quantification, robustness, domain adaptation, and training data regimes.

1. Architectural Families and Learning Paradigms

State-of-the-art MDE models fall into several broad architectural and training categories:

Encoder–Decoder CNN and ViT Architectures

ResNet-Based CNNs: Standard backbone for early and mid-generation MDE, e.g., DORN, BTS, and Monodepth2, often featuring skip-connections and multi-scale decoders (Liu et al., 2019, Chawla et al., 2021, Gurram et al., 2021).
Vision Transformer Hybrids: Recent models such as METER integrate MobileNetV2-style convolutional encoders with transformer-based “METER blocks,” yielding strong accuracy–latency tradeoffs on embedded platforms via transformer-enhanced spatial reasoning (Papa et al., 13 Mar 2024).
Pure ViTs and “MetaFormer” Paradigms: Lightweight “pooling-mixer” transformers decouple attention from spatial mixing to maximize efficiency (Cirillo et al., 19 Sep 2025).
Full-Transformer Decoders: Models such as PixelFormer maximize global context, allowing each patch to attend to all others, at the cost of quadratic complexity (Cirillo et al., 19 Sep 2025).

LLM-MDE: Cross-modal reprogramming aligns ViT patch features with LLM text prototypes, leveraging frozen LLMs for few-shot and zero-shot MDE via adaptive prompt design; this enables dense “vision-as-language” inference with minimal supervision (Xia et al., 2 Sep 2024).

Specialized Decoders

Bimodal Density Heads: Mixture models for per-pixel disparity, such as the Laplacian mixture in EfficientDepth, capture both smooth and discontinuous depth structures (Litvynchuk et al., 26 Sep 2025).

Classical and Biologically Inspired Cue Fusion

Semantic and Size Priors: Incorporating semantic segmentation, language embeddings (e.g., GloVe vectors), and real-world object-size priors mimics biological vision cues (relative size, familiar size, absolute scale), systematically improving accuracy (Auty et al., 2022).

2. Geometric Consistency, Self-Supervision, and Adversarial Learning

View-Consistent Supervision

AVCL (Adversarial View-Consistent Learning): Predicts depth such that, after differentiable SE(3) warping to multiple adversarially-sampled poses, the warped prediction remains consistent with ground-truth geometry across all views, enforced through a combination of source and view-consistency losses (Liu et al., 2019).

Self-Supervised and SfM-Enhanced Pipelines

Structure-from-Motion (SfM) and View Synthesis: Self-supervised MDE exploits view synthesis using photometric reconstruction losses, often paired with pose-networks, automasking for degenerate regions, and edge-aware smoothness penalties. MonoDEVSNet augments real-world SfM with high-fidelity synthetic supervision and domain adaptation via gradient reversal (Gurram et al., 2021).
Directional Consistency and Stereo-Temporal Fusion: Jointly enforcing stereo and structure-from-motion cues (i.e., temporal/pose and binocular disparity) via differentiable warping leads to stronger geometric constraints, as in (Truetsch et al., 2019).

Synthetic and Domain-Adapted Pretraining

Virtual-World Datasets: Large-scale synthetic pretraining (e.g., MineNavi, Virtual KITTI) combined with careful domain gap minimization closes much of the gap to fully supervised approaches and accelerates convergence in real-world fine-tuning (Wang et al., 2020, Gurram et al., 2021).

3. Model Efficiency, On-Device Adaptation, and Real-World Deployment

Lightweight and Embedded Architectures

METER Family: Real-time MDE solutions for microcontroller and embedded GPU platforms, supporting dynamic adjustment of trade-off between network depth, speed, and memory footprint (e.g., METER S/XS/XXS variants) (Papa et al., 13 Mar 2024).
μPyD-Net and On-Device Learning (ODL): Tiny (≤0.1M parameter) models enable on-device retraining using ultra-low power MCUs, compensating for domain shift by sparsely updating only the final decoder layers with pseudo-labels from an auxiliary depth sensor (“memory-driven sparse update”) (Nadalini et al., 26 Nov 2025).

Data Efficiency and Auxiliary Tasks

Multi-Source Auxiliary Supervision: Training with auxiliary segmentation/classification datasets using alternating-step schemes and a shared decoder, especially with multi-label dense classification (MLDC) as an auxiliary task, boosts accuracy and data efficiency by ~11–22%, often reducing depth label requirements by 80–99% (Quercia et al., 22 Jan 2025).

4. Uncertainty Quantification and Error Diagnosis

Deterministic Depth-Probability Volumes

Entropy-Based Uncertainty: Viewing MDE as per-pixel depth classification yields a probability volume from which uncertainty can be extracted via (scaled) Shannon entropy, with ordinal- and uncertainty-aware regularization to ensure uncertainty is correlated with true errors. Spearman rank correlation is advocated as the primary metric, decoupling accuracy from reliability (Xiang et al., 2023).

Error Detection and Correction

DEDN/DECN: Depth Error Detection Networks produce spatial error maps (under/over/correct) per pixel, providing actionable diagnostics for robotics and AR. Simple post-hoc correction networks incrementally adjust estimated depths where confident error is detected, yielding systematic improvements in structured scene errors (e.g., plane boundaries) (Chawla et al., 2021).

5. Robustness and Security under Adversarial Threats

Adversarial Attacks

3D³Fool Physical Attacks: Optimization of object-wide, viewpoint-robust adversarial 3D textures dramatically outperforms classical 2D patches, causing up to 12.75m mean depth error and affecting up to half the vehicle pixels under arbitrary viewing/weather conditions, with replicable >10m physical errors in real camera captures (Zheng et al., 26 Mar 2024).

Self-Supervised Adversarial Hardening

View-Synthesis-Based Defense: Directly embedding adversarial training (using $L_0$ -norm-constrained patches) in the view-synthesis self-supervision loop trains the model to restore photometric/geometric consistency even under physical attack, achieving >90% reduction in adversarial error with negligible loss of benign accuracy, and substantially outperforming generic contrastive (SimSiam) or supervised-pseudo approaches in both white-box and transfer/physical regimes (Cheng et al., 9 Jun 2024).

6. Foundation Models, Knowledge Distillation, and Generalization

Large-Scale Distillation

Cross-Context Distillation: Integrating both global and local (crop-based) pseudo-labels from teacher models during training (e.g., combining full-image scene consistency and fine-grained patch detail) yields stronger students than either regime alone. Multi-teacher distillation, including diffusion-based and encoder–decoder teachers, further enhances quality and reduces teacher bias (He et al., 26 Feb 2025).

LLM-MDE: Frozen LLMs, equipped with lightweight cross-modal adapters and image-driven prompts, achieve competitive performance in few- and zero-shot MDE, with >95% parameter freezing and rapid adaptability across new domains/scenes (Xia et al., 2 Sep 2024).

7. Explainability, XAI, and Model Interpretation

Feature Attribution and Attribution Fidelity

Saliency and Integrated Gradients: Saliency maps provide robust global attribution in lightweight networks, while integrated gradients are more discriminative in deep transformer models. Attention rollout methods, though effective in classifier ViTs, fail to rank critical pixels in MDE. Attribution Fidelity (AF), a normalized difference of perturbation sensitivity between top- and bottom-ranked pixels, reliably diagnoses when an attribution method provides meaningful explanations, in contrast to classical AE or Faithfulness Estimate metrics (Cirillo et al., 19 Sep 2025).

References: