Single-Camera Depth Estimation

Updated 1 December 2025

Single-camera depth estimation is the process of inferring dense scene depth from a single image using visual priors, geometric cues, and optics-based methods.
Techniques span supervised, unsupervised, and hybrid frameworks that integrate calibrated camera parameters, defocus cues, and coded apertures to address scale ambiguity.
Practical applications include mobile AR, robotics, and medical imaging, while challenges remain in generalization, textureless surfaces, and computational efficiency.

Single-camera depth estimation refers to the inference of dense or sparse scene depth (distance from the camera to every pixel or spatial location) from a single image or video sequence captured by one camera. The problem is fundamentally ill-posed: a monocular 2D projection does not uniquely determine the underlying 3D scene geometry without additional cues or strong statistical priors. Recent advances leverage deep neural networks, geometric constraints, optics-inspired cues, and hybrid learning frameworks to recover per-pixel depth or disparity maps from single-camera data—enabling applications across robotics, augmented reality, mobile photography, and scientific imaging.

1. Problem Formulation and Fundamental Ambiguities

In single-image depth estimation, the goal is to recover a dense, per-pixel depth map $D: \Omega \to \mathbb R^+$ from an observed image $I$ over the domain $\Omega \subset \mathbb R^2$ , i.e., $\hat D(x) = f_\theta(I; x)$ for all $x \in \Omega$ (Mertan et al., 2021). This is severely ill-posed, as multiple 3D scenes can lead to the same image under the perspective projection model. Methods therefore rely on a combination of learned visual priors, geometric cues, physical constraints, parametric models of the camera and optics, or auxiliary signals (e.g., motion, coded illumination).

A crucial ambiguity is the scale problem: metric depth cannot be uniquely recovered from a single view without knowledge of camera calibration (e.g., focal length) or extrinsic scale cues. For example, increasing the focal length and scaling the scene depth by the same factor leaves the projected image unchanged, as shown explicitly by the projective camera equations and demonstrated quantitatively in (He et al., 2018).

2. Architectures and Learning Paradigms

Single-camera depth estimation methods can be classified as follows:

Classical, non-deep approaches: Utilize hand-crafted features (SIFT, texture gradients, vanishing lines), shape-from-shading, or defocus, combined with Markov random fields or global optimization (Mertan et al., 2021). These generally lack robustness across diverse scenes and camera configurations.
Supervised deep learning: Encoder–decoder architectures (ResNet/UNet backbones, Vision Transformers, spatial pyramid pooling) are trained with ground-truth metric depths from RGB-D datasets or LiDAR (Aleotti et al., 2020, Ignatov et al., 2021, Zhang et al., 2022, He et al., 2018). Losses include L1/L2 on depth, scale-invariant log loss [Eigen et al.], gradient/normal consistency, and more.
Unsupervised/self-supervised approaches: Exploit photometric consistency across time (in video) or stereo views, enforcing that estimated depths predict how the scene would appear from nearby poses (Hermann et al., 2020, Kopf et al., 2020). Depth and pose estimators are trained jointly to minimize reprojection error and spatial smoothness, often with auto-masking for occlusions and edge-aware regularization.
Hybrid or physics-informed approaches: Incorporate optics models (defocus blur, coded aperture, lensless imaging, dual-pixel cues, plenoptic light fields) or explicitly embed camera parameters or self-calibration networks (Garg et al., 2019, Wijayasingha et al., 2023, Lopez et al., 2023, Lasheras-Hernandez et al., 3 Dec 2024, Facil et al., 2019, Chanduri et al., 2021). Methods may learn from defocus cue images (Gur et al., 2020, Wijayasingha et al., 2023), coded-aperture or colored-coded images (Lopez et al., 2023), or directly infer metric depth from novel sensors (e.g., plenoptic cameras (Lasheras-Hernandez et al., 3 Dec 2024), FlatCam (Asif, 2017)).
Uncertainty-aware and multi-task learning: Recent models estimate pixelwise aleatoric and epistemic uncertainty, and combine depth estimation with semantic segmentation or surface normal predictions for mutual regularization (Rodriguez-Puigvert, 20 Jun 2024, Mertan et al., 2021, Chanduri et al., 2021).

3. Handling Camera Parameters and Generalization

A major challenge is the failure of single-view networks to generalize across cameras with differing intrinsics, focal lengths, and sensor geometries. Several strategies have been developed:

Embedding camera parameters as explicit input: Approaches such as CAM-Convs concatenate per-pixel coordinate maps, centered coordinates, local field-of-view angles, and focal length information to feature tensors at all scales, allowing the network to learn calibration-aware filters and generalize to varying K (Facil et al., 2019, He et al., 2018).
Implicit or learned intrinsics: Networks can include micro-architectures to predict intrinsic parameters (focal lengths, principal points) end-to-end, jointly with depth and pose (Chanduri et al., 2021).
Hybrid two-stage architectures: For relative-to-metric conversion, frameworks like the Versatile Depth Estimator (VDE) decouple camera-agnostic relative depth estimation from a lightweight camera-specific converter tuned per device, yielding both generalization and parameter efficiency (Jun et al., 2023).
Camera-independent correction for optical cues: In physics-informed defocus-based depth estimation, a one-time calibration enables transfer of the model to new cameras by normalizing measured blur maps with a camera-specific scalar, obviating retraining (Wijayasingha et al., 2023).

4. Physical and Coded-Optics Cues

Certain approaches exploit lens or sensor physics to encode depth cues directly into the image:

Defocus-based methods: Depth from defocus models leverage thin-lens equations to relate blur patterns (PSF width) to scene depth, with differentiable layers or two-stage networks learning to regress per-pixel depth from measured blur, after compensating for camera parameters and sensor blur (Gur et al., 2020, Wijayasingha et al., 2023).
Coded/color-coded apertures: Depth is encoded through chromatic aberrations or spatially varying coded patterns, learned or hand-designed, and a dedicated network is trained to invert the image formation (Lopez et al., 2023). Richer color coding (multiple spectral primaries) is shown to significantly improve performance, and low-cost photographic film implementations are feasible.
FlatCam/lensless imaging: Mask-based lensless cameras encode scene depth as multiplexed modulation patterns. Recovery involves physics-based forward modeling with a combination of linear operators and greedy pursuit algorithms or optimization to reconstruct per-pixel depth/intensity, requiring careful calibration (Asif, 2017).
Dual-pixel autofocus sensors: Utilize sub-aperture differences to extract local stereo and defocus cues. Depth estimation from dual-pixel cues involves accounting for affine ambiguity in inverse depth, either with calibration or by learning affine-invariant mappings (Garg et al., 2019).
Plenoptic/light-field cameras: Intrinsically encode metric disparity at microlens scale. Hybrid approaches combine sparse metric estimates from microlens-based stereo with dense relative depth from foundation models, aligning these via robust estimators to output per-pixel metric depth (Lasheras-Hernandez et al., 3 Dec 2024).

5. Learning Objectives, Regularization, and Evaluation

The diversity of depth cues and hardware supplemental information motivates specialized loss functions and regularizers:

Supervised settings: Per-pixel L1/L2, scale-invariant log loss, and normal or gradient consistency losses are standard (Mertan et al., 2021, Zhang et al., 2022, Jun et al., 2023).
Unsupervised/self-supervised: Photometric reconstruction losses, smoothness regularizers, auto-masking for dynamic/static scenes, and left-right or multi-view consistency are widely used (Hermann et al., 2020, Chanduri et al., 2021, Kopf et al., 2020).
Uncertainty-aware learning: Prediction heads for aleatoric and epistemic uncertainty enable uncertainty-weighted loss terms in both teacher-student self-supervision and Bayesian frameworks (Rodriguez-Puigvert, 20 Jun 2024, Chanduri et al., 2021).
Coded-optics: Networks are supervised by the ability to both reconstruct known depth cues from coded images and, where available, match PSF or chromatic features to ground truth (Lopez et al., 2023, Gur et al., 2020, Wijayasingha et al., 2023).

Evaluation is performed on indoor/outdoor datasets (NYU Depth V2, KITTI, Make3D, custom light-field/defocus sets) using RMSE, absolute/squared relative error, log RMSE, and thresholded accuracy (e.g., fraction of pixels with depth error <1.25×) (Mertan et al., 2021, Aleotti et al., 2020, Ignatov et al., 2021). Platform-specific benchmarks (e.g., Raspberry Pi, iPhone XS) are reported for real-time deployment (Ignatov et al., 2021, Aleotti et al., 2020).

6. Practical Deployments and Special Cases

Single-camera depth estimation underpins real-time mobile applications (AR occlusion, bokeh), mapping and navigation in robotics and UAVs, scientific/medical imaging, and lensless/miniaturized cameras.

Mobile/embedded devices: Lightweight encoder–decoder architectures (e.g., MobileNet, EfficientNet, FastDepth, PyDNet) with knowledge distillation, low-compute feature-fusion, and careful operator selection achieve real-time performance with <4 MB memory and ~10 FPS or better on Raspberry Pi and smartphone hardware (Ignatov et al., 2021, Aleotti et al., 2020).
Medical endoscopy: Domain-adapted pipelines use Bayesian DNNs for uncertainty-aware depth estimation, synthetic-to-real transfer via uncertainty-weighted teacher-student learning, and illumination-based self-supervision exploiting known light decay in endoscopes (Rodriguez-Puigvert, 20 Jun 2024).
Lensless and coded cameras: Mask-based or CCA-based systems enable thin, flexible, or multi-spectral depth imaging with specialized inversion or hybrid learning algorithms (Asif, 2017, Lopez et al., 2023).

7. Limitations and Open Challenges

Despite substantial progress, key challenges remain:

Scale ambiguity and calibration dependence: Absolute metric depth recovery requires knowledge of camera intrinsics or external scale cues. Data-driven methods without such information recover only relative (up-to-scale) depth (Mertan et al., 2021, He et al., 2018).
Textureless and reflective surfaces: Regions devoid of visual or defocus structure, as well as transparent/specular objects, remain problematic for both learning-based and physics-based pipelines (Kopf et al., 2020, Mertan et al., 2021).
Generalization to novel environments: Domain shift, varying sensor characteristics, and distributional mismatches lead to degradation in accuracy. Hybrid CRDE/R2MC frameworks (Jun et al., 2023), camera-parameter embedding (Facil et al., 2019, Chanduri et al., 2021), and uncertainty modeling (Rodriguez-Puigvert, 20 Jun 2024) are promising countermeasures.
Computational complexity in physically-based inversion: Coded/lensless approaches and optimization-based multi-view fusion incur significant computation and memory requirements, limiting applicability for long sequences or high-resolution video (Kopf et al., 2020, Asif, 2017).
Data demands for coded/deep-optics methods: Jointly learning optical codes and depth networks may require spectral/depth datasets, which are scarce or costly to capture (Lopez et al., 2023).

Future research directions include tighter integration of physical and semantic priors, self-adaptive optical coding, improved domain adaptation and cross-camera generalization, real-time uncertainty quantification, and fusion with active sensors where possible.

Selected References

"Single Image Depth Estimation: An Overview" (Mertan et al., 2021)
"Robust Consistent Video Depth Estimation" (Kopf et al., 2020)
"CAM-Convs: Camera-Aware Multi-Scale Convolutions for Single-View Depth" (Facil et al., 2019)
"Versatile Depth Estimator Based on Common Relative Depth Estimation and Camera-Specific Relative-to-Metric Depth Conversion" (Jun et al., 2023)
"Uncertainty and Self-Supervision in Single-View Depth" (Rodriguez-Puigvert, 20 Jun 2024)
"Camera-Independent Single Image Depth Estimation from Defocus Blur" (Wijayasingha et al., 2023)
"Depth Estimation from a Single Optical Encoded Image using a Learned Colored-Coded Aperture" (Lopez et al., 2023)
"Learning Single Camera Depth Estimation using Dual-Pixels" (Garg et al., 2019)
"Single-Shot Metric Depth from Focused Plenoptic Cameras" (Lasheras-Hernandez et al., 3 Dec 2024)
"Single Image Depth Estimation Trained via Depth from Defocus Cues" (Gur et al., 2020)
"Toward Depth Estimation Using Mask-Based Lensless Cameras" (Asif, 2017)
"Fast and Accurate Single-Image Depth Estimation on Mobile Devices" (Ignatov et al., 2021)
"Real-time single image depth perception in the wild with handheld devices" (Aleotti et al., 2020)