Papers
Topics
Authors
Recent
2000 character limit reached

Single-Camera Depth Estimation

Updated 1 December 2025
  • Single-camera depth estimation is the process of inferring dense scene depth from a single image using visual priors, geometric cues, and optics-based methods.
  • Techniques span supervised, unsupervised, and hybrid frameworks that integrate calibrated camera parameters, defocus cues, and coded apertures to address scale ambiguity.
  • Practical applications include mobile AR, robotics, and medical imaging, while challenges remain in generalization, textureless surfaces, and computational efficiency.

Single-camera depth estimation refers to the inference of dense or sparse scene depth (distance from the camera to every pixel or spatial location) from a single image or video sequence captured by one camera. The problem is fundamentally ill-posed: a monocular 2D projection does not uniquely determine the underlying 3D scene geometry without additional cues or strong statistical priors. Recent advances leverage deep neural networks, geometric constraints, optics-inspired cues, and hybrid learning frameworks to recover per-pixel depth or disparity maps from single-camera data—enabling applications across robotics, augmented reality, mobile photography, and scientific imaging.

1. Problem Formulation and Fundamental Ambiguities

In single-image depth estimation, the goal is to recover a dense, per-pixel depth map D:Ω→R+D: \Omega \to \mathbb R^+ from an observed image II over the domain Ω⊂R2\Omega \subset \mathbb R^2, i.e., D^(x)=fθ(I;x)\hat D(x) = f_\theta(I; x) for all x∈Ωx \in \Omega (Mertan et al., 2021). This is severely ill-posed, as multiple 3D scenes can lead to the same image under the perspective projection model. Methods therefore rely on a combination of learned visual priors, geometric cues, physical constraints, parametric models of the camera and optics, or auxiliary signals (e.g., motion, coded illumination).

A crucial ambiguity is the scale problem: metric depth cannot be uniquely recovered from a single view without knowledge of camera calibration (e.g., focal length) or extrinsic scale cues. For example, increasing the focal length and scaling the scene depth by the same factor leaves the projected image unchanged, as shown explicitly by the projective camera equations and demonstrated quantitatively in (He et al., 2018).

2. Architectures and Learning Paradigms

Single-camera depth estimation methods can be classified as follows:

3. Handling Camera Parameters and Generalization

A major challenge is the failure of single-view networks to generalize across cameras with differing intrinsics, focal lengths, and sensor geometries. Several strategies have been developed:

  • Embedding camera parameters as explicit input: Approaches such as CAM-Convs concatenate per-pixel coordinate maps, centered coordinates, local field-of-view angles, and focal length information to feature tensors at all scales, allowing the network to learn calibration-aware filters and generalize to varying K (Facil et al., 2019, He et al., 2018).
  • Implicit or learned intrinsics: Networks can include micro-architectures to predict intrinsic parameters (focal lengths, principal points) end-to-end, jointly with depth and pose (Chanduri et al., 2021).
  • Hybrid two-stage architectures: For relative-to-metric conversion, frameworks like the Versatile Depth Estimator (VDE) decouple camera-agnostic relative depth estimation from a lightweight camera-specific converter tuned per device, yielding both generalization and parameter efficiency (Jun et al., 2023).
  • Camera-independent correction for optical cues: In physics-informed defocus-based depth estimation, a one-time calibration enables transfer of the model to new cameras by normalizing measured blur maps with a camera-specific scalar, obviating retraining (Wijayasingha et al., 2023).

4. Physical and Coded-Optics Cues

Certain approaches exploit lens or sensor physics to encode depth cues directly into the image:

  • Defocus-based methods: Depth from defocus models leverage thin-lens equations to relate blur patterns (PSF width) to scene depth, with differentiable layers or two-stage networks learning to regress per-pixel depth from measured blur, after compensating for camera parameters and sensor blur (Gur et al., 2020, Wijayasingha et al., 2023).
  • Coded/color-coded apertures: Depth is encoded through chromatic aberrations or spatially varying coded patterns, learned or hand-designed, and a dedicated network is trained to invert the image formation (Lopez et al., 2023). Richer color coding (multiple spectral primaries) is shown to significantly improve performance, and low-cost photographic film implementations are feasible.
  • FlatCam/lensless imaging: Mask-based lensless cameras encode scene depth as multiplexed modulation patterns. Recovery involves physics-based forward modeling with a combination of linear operators and greedy pursuit algorithms or optimization to reconstruct per-pixel depth/intensity, requiring careful calibration (Asif, 2017).
  • Dual-pixel autofocus sensors: Utilize sub-aperture differences to extract local stereo and defocus cues. Depth estimation from dual-pixel cues involves accounting for affine ambiguity in inverse depth, either with calibration or by learning affine-invariant mappings (Garg et al., 2019).
  • Plenoptic/light-field cameras: Intrinsically encode metric disparity at microlens scale. Hybrid approaches combine sparse metric estimates from microlens-based stereo with dense relative depth from foundation models, aligning these via robust estimators to output per-pixel metric depth (Lasheras-Hernandez et al., 3 Dec 2024).

5. Learning Objectives, Regularization, and Evaluation

The diversity of depth cues and hardware supplemental information motivates specialized loss functions and regularizers:

Evaluation is performed on indoor/outdoor datasets (NYU Depth V2, KITTI, Make3D, custom light-field/defocus sets) using RMSE, absolute/squared relative error, log RMSE, and thresholded accuracy (e.g., fraction of pixels with depth error <1.25×) (Mertan et al., 2021, Aleotti et al., 2020, Ignatov et al., 2021). Platform-specific benchmarks (e.g., Raspberry Pi, iPhone XS) are reported for real-time deployment (Ignatov et al., 2021, Aleotti et al., 2020).

6. Practical Deployments and Special Cases

Single-camera depth estimation underpins real-time mobile applications (AR occlusion, bokeh), mapping and navigation in robotics and UAVs, scientific/medical imaging, and lensless/miniaturized cameras.

  • Mobile/embedded devices: Lightweight encoder–decoder architectures (e.g., MobileNet, EfficientNet, FastDepth, PyDNet) with knowledge distillation, low-compute feature-fusion, and careful operator selection achieve real-time performance with <4 MB memory and ~10 FPS or better on Raspberry Pi and smartphone hardware (Ignatov et al., 2021, Aleotti et al., 2020).
  • Medical endoscopy: Domain-adapted pipelines use Bayesian DNNs for uncertainty-aware depth estimation, synthetic-to-real transfer via uncertainty-weighted teacher-student learning, and illumination-based self-supervision exploiting known light decay in endoscopes (Rodriguez-Puigvert, 20 Jun 2024).
  • Lensless and coded cameras: Mask-based or CCA-based systems enable thin, flexible, or multi-spectral depth imaging with specialized inversion or hybrid learning algorithms (Asif, 2017, Lopez et al., 2023).

7. Limitations and Open Challenges

Despite substantial progress, key challenges remain:

  • Scale ambiguity and calibration dependence: Absolute metric depth recovery requires knowledge of camera intrinsics or external scale cues. Data-driven methods without such information recover only relative (up-to-scale) depth (Mertan et al., 2021, He et al., 2018).
  • Textureless and reflective surfaces: Regions devoid of visual or defocus structure, as well as transparent/specular objects, remain problematic for both learning-based and physics-based pipelines (Kopf et al., 2020, Mertan et al., 2021).
  • Generalization to novel environments: Domain shift, varying sensor characteristics, and distributional mismatches lead to degradation in accuracy. Hybrid CRDE/R2MC frameworks (Jun et al., 2023), camera-parameter embedding (Facil et al., 2019, Chanduri et al., 2021), and uncertainty modeling (Rodriguez-Puigvert, 20 Jun 2024) are promising countermeasures.
  • Computational complexity in physically-based inversion: Coded/lensless approaches and optimization-based multi-view fusion incur significant computation and memory requirements, limiting applicability for long sequences or high-resolution video (Kopf et al., 2020, Asif, 2017).
  • Data demands for coded/deep-optics methods: Jointly learning optical codes and depth networks may require spectral/depth datasets, which are scarce or costly to capture (Lopez et al., 2023).

Future research directions include tighter integration of physical and semantic priors, self-adaptive optical coding, improved domain adaptation and cross-camera generalization, real-time uncertainty quantification, and fusion with active sensors where possible.


Selected References

Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Single-Camera Depth Estimation.