Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
156 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

HybridDepth: Robust Metric Depth Fusion by Leveraging Depth from Focus and Single-Image Priors (2407.18443v2)

Published 26 Jul 2024 in cs.CV

Abstract: We propose HYBRIDDEPTH, a robust depth estimation pipeline that addresses key challenges in depth estimation,including scale ambiguity, hardware heterogeneity, and generalizability. HYBRIDDEPTH leverages focal stack, data conveniently accessible in common mobile devices, to produce accurate metric depth maps. By incorporating depth priors afforded by recent advances in singleimage depth estimation, our model achieves a higher level of structural detail compared to existing methods. We test our pipeline as an end-to-end system, with a newly developed mobile client to capture focal stacks, which are then sent to a GPU-powered server for depth estimation. Comprehensive quantitative and qualitative analyses demonstrate that HYBRIDDEPTH outperforms state-of-the-art(SOTA) models on common datasets such as DDFF12 and NYU Depth V2. HYBRIDDEPTH also shows strong zero-shot generalization. When trained on NYU Depth V2, HYBRIDDEPTH surpasses SOTA models in zero-shot performance on ARKitScenes and delivers more structurally accurate depth maps on Mobile Depth.

Summary

  • The paper introduces an end-to-end pipeline that fuses focal stack and single-image depth cues to produce robust metric depth maps on mobile devices.
  • The method employs least-squares fitting and a deep refinement layer, achieving a 13% improvement in key metrics on datasets like NYU Depth v2.
  • The approach offers efficient inference and superior zero-shot performance, underlining its potential for real-time mobile AR applications.

HybridDepth: Robust Depth Fusion for Mobile AR by Leveraging Depth from Focus and Single-Image Priors

Introduction

HybridDepth presents a robust pipeline for depth estimation explicitly designed for mobile augmented reality (AR) applications. The system addresses typical challenges inherent in depth estimation for mobile AR, including scale ambiguity, hardware heterogeneity, and generalizability. By leveraging both depth from focus (DFF) and single-image depth priors, HybridDepth achieves accurate and generalizable depth maps, outperforming state-of-the-art (SOTA) models across multiple benchmarks.

Key Contributions

HybridDepth's primary contributions are threefold:

  1. End-to-End Pipeline: HybridDepth integrates focal stack information with relative depth estimation to achieve robust metric depth maps, ensuring it can operate with the limited hardware capabilities of mobile devices.
  2. State-of-the-Art Performance: HybridDepth outperforms existing SOTA methods, including recent advancements such as DepthAnything, on datasets like NYU Depth v2, DDFF12, and ARKitScenes.
  3. Generalization: The model demonstrates superior zero-shot performance, particularly on AR-specific datasets, highlighting its robustness across diverse and unforeseen environments.

Methodology

HybridDepth employs a three-phase approach:

  1. Capture Relative and Metric Depth: Two modules are utilized: a single-image relative depth estimator and a DFF metric depth estimator. The single-image model lays the structural foundation while the DFF estimator provides the necessary metric depth.
  2. Least-Squares Fitting: This phase aligns the scale of relative depth estimates with metric depths derived from the DFF model using least-squares fitting.
  3. Refinement Layer: A final deep learning-based refinement layer corrects and fine-tunes the intermediate depth map using a locally-adaptive scale map, derived from the DFF branch and the globally scaled depth map.

Results and Evaluation

NYU Depth v2 and DDFF12: HybridDepth achieves a 13% improvement in RMSE and AbsRel metrics on the NYU Depth v2 dataset compared to recent approaches. On DDFF12, it demonstrates superior performance, showcasing its utility even on data with large texture-less areas.

ARKitScenes: In both zero-shot and trained evaluations, HybridDepth achieves unprecedented performance, with an RMSE of 0.367 and 0.254 in zero-shot and trained settings, respectively. This highlights the model's ability to handle complex, AR-specific scenarios effectively.

Model Efficiency: When compared to SOTA models like ZoeDepth and DepthAnything, HybridDepth offers significantly shorter inference times and a smaller model size, making it highly suitable for real-time mobile applications.

Implications and Future Work

Practical Implications: The primary implication of HybridDepth lies in its enhanced deployment feasibility on mobile devices due to its ability to operate with only standard cameras. This characteristic obviates the need for specialized hardware such as LiDAR or ToF sensors, broadening its applicability across varied device ecosystems.

Theoretical Implications: By combining DFF and single-image priors, HybridDepth presents a novel methodological framework that can be extended to other domains requiring robust depth estimation. It sets a benchmark for integrating multi-source depth cues while maintaining computational efficiency and generalizability.

Future Work: Although HybridDepth shows great promise, further improvements can target the DFF branch to mitigate scaling errors, particularly for pixels lacking optimal focus data. A focus on selective depth value extraction in the DFF process could enhance overall accuracy and reliability.

Conclusion

HybridDepth marks a significant advancement in depth estimation for mobile AR, leveraging the strengths of DFF and relative depth estimation to produce robust, accurate, and efficient depth maps. Its strong numerical results across multiple datasets and its superior performance in real-world applications underscore its potential as a practical and scalable solution for the future of mobile AR experiences.