- The paper introduces an end-to-end pipeline that fuses focal stack and single-image depth cues to produce robust metric depth maps on mobile devices.
- The method employs least-squares fitting and a deep refinement layer, achieving a 13% improvement in key metrics on datasets like NYU Depth v2.
- The approach offers efficient inference and superior zero-shot performance, underlining its potential for real-time mobile AR applications.
HybridDepth: Robust Depth Fusion for Mobile AR by Leveraging Depth from Focus and Single-Image Priors
Introduction
HybridDepth presents a robust pipeline for depth estimation explicitly designed for mobile augmented reality (AR) applications. The system addresses typical challenges inherent in depth estimation for mobile AR, including scale ambiguity, hardware heterogeneity, and generalizability. By leveraging both depth from focus (DFF) and single-image depth priors, HybridDepth achieves accurate and generalizable depth maps, outperforming state-of-the-art (SOTA) models across multiple benchmarks.
Key Contributions
HybridDepth's primary contributions are threefold:
- End-to-End Pipeline: HybridDepth integrates focal stack information with relative depth estimation to achieve robust metric depth maps, ensuring it can operate with the limited hardware capabilities of mobile devices.
- State-of-the-Art Performance: HybridDepth outperforms existing SOTA methods, including recent advancements such as DepthAnything, on datasets like NYU Depth v2, DDFF12, and ARKitScenes.
- Generalization: The model demonstrates superior zero-shot performance, particularly on AR-specific datasets, highlighting its robustness across diverse and unforeseen environments.
Methodology
HybridDepth employs a three-phase approach:
- Capture Relative and Metric Depth: Two modules are utilized: a single-image relative depth estimator and a DFF metric depth estimator. The single-image model lays the structural foundation while the DFF estimator provides the necessary metric depth.
- Least-Squares Fitting: This phase aligns the scale of relative depth estimates with metric depths derived from the DFF model using least-squares fitting.
- Refinement Layer: A final deep learning-based refinement layer corrects and fine-tunes the intermediate depth map using a locally-adaptive scale map, derived from the DFF branch and the globally scaled depth map.
Results and Evaluation
NYU Depth v2 and DDFF12: HybridDepth achieves a 13% improvement in RMSE and AbsRel metrics on the NYU Depth v2 dataset compared to recent approaches. On DDFF12, it demonstrates superior performance, showcasing its utility even on data with large texture-less areas.
ARKitScenes: In both zero-shot and trained evaluations, HybridDepth achieves unprecedented performance, with an RMSE of 0.367 and 0.254 in zero-shot and trained settings, respectively. This highlights the model's ability to handle complex, AR-specific scenarios effectively.
Model Efficiency: When compared to SOTA models like ZoeDepth and DepthAnything, HybridDepth offers significantly shorter inference times and a smaller model size, making it highly suitable for real-time mobile applications.
Implications and Future Work
Practical Implications: The primary implication of HybridDepth lies in its enhanced deployment feasibility on mobile devices due to its ability to operate with only standard cameras. This characteristic obviates the need for specialized hardware such as LiDAR or ToF sensors, broadening its applicability across varied device ecosystems.
Theoretical Implications: By combining DFF and single-image priors, HybridDepth presents a novel methodological framework that can be extended to other domains requiring robust depth estimation. It sets a benchmark for integrating multi-source depth cues while maintaining computational efficiency and generalizability.
Future Work: Although HybridDepth shows great promise, further improvements can target the DFF branch to mitigate scaling errors, particularly for pixels lacking optimal focus data. A focus on selective depth value extraction in the DFF process could enhance overall accuracy and reliability.
Conclusion
HybridDepth marks a significant advancement in depth estimation for mobile AR, leveraging the strengths of DFF and relative depth estimation to produce robust, accurate, and efficient depth maps. Its strong numerical results across multiple datasets and its superior performance in real-world applications underscore its potential as a practical and scalable solution for the future of mobile AR experiences.