Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
184 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Towards Zero-Shot Scale-Aware Monocular Depth Estimation (2306.17253v1)

Published 29 Jun 2023 in cs.CV and cs.LG

Abstract: Monocular depth estimation is scale-ambiguous, and thus requires scale supervision to produce metric predictions. Even so, the resulting models will be geometry-specific, with learned scales that cannot be directly transferred across domains. Because of that, recent works focus instead on relative depth, eschewing scale in favor of improved up-to-scale zero-shot transfer. In this work we introduce ZeroDepth, a novel monocular depth estimation framework capable of predicting metric scale for arbitrary test images from different domains and camera parameters. This is achieved by (i) the use of input-level geometric embeddings that enable the network to learn a scale prior over objects; and (ii) decoupling the encoder and decoder stages, via a variational latent representation that is conditioned on single frame information. We evaluated ZeroDepth targeting both outdoor (KITTI, DDAD, nuScenes) and indoor (NYUv2) benchmarks, and achieved a new state-of-the-art in both settings using the same pre-trained model, outperforming methods that train on in-domain data and require test-time scaling to produce metric estimates.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (65)
  1. Attention attention everywhere: Monocular depth prediction with skip attention. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 5861–5870, January 2023.
  2. Mapillary planet-scale depth dataset. In European Conference on Computer Vision, pages 589–604. Springer, 2020.
  3. Adabins: Depth estimation using adaptive bins. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021.
  4. Zoedepth: Zero-shot transfer by combining relative and metric depth, 2023.
  5. Unsupervised depth learning in challenging indoor video: Weak rectification to rescue. ArXiv, abs/2006.02708, 2020.
  6. Variational inference: A review for statisticians. Journal of the American Statistical Association, 112(518):859–877, 2017.
  7. nuscenes: A multimodal dataset for autonomous driving. In CVPR, 2020.
  8. Multimodal scale consistency and awareness for monocular self-supervised depth estimation. In 2021 IEEE International Conference on Robotics and Automation (ICRA). IEEE (in press), 2021.
  9. Towards real-time monocular depth estimation for robotics: A survey, 2021.
  10. Omnidata: A scalable pipeline for making multi-task mid-level vision datasets from 3d scans. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10786–10796, 2021.
  11. Depth map prediction using a multi-scale deep network. arXiv:1406.2283, 2014.
  12. CAM-Convs: Camera-Aware Multi-Scale Convolutions for Single-View Depth. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
  13. Self-supervised camera self-calibration from video. In IEEE International Conference on Robotics and Automation (ICRA), 2022.
  14. Deep ordinal regression network for monocular depth estimation. In CVPR, 2018.
  15. Unsupervised cnn for single view depth estimation: Geometry to the rescue. In ECCV, 2016.
  16. Vision meets robotics: The kitti dataset. IJRR, 2013.
  17. Are we ready for autonomous driving? the kitti vision benchmark suite. In CVPR, 2012.
  18. Unsupervised monocular depth estimation with left-right consistency. In CVPR, 2017.
  19. Digging into self-supervised monocular depth prediction. In ICCV, 2019.
  20. Depth from videos in the wild: Unsupervised monocular depth learning from unknown cameras. In CVPR, 2019.
  21. Sparse auxiliary networks for unified monocular depth prediction and completion. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2021.
  22. Multi-frame self-supervised depth with transformers. In Proceedings of the International Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
  23. 3d packing for self-supervised monocular depth estimation. In CVPR, 2020.
  24. Semantically-guided representation learning for self-supervised monocular depth. In ICLR, 2020.
  25. Learning optical flow, depth, and scene flow without real-world labels. IEEE Robotics and Automation Letters, 2022.
  26. Geometric unsupervised domain adaptation for semantic segmentation. In ICCV, 2021.
  27. Full surround monodepth from multiple cameras. arXiv:2104.00152, 2021.
  28. Depth field networks for generalizable multi-view scene representation. In IEEE/CVF International Conference on Computer Vision (ICCV), 2022.
  29. Monocular depth estimation through virtual-world supervision and real-world sfm self-supervision. arXiv:2103.12209, 2021.
  30. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016.
  31. Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415, 2016.
  32. Rvmde: Radar validated monocular depth estimation for robotics, 2021.
  33. Perceiver IO: A general architecture for structured inputs & outputs. arXiv preprint arXiv:2107.14795, 2021.
  34. Putting nerf on a diet: Semantically consistent few-shot view synthesis. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 5885–5894, October 2021.
  35. Monoindoor: Towards good practice of self-supervised monocular depth estimation for indoor environments. In 2021 IEEE/CVF International Conference on Computer Vision (ICCV), 2021.
  36. Jiaqi Zou Ke Mei, Chuang Zhu and Shanghang Zhang. Instance adaptive self-training for unsupervised domain adaptation. In European Conference on Computer Vision (ECCV), 2020.
  37. Self-supervised surround-view depth estimation with volumetric feature fusion. In Advances in Neural Information Processing Systems, 2022.
  38. On information and sufficiency. Ann. Math. Statist., 22(1):79–86, 1951.
  39. From big to small: Multi-scale local planar guidance for monocular depth estimation. arXiv preprint arXiv:1907.10326, 2019.
  40. From big to small: Multi-scale local planar guidance for monocular depth estimation. arXiv:1907.10326, 2019.
  41. Spigan: Privileged adversarial learning from simulation. In corl, 2019.
  42. Patch-wise attention network for monocular depth estimation. In In Proceedings of the AAAI Conference on Artificial Intelligence, 2021.
  43. Structdepth: Leveraging the structural regularities for self-supervised indoor depth estimation. In Proceedings of the IEEE International Conference on Computer Vision, 2021.
  44. Monoindoor++:towards better practice of self-supervised monocular depth estimation for indoor environments. ArXiv, abs/2207.08951, 2022.
  45. Binsformer: Revisiting adaptive bins for monocular depth estimation. arXiv preprint arXiv:2204.00987, 2022.
  46. Va-depthnet: A variational approach to single image depth prediction, 2023.
  47. Decoupled weight decay regularization, 2019.
  48. Pushmeet Kohli Nathan Silberman, Derek Hoiem and Rob Fergus. Indoor segmentation and support inference from rgbd images. In ECCV, 2012.
  49. Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32. 2019.
  50. Vision transformers for dense prediction. arXiv:2103.13413, 2021.
  51. Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022.
  52. Feature-metric loss for self-supervised learning of depth and egomotion. In ECCV, 2020.
  53. Scalability in perception for autonomous driving: Waymo open dataset. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2446–2454, 2020.
  54. Do what you can, with what you have: Scale-aware and high quality monocular depth estimation without real world labels. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2022.
  55. Sparsity invariant cnns. In 3DV, 2017.
  56. Attention is all you need. In NeurIPS, 2017.
  57. Dada: Depth-aware domain adaptation in semantic segmentation. In ICCV, 2019.
  58. Self-supervised scale recovery for monocular depth and egomotion estimation. In IROS, 2021.
  59. Tartanair: A dataset to push the limits of visual slam. In IROS, 2020.
  60. Surrounddepth: Entangling surrounding views for self-supervised multi-camera depth estimation. arXiv preprint arXiv:2204.03636, 2022.
  61. Toward practical monocular indoor depth estimation. In CVPR, 2022.
  62. Input-level inductive biases for 3D reconstruction. In Proceedings of the IEEE International Conference on Computer Vision (CVPR), 2022.
  63. Geometry-aware symmetric domain adaptation for monocular depth estimation. In ICCV, 2019.
  64. Unsupervised scene adaptation with memory regularization in vivo. In IJCAI, 2020.
  65. Unsupervised learning of depth and ego-motion from video. In CVPR, 2017.
Citations (51)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com