Towards Zero-Shot Scale-Aware Monocular Depth Estimation (2306.17253v1)
Abstract: Monocular depth estimation is scale-ambiguous, and thus requires scale supervision to produce metric predictions. Even so, the resulting models will be geometry-specific, with learned scales that cannot be directly transferred across domains. Because of that, recent works focus instead on relative depth, eschewing scale in favor of improved up-to-scale zero-shot transfer. In this work we introduce ZeroDepth, a novel monocular depth estimation framework capable of predicting metric scale for arbitrary test images from different domains and camera parameters. This is achieved by (i) the use of input-level geometric embeddings that enable the network to learn a scale prior over objects; and (ii) decoupling the encoder and decoder stages, via a variational latent representation that is conditioned on single frame information. We evaluated ZeroDepth targeting both outdoor (KITTI, DDAD, nuScenes) and indoor (NYUv2) benchmarks, and achieved a new state-of-the-art in both settings using the same pre-trained model, outperforming methods that train on in-domain data and require test-time scaling to produce metric estimates.
- Attention attention everywhere: Monocular depth prediction with skip attention. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 5861–5870, January 2023.
- Mapillary planet-scale depth dataset. In European Conference on Computer Vision, pages 589–604. Springer, 2020.
- Adabins: Depth estimation using adaptive bins. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021.
- Zoedepth: Zero-shot transfer by combining relative and metric depth, 2023.
- Unsupervised depth learning in challenging indoor video: Weak rectification to rescue. ArXiv, abs/2006.02708, 2020.
- Variational inference: A review for statisticians. Journal of the American Statistical Association, 112(518):859–877, 2017.
- nuscenes: A multimodal dataset for autonomous driving. In CVPR, 2020.
- Multimodal scale consistency and awareness for monocular self-supervised depth estimation. In 2021 IEEE International Conference on Robotics and Automation (ICRA). IEEE (in press), 2021.
- Towards real-time monocular depth estimation for robotics: A survey, 2021.
- Omnidata: A scalable pipeline for making multi-task mid-level vision datasets from 3d scans. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10786–10796, 2021.
- Depth map prediction using a multi-scale deep network. arXiv:1406.2283, 2014.
- CAM-Convs: Camera-Aware Multi-Scale Convolutions for Single-View Depth. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
- Self-supervised camera self-calibration from video. In IEEE International Conference on Robotics and Automation (ICRA), 2022.
- Deep ordinal regression network for monocular depth estimation. In CVPR, 2018.
- Unsupervised cnn for single view depth estimation: Geometry to the rescue. In ECCV, 2016.
- Vision meets robotics: The kitti dataset. IJRR, 2013.
- Are we ready for autonomous driving? the kitti vision benchmark suite. In CVPR, 2012.
- Unsupervised monocular depth estimation with left-right consistency. In CVPR, 2017.
- Digging into self-supervised monocular depth prediction. In ICCV, 2019.
- Depth from videos in the wild: Unsupervised monocular depth learning from unknown cameras. In CVPR, 2019.
- Sparse auxiliary networks for unified monocular depth prediction and completion. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2021.
- Multi-frame self-supervised depth with transformers. In Proceedings of the International Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
- 3d packing for self-supervised monocular depth estimation. In CVPR, 2020.
- Semantically-guided representation learning for self-supervised monocular depth. In ICLR, 2020.
- Learning optical flow, depth, and scene flow without real-world labels. IEEE Robotics and Automation Letters, 2022.
- Geometric unsupervised domain adaptation for semantic segmentation. In ICCV, 2021.
- Full surround monodepth from multiple cameras. arXiv:2104.00152, 2021.
- Depth field networks for generalizable multi-view scene representation. In IEEE/CVF International Conference on Computer Vision (ICCV), 2022.
- Monocular depth estimation through virtual-world supervision and real-world sfm self-supervision. arXiv:2103.12209, 2021.
- Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016.
- Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415, 2016.
- Rvmde: Radar validated monocular depth estimation for robotics, 2021.
- Perceiver IO: A general architecture for structured inputs & outputs. arXiv preprint arXiv:2107.14795, 2021.
- Putting nerf on a diet: Semantically consistent few-shot view synthesis. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 5885–5894, October 2021.
- Monoindoor: Towards good practice of self-supervised monocular depth estimation for indoor environments. In 2021 IEEE/CVF International Conference on Computer Vision (ICCV), 2021.
- Jiaqi Zou Ke Mei, Chuang Zhu and Shanghang Zhang. Instance adaptive self-training for unsupervised domain adaptation. In European Conference on Computer Vision (ECCV), 2020.
- Self-supervised surround-view depth estimation with volumetric feature fusion. In Advances in Neural Information Processing Systems, 2022.
- On information and sufficiency. Ann. Math. Statist., 22(1):79–86, 1951.
- From big to small: Multi-scale local planar guidance for monocular depth estimation. arXiv preprint arXiv:1907.10326, 2019.
- From big to small: Multi-scale local planar guidance for monocular depth estimation. arXiv:1907.10326, 2019.
- Spigan: Privileged adversarial learning from simulation. In corl, 2019.
- Patch-wise attention network for monocular depth estimation. In In Proceedings of the AAAI Conference on Artificial Intelligence, 2021.
- Structdepth: Leveraging the structural regularities for self-supervised indoor depth estimation. In Proceedings of the IEEE International Conference on Computer Vision, 2021.
- Monoindoor++:towards better practice of self-supervised monocular depth estimation for indoor environments. ArXiv, abs/2207.08951, 2022.
- Binsformer: Revisiting adaptive bins for monocular depth estimation. arXiv preprint arXiv:2204.00987, 2022.
- Va-depthnet: A variational approach to single image depth prediction, 2023.
- Decoupled weight decay regularization, 2019.
- Pushmeet Kohli Nathan Silberman, Derek Hoiem and Rob Fergus. Indoor segmentation and support inference from rgbd images. In ECCV, 2012.
- Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32. 2019.
- Vision transformers for dense prediction. arXiv:2103.13413, 2021.
- Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022.
- Feature-metric loss for self-supervised learning of depth and egomotion. In ECCV, 2020.
- Scalability in perception for autonomous driving: Waymo open dataset. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2446–2454, 2020.
- Do what you can, with what you have: Scale-aware and high quality monocular depth estimation without real world labels. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2022.
- Sparsity invariant cnns. In 3DV, 2017.
- Attention is all you need. In NeurIPS, 2017.
- Dada: Depth-aware domain adaptation in semantic segmentation. In ICCV, 2019.
- Self-supervised scale recovery for monocular depth and egomotion estimation. In IROS, 2021.
- Tartanair: A dataset to push the limits of visual slam. In IROS, 2020.
- Surrounddepth: Entangling surrounding views for self-supervised multi-camera depth estimation. arXiv preprint arXiv:2204.03636, 2022.
- Toward practical monocular indoor depth estimation. In CVPR, 2022.
- Input-level inductive biases for 3D reconstruction. In Proceedings of the IEEE International Conference on Computer Vision (CVPR), 2022.
- Geometry-aware symmetric domain adaptation for monocular depth estimation. In ICCV, 2019.
- Unsupervised scene adaptation with memory regularization in vivo. In IJCAI, 2020.
- Unsupervised learning of depth and ego-motion from video. In CVPR, 2017.