IEBins: Iterative Elastic Bins for Monocular Depth Estimation (2309.14137v1)
Abstract: Monocular depth estimation (MDE) is a fundamental topic of geometric computer vision and a core technique for many downstream applications. Recently, several methods reframe the MDE as a classification-regression problem where a linear combination of probabilistic distribution and bin centers is used to predict depth. In this paper, we propose a novel concept of iterative elastic bins (IEBins) for the classification-regression-based MDE. The proposed IEBins aims to search for high-quality depth by progressively optimizing the search range, which involves multiple stages and each stage performs a finer-grained depth search in the target bin on top of its previous stage. To alleviate the possible error accumulation during the iterative process, we utilize a novel elastic target bin to replace the original target bin, the width of which is adjusted elastically based on the depth uncertainty. Furthermore, we develop a dedicated framework composed of a feature extractor and an iterative optimizer that has powerful temporal context modeling capabilities benefiting from the GRU-based architecture. Extensive experiments on the KITTI, NYU-Depth-v2 and SUN RGB-D datasets demonstrate that the proposed method surpasses prior state-of-the-art competitors. The source code is publicly available at https://github.com/ShuweiShao/IEBins.
- Depth map prediction from a single image using a multi-scale deep network. In Advances in Neural Information Processing Systems, pages 2366–2374, 2014.
- Deep ordinal regression network for monocular depth estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2002–2011, 2018.
- Adabins: Depth estimation using adaptive bins. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4009–4018, 2021.
- From big to small: Multi-scale local planar guidance for monocular depth estimation. arXiv preprint arXiv:1907.10326, 2019.
- Neural window fully-connected crfs for monocular depth estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3916–3925, June 2022.
- Urcdc-depth: Uncertainty rectified cross-distillation with cutflip for monocular depth estimation. IEEE Transations on Multimedia, 2023.
- Nddepth: Normal-distance assisted monocular depth estimation. Proceedings of the IEEE International Conference on Computer Vision, 2023.
- Binsformer: Revisiting adaptive bins for monocular depth estimation. arXiv preprint arXiv:2204.00987, 2022.
- Guiding monocular depth estimation using depth-attention volume. In European Conference on Computer Vision, pages 581–597. Springer, 2020.
- Transformer-based attention networks for continuous pixel-wise prediction. In Proceedings of the IEEE International Conference on Computer Vision, pages 16269–16279, October 2021.
- Adaptive surface normal constraint for depth estimation. In Proceedings of the IEEE International Conference on Computer Vision, pages 12849–12858, October 2021.
- Towards comprehensive monocular depth estimation: Multiple heads are better than one. IEEE Transactions on Multimedia, 2022.
- Estimating depth from monocular images as classification using deep fully convolutional residual networks. IEEE Transactions on Circuits and Systems for Video Technology, 28(11):3174–3182, 2017.
- Self-supervised monocular trained depth estimation using self-attention and discrete disparity volume. In Proceedings of the IEEE International Conference on Computer Vision, pages 4756–4765, 2020.
- Localbins: Improving depth estimation by learning local distributions. In Proceedings of the European Conference on Computer Vision, pages 480–496. Springer, 2022.
- Attention attention everywhere: Monocular depth prediction with skip attention. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision, pages 5861–5870, 2023.
- Uncertainty quantification in depth estimation via constrained ordinal regression. In Proceedings of the European Conference on Computer Vision, pages 237–256. Springer, 2022.
- Vision meets robotics: The kitti dataset. The International Journal of Robotics Research, 32(11):1231–1237, 2013.
- Indoor segmentation and support inference from rgbd images. In European Conference on Computer Vision, pages 746–760. Springer, 2012.
- Sun rgb-d: A rgb-d scene understanding benchmark suite. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 567–576, 2015.
- Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1874–1883, 2016.
- Deep sparse rectifier neural networks. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pages 315–323. JMLR Workshop and Conference Proceedings, 2011.
- Learning depth from single monocular images. In Advances in Neural Information Processing Systems, volume 18, pages 1–8, 2005.
- Deeper depth prediction with fully convolutional residual networks. In 2016 Fourth International Conference on 3D Vision, pages 239–248. IEEE, 2016.
- Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, June 2016.
- Attention is all you need. In Advances in Neural Information Processing Systems, pages 5998–6008, 2017.
- An image is worth 16x16 words: Transformers for image recognition at scale. International Conference on Learning Representations, 2021.
- Conformer: local features coupling global representations for visual recognition. In Proceedings of the IEEE International Conference on Computer Vision, pages 367–376, October 2021.
- Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE International Conference on Computer Vision, pages 10012–10022, October 2021.
- Vision transformers for dense prediction. In Proceedings of the IEEE International Conference on Computer Vision, pages 12179–12188, 2021.
- Soft labels for ordinal regression. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4738–4747, 2019.
- Raft: Recurrent all-pairs field transforms for optical flow. In Proceedings of the European Conference on Computer Vision, pages 402–419. Springer, 2020.
- Raft-stereo: Multilevel recurrent field transforms for stereo matching. In 2021 International Conference on 3D Vision (3DV), pages 218–227. IEEE, 2021.
- Itermvs: iterative probability estimation for efficient multi-view stereo. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8606–8615, 2022.
- Dro: Deep recurrent optimizer for video to depth. IEEE Robotics and Automation Letters, 8(5):2844–2851, 2023.
- Raft-3d: Scene flow using rigid-motion embeddings. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8375–8384, 2021.
- Swin transformer v2: Scaling up capacity and resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 12009–12019, June 2022.
- Enforcing geometric constraints of virtual normal for depth prediction. In Proceedings of the IEEE International Conference on Computer Vision, pages 5684–5693, 2019.
- Patch-wise attention network for monocular depth estimation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 1873–1881, 2021.
- P3depth: Monocular depth estimation with a piecewise planarity prior. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1610–1621, 2022.
- Efficientnet: Rethinking model scaling for convolutional neural networks. In International Conference on Machine Learning, pages 6105–6114. PMLR, 2019.
- Bidirectional attention network for monocular depth estimation. In 2021 IEEE International Conference on Robotics and Automation, pages 11746–11752. IEEE, 2021.
- Vip-deeplab: Learning visual perception with depth-aware video panoptic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3997–4008, 2021.
- Sparsity invariant cnns. In 2017 International Conference on 3D Vision, pages 11–20. IEEE, 2017.
- Structure-aware residual pyramid network for monocular depth estimation. In Proceedings of the 28th International Joint Conference on Artificial Intelligence, pages 694–700, 2019.
- Automatic differentiation in pytorch. In Advances in Neural Information Processing Systems Workshop Autodiff, 2017.
- Adam: A method for stochastic optimization. In International Conference on Learning Representations, 2015.
- Orb-slam2: An open-source slam system for monocular, stereo, and rgb-d cameras. IEEE Transactions on Robotics, 33(5):1255–1262, 2017.