Towards Comprehensive Monocular Depth Estimation: Multiple Heads Are Better Than One (2111.08313v2)
Abstract: Depth estimation attracts widespread attention in the computer vision community. However, it is still quite difficult to recover an accurate depth map using only one RGB image. We observe a phenomenon that existing methods tend to fail in different cases, caused by differences in network architecture, loss function and so on. In this work, we investigate into the phenomenon and propose to integrate the strengths of multiple weak depth predictor to build a comprehensive and accurate depth predictor, which is critical for many real-world applications, e.g., 3D reconstruction. Specifically, we construct multiple base (weak) depth predictors by utilizing different Transformer-based and convolutional neural network (CNN)-based architectures. Transformer establishes long-range correlation while CNN preserves local information ignored by Transformer due to the spatial inductive bias. Therefore, the coupling of Transformer and CNN contributes to the generation of complementary depth estimates, which are essential to achieve a comprehensive depth predictor. Then, we design mixers to learn from multiple weak predictions and adaptively fuse them into a strong depth estimate. The resultant model, which we refer to as Transformer-assisted depth ensembles (TEDepth). On the standard NYU-Depth-v2 and KITTI datasets, we thoroughly explore how the neural ensembles affect the depth estimation and demonstrate that our TEDepth achieves better results than previous state-of-the-art approaches. To validate the generalizability across cameras, we directly apply the models trained on NYU-Depth-v2 to the SUN RGB-D dataset without any fine-tuning, and the superior results emphasize its strong generalizability.
- From big to small: Multi-scale local planar guidance for monocular depth estimation. arXiv preprint arXiv:1907.10326, 2019.
- Transformer-based attention networks for continuous pixel-wise prediction. In Proceedings of the IEEE International Conference on Computer Vision, pages 16269–16279, October 2021.
- Fusenet: Incorporating depth into semantic segmentation via fusion-based cnn architecture. In Asian Conference on Computer Vision, pages 213–228. Springer, 2016.
- Depth-assisted real-time 3d object detection for augmented reality. In ICAT, volume 11 (2), pages 126–132, 2011.
- Self-supervised monocular depth and ego-motion estimation in endoscopy: Appearance flow to the rescue. Medical image analysis, 77:102338, 2022.
- Learning depth from single monocular images. In Advances in Neural Information Processing Systems, volume 18, pages 1–8, 2005.
- Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In Proceedings of the IEEE International Conference on Computer Vision, pages 2650–2658, 2015.
- Deeper depth prediction with fully convolutional residual networks. In 2016 Fourth International Conference on 3D Vision, pages 239–248. IEEE, 2016.
- A two-streamed network for estimating fine-scaled depth maps from single rgb images. In Proceedings of the IEEE International Conference on Computer Vision, pages 3372–3380, 2017.
- Deep ordinal regression network for monocular depth estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2002–2011, 2018.
- Contextualized cnn for scene-aware depth estimation from single rgb image. IEEE Transactions on Multimedia, 22(5):1220–1233, 2019.
- Bayesian denet: Monocular depth prediction and frame-wise fusion with synchronized uncertainty. IEEE Transactions on Multimedia, 21(11):2701–2713, 2019.
- Unsupervised monocular depth estimation using attention and multi-warp reconstruction. IEEE Transactions on Multimedia, 2021.
- Indoor segmentation and support inference from rgbd images. In European Conference on Computer Vision, pages 746–760. Springer, 2012.
- A survey on ensemble learning. Frontiers of Computer Science, 14(2):241–258, 2020.
- Snapshot ensembles: Train 1, get m for free. International Conference on Learning Representations, 2017.
- Local minima found in the subparameter space can be effective for ensembles of deep convolutional neural networks. Pattern Recognition, 109:107582, 2021.
- Weighted boxes fusion: Ensembling boxes from different object detection models. Image and Vision Computing, 107:104117, 2021.
- Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(4):834–848, 2017.
- Vision transformers for dense prediction. In Proceedings of the IEEE International Conference on Computer Vision, pages 12179–12188, 2021.
- Volo: Vision outlooker for visual recognition. arXiv preprint arXiv:2106.13112, 2021.
- Conformer: Local features coupling global representations for visual recognition. In Proceedings of the IEEE International Conference on Computer Vision, pages 367–376, October 2021.
- An image is worth 16x16 words: Transformers for image recognition at scale. International Conference on Learning Representations, 2021.
- Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, June 2016.
- Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4700–4708, 2017.
- Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1492–1500, 2017.
- Sun rgb-d: A rgb-d scene understanding benchmark suite. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 567–576, 2015.
- Vision meets robotics: The kitti dataset. The International Journal of Robotics Research, 32(11):1231–1237, 2013.
- Depth map prediction from a single image using a multi-scale deep network. In Advances in Neural Information Processing Systems, pages 2366–2374, 2014.
- Estimating depth from monocular images as classification using deep fully convolutional residual networks. IEEE Transactions on Circuits and Systems for Video Technology, 28(11):3174–3182, 2017.
- Geonet: Geometric neural network for joint depth and surface normal estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 283–291, 2018.
- Enforcing geometric constraints of virtual normal for depth prediction. In Proceedings of the IEEE International Conference on Computer Vision, pages 5684–5693, 2019.
- P3depth: Monocular depth estimation with a piecewise planarity prior. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1610–1621, 2022.
- Attention is all you need. In Advances in Neural Information Processing Systems, pages 5998–6008, 2017.
- Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE International Conference on Computer Vision, pages 10012–10022, October 2021.
- Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6881–6890, 2021.
- Ts-cam: Token semantic coupled attention map for weakly supervised object localization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2886–2895, 2021.
- Transgan: Two transformers can make one strong gan. Advances in Neural Information Processing Systems, 2021.
- An empirical comparison of voting classification algorithms: Bagging, boosting, and variants. Machine learning, 36(1):105–139, 1999.
- Ensembling neural networks: many could be better than all. Artificial intelligence, 137(1-2):239–263, 2002.
- Simple and scalable predictive uncertainty estimation using deep ensembles. Advances in Neural Information Processing Systems, pages 6402–6413, 2017.
- Hyperparameter ensembles for robustness and uncertainty quantification. arXiv preprint arXiv:2006.13570, 2020.
- Dverge: Diversifying vulnerabilities for enhanced robust generation of ensembles. Advances in Neural Information Processing Systems, 33, 2020.
- A constructive algorithm for training cooperative neural network ensembles. IEEE Transactions on Neural Networks, 14(4):820–834, 2003.
- Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3):211–252, 2015.
- Defense against adversarial attacks using high-level representation guided denoiser. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1778–1787, 2018.
- Deep mutual learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4320–4328, 2018.
- Decoupled weight decay regularization. In International Conference on Learning Representations, 2018.
- Fast and accurate deep network learning by exponential linear units (elus). arXiv preprint arXiv:1511.07289, 2015.
- Laplacian pyramid neural network for dense continuous-value regression for complex scenes. IEEE Transactions on Neural Networks and Learning Systems, 32(11):5034–5046, 2020.
- Densely connecting depth maps for monocular depth estimation. In European Conference on Computer Vision, pages 149–165. Springer, 2020.
- Guiding monocular depth estimation using depth-attention volume. In European Conference on Computer Vision, pages 581–597. Springer, 2020.
- Adaptive surface normal constraint for depth estimation. In Proceedings of the IEEE International Conference on Computer Vision, pages 12849–12858, October 2021.
- Automatic differentiation in pytorch. In Advances in Neural Information Processing Systems Workshop Autodiff, 2017.
- Shuwei Shao (14 papers)
- Ran Li (191 papers)
- Zhongcai Pei (13 papers)
- Zhong Liu (73 papers)
- Weihai Chen (29 papers)
- Wentao Zhu (73 papers)
- Xingming Wu (20 papers)
- Baochang Zhang (113 papers)