MeSa: Masked, Geometric, and Supervised Pre-training for Monocular Depth Estimation
Abstract: Pre-training has been an important ingredient in developing strong monocular depth estimation models in recent years. For instance, self-supervised learning (SSL) is particularly effective by alleviating the need for large datasets with dense ground-truth depth maps. However, despite these improvements, our study reveals that the later layers of the SOTA SSL method are actually suboptimal. By examining the layer-wise representations, we demonstrate significant changes in these later layers during fine-tuning, indicating the ineffectiveness of their pre-trained features for depth estimation. To address these limitations, we propose MeSa, a comprehensive framework that leverages the complementary strengths of masked, geometric, and supervised pre-training. Hence, MeSa benefits from not only general-purpose representations learnt via masked pre training but also specialized depth-specific features acquired via geometric and supervised pre-training. Our CKA layer-wise analysis confirms that our pre-training strategy indeed produces improved representations for the later layers, overcoming the drawbacks of the SOTA SSL method. Furthermore, via experiments on the NYUv2 and IBims-1 datasets, we demonstrate that these enhanced representations translate to performance improvements in both the in-distribution and out-of-distribution settings. We also investigate the influence of the pre-training dataset and demonstrate the efficacy of pre-training on LSUN, which yields significantly better pre-trained representations. Overall, our approach surpasses the masked pre-training SSL method by a substantial margin of 17.1% on the RMSE. Moreover, even without utilizing any recently proposed techniques, MeSa also outperforms the most recent methods and establishes a new state-of-the-art for monocular depth estimation on the challenging NYUv2 dataset.
- Adabins: Depth estimation using adaptive bins. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4009–4018, 2021.
- Zoedepth: Zero-shot transfer by combining relative and metric depth. arXiv preprint arXiv:2302.12288, 2023.
- Auto-rectify network for unsupervised indoor depth estimation. IEEE PAMI, 2021a.
- Unsupervised scale-consistent depth learning from video. IJCV, 2021b.
- Parametric instance classification for unsupervised visual feature learning. Advances in Neural Information Processing Systems, 33, 2020.
- A simple framework for contrastive learning of visual representations. ICML, 2020.
- Self-supervised learning with geometric constraints in monocular video: Connecting flow, depth, and camera. In ICCV, pages 7063–7072, 2019.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
- Unsupervised visual representation learning by context prediction. In ICCV, pages 1422–1430, 2015.
- Depth map prediction from a single image using a multi-scale deep network. In NeurIPS, 2014.
- Deep ordinal regression network for monocular depth estimation. In CVPR, pages 2002–2011, 2018.
- Unsupervised cnn for single view depth estimation: Geometry to the rescue. In ECCV. Springer, 2016.
- Unsupervised representation learning by predicting image rotations. arXiv preprint arXiv:1803.07728, 2018.
- Unsupervised monocular depth estimation with left-right consistency. In CVPR, 2017.
- Digging into self-supervised monocular depth prediction. In ICCV, 2019.
- Bootstrap your own latent-a new approach to self-supervised learning. Advances in Neural Information Processing Systems, 33, 2020.
- 3d packing for self-supervised monocular depth estimation. In CVPR, 2020.
- Discriminative, restorative, and adversarial learning: Stepwise incremental pretraining. In Domain Adaptation and Representation Transfer: 4th MICCAI Workshop, DART 2022, Held in Conjunction with MICCAI 2022, Singapore, September 22, 2022, Proceedings, pages 66–76. Springer, 2022.
- Deep residual learning for image recognition. In CVPR, pages 770–778, 2016.
- Momentum contrast for unsupervised visual representation learning. CVPR, 2020.
- Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16000–16009, 2022.
- Spatial transformer networks. In NeurIPS, 2015.
- Panoptic feature pyramid networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6399–6408, 2019.
- Similarity of neural network representations revisited. In International Conference on Machine Learning, pages 3519–3529. PMLR, 2019.
- Deeper depth prediction with fully convolutional residual networks. In 3DV, 2016.
- Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
- Unsupervised learning of depth and ego-motion from monocular video using 3d geometric constraints. In CVPR, 2018.
- What makes good synthetic training data for learning disparity and optical flow estimation? International Journal of Computer Vision, 126:942–960, 2018.
- What is being transferred in transfer learning? Advances in neural information processing systems, 33:512–523, 2020.
- All in tokens: Unifying output space of visual tasks via soft token. arXiv preprint arXiv:2301.02229, 2023.
- Layer-wise analysis of a self-supervised speech representation model. In 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pages 914–921. IEEE, 2021.
- Comparative layer-wise analysis of self-supervised speech models. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2023.
- P3depth: Monocular depth estimation with a piecewise planarity prior. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1610–1621, 2022.
- Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer. IEEE PAMI, 2020.
- Vision transformers for dense prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 12179–12188, 2021.
- Indoor segmentation and support inference from rgbd images. In ECCV, 2012.
- Sc-depthv3: Robust self-supervised monocular depth estimation for dynamic scenes. arXiv preprint arXiv:2211.03660, 2022.
- Image Quality Assessment: from error visibility to structural similarity. IEEE TIP, 13(4), 2004.
- Unsupervised feature learning via non-parametric instance discrimination. In CVPR, pages 3733–3742, 2018.
- Propagate yourself: Exploring pixel-level consistency for unsupervised visual representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16684–16693, 2021.
- Revealing the dark secrets of masked image modeling. arXiv preprint arXiv:2205.13543, 2022a.
- Simmim: A simple framework for masked image modeling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9653–9663, 2022b.
- Learning to recover 3d scene shape from a single image. In CVPR, pages 204–213, 2021.
- Z. Yin and J. Shi. GeoNet: Unsupervised learning of dense depth, optical flow and camera pose. In CVPR, 2018.
- Lsun: Construction of a large-scale image dataset using deep learning with humans in the loop. arXiv preprint arXiv:1506.03365, 2015.
- New crfs: Neural window fully-connected crfs for monocular depth estimation. arXiv preprint arXiv:2203.01502, 2022.
- Colorful image colorization. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part III 14, pages 649–666. Springer, 2016.
- Unleashing text-to-image diffusion models for visual perception. arXiv preprint arXiv:2303.02153, 2023.
- Unsupervised learning of depth and ego-motion from video. In CVPR, 2017.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.