Papers
Topics
Authors
Recent
Search
2000 character limit reached

MeSa: Masked, Geometric, and Supervised Pre-training for Monocular Depth Estimation

Published 6 Oct 2023 in cs.CV | (2310.04551v1)

Abstract: Pre-training has been an important ingredient in developing strong monocular depth estimation models in recent years. For instance, self-supervised learning (SSL) is particularly effective by alleviating the need for large datasets with dense ground-truth depth maps. However, despite these improvements, our study reveals that the later layers of the SOTA SSL method are actually suboptimal. By examining the layer-wise representations, we demonstrate significant changes in these later layers during fine-tuning, indicating the ineffectiveness of their pre-trained features for depth estimation. To address these limitations, we propose MeSa, a comprehensive framework that leverages the complementary strengths of masked, geometric, and supervised pre-training. Hence, MeSa benefits from not only general-purpose representations learnt via masked pre training but also specialized depth-specific features acquired via geometric and supervised pre-training. Our CKA layer-wise analysis confirms that our pre-training strategy indeed produces improved representations for the later layers, overcoming the drawbacks of the SOTA SSL method. Furthermore, via experiments on the NYUv2 and IBims-1 datasets, we demonstrate that these enhanced representations translate to performance improvements in both the in-distribution and out-of-distribution settings. We also investigate the influence of the pre-training dataset and demonstrate the efficacy of pre-training on LSUN, which yields significantly better pre-trained representations. Overall, our approach surpasses the masked pre-training SSL method by a substantial margin of 17.1% on the RMSE. Moreover, even without utilizing any recently proposed techniques, MeSa also outperforms the most recent methods and establishes a new state-of-the-art for monocular depth estimation on the challenging NYUv2 dataset.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (49)
  1. Adabins: Depth estimation using adaptive bins. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4009–4018, 2021.
  2. Zoedepth: Zero-shot transfer by combining relative and metric depth. arXiv preprint arXiv:2302.12288, 2023.
  3. Auto-rectify network for unsupervised indoor depth estimation. IEEE PAMI, 2021a.
  4. Unsupervised scale-consistent depth learning from video. IJCV, 2021b.
  5. Parametric instance classification for unsupervised visual feature learning. Advances in Neural Information Processing Systems, 33, 2020.
  6. A simple framework for contrastive learning of visual representations. ICML, 2020.
  7. Self-supervised learning with geometric constraints in monocular video: Connecting flow, depth, and camera. In ICCV, pages 7063–7072, 2019.
  8. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  9. Unsupervised visual representation learning by context prediction. In ICCV, pages 1422–1430, 2015.
  10. Depth map prediction from a single image using a multi-scale deep network. In NeurIPS, 2014.
  11. Deep ordinal regression network for monocular depth estimation. In CVPR, pages 2002–2011, 2018.
  12. Unsupervised cnn for single view depth estimation: Geometry to the rescue. In ECCV. Springer, 2016.
  13. Unsupervised representation learning by predicting image rotations. arXiv preprint arXiv:1803.07728, 2018.
  14. Unsupervised monocular depth estimation with left-right consistency. In CVPR, 2017.
  15. Digging into self-supervised monocular depth prediction. In ICCV, 2019.
  16. Bootstrap your own latent-a new approach to self-supervised learning. Advances in Neural Information Processing Systems, 33, 2020.
  17. 3d packing for self-supervised monocular depth estimation. In CVPR, 2020.
  18. Discriminative, restorative, and adversarial learning: Stepwise incremental pretraining. In Domain Adaptation and Representation Transfer: 4th MICCAI Workshop, DART 2022, Held in Conjunction with MICCAI 2022, Singapore, September 22, 2022, Proceedings, pages 66–76. Springer, 2022.
  19. Deep residual learning for image recognition. In CVPR, pages 770–778, 2016.
  20. Momentum contrast for unsupervised visual representation learning. CVPR, 2020.
  21. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16000–16009, 2022.
  22. Spatial transformer networks. In NeurIPS, 2015.
  23. Panoptic feature pyramid networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6399–6408, 2019.
  24. Similarity of neural network representations revisited. In International Conference on Machine Learning, pages 3519–3529. PMLR, 2019.
  25. Deeper depth prediction with fully convolutional residual networks. In 3DV, 2016.
  26. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
  27. Unsupervised learning of depth and ego-motion from monocular video using 3d geometric constraints. In CVPR, 2018.
  28. What makes good synthetic training data for learning disparity and optical flow estimation? International Journal of Computer Vision, 126:942–960, 2018.
  29. What is being transferred in transfer learning? Advances in neural information processing systems, 33:512–523, 2020.
  30. All in tokens: Unifying output space of visual tasks via soft token. arXiv preprint arXiv:2301.02229, 2023.
  31. Layer-wise analysis of a self-supervised speech representation model. In 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pages 914–921. IEEE, 2021.
  32. Comparative layer-wise analysis of self-supervised speech models. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2023.
  33. P3depth: Monocular depth estimation with a piecewise planarity prior. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1610–1621, 2022.
  34. Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer. IEEE PAMI, 2020.
  35. Vision transformers for dense prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 12179–12188, 2021.
  36. Indoor segmentation and support inference from rgbd images. In ECCV, 2012.
  37. Sc-depthv3: Robust self-supervised monocular depth estimation for dynamic scenes. arXiv preprint arXiv:2211.03660, 2022.
  38. Image Quality Assessment: from error visibility to structural similarity. IEEE TIP, 13(4), 2004.
  39. Unsupervised feature learning via non-parametric instance discrimination. In CVPR, pages 3733–3742, 2018.
  40. Propagate yourself: Exploring pixel-level consistency for unsupervised visual representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16684–16693, 2021.
  41. Revealing the dark secrets of masked image modeling. arXiv preprint arXiv:2205.13543, 2022a.
  42. Simmim: A simple framework for masked image modeling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9653–9663, 2022b.
  43. Learning to recover 3d scene shape from a single image. In CVPR, pages 204–213, 2021.
  44. Z. Yin and J. Shi. GeoNet: Unsupervised learning of dense depth, optical flow and camera pose. In CVPR, 2018.
  45. Lsun: Construction of a large-scale image dataset using deep learning with humans in the loop. arXiv preprint arXiv:1506.03365, 2015.
  46. New crfs: Neural window fully-connected crfs for monocular depth estimation. arXiv preprint arXiv:2203.01502, 2022.
  47. Colorful image colorization. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part III 14, pages 649–666. Springer, 2016.
  48. Unleashing text-to-image diffusion models for visual perception. arXiv preprint arXiv:2303.02153, 2023.
  49. Unsupervised learning of depth and ego-motion from video. In CVPR, 2017.
Citations (4)

Summary

No one has generated a summary of this paper yet.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.