Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
126 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Probing the 3D Awareness of Visual Foundation Models (2404.08636v1)

Published 12 Apr 2024 in cs.CV

Abstract: Recent advances in large-scale pretraining have yielded visual foundation models with strong capabilities. Not only can recent models generalize to arbitrary images for their training task, their intermediate representations are useful for other visual tasks such as detection and segmentation. Given that such models can classify, delineate, and localize objects in 2D, we ask whether they also represent their 3D structure? In this work, we analyze the 3D awareness of visual foundation models. We posit that 3D awareness implies that representations (1) encode the 3D structure of the scene and (2) consistently represent the surface across views. We conduct a series of experiments using task-specific probes and zero-shot inference procedures on frozen features. Our experiments reveal several limitations of the current models. Our code and analysis can be found at https://github.com/mbanani/probe3d.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (102)
  1. Deep vit features as dense visual descriptors. arXiv preprint arXiv:2112.05814, 2021.
  2. Estimating and exploiting the aleatoric uncertainty in surface normal estimation. In ICCV, 2021.
  3. Shape matching and object recognition using low distortion correspondences. In CVPR, pages 26–33. Citeseer, 2005.
  4. Adabins: Depth estimation using adaptive bins. In CVPR, 2021.
  5. Zoedepth: Zero-shot transfer by combining relative and metric depth. arXiv preprint arXiv:2302.12288, 2023.
  6. Stylegan knows normal, depth, albedo, and more. arXiv preprint arXiv:2306.00987, 2023.
  7. Thomas O. Binford. Visual perception by computer. In Proceedings of the IEEE Conference on Systems and Control, 1971.
  8. Rodney A Brooks. Symbolic reasoning among 3-d models and 2-d images. Artificial intelligence, 17(1-3):285–348, 1981.
  9. Emerging properties in self-supervised vision transformers. In IEEE Conf. Comput. Vis. Pattern Recog., 2021.
  10. Muse: Text-to-image generation via masked generative transformers. In ICML, 2023.
  11. Return of the devil in the details: delving deep into convolutional nets. In BMVC, 2014.
  12. Beyond surface statistics: Scene representations in a latent diffusion model. arXiv preprint arXiv:2306.05720, 2023.
  13. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In CVPR, 2017.
  14. Vision transformers need registers. arXiv preprint arXiv:2309.16588, 2023.
  15. An image is worth 16x16 words: Transformers for image recognition at scale. In Int. Conf. Learn. Represent., 2020.
  16. Generative models: What do they know? do they know things? let’s find out! arXiv, 2023.
  17. Depth map prediction from a single image using a multi-scale deep network. In NeruIPS, 2014.
  18. Bootstrap Your Own Correspondences. In ICCV, 2021.
  19. UnsupervisedR&R: Unsupervised Point Cloud Registration via Differentiable Rendering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7129–7139, 2021.
  20. Learning Visual Representations via Language-Guided Sampling. In CVPR, 2023.
  21. The pascal visual object classes (voc) challenge. International Journal of Computer Vision, 88(2):303–338, 2010.
  22. David Fouhey. Factoring Scenes into 3D Structure and Style. PhD thesis, Carnegie Mellon University, Pittsburgh, PA, 2016.
  23. Single image 3D without a single 3D image. In ICCV, 2015.
  24. Battle of the backbones: A large-scale comparison of pretrained models across computer vision tasks. In NeurIPS Datasets and Benchmarks Track, 2023.
  25. Zero-shot category-level object pose estimation. In ECCV, 2022.
  26. Scaling and benchmarking self-supervised visual representation learning. In ICCV, 2019.
  27. Asic: Aligning sparse in-the-wild image collections. arXiv preprint arXiv:2303.16201, 2023.
  28. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In ICCV, pages 1026–1034, 2015.
  29. Masked autoencoders are scalable vision learners. In CVPR, 2022.
  30. Unsupervised semantic correspondence using stable diffusion. arXiv preprint arXiv:2305.15581, 2023.
  31. OpenCLIP, 2021.
  32. Navi: Category-agnostic image collections with high-quality 3d shape and pose annotations. In NeurIPS Datasets and Benchmarks Track, 2023.
  33. Repurposing diffusion-based image generators for monocular depth estimation. arXiv preprint arXiv:2312.02145, 2023.
  34. Dag: Depth-aware guidance with denoising diffusion probabilistic models. arXiv preprint arXiv:2212.08861, 2022.
  35. Adam: A method for stochastic optimization. In ICLR, 2015.
  36. Segment anything. arXiv preprint arXiv:2304.02643, 2023.
  37. The singularities of the visual mapping. Biological cybernetics, 24(1):51–59, 1976.
  38. The internal representation of solid shape with respect to vision. Biological cybernetics, 32(4):211–216, 1979.
  39. Surface shape and curvature scales. Image and vision computing, 10(8):557–564, 1992.
  40. Pictorial surface attitude and local depth comparisons. Perception & Psychophysics, 58(2):163–173, 1996.
  41. Do better imagenet models transfer better? In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2661–2671, 2019.
  42. Fine-tuning can distort pretrained features and underperform out-of-distribution. In International Conference on Learning Representations, 2022.
  43. Discriminatively trained dense surface normal estimation. In ECCV, 2014.
  44. Does clip bind concepts? probing compositionality in large image models. arXiv preprint arXiv:2212.10537, 2022.
  45. Your diffusion model is secretly a zero-shot classifier. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 2206–2217, 2023.
  46. Align before fuse: Vision and language representation learning with momentum distillation. In Advances in Neural Information Processing Systems, 2021.
  47. Megadepth: Learning single-view depth prediction from internet photos. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2041–2050, 2018.
  48. Localization vs. semantics: Visual representations in unimodal and multimodal models. In EACL, 2024.
  49. On the variance of the adaptive learning rate and beyond. In ICLR, 2020.
  50. Zero-1-to-3: Zero-shot one image to 3d object. In ICCV, 2023.
  51. A convnet for the 2020s. In CVPR, 2022.
  52. SGDR: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983, 2016.
  53. David G Lowe. Distinctive image features from scale-invariant keypoints. IJCV, 2004.
  54. Diffusion hyperfeatures: Searching through time and space for semantic correspondence. arXiv, 2023.
  55. A computational theory of human stereo vision. Royal Society of London, 1979.
  56. Deep spectral methods: A surprisingly strong baseline for unsupervised semantic segmentation and localization. In IEEE Conf. Comput. Vis. Pattern Recog., 2022.
  57. Spair-71k: A large-scale benchmark for semantic correspondence. arXiv prepreint arXiv:1908.10543, 2019.
  58. Slip: Self-supervision meets language-image pre-training. In Eur. Conf. Comput. Vis., 2022.
  59. Visual discrimination of local surface structure: Slant, tilt, and curvedness. Vision research, 46(6-7):1057–1069, 2006.
  60. DINOv2: Learning Robust Visual Features without Supervision, 2023.
  61. idisc: Internal discretization for monocular depth estimation. In CVPR, 2023.
  62. Dreamfusion: Text-to-3d using 2d diffusion. In ICLR, 2022.
  63. Magic123: One image to high-quality 3d object generation using both 2d and 3d diffusion priors. arXiv preprint arXiv:2306.17843, 2023.
  64. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
  65. Learning transferable visual models from natural language supervision. In Int. Conf. Machine Learning, 2021.
  66. Dreambooth3d: Subject-driven text-to-3d generation. ICCV, 2023.
  67. Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer. TPAMI, 2020.
  68. Vision transformers for dense prediction. In ICCV, 2021.
  69. High-resolution image synthesis with latent diffusion models. In CVPR, pages 10684–10695, 2022.
  70. Photorealistic text-to-image diffusion models with deep language understanding. In NeurIPS, 2022.
  71. Shadows don’t lie and lines can’t bend! generative models don’t know projective geometry…for now. 2023.
  72. Superglue: Learning feature matching with graph neural networks. In CVPR, 2020.
  73. Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114, 2021.
  74. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Int. Conf. Comput. Vis., 2017.
  75. Second-order isomorphism of internal representations: Shapes of states. Cognitive psychology, 1(1):1–17, 1970.
  76. Mental rotation of three-dimensional objects. Science, 171(3972):701–703, 1971.
  77. Mvdream: Multi-view diffusion for 3d generation. arXiv preprint arXiv:2308.16512, 2023.
  78. Indoor segmentation and support inference from rgbd images. In ECCV, 2012.
  79. Beyond core knowledge: Natural geometry. Cognitive science, 34(5):863–884, 2010.
  80. How to train your vit? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270, 2021.
  81. Reclip: A strong zero-shot baseline for referring expression comprehension. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2022.
  82. LoFTR: Detector-free local feature matching with transformers. CVPR, 2021.
  83. Emergent correspondence from image diffusion. arXiv preprint arXiv:2306.03881, 2023.
  84. What do single-view 3d reconstruction networks learn? In CVPR, 2019.
  85. Winoground: Probing vision and language models for visio-linguistic compositionality. In CVPR, 2022.
  86. Deit III: Revenge of the ViT. In ECCV, 2022.
  87. Teaching matters: Investigating the role of supervision in vision transformers. In IEEE Conf. Comput. Vis. Pattern Recog., 2023.
  88. Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In CVPR, 2023.
  89. Ross Wightman. Pytorch image models. https://github.com/rwightman/pytorch-image-models, 2019.
  90. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38–45, Online, 2020. Association for Computational Linguistics.
  91. Convnext v2: Co-designing and scaling convnets with masked autoencoders. In CVPR, 2023.
  92. Beyond pascal: A benchmark for 3d object detection in the wild. In IEEE Winter Conference on Applications of Computer Vision (WACV), 2014.
  93. Neurallift-360: Lifting an in-the-wild 2d photo to a 3d object with 360deg views. In CVPR, 2023a.
  94. Demystifying clip data. arXiv preprint arXiv:2309.16671, 2023b.
  95. ODISE: Open-Vocabulary Panoptic Segmentation with Text-to-Image Diffusion Models. In CVPR, 2023c.
  96. Metric3d: Towards zero-shot metric 3d prediction from a single image. In ICCV, 2023.
  97. Sigmoid loss for language image pre-training. ICCV, 2023.
  98. What does stable diffusion know about the 3d scene?, 2023.
  99. A tale of two features: Stable diffusion complements dino for zero-shot semantic correspondence. In NeurIPS, 2023a.
  100. Self-supervised geometric correspondence for category-level 6d object pose estimation in the wild. In ICLR, 2023b.
  101. Unleashing text-to-image diffusion models for visual perception. arXiv preprint arXiv:2303.02153, 2023.
  102. ibot: Image bert pre-training with online tokenizer. International Conference on Learning Representations (ICLR), 2022.
Citations (49)

Summary

  • The paper demonstrates that self-supervised models exhibit strong 3D surface property encoding, surpassing even some models trained for dedicated 3D tasks.
  • The paper reveals significant variability in depth and surface normal estimation, underscoring current limitations in achieving reliable multiview consistency.
  • The paper highlights the need for improved training strategies and model architectures to enhance 3D representation without relying solely on direct 3D supervision.

Probing the Depths: Unveiling the 3D Awareness of Visual Foundation Models

Introduction to 3D Awareness in Visual Models

Recent advances in visual foundation models have demonstrated remarkable capabilities across a spectrum of tasks, including image classification, segmentation, and generation. A crucial aspect of understanding these models' capabilities lies in evaluating their awareness and representation of 3D properties—the underlying structure of the 3D world images represent. Despite the significant generalization capabilities of these models, their ability to encode and understand 3D geometry remains relatively underexplored. Our investigation focuses on probing the 3D awareness of various large-scale pretrained visual models, specifically analyzing their ability to encode single-view surface reconstruction and demonstrate multiview consistency.

Evaluating 3D Aware Visual Representations

Understanding 3D structure from 2D images is a complex problem that has been extensively studied in both psychophysics and computer vision. Inspired by human perception, which encodes 3D properties such as depth and surface orientation, we position 3D aware representations as those encoding similar basic 3D properties and being consistent across different views. We propose to explore this through probing models on their capacity for depth estimation, surface normal estimation, and establishing accurate correspondence across views—an approach aimed at covering both single-image and multiview aspects of 3D understanding.

Experimental Setup and Analysis

Our empirical analysis spans a variety of pretraining objectives, including models trained for tasks such as classification, language supervision, and self-supervision, among others. We evaluate these models on well-established 3D benchmarks, focusing on their ability to estimate monocular depth, surface normals, and establish correspondence across views. Our findings reveal striking differences in the 3D awareness of these models. For instance, self-supervised models like DINOv2 show promising capabilities in encoding depth and surface normals, whereas models trained with vision-language objectives exhibit significantly poorer performance in these aspects.

Monocular Depth and Surface Normal Estimation

Probing for single-view 3D understanding reveals substantial variability among models in representing depth and surface normals. Notably, DINOv2 and StableDiffusion exhibit strong performance, suggesting their effective representation of surface properties. Surprisingly, models trained specifically on depth estimation tasks did not outperform the generalist self-supervised models, indicating that the ability to encode 3D surface properties may not directly correlate with the training objective focused on 3D tasks.

Multiview Consistency

In evaluating multiview consistency, we observed a marked performance degradation across models as the variation in viewpoints increased. This degradation suggests that while models can encode some aspects of 3D structure from single images, their representations often lack the consistency required for accurate 3D correspondence across different views. Models exhibiting strong single-view 3D understanding did not necessarily perform well in multiview consistency, highlighting a gap in current models' 3D awareness and representation capabilities.

Theoretical and Practical Implications

Our findings underscore the nuanced nature of 3D awareness in visual foundation models and suggest several directions for future research. The variability in 3D representation capabilities across models, especially those trained with different objectives, opens up inquiries into the inherent properties of various training paradigms and their influence on 3D understanding. Furthermore, the challenge of multiview consistency points to potential improvements in model architectures and training strategies that could enhance 3D awareness without direct 3D supervision.

Conclusion and Future Directions

This paper presents a comprehensive evaluation of the 3D awareness of visual foundation models, highlighting significant variability in their ability to encode and understand 3D properties. Our findings suggest that despite their impressive capabilities in other domains, current models still face challenges in achieving true 3D awareness, particularly in representing consistent 3D geometry across views. These insights contribute to a deeper understanding of the capabilities and limitations of visual foundation models, paving the way for future research aimed at enhancing their 3D representation abilities.

Youtube Logo Streamline Icon: https://streamlinehq.com