Papers
Topics
Authors
Recent
Search
2000 character limit reached

Probing the 3D Awareness of Visual Foundation Models

Published 12 Apr 2024 in cs.CV | (2404.08636v1)

Abstract: Recent advances in large-scale pretraining have yielded visual foundation models with strong capabilities. Not only can recent models generalize to arbitrary images for their training task, their intermediate representations are useful for other visual tasks such as detection and segmentation. Given that such models can classify, delineate, and localize objects in 2D, we ask whether they also represent their 3D structure? In this work, we analyze the 3D awareness of visual foundation models. We posit that 3D awareness implies that representations (1) encode the 3D structure of the scene and (2) consistently represent the surface across views. We conduct a series of experiments using task-specific probes and zero-shot inference procedures on frozen features. Our experiments reveal several limitations of the current models. Our code and analysis can be found at https://github.com/mbanani/probe3d.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (102)
  1. Deep vit features as dense visual descriptors. arXiv preprint arXiv:2112.05814, 2021.
  2. Estimating and exploiting the aleatoric uncertainty in surface normal estimation. In ICCV, 2021.
  3. Shape matching and object recognition using low distortion correspondences. In CVPR, pages 26–33. Citeseer, 2005.
  4. Adabins: Depth estimation using adaptive bins. In CVPR, 2021.
  5. Zoedepth: Zero-shot transfer by combining relative and metric depth. arXiv preprint arXiv:2302.12288, 2023.
  6. Stylegan knows normal, depth, albedo, and more. arXiv preprint arXiv:2306.00987, 2023.
  7. Thomas O. Binford. Visual perception by computer. In Proceedings of the IEEE Conference on Systems and Control, 1971.
  8. Rodney A Brooks. Symbolic reasoning among 3-d models and 2-d images. Artificial intelligence, 17(1-3):285–348, 1981.
  9. Emerging properties in self-supervised vision transformers. In IEEE Conf. Comput. Vis. Pattern Recog., 2021.
  10. Muse: Text-to-image generation via masked generative transformers. In ICML, 2023.
  11. Return of the devil in the details: delving deep into convolutional nets. In BMVC, 2014.
  12. Beyond surface statistics: Scene representations in a latent diffusion model. arXiv preprint arXiv:2306.05720, 2023.
  13. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In CVPR, 2017.
  14. Vision transformers need registers. arXiv preprint arXiv:2309.16588, 2023.
  15. An image is worth 16x16 words: Transformers for image recognition at scale. In Int. Conf. Learn. Represent., 2020.
  16. Generative models: What do they know? do they know things? let’s find out! arXiv, 2023.
  17. Depth map prediction from a single image using a multi-scale deep network. In NeruIPS, 2014.
  18. Bootstrap Your Own Correspondences. In ICCV, 2021.
  19. UnsupervisedR&R: Unsupervised Point Cloud Registration via Differentiable Rendering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7129–7139, 2021.
  20. Learning Visual Representations via Language-Guided Sampling. In CVPR, 2023.
  21. The pascal visual object classes (voc) challenge. International Journal of Computer Vision, 88(2):303–338, 2010.
  22. David Fouhey. Factoring Scenes into 3D Structure and Style. PhD thesis, Carnegie Mellon University, Pittsburgh, PA, 2016.
  23. Single image 3D without a single 3D image. In ICCV, 2015.
  24. Battle of the backbones: A large-scale comparison of pretrained models across computer vision tasks. In NeurIPS Datasets and Benchmarks Track, 2023.
  25. Zero-shot category-level object pose estimation. In ECCV, 2022.
  26. Scaling and benchmarking self-supervised visual representation learning. In ICCV, 2019.
  27. Asic: Aligning sparse in-the-wild image collections. arXiv preprint arXiv:2303.16201, 2023.
  28. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In ICCV, pages 1026–1034, 2015.
  29. Masked autoencoders are scalable vision learners. In CVPR, 2022.
  30. Unsupervised semantic correspondence using stable diffusion. arXiv preprint arXiv:2305.15581, 2023.
  31. OpenCLIP, 2021.
  32. Navi: Category-agnostic image collections with high-quality 3d shape and pose annotations. In NeurIPS Datasets and Benchmarks Track, 2023.
  33. Repurposing diffusion-based image generators for monocular depth estimation. arXiv preprint arXiv:2312.02145, 2023.
  34. Dag: Depth-aware guidance with denoising diffusion probabilistic models. arXiv preprint arXiv:2212.08861, 2022.
  35. Adam: A method for stochastic optimization. In ICLR, 2015.
  36. Segment anything. arXiv preprint arXiv:2304.02643, 2023.
  37. The singularities of the visual mapping. Biological cybernetics, 24(1):51–59, 1976.
  38. The internal representation of solid shape with respect to vision. Biological cybernetics, 32(4):211–216, 1979.
  39. Surface shape and curvature scales. Image and vision computing, 10(8):557–564, 1992.
  40. Pictorial surface attitude and local depth comparisons. Perception & Psychophysics, 58(2):163–173, 1996.
  41. Do better imagenet models transfer better? In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2661–2671, 2019.
  42. Fine-tuning can distort pretrained features and underperform out-of-distribution. In International Conference on Learning Representations, 2022.
  43. Discriminatively trained dense surface normal estimation. In ECCV, 2014.
  44. Does clip bind concepts? probing compositionality in large image models. arXiv preprint arXiv:2212.10537, 2022.
  45. Your diffusion model is secretly a zero-shot classifier. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 2206–2217, 2023.
  46. Align before fuse: Vision and language representation learning with momentum distillation. In Advances in Neural Information Processing Systems, 2021.
  47. Megadepth: Learning single-view depth prediction from internet photos. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2041–2050, 2018.
  48. Localization vs. semantics: Visual representations in unimodal and multimodal models. In EACL, 2024.
  49. On the variance of the adaptive learning rate and beyond. In ICLR, 2020.
  50. Zero-1-to-3: Zero-shot one image to 3d object. In ICCV, 2023.
  51. A convnet for the 2020s. In CVPR, 2022.
  52. SGDR: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983, 2016.
  53. David G Lowe. Distinctive image features from scale-invariant keypoints. IJCV, 2004.
  54. Diffusion hyperfeatures: Searching through time and space for semantic correspondence. arXiv, 2023.
  55. A computational theory of human stereo vision. Royal Society of London, 1979.
  56. Deep spectral methods: A surprisingly strong baseline for unsupervised semantic segmentation and localization. In IEEE Conf. Comput. Vis. Pattern Recog., 2022.
  57. Spair-71k: A large-scale benchmark for semantic correspondence. arXiv prepreint arXiv:1908.10543, 2019.
  58. Slip: Self-supervision meets language-image pre-training. In Eur. Conf. Comput. Vis., 2022.
  59. Visual discrimination of local surface structure: Slant, tilt, and curvedness. Vision research, 46(6-7):1057–1069, 2006.
  60. DINOv2: Learning Robust Visual Features without Supervision, 2023.
  61. idisc: Internal discretization for monocular depth estimation. In CVPR, 2023.
  62. Dreamfusion: Text-to-3d using 2d diffusion. In ICLR, 2022.
  63. Magic123: One image to high-quality 3d object generation using both 2d and 3d diffusion priors. arXiv preprint arXiv:2306.17843, 2023.
  64. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
  65. Learning transferable visual models from natural language supervision. In Int. Conf. Machine Learning, 2021.
  66. Dreambooth3d: Subject-driven text-to-3d generation. ICCV, 2023.
  67. Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer. TPAMI, 2020.
  68. Vision transformers for dense prediction. In ICCV, 2021.
  69. High-resolution image synthesis with latent diffusion models. In CVPR, pages 10684–10695, 2022.
  70. Photorealistic text-to-image diffusion models with deep language understanding. In NeurIPS, 2022.
  71. Shadows don’t lie and lines can’t bend! generative models don’t know projective geometry…for now. 2023.
  72. Superglue: Learning feature matching with graph neural networks. In CVPR, 2020.
  73. Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114, 2021.
  74. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Int. Conf. Comput. Vis., 2017.
  75. Second-order isomorphism of internal representations: Shapes of states. Cognitive psychology, 1(1):1–17, 1970.
  76. Mental rotation of three-dimensional objects. Science, 171(3972):701–703, 1971.
  77. Mvdream: Multi-view diffusion for 3d generation. arXiv preprint arXiv:2308.16512, 2023.
  78. Indoor segmentation and support inference from rgbd images. In ECCV, 2012.
  79. Beyond core knowledge: Natural geometry. Cognitive science, 34(5):863–884, 2010.
  80. How to train your vit? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270, 2021.
  81. Reclip: A strong zero-shot baseline for referring expression comprehension. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2022.
  82. LoFTR: Detector-free local feature matching with transformers. CVPR, 2021.
  83. Emergent correspondence from image diffusion. arXiv preprint arXiv:2306.03881, 2023.
  84. What do single-view 3d reconstruction networks learn? In CVPR, 2019.
  85. Winoground: Probing vision and language models for visio-linguistic compositionality. In CVPR, 2022.
  86. Deit III: Revenge of the ViT. In ECCV, 2022.
  87. Teaching matters: Investigating the role of supervision in vision transformers. In IEEE Conf. Comput. Vis. Pattern Recog., 2023.
  88. Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In CVPR, 2023.
  89. Ross Wightman. Pytorch image models. https://github.com/rwightman/pytorch-image-models, 2019.
  90. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38–45, Online, 2020. Association for Computational Linguistics.
  91. Convnext v2: Co-designing and scaling convnets with masked autoencoders. In CVPR, 2023.
  92. Beyond pascal: A benchmark for 3d object detection in the wild. In IEEE Winter Conference on Applications of Computer Vision (WACV), 2014.
  93. Neurallift-360: Lifting an in-the-wild 2d photo to a 3d object with 360deg views. In CVPR, 2023a.
  94. Demystifying clip data. arXiv preprint arXiv:2309.16671, 2023b.
  95. ODISE: Open-Vocabulary Panoptic Segmentation with Text-to-Image Diffusion Models. In CVPR, 2023c.
  96. Metric3d: Towards zero-shot metric 3d prediction from a single image. In ICCV, 2023.
  97. Sigmoid loss for language image pre-training. ICCV, 2023.
  98. What does stable diffusion know about the 3d scene?, 2023.
  99. A tale of two features: Stable diffusion complements dino for zero-shot semantic correspondence. In NeurIPS, 2023a.
  100. Self-supervised geometric correspondence for category-level 6d object pose estimation in the wild. In ICLR, 2023b.
  101. Unleashing text-to-image diffusion models for visual perception. arXiv preprint arXiv:2303.02153, 2023.
  102. ibot: Image bert pre-training with online tokenizer. International Conference on Learning Representations (ICLR), 2022.
Citations (49)

Summary

  • The paper’s main contribution is probing the 3D awareness of pretrained visual models using task-specific probes and zero-shot inference.
  • It employs depth estimation, surface normal, and correspondence evaluations to reveal strengths and limitations, with models like DINOv2 excelling in detail capture.
  • The study highlights that while models capture surface properties, they generally struggle with maintaining true 3D consistency across varying viewpoints.

3D Awareness in Visual Foundation Models

The paper "Probing the 3D Awareness of Visual Foundation Models" (2404.08636) investigates the extent to which visual foundation models, pretrained on large-scale image datasets, capture the 3D structure of scenes and objects. The central hypothesis is that 3D awareness manifests in two key capabilities: the ability to reconstruct the 3D geometry of a scene from a single view and the consistency of representations across different views. The study employs task-specific probes and zero-shot inference procedures on frozen features to evaluate a diverse range of models.

Evaluation Methodology

The research evaluates models on their ability to estimate depth, surface normals, and 3D correspondence. These tasks are assessed at both scene-level, using the NYUv2 dataset [silberman2012indoor], and object-level, using the NAVI dataset [jampani2023navi], to provide a comprehensive analysis. Models include those trained via classification, language supervision, self-supervision, text-conditioned image generation, depth estimation, and class-agnostic segmentation. The models' aggregated performance in single-image and multiview tasks is shown in (Figure 1). Figure 1

Figure 1: Are current visual foundation models 3D aware? We probe the 3D awareness of the learned representations by evaluating their ability to encode the 3D structure of the visible surface and their consistency across views.

To avoid evaluating transferability, the paper opts to probe frozen representations through task-specific probes or zero-shot inference methods. This allows for the direct evaluation of the pretrained representations, rather than the transferability of their pretrained weights. The single image surface reconstruction probes and zero-shot multi-view consistency are described in the following sections.

Single-View 3D Understanding

Single-view 3D understanding is assessed through monocular depth estimation and surface normal estimation. The former predicts the depth for each pixel in an image, while the latter predicts the orientation of the surface at each pixel. Figure 2

Figure 2: Depth Estimation Results. While pretrained representations exhibit large variation in their ability to represent depth, their performance is consistent on objects and scenes. CLIP and MAE features do not encode depth and appear to instead capture rough priors such as "floor pixels are close". Most models appear to capture the rough structure of the scene and vary in the degree to which they capture details. DINOv2 performs best and accurately captures fine details; \eg, cow's ear, desk chair, and coffee table.

A dense multiscale probe, similar to the DPT decoder [ranftl2021dpt], is used to map features from multiple layers to depth or surface normals. This approach deviates from linear probing to account for potential non-linear encoding of 3D properties across different network layers. Root-mean-squared prediction error and recall at different thresholds are the primary metrics. The ability of models to encode depth is variable, with DINOv2 and StableDiffusion producing detailed depth maps, while CLIP and MAE generate blurry estimates. Similarly, surface normal probe results reveal that some models capture fine details, while others rely on coarse priors. The surface normal qualitative examples are shown in (Figure 3). Figure 3

Figure 3: Surface Normal Qualitative Examples. With the exception of CLIP, models can capture the rough orientation of object and scene surfaces; \eg, floors, walls, ceilings. The main distinction seems to be in how well they capture finer details. Similarly to depth results, we find that DINOv2 and StableDiffusion perform best and can capture fine details such as the edges of the toy car and the white seat. Surprisingly, we find that SAM's predictions are not as detailed despite its ability to predict accurate segmentation boundaries.

Performance is strongly correlated across domains and tasks (Figure 4), suggesting that the probes measure a single underlying capability. Discriminative self-supervised models perform best, followed by StableDiffusion, while language-supervised models perform poorly, which could be attributed to vision LLMs struggling with spatial relations and compositionality [lewis2022does, subramanian2022reclip, li2024localizationvssemantics]. Figure 4

Figure 4: Single view performance correlation. Depth and surface normal performance is highly correlated across domains.

Multiview Consistency Assessment

Multiview consistency is evaluated using correspondence estimation, where the goal is to identify image patches across views that depict the same 3D point. This is performed using Paired ScanNet [dai2017scannet, sarlin2020superglue] for scenes and the NAVI wild set for objects. Rather than training a probe, the approach computes correspondence between dense feature maps to evaluate representation consistency directly. Figure 5

Figure 5: Correspondence Estimation Qualitative Results. We observe that models can estimate accurate correspondence for small viewpoint changes, but struggle with large viewpoint changes. This is true even if the change is an in-plane rotation as shown with the eagle. This pattern is consistent for both objects and scenes, although performance is not well correlated: SAM and StableDiffusion perform better for scenes, while DeiT and DINOv2 are more consistent for objects. Correspondence color-coded for accuracy.

While models can estimate accurate correspondence for small viewpoint changes, performance deteriorates rapidly for larger changes, as seen in (Figure 6). StableDiffusion and SAM experience sharp performance drops, while DINOv2 and DeiT exhibit more consistent performance across a wider range of baselines. The rapid deterioration is also shown in the multiview performance as binned by viewpoint in (Figure 6). The results suggest that current models are not 3D consistent, despite encoding surface properties. Figure 6

Figure 6: While all models experience performance drops with larger viewpoint changes, some experience sharper drops suggesting a lack of 3D awareness.

The paper highlights a distinction between semantic and geometric correspondence. While models excel at semantic correspondence [amir2021deep, zhang2023tale, tang2023dift], as shown in (Figure 7), they exhibit systematic errors when viewing objects from different viewpoints, suggesting a combination of semantic and 2D location representation. Figure 7

Figure 7: Semantic Correspondence. StableDiffusion represents semantics well, but lack 3D consistency. This results in accurate correspondence for objects viewed from similar angles and systematic errors when viewing objects from different viewpoints.

Cross-Task Analysis

The study computes correlations between models' aggregated performance across multiple tasks to understand relationships between different tasks and training objectives. As shown in (Figure 8), performance on single-view tasks is strongly correlated with itself and semantic correspondence, but exhibits a drop in correlation with scene-level correspondence estimation and correspondence estimation with large viewpoint variations. This further supports the claim that semantic correspondence is not a reliable measure of 3D consistency. Figure 8

Figure 8: Cross-task performance correlation. Performance on single view tasks is strongly correlated with itself as well as semantic correspondence, but we see a drop in correlation performance of scene-level correspondence estimation and correspodence estimation with large viewpoint variation.

Conclusion

The paper concludes that visual foundation models learn representations that encode properties of the visual surface, except for vision-LLMs. However, models struggle with multiview consistency, indicating a learning of view-consistent rather than 3D-consistent representations. This lack of consistency could be the result of learning view-dependent representations, or current models are simply good "image models" where good discriminative features are sufficient for strong 2.5D understanding. Future research could investigate more complex and higher-order tasks related to 3D awareness. Overall, our findings underscore the importance of considering 3D awareness in the design and evaluation of visual representation learning approaches.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 14 tweets with 464 likes about this paper.