Probing the 3D Awareness of Visual Foundation Models

Published 12 Apr 2024 in cs.CV | (2404.08636v1)

Abstract: Recent advances in large-scale pretraining have yielded visual foundation models with strong capabilities. Not only can recent models generalize to arbitrary images for their training task, their intermediate representations are useful for other visual tasks such as detection and segmentation. Given that such models can classify, delineate, and localize objects in 2D, we ask whether they also represent their 3D structure? In this work, we analyze the 3D awareness of visual foundation models. We posit that 3D awareness implies that representations (1) encode the 3D structure of the scene and (2) consistently represent the surface across views. We conduct a series of experiments using task-specific probes and zero-shot inference procedures on frozen features. Our experiments reveal several limitations of the current models. Our code and analysis can be found at https://github.com/mbanani/probe3d.

Abstract PDF HTML Upgrade to Chat

Authors (10)

References (102)

Citations (49)

View on Semantic Scholar

Summary

The paper’s main contribution is probing the 3D awareness of pretrained visual models using task-specific probes and zero-shot inference.
It employs depth estimation, surface normal, and correspondence evaluations to reveal strengths and limitations, with models like DINOv2 excelling in detail capture.
The study highlights that while models capture surface properties, they generally struggle with maintaining true 3D consistency across varying viewpoints.

3D Awareness in Visual Foundation Models

The paper "Probing the 3D Awareness of Visual Foundation Models" (2404.08636) investigates the extent to which visual foundation models, pretrained on large-scale image datasets, capture the 3D structure of scenes and objects. The central hypothesis is that 3D awareness manifests in two key capabilities: the ability to reconstruct the 3D geometry of a scene from a single view and the consistency of representations across different views. The study employs task-specific probes and zero-shot inference procedures on frozen features to evaluate a diverse range of models.

Evaluation Methodology

The research evaluates models on their ability to estimate depth, surface normals, and 3D correspondence. These tasks are assessed at both scene-level, using the NYUv2 dataset [silberman2012indoor], and object-level, using the NAVI dataset [jampani2023navi], to provide a comprehensive analysis. Models include those trained via classification, language supervision, self-supervision, text-conditioned image generation, depth estimation, and class-agnostic segmentation. The models' aggregated performance in single-image and multiview tasks is shown in (Figure 1).

Figure 1: Are current visual foundation models 3D aware? We probe the 3D awareness of the learned representations by evaluating their ability to encode the 3D structure of the visible surface and their consistency across views.

To avoid evaluating transferability, the paper opts to probe frozen representations through task-specific probes or zero-shot inference methods. This allows for the direct evaluation of the pretrained representations, rather than the transferability of their pretrained weights. The single image surface reconstruction probes and zero-shot multi-view consistency are described in the following sections.

Single-View 3D Understanding

Single-view 3D understanding is assessed through monocular depth estimation and surface normal estimation. The former predicts the depth for each pixel in an image, while the latter predicts the orientation of the surface at each pixel.

Figure 2: Depth Estimation Results. While pretrained representations exhibit large variation in their ability to represent depth, their performance is consistent on objects and scenes. CLIP and MAE features do not encode depth and appear to instead capture rough priors such as "floor pixels are close". Most models appear to capture the rough structure of the scene and vary in the degree to which they capture details. DINOv2 performs best and accurately captures fine details; \eg, cow's ear, desk chair, and coffee table.

A dense multiscale probe, similar to the DPT decoder [ranftl2021dpt], is used to map features from multiple layers to depth or surface normals. This approach deviates from linear probing to account for potential non-linear encoding of 3D properties across different network layers. Root-mean-squared prediction error and recall at different thresholds are the primary metrics. The ability of models to encode depth is variable, with DINOv2 and StableDiffusion producing detailed depth maps, while CLIP and MAE generate blurry estimates. Similarly, surface normal probe results reveal that some models capture fine details, while others rely on coarse priors. The surface normal qualitative examples are shown in (Figure 3).

Figure 3: Surface Normal Qualitative Examples. With the exception of CLIP, models can capture the rough orientation of object and scene surfaces; \eg, floors, walls, ceilings. The main distinction seems to be in how well they capture finer details. Similarly to depth results, we find that DINOv2 and StableDiffusion perform best and can capture fine details such as the edges of the toy car and the white seat. Surprisingly, we find that SAM's predictions are not as detailed despite its ability to predict accurate segmentation boundaries.

Performance is strongly correlated across domains and tasks (Figure 4), suggesting that the probes measure a single underlying capability. Discriminative self-supervised models perform best, followed by StableDiffusion, while language-supervised models perform poorly, which could be attributed to vision LLMs struggling with spatial relations and compositionality [lewis2022does, subramanian2022reclip, li2024localizationvssemantics].

Figure 4: Single view performance correlation. Depth and surface normal performance is highly correlated across domains.

Multiview Consistency Assessment

Multiview consistency is evaluated using correspondence estimation, where the goal is to identify image patches across views that depict the same 3D point. This is performed using Paired ScanNet [dai2017scannet, sarlin2020superglue] for scenes and the NAVI wild set for objects. Rather than training a probe, the approach computes correspondence between dense feature maps to evaluate representation consistency directly.

Figure 5: Correspondence Estimation Qualitative Results. We observe that models can estimate accurate correspondence for small viewpoint changes, but struggle with large viewpoint changes. This is true even if the change is an in-plane rotation as shown with the eagle. This pattern is consistent for both objects and scenes, although performance is not well correlated: SAM and StableDiffusion perform better for scenes, while DeiT and DINOv2 are more consistent for objects. Correspondence color-coded for accuracy.

While models can estimate accurate correspondence for small viewpoint changes, performance deteriorates rapidly for larger changes, as seen in (Figure 6). StableDiffusion and SAM experience sharp performance drops, while DINOv2 and DeiT exhibit more consistent performance across a wider range of baselines. The rapid deterioration is also shown in the multiview performance as binned by viewpoint in (Figure 6). The results suggest that current models are not 3D consistent, despite encoding surface properties.

Figure 6: While all models experience performance drops with larger viewpoint changes, some experience sharper drops suggesting a lack of 3D awareness.

The paper highlights a distinction between semantic and geometric correspondence. While models excel at semantic correspondence [amir2021deep, zhang2023tale, tang2023dift], as shown in (Figure 7), they exhibit systematic errors when viewing objects from different viewpoints, suggesting a combination of semantic and 2D location representation.

Figure 7: Semantic Correspondence. StableDiffusion represents semantics well, but lack 3D consistency. This results in accurate correspondence for objects viewed from similar angles and systematic errors when viewing objects from different viewpoints.

Cross-Task Analysis

The study computes correlations between models' aggregated performance across multiple tasks to understand relationships between different tasks and training objectives. As shown in (Figure 8), performance on single-view tasks is strongly correlated with itself and semantic correspondence, but exhibits a drop in correlation with scene-level correspondence estimation and correspondence estimation with large viewpoint variations. This further supports the claim that semantic correspondence is not a reliable measure of 3D consistency.

Figure 8: Cross-task performance correlation. Performance on single view tasks is strongly correlated with itself as well as semantic correspondence, but we see a drop in correlation performance of scene-level correspondence estimation and correspodence estimation with large viewpoint variation.

Conclusion

The paper concludes that visual foundation models learn representations that encode properties of the visual surface, except for vision-LLMs. However, models struggle with multiview consistency, indicating a learning of view-consistent rather than 3D-consistent representations. This lack of consistency could be the result of learning view-dependent representations, or current models are simply good "image models" where good discriminative features are sufficient for strong 2.5D understanding. Future research could investigate more complex and higher-order tasks related to 3D awareness. Overall, our findings underscore the importance of considering 3D awareness in the design and evaluation of visual representation learning approaches.

Markdown Report Issue