Papers
Topics
Authors
Recent
2000 character limit reached

Evaluating Foundation Models' 3D Understanding Through Multi-View Correspondence Analysis (2512.11574v1)

Published 12 Dec 2025 in cs.CV

Abstract: Benchmarking 3D spatial understanding of foundation models is essential for real-world applications such as robotics and autonomous driving. Existing evaluations often rely on downstream finetuning with linear heads or task-specific decoders, making it difficult to isolate the intrinsic 3D reasoning ability of pretrained encoders. In this work, we introduce a novel benchmark for in-context 3D scene understanding that requires no finetuning and directly probes the quality of dense visual features. Building on the Hummingbird framework, which evaluates in-context 2D scene understanding, we extend the setup to the 3D Multi-View ImageNet (MVImgNet) dataset. Given a set of images from objects in specific angles (keys), we benchmark the performance of segmenting novel views (queries) and report the scores in 4 categories of easy, medium, hard, and extreme based on the key-query view contrast. We benchmark 8 state-of-the-art foundation models and show DINO-based encoders remain competitive across large viewpoint shifts, while 3D-aware models like VGGT require dedicated multi-view adjustments. Our code is publicly available at https://github.com/ToyeshC/open-hummingbird-3d-eval .

Summary

  • The paper introduces a memory-based multi-view segmentation benchmark to evaluate 3D scene understanding with mean intersection-over-union as the core metric.
  • It demonstrates that self-supervised ViT encoders, especially DINOv3 and DINOv2, maintain robust spatial consistency under significant angular shifts.
  • The study reveals key memory-performance trade-offs and exposes limitations in geometry-grounded models like VGGT when applied to multi-view settings.

Evaluating Foundation Models’ 3D Understanding Through Multi-View Correspondence Analysis

Introduction and Motivation

The paper targets the underexplored issue of viewpoint robustness and 3D scene understanding in vision foundation models, especially those relying on ViT-based self-supervised and multimodal encoders. Conventional benchmarks mainly assess single-view recognition or dense view synthesis, neglecting pixelwise semantic consistency as viewpoint shifts. Catastrophic failures in recognition due to angular deviations are well-documented, yet standard evaluation rarely isolates this geometric vulnerability. The authors’ major contribution is a systematic, in-context, memory-based segmentation benchmark on Multi-View ImageNet (MVImgNet), extending the Hummingbird paradigm of non-parametric, prompt-based scene understanding from 2D to structured multi-view 3D scenarios. This approach bypasses finetuning and decoders, directly probing encoder-intrinsic spatial robustness under controlled angular displacements.

Experimental Methodology

The experimental setup leverages the Hummingbird retrieval framework in combination with COLMAP-binned MVImgNet object categories, assigning images to angular bins spanning 00^\circ to 9090^\circ in 1515^\circ steps. Model feature memory banks store reference-angled support images with dense masks; query masks are inferred by k-NN patch feature retrieval under varied angular separations. Models evaluated include self-supervised (DINO, DINOv2, DINOv3), multimodal (CLIP, SigLIP2), mixed-supervised (TIPS, C-RADIOv2), and an explicitly geometry-grounded transformer (VGGT). Memory sizes are scaled to examine sample efficiency and computational trade-offs.

Difficulty levels for segmentation tasks are systematically constructed by varying the density and distribution of support-bin angles, creating Easy, Medium, Hard, and Extreme regimes. A primary metric is mean intersection-over-union (mIoU), measured as a function of validation bin angular offset relative to the support bins.

The object category selection enforces full angular coverage and manageable memory footprints, resulting in 15 diverse classes (Figure 1, Figure 2). Figure 1

Figure 1: Multi-view categories. Each MVImgNet category is shown across all viewpoint bins (00^\circ9090^\circ), above are example visualizations for 4 of the 15 selected classes.

Results and Analysis

Cross-Viewpoint Generalization

Systematic evaluation (per Table 1 and Figure 3) reveals three key findings:

  • DINOv3 consistently yields the highest mIoU across all angular difficulties and memory bank sizes, followed by DINOv2 and DINO.
  • DINO-based features are remarkably robust under large viewpoint shifts, showing smooth and gradual degradation in performance as angular separation from the support bin increases.
  • Multimodal models like CLIP and SigLIP2 exhibit moderate cross-view robustness, outperforming mixed-supervision (TIPS, C-RADIOv2) models, which suffer from sharper mIoU declines.
  • VGGT—though designed for 3D geometric tasks—performs substantially worse in this segmentation-retrieval setting, with nearly background-only predictions. This is ascribed to architectural mismatch, as VGGT’s multi-view fusion aggregator is rendered inert with a single-view memory bank. Figure 3

    Figure 3: Segmentation performance across viewpoint bins and difficulty levels. Bars: mIoU on unseen bins; dots: reference bin performance. Self-supervised transformers maintain more stable generalization as angular distance increases.

Breaking Point Analysis

Normalized mIoU curves (Figure 4) demonstrate that DINOv2, DINOv3, DINO, and CLIP degrade smoothly, showing no sudden breaking point up to 9090^\circ reference-query angle. TIPS and VGGT, by contrast, exhibit significant performance drops at 3030^\circ, failing to maintain spatial consistency, confirming self-supervised ViT encoders’ clear advantage in geometric feature stability under perturbation. Figure 4

Figure 4: Normalized mIoU under viewpoint shifts. DINO-like models degrade more smoothly, TIPS/CRADIO/VGGT degrade sharply at small angular offsets.

Memory Size Robustness

Increasing the support memory size systematically boosts mIoU, particularly for weaker encoders (see Table 2). Gains for DINO-family models are marginal past 640k entries, while SigLIP2/TIPS/CRADIOv2 see larger, more sustained improvements, indicating that robust feature learning is more parameter-efficient than scaling external memory. This reflects typical memory-performance trade-offs for nonparametric inference pipelines under descriptor uncertainty.

Per-Class and Qualitative Insights

Some classes (e.g., "sofa", "broccoli") maintain high generalization, while thin-structured or partially annotated categories (e.g., "coat rack", "bed") manifest sharply reduced mIoU and, in the case of bed, annotation errors outweigh true model failures (see Figure 5, Figure 6).

Qualitative comparisons (Figure 7) show that DINO-predicted masks often more closely trace visible object boundaries even than ground-truth annotations—a strong indication of local shape preservation, even in the presence of label noise. Figure 7

Figure 7

Figure 7

Figure 7: Qualitative segmentation using DINO. Model predictions closely align with visual boundaries, sometimes outperforming coarse ground truth in terms of appearance-consistent segmentation.

Implications and Future Directions

This work definitively demonstrates that self-supervised ViT encoders, particularly DINOv3 and DINOv2, provide robust, viewpoint-consistent representations for 3D-aware segmentation tasks without task-specific finetuning or explicit 3D alignment in pretraining. In practical terms, this finding is highly consequential for robotics, autonomous driving, and embodied intelligence, where agents must recognize objects or scenes from arbitrary, previously unseen viewpoints using only in-context memory.

The failure modes of explicitly geometry-grounded models in this setting accentuate the architectural and pretraining objective mismatch that persists in dense correspondence and retrieval-based inference regimes. Robustness to geometric perturbation can—at present—be more reliably obtained via appropriate self-supervised objectives than explicitly multi-view model design, unless memory architectures are also co-adapted.

Theoretically, these findings highlight the emergence of 3D-aware part consistency in scalable self-supervised ViTs, suggesting further pretraining objectives that enforce equivariance or invariance to 3D transformations may be fruitful. The evaluation paradigm also points the way towards nonparametric, memory-based generalization tests that better match downstream applications in open-world and few-shot scenarios.

Open research directions include:

  • Extending the evaluation to multi-object segmentation, occlusion, and compound 3D rotations.
  • Benchmarking with synthesized as well as real-world multiview data and extending angular range beyond 9090^\circ.
  • Investigating how architectural elements (e.g., register tokens in DINOv3) interact with retrieval robustness.
  • Mixing parametrized and nonparametric inference to further capitalize on robust self-supervised spatial features.

Conclusion

The paper offers a rigorous, controlled, memory-retrieval-driven evaluation of current foundation vision models’ capacity for viewpoint-consistent 3D understanding. The evidence unambiguously positions DINO-based self-supervised ViTs as the state-of-the-art for geometry-aware, in-context segmentation among major foundation encoders, with their advantage amplified as viewpoint variability increases. Multimodal and geometry-focused pretraining, as currently implemented, lag behind unless model and inference procedures are jointly adapted. This benchmark sets a new standard for evaluating 3D spatial consistency in frozen visual encoders and raises foundational questions about the interplay between pretraining objectives, architectural priors, and in-context generalization.


References:

Whiteboard

Paper to Video (Beta)

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 1 like about this paper.