Optimal Vision Foundation Model backbone for correspondence estimation

Determine the optimal Vision Foundation Model—considering both 2D-pretrained and 3D-pretrained architectures—to serve as the feature extractor backbone for dense correspondence estimation between image pairs, evaluating accuracy and robustness across domains and viewpoints.

Background

The paper examines using Vision Foundation Models (VFMs) as shared feature extractors for dense image correspondence. Prior work has successfully used both 2D-pretrained (e.g., DINOv2/DINOv3) and 3D-pretrained (e.g., DUSt3R, MASt3R, Aerial-MASt3R, VGGT) backbones for matching, but there is no consensus on which backbone is best.

To investigate, the authors conduct linear probing across multiple 2D and 3D VFMs and find that decoders from 3D VFMs often yield superior spatial alignment. Despite these observations, they explicitly state that deciding the optimal VFM backbone for correspondence estimation remains an open question.

References

While previous methods have shown that both 2D and 3D VFMs are helpful as the feature extractor backbone for correspondence estimation, the optimal VFM choice remains an open question.

SPIDER: Spatial Image CorresponDence Estimator for Robust Calibration (2511.17750 - Shao et al., 21 Nov 2025) in Section 3.1 (Multi-Scale 3D Vision Foundation Models)