Handling Ambiguity in Cross-View, Cross-Modality Correspondence

Determine algorithmic strategies to robustly handle ambiguity when predicting pixel-level correspondences between ground-level photographs and floor plans in cross-view, cross-modality settings, particularly in cases where photos provide minimal contextual cues or where scene layouts exhibit structural symmetry that leads to multiple plausible alignments.

Background

The paper introduces C3, a dataset of correspondences between floor plans and photos, and evaluates state-of-the-art methods including DUSt3R and MASt3R, finding that existing models struggle with cross-view, cross-modality matching. The authors fine-tune DUSt3R with their approach C3Po, achieving notable improvements but still observing significant errors compared to classical correspondence problems.

Through error analysis, the authors identify that ambiguity arises when ground-level photos lack sufficient global context and when structures exhibit symmetries, leading to multiple plausible correspondence configurations. Addressing this ambiguity is highlighted as an outstanding challenge for future research.

References

We analyze the remaining errors and find multiple challenges that are particular to this cross-view, cross-modal problem: often, ground-level photos do not provide enough context of the overall scene, and this problem is exacerbated when symmetries in the structure make the problem ambiguous. Handling this ambiguity is an open problem deserving of future research.

C3Po: Cross-View Cross-Modality Correspondence by Pointmap Prediction (2511.18559 - Huang et al., 23 Nov 2025) in Section 1 (Introduction)