Interpretability of Correspondence Structures Under Probing

Ascertain the interpretability of correspondence-related structures encoded within the Visual Geometry Grounded Transformer (VGGT) representation space when assessed via camera-token probing of intermediate layers.

Background

The authors train simple two-layer MLP probes on VGGT’s camera tokens to recover the fundamental matrix, observing that epipolar geometry becomes decodable in the middle layers.

However, they state that the interpretability of the hypothesized correspondence structures within the representation space is not clear under the probing scheme, motivating further analysis of attention maps as a more direct lens.

References

How interpretable these structures are in the representation space is not yet clear under this probing scheme.

On Geometric Understanding and Learned Data Priors in VGGT (2512.11508 - Bratulić et al., 12 Dec 2025) in Section 4.1 (Does VGGT encode geometry? If so, where?)