Mechanism of Epipolar Geometry Recovery in VGGT

Determine the internal mechanisms by which the Visual Geometry Grounded Transformer (VGGT) recovers epipolar geometric information in its intermediate layers, identifying how the model organizes representations to yield fundamental matrix–consistent relationships across views.

Background

The paper shows that VGGT’s intermediate layers allow recovery of the fundamental matrix, with performance sharply improving around the middle layers. This suggests the model implicitly learns epipolar geometry without direct supervision.

Despite observing geometric information emerging, the authors note they do not yet know the underlying mechanism by which VGGT recovers this information. They hypothesize the global attention layers may compute correspondences that enable geometric alignment, and proceed to analyze attention maps to test this hypothesis.

References

Yet, so far, we do not know how the model recovers this information.

On Geometric Understanding and Learned Data Priors in VGGT (2512.11508 - Bratulić et al., 12 Dec 2025) in Section 4.2 (How do attention maps encode correspondences?)