Dice Question Streamline Icon: https://streamlinehq.com

Open Question: Interpreting the Magnitude of Cross-Modal Alignment Scores

Determine whether a mutual nearest-neighbor alignment score of approximately 0.16 between language and vision model representations reflects strong alignment with residual variance attributable to noise or indicates poor alignment with substantial representational differences remaining to be explained.

Information Square Streamline Icon: https://streamlinehq.com

Background

Throughout the paper, the authors measure alignment using mutual nearest-neighbor metrics and observe increasing cross-modal alignment with model competence and scale. However, absolute alignment scores remain modest (e.g., around 0.16), raising questions about how to interpret these magnitudes.

In the limitations section, the authors explicitly pose the uncertainty regarding whether such scores indicate meaningful convergence or sizeable gaps, and they leave the resolution as an open question.

References

Is a score of $0.16$ indicative of strong alignment with the remaining gap being “noise” or does it signify poor alignment with major differences left to explain? We leave this as an open question.

The Platonic Representation Hypothesis (2405.07987 - Huh et al., 13 May 2024) in Section 6 (Counterexamples and limitations), paragraph “Lots left to explain”