Quantify Complementarity in Interleaved Image–Text Training Data
Determine how much of the interleaved image–text data used for training Vision–Language Models contains truly complementary information across modalities and quantify the extent to which such data forces these models to correlate visual and textual modalities during training.
References
While a lot of interleaved imageâtext data is used during training, it is unclear how much of it is truly complementary, and hence how much the model is forced to correlate the two modalities.
— CRIT: Graph-Based Automatic Data Synthesis to Enhance Cross-Modal Multi-Hop Reasoning
(2604.01634 - Sung et al., 2 Apr 2026) in Section 1 (Introduction)