Quantify Complementarity in Interleaved Image–Text Training Data

Determine how much of the interleaved image–text data used for training Vision–Language Models contains truly complementary information across modalities and quantify the extent to which such data forces these models to correlate visual and textual modalities during training.

Background

The paper argues that many multimodal benchmarks and training corpora do not reliably require integrating complementary information across images and text, which may lead Vision–LLMs to hallucinate or rely on a single modality. Assessing complementarity in training data is essential to understand whether and how models are compelled to connect visual and textual inputs.

The authors motivate their CRIT dataset by noting uncertainty about the degree of complementarity in standard interleaved image–text corpora, and link this uncertainty to poor cross-modal multi-hop reasoning and weak grounding in current models.

References

While a lot of interleaved image–text data is used during training, it is unclear how much of it is truly complementary, and hence how much the model is forced to correlate the two modalities.

CRIT: Graph-Based Automatic Data Synthesis to Enhance Cross-Modal Multi-Hop Reasoning  (2604.01634 - Sung et al., 2 Apr 2026) in Section 1 (Introduction)