A High-Quality Dataset and Reliable Evaluation for Interleaved Image-Text Generation (2506.09427v1)

Published 11 Jun 2025 in cs.CV and cs.AI

Abstract: Recent advancements in Large Multimodal Models (LMMs) have significantly improved multimodal understanding and generation. However, these models still struggle to generate tightly interleaved image-text outputs, primarily due to the limited scale, quality and instructional richness of current training datasets. To address this, we introduce InterSyn, a large-scale multimodal dataset constructed using our Self-Evaluation with Iterative Refinement (SEIR) method. InterSyn features multi-turn, instruction-driven dialogues with tightly interleaved imagetext responses, providing rich object diversity and rigorous automated quality refinement, making it well-suited for training next-generation instruction-following LMMs. Furthermore, to address the lack of reliable evaluation tools capable of assessing interleaved multimodal outputs, we introduce SynJudge, an automatic evaluation model designed to quantitatively assess multimodal outputs along four dimensions: text content, image content, image quality, and image-text synergy. Experimental studies show that the SEIR method leads to substantially higher dataset quality compared to an otherwise identical process without refinement. Moreover, LMMs trained on InterSyn achieve uniform performance gains across all evaluation metrics, confirming InterSyn's utility for advancing multimodal systems.

Summary

Overview of "A High-Quality Dataset and Reliable Evaluation for Interleaved Image-Text Generation"

The paper "A High-Quality Dataset and Reliable Evaluation for Interleaved Image-Text Generation" presents comprehensive research focused on improving multimodal models, specifically targeting interleaved image-text generation tasks. As the capabilities of Large Multimodal Models (LMMs) continue to advance, one of the prevalent challenges they face is maintaining the quality and synergy of simultaneous image-text outputs. The publication addresses this issue by introducing both a novel dataset, InterSyn, and an evaluation model, SynJudge.

InterSyn is characterized by its large scale, boasting 1.8 million single-turn samples and 50,000 multi-turn dialogues spanning diverse topics. The dataset is constructed using the Self-Evaluation with Iterative Refinement (SEIR) method, a strategically designed process that ensures semantic completeness, cross-modal synergy, and contextual relevance with minimal human intervention. This iterative refinement pipeline embeds feedback loops at multiple stages (question refinement, answer refinement, and image refinement), markedly enhancing the dataset quality compared to traditional methods.

SynJudge, the proposed evaluation model, provides a robust tool for assessing multimodal outputs across four dimensions: text content, image content, image quality, and image-text synergy. Unlike traditional metrics that may merely account for image-text consistency, SynJudge introduces a unique synergy metric that rewards complementary and non-redundant alignment between textual and visual modalities. This perspective aligns the evaluation process more closely with human judgment and preference—an essential consideration for authentic and effective multimodal understanding.

Key Results and Findings

Experimentally, the SEIR method demonstrated substantial improvements, achieving consistent gains across various metrics, including text content completeness (TCC), image content completeness (ICC), image quality (IQ), and image-text synergy (ITS). Human evaluation indicates up to a 32% improvement in initial question quality after iterative refinement. An integrated experiment showed that LMMs trained with InterSyn data experienced performance enhancements, most notably achieving a 52.1% improvement in ITS.

SynJudge was rigorously tested against human evaluations, demonstrating a closer alignment with human ratings compared to existing methods, reducing the average deviation by up to 5%. This fine-grained capability of SynJudge in evaluating interleaved multimodal outputs provides a robust framework for rapid, scalable benchmarking in future studies.

Implications and Future Directions

The construction of InterSyn, along with SynJudge, marks significant progress in the development of unified multimodal models. InterSyn provides an enriched, diverse dataset suitable for training models capable of maintaining robust instruction alignment and reasoning across modalities. The insights from SynJudge will facilitate the advancement of evaluation techniques, paving the way for multimodal systems capable of synergistic generation.

Future research may explore extending the SEIR method to accommodate multiple images per dialogue turn, enhancing the dataset to include structured tasks or long-term coherence, and further developing judgement models to handle complex multi-image contexts. These directions will be crucial as we aim to build more comprehensive and expressive multimodal intelligence systems, capable of seamless dialogue and interaction across visual and textual domains.

The paper adds considerable value to the field of multimodal modeling and sets foundational advancements for future explorations into scalable data generation and evaluation methodologies.