Img-Diff: Contrastive Data Synthesis for Multimodal Large Language Models (2408.04594v3)

Published 8 Aug 2024 in cs.CV and cs.AI

Abstract: High-performance Multimodal LLMs (MLLMs) are heavily dependent on data quality. To advance fine-grained image recognition within MLLMs, we introduce a novel data synthesis method inspired by contrastive learning and image difference captioning. Our key idea involves challenging the model to discern both matching and distinct elements by scrutinizing object differences in detailed regions across similar images. We begin by generating pairs of similar images that emphasize object variations. Following this, we employ a Difference Area Generator to pinpoint object differences, and subsequently, a Difference Captions Generator to articulate these differences. This process results in a high-quality dataset of "object replacement" samples, termed Img-Diff, which can be scaled as needed due to its automated nature. We leverage this generated dataset to fine-tune state-of-the-art (SOTA) MLLMs, such as InternVL2, achieving substantial improvements across various image difference and Visual Question Answering tasks. Notably, the trained models significantly outperform existing SOTA models like GPT-4V and Gemini on the MMVP benchmark. Additionally, we conduct comprehensive evaluations to validate the dataset's diversity, quality, and robustness, offering several insights into the synthesis of such contrastive datasets. We release our codes and dataset to encourage further research on multimodal data synthesis and MLLMs' fundamental capabilities for image understanding.

Citations (3)

View on Semantic Scholar

Summary

The paper introduces a novel Img-Diff dataset that improves detection of subtle visual differences in multimodal large language models.
It employs a multi-stage process using image pair generation, segmentation, and advanced filtering to capture object differences effectively.
Fine-tuning with Img-Diff achieves state-of-the-art performance on benchmarks like MMVP and Spot-the-Diff, highlighting significant gains in visual recognition.

Img-Diff: Contrastive Data Synthesis for Multimodal LLMs

The paper entitled "Img-Diff: Contrastive Data Synthesis for Multimodal LLMs" introduces a novel dataset named Img-Diff, aimed at enhancing fine-grained image recognition in Multimodal LLMs (MLLMs). By leveraging insights from contrastive learning and image difference captioning, the dataset is meticulously designed to challenge models by identifying both matching and distinct components between image pairs. The dataset, alongside the proposed data synthesis methodology, contributes valuable insights into the fields of multimodal data synthesis and MLLMs’ visual capabilities.

Methodology

The Img-Diff dataset is constructed through a multi-stage process consisting of generating image pairs, difference area identification, and difference caption generation. The authors utilize the Stable-Diffusion-XL model and advanced image editing techniques to create pairs of similar images with subtle object replacements. This process involves:

Image Pairs Generation:
- Utilizing Vicuna-1.5-13B to perform object replacement in image captions sourced from MSCOCO.
- Generating image pairs using Stable-Diffusion-XL and image editing techniques.
Difference Area Generator:
- Employing FastSAM for image segmentation.
- Implementing filtering mechanisms such as the Image Similarity Filter, Image-text Matching Filter, and Difference Detector to identify bounding box regions containing object differences.
Difference Captions Generator:
- Using LLaVA-NEXT to generate content captions for bounding box regions.
- Filtering content captions through Image-text Matching Filter and Caption Similarity Filter.
- Generating difference captions with the aid of LLaVA-NEXT based on highlighted regions within the images.

Figure \ref{fig:object_replacement_overview} and \ref{fig:difference_area_generator} in the paper illustrate the comprehensive workflow for data generation, including the segmentation, filtering, and captioning processes.

Evaluation

The Img-Diff dataset was used to fine-tune LLaVA-1.5-7B and MGM-7B, demonstrating significant performance improvements over state-of-the-art models on various benchmarks:

MMVP Benchmark:
- Finetuned models surpass SOTA models like GPT-4V and Gemini, indicating superior capability in identifying image differences.
Spot-the-Diff Benchmark:
- Models trained with Img-Diff displayed improved performance in metrics such as BLEU, METEOR, CIDEr-D, and ROUGE-L, highlighting enhanced proficiency in subtle difference detection.
Image-Editing Request Benchmark:
- Achieved new SOTA scores, demonstrating notable improvements in generating transformation descriptions for image pairs.

Additionally, the dataset contributed to comprehensive performance gains across numerous MLLM benchmarks, indicating its utility beyond specialized tasks.

Data Quality and Diversity

The paper emphasizes thorough validation of data quality and diversity:

Data Quality:
- Manual evaluation showed that a high percentage of samples accurately reflect object differences and corresponding captions.
Data Diversity:
- Analysis revealed broad coverage of object categories and frequent inclusion of common object categories, ensuring comprehensive representation.

Implications and Future Work

The Img-Diff dataset and the proposed contrastive data synthesis methodology present significant implications for the future development of MLLMs:

Improved fine-grained recognition capabilities can lead to better performance in downstream tasks such as VQA and detailed image analysis.
The methodologies can inspire further innovations in multimodal data synthesis, leveraging generative models and advanced filtering techniques.
The insights gained from the construction of Img-Diff can inform the development of more robust and diverse datasets, ultimately advancing the field of MLLMs.

In exploring alternative data synthesis methods like “object removal,” the authors found mixed results. While “object removal” showed potential in improving specific aspects of MLLM performance, its effectiveness varied across different models and benchmarks.

Conclusion

The Img-Diff dataset significantly contributes to enhancing the fine-grained image recognition capabilities of MLLMs. By focusing on contrastive data synthesis, the paper underscores the value of targeted datasets in improving model performance on both specialized and broad tasks. The comprehensive evaluation and rigorous methodological approach suggest a promising direction for future research in multimodal data synthesis. The open-sourcing of the dataset further encourages ongoing research and innovation, potentially catalyzing advancements in the capabilities of MLLMs and their applications.

PDF Markdown

Related Papers

GitHub

GitHub - modelscope/data-juicer: A one-stop data processing system to make data higher-quality, juicier, and more digestible for (multimodal) LLMs! 🍎 🍋 🌽 ➡️ ➡️🍸 🍹 🍷为大模型提供更高质量、更丰富、更易”消化“的数据！ (2,856 stars)

Tweets

https://twitter.com/_akhaliq/status/1821754903082959335

https://twitter.com/arxivsanitybot/status/1822989208019255628

https://twitter.com/arXivGPT/status/1823090107899072887

https://twitter.com/javaeeeee1/status/1822052339785630165

https://twitter.com/arXivGPT/status/1822721234104197270

https://twitter.com/arXivGPT/status/1822358572593811484