MegaPairs: Massive Data Synthesis For Universal Multimodal Retrieval (2412.14475v1)

Published 19 Dec 2024 in cs.CV and cs.CL

Abstract: Despite the rapidly growing demand for multimodal retrieval, progress in this field remains severely constrained by a lack of training data. In this paper, we introduce MegaPairs, a novel data synthesis method that leverages vision LLMs (VLMs) and open-domain images, together with a massive synthetic dataset generated from this method. Our empirical analysis shows that MegaPairs generates high-quality data, enabling the multimodal retriever to significantly outperform the baseline model trained on 70$\times$ more data from existing datasets. Moreover, since MegaPairs solely relies on general image corpora and open-source VLMs, it can be easily scaled up, enabling continuous improvements in retrieval performance. In this stage, we produced more than 26 million training instances and trained several models of varying sizes using this data. These new models achieve state-of-the-art zero-shot performance across 4 popular composed image retrieval (CIR) benchmarks and the highest overall performance on the 36 datasets provided by MMEB. They also demonstrate notable performance improvements with additional downstream fine-tuning. Our produced dataset, well-trained models, and data synthesis pipeline will be made publicly available to facilitate the future development of this field.

Summary

The paper introduces MegaPairs, a method that automatically generates 26M high-quality multimodal training instances to enhance retrieval systems.
It employs a two-step pipeline combining multi-model image pairing and open-ended instruction generation using VLMs and LLMs for tasks like composed image retrieval.
Empirical evaluation shows MMRet models trained on MegaPairs outperform prior benchmarks, achieving state-of-the-art zero-shot and out-of-distribution retrieval performance.

This paper introduces MegaPairs, a novel method for synthesizing massive multimodal training data to improve universal multimodal retrieval systems. The core idea is to leverage existing open-domain image corpora and powerful vision-LLMs (VLMs) and LLMs to automatically generate large-scale, high-quality training instances. This addresses the critical challenge of data scarcity for training effective multimodal retrievers capable of handling diverse tasks beyond simple image-text matching, such as composed image retrieval (CIR).

The MegaPairs construction pipeline involves two main steps:

Mining Correlated Image Pairs: From a large image corpus (like a subset of Recap-DataComp-1B), pairs of images $(\mathcal{I}_q, \mathcal{I}_t)$ $(I_{q}, I_{t})$ are sampled based on heterogeneous correlations. This is achieved by using multiple similarity models:
- CLIP's image encoder for visual-semantic correlation.
- DINOv2 for visual-pattern correlation.
- CLIP's text encoder for caption correlation (using image captions). This multi-faceted approach ensures diversity in the relationships captured between image pairs, going beyond simple visual similarity. Images from the retrieved set that are not the target are included as hard negative samples for training.
Generating Open-Ended Instructions: For each sampled image pair, open-source MLLMs (like InternVL2-26B) and LLMs (like Llama-3-8B) are used in a two-step annotation process. The MLLM first generates a detailed description of the common concepts and differences between the query image $\mathcal{I}_q$ and the target image $\mathcal{I}_t$ . This description is then refined by the LLM to produce multiple textual instructions $\mathcal{T}_{q \rightarrow t_i}$ describing the transition or relationship between the two images. This results in multimodal triplets $(\mathcal{I}_q, \mathcal{T}_{q \rightarrow t_i}, \mathcal{I}_{t_i})$ .

Using this method, the authors generated over 26 million training instances. A key finding from their analysis is the high quality of the synthesized data; a pilot experiment showed that training with only 0.5 million MegaPairs instances achieved better performance than training on the entire 36.7 million instances from the existing MagicLens [ICML 2024] dataset using the same backbone. The ablation studies confirm the effectiveness of using hard negatives and combining multiple image pair search strategies for better performance.

The paper also introduces MMRet, a series of multimodal retriever models trained on the MegaPairs dataset. MMRet models are based on pre-trained VLMs and come in two main architectures:

CLIP-based MMRet: Utilizes a dual encoder (CLIP's image and text encoders) with a score-fusion strategy for composed image-text embeddings.
MLLM-based MMRet: Built upon MLLMs like LLaVA-1.6 [arXiv 2401.xxxx], it processes composed inputs as interleaved sequences of image and text tokens, typically including task-specific instructions, and uses the normalized last hidden state of the [EOS] token as the embedding.

Both MMRet architectures are trained using multimodal contrastive learning with the InfoNCE loss, enabling them to handle various multimodal inputs (image, text, or composed).

For implementation, the CLIP-based models (MMRet-Base, MMRet-Large) are initialized from CLIP variants and trained on MegaPairs with specific batch sizes and learning rates. The MLLM-based MMRet (MMRet-MLLM) is initialized from LLaVA-1.6 and fine-tuned using LoRA on both the visual and language backbones. Training details like batch sizes, learning rates, and LoRA rank are provided.

The empirical evaluation demonstrates the effectiveness of MMRet trained on MegaPairs:

Zero-shot CIR: MMRet models achieved state-of-the-art performance on four popular benchmarks: CIRCO [ICCV 2023], CIRR [ICCV 2021], FashionIQ [CVPR 2021], and GeneCIS [CVPR 2023]. MMRet-MLLM, in particular, set new SOTA records on CIRCO, CIRR, and GeneCIS, significantly surpassing previous models including proprietary ones.
MMEB Performance: Evaluated on the Massive Multimodal Embedding Benchmark (MMEB) [arXiv (2410.05160)] across 36 datasets covering Classification, VQA, Retrieval, and Grounding. MMRet-MLLM achieved the highest overall zero-shot performance. When fine-tuned on MMEB training data, MMRet-MLLM further improved, achieving state-of-the-art overall performance and notable gains on out-of-distribution (OOD) datasets, highlighting the strong generalization capabilities imparted by training on MegaPairs.

The paper argues that using open-source models and general image corpora makes the MegaPairs data synthesis method highly scalable and cost-effective, facilitating continuous improvement in multimodal retrieval. The generated dataset, trained models, and the data synthesis pipeline are planned for public release.

Potential limitations include the possibility of exploring even more diverse pairing methods beyond the three used, such as incorporating advanced text retrievers or image-text retrieval strategies. The authors also provide an ethics statement, acknowledging efforts to filter harmful content from the source image corpus and discouraging the use of models for sensitive content.

PDF Markdown

Related Papers

Tweets

https://twitter.com/rohanpaul_ai/status/1877814889126748467

https://twitter.com/arXivGPT/status/1870892990102659107

https://twitter.com/arXivGPT/status/1871255490698576188

https://twitter.com/CSVisionPapers/status/1870163620341899497