- The paper introduces MegaPairs, a method that automatically generates 26M high-quality multimodal training instances to enhance retrieval systems.
- It employs a two-step pipeline combining multi-model image pairing and open-ended instruction generation using VLMs and LLMs for tasks like composed image retrieval.
- Empirical evaluation shows MMRet models trained on MegaPairs outperform prior benchmarks, achieving state-of-the-art zero-shot and out-of-distribution retrieval performance.
This paper introduces MegaPairs, a novel method for synthesizing massive multimodal training data to improve universal multimodal retrieval systems. The core idea is to leverage existing open-domain image corpora and powerful vision-LLMs (VLMs) and LLMs to automatically generate large-scale, high-quality training instances. This addresses the critical challenge of data scarcity for training effective multimodal retrievers capable of handling diverse tasks beyond simple image-text matching, such as composed image retrieval (CIR).
The MegaPairs construction pipeline involves two main steps:
- Mining Correlated Image Pairs: From a large image corpus (like a subset of Recap-DataComp-1B), pairs of images (Iq,It) are sampled based on heterogeneous correlations. This is achieved by using multiple similarity models:
- CLIP's image encoder for visual-semantic correlation.
- DINOv2 for visual-pattern correlation.
- CLIP's text encoder for caption correlation (using image captions).
This multi-faceted approach ensures diversity in the relationships captured between image pairs, going beyond simple visual similarity. Images from the retrieved set that are not the target are included as hard negative samples for training.
- Generating Open-Ended Instructions: For each sampled image pair, open-source MLLMs (like InternVL2-26B) and LLMs (like Llama-3-8B) are used in a two-step annotation process. The MLLM first generates a detailed description of the common concepts and differences between the query image Iq and the target image It. This description is then refined by the LLM to produce multiple textual instructions Tq→ti describing the transition or relationship between the two images. This results in multimodal triplets (Iq,Tq→ti,Iti).
Using this method, the authors generated over 26 million training instances. A key finding from their analysis is the high quality of the synthesized data; a pilot experiment showed that training with only 0.5 million MegaPairs instances achieved better performance than training on the entire 36.7 million instances from the existing MagicLens [ICML 2024] dataset using the same backbone. The ablation studies confirm the effectiveness of using hard negatives and combining multiple image pair search strategies for better performance.
The paper also introduces MMRet, a series of multimodal retriever models trained on the MegaPairs dataset. MMRet models are based on pre-trained VLMs and come in two main architectures:
- CLIP-based MMRet: Utilizes a dual encoder (CLIP's image and text encoders) with a score-fusion strategy for composed image-text embeddings.
- MLLM-based MMRet: Built upon MLLMs like LLaVA-1.6 [arXiv 2401.xxxx], it processes composed inputs as interleaved sequences of image and text tokens, typically including task-specific instructions, and uses the normalized last hidden state of the [EOS] token as the embedding.
Both MMRet architectures are trained using multimodal contrastive learning with the InfoNCE loss, enabling them to handle various multimodal inputs (image, text, or composed).
For implementation, the CLIP-based models (MMRet-Base, MMRet-Large) are initialized from CLIP variants and trained on MegaPairs with specific batch sizes and learning rates. The MLLM-based MMRet (MMRet-MLLM) is initialized from LLaVA-1.6 and fine-tuned using LoRA on both the visual and language backbones. Training details like batch sizes, learning rates, and LoRA rank are provided.
The empirical evaluation demonstrates the effectiveness of MMRet trained on MegaPairs:
- Zero-shot CIR: MMRet models achieved state-of-the-art performance on four popular benchmarks: CIRCO [ICCV 2023], CIRR [ICCV 2021], FashionIQ [CVPR 2021], and GeneCIS [CVPR 2023]. MMRet-MLLM, in particular, set new SOTA records on CIRCO, CIRR, and GeneCIS, significantly surpassing previous models including proprietary ones.
- MMEB Performance: Evaluated on the Massive Multimodal Embedding Benchmark (MMEB) [arXiv (2410.05160)] across 36 datasets covering Classification, VQA, Retrieval, and Grounding. MMRet-MLLM achieved the highest overall zero-shot performance. When fine-tuned on MMEB training data, MMRet-MLLM further improved, achieving state-of-the-art overall performance and notable gains on out-of-distribution (OOD) datasets, highlighting the strong generalization capabilities imparted by training on MegaPairs.
The paper argues that using open-source models and general image corpora makes the MegaPairs data synthesis method highly scalable and cost-effective, facilitating continuous improvement in multimodal retrieval. The generated dataset, trained models, and the data synthesis pipeline are planned for public release.
Potential limitations include the possibility of exploring even more diverse pairing methods beyond the three used, such as incorporating advanced text retrievers or image-text retrieval strategies. The authors also provide an ethics statement, acknowledging efforts to filter harmful content from the source image corpus and discouraging the use of models for sensitive content.