MM-Embed: Universal Multimodal Retrieval with Multimodal LLMs (2411.02571v1)

Published 4 Nov 2024 in cs.CL, cs.AI, cs.CV, cs.IR, and cs.LG

Abstract: State-of-the-art retrieval models typically address a straightforward search scenario, where retrieval tasks are fixed (e.g., finding a passage to answer a specific question) and only a single modality is supported for both queries and retrieved results. This paper introduces techniques for advancing information retrieval with multimodal LLMs (MLLMs), enabling a broader search scenario, termed universal multimodal retrieval, where multiple modalities and diverse retrieval tasks are accommodated. To this end, we first study fine-tuning an MLLM as a bi-encoder retriever on 10 datasets with 16 retrieval tasks. Our empirical results show that the fine-tuned MLLM retriever is capable of understanding challenging queries, composed of both text and image, but underperforms a smaller CLIP retriever in cross-modal retrieval tasks due to modality bias from MLLMs. To address the issue, we propose modality-aware hard negative mining to mitigate the modality bias exhibited by MLLM retrievers. Second, we propose to continually fine-tune the universal multimodal retriever to enhance its text retrieval capability while maintaining multimodal retrieval capability. As a result, our model, MM-Embed, achieves state-of-the-art performance on the multimodal retrieval benchmark M-BEIR, which spans multiple domains and tasks, while also surpassing the state-of-the-art text retrieval model, NV-Embed-v1, on MTEB retrieval benchmark. Finally, we explore to prompt the off-the-shelf MLLMs as the zero-shot rerankers to refine the ranking of the candidates from the multimodal retriever. We find that through prompt-and-reranking, MLLMs can further improve multimodal retrieval when the user queries (e.g., text-image composed queries) are more complex and challenging to understand. These findings also pave the way to advance universal multimodal retrieval in the future.

PDF HTML Abstract

Universal Multimodal Retrieval with Multimodal LLMs

In this paper, titled "MM-Embed: Universal Multimodal Retrieval with Multimodal LLMs," the authors address a significant limitation in current state-of-the-art retrieval models that focus primarily on narrow, single-modal search scenarios. Traditional retrieval models are typically designed to handle either text or image queries, with fixed retrieval tasks, such as finding a passage to answer a specific question. This research proposes a "universal multimodal retrieval" framework that accommodates diverse retrieval tasks using queries and documents that consist of interleaved text and images.

The paper introduces methodologies for fine-tuning Multimodal LLMs (MLLMs) to perform retrieval tasks across multiple modalities. This is achieved by employing a multimodal retriever model named MM-Embed. The research first involves the evaluation of bi-encoder retrievers fine-tuned on a set of ten datasets covering sixteen different retrieval tasks. The empirical analysis reveals that MLLM-based retrievers, while capable of processing complex queries involving both text and images, experience modality biases that lead to performance deficits compared to smaller, task-specific models.

To address the observed modality bias, especially in scenarios where text-based queries are intended to retrieve image content, the authors introduce a technique termed "modality-aware hard negative mining." This approach aims to mitigate the tendency of MLLMs to favor text over image queries when evaluating retrieval accuracy. By continually refining the retriever through text-to-text retrieval fine-tuning, MM-Embed improves its efficacy in handling diverse retrieval tasks while advancing its state-of-the-art text retrieval capabilities across various benchmarks, including MTEB and M-BEIR.

A novel aspect of the research explores the application of MLLMs as zero-shot rerankers. The authors find that prompting MLLMs in a zero-shot setting enhances the retrieval accuracy in complex tasks that require the understanding of interleaved text and image inputs, such as visual question answering and composed image retrieval. For instance, in the composed image retrieval domain, deploying zero-shot rerankers results in a substantial increase in accuracy metrics such as mean Average Precision (mAP@5) by over 7 points compared to existing state-of-the-art systems.

The implications of this research are eventful, as MM-Embed sets a new benchmark by being the first universal multimodal retriever that demonstrates competitive performance across varied tasks while maintaining robust capabilities in traditional text retrieval contexts. The paper also opens up avenues for prospective research, particularly in the domains of knowledge distillation for smaller retriever models and enhanced reranking techniques using MLLMs.

In conclusion, this paper lays the groundwork for the continuous evolution and development of retrieval models that are not only flexible and capable of handling diverse and complex queries but also can seamlessly function in real-world applications. With a strong empirical foundation, MM-Embed can inspire future work in optimizing curriculum-based learning strategies and expanding the scope of retrieval models to integrate other modalities, such as audio and video.