Papers
Topics
Authors
Recent
Search
2000 character limit reached

MM-Embed: Universal Multimodal Retrieval with Multimodal LLMs

Published 4 Nov 2024 in cs.CL, cs.AI, cs.CV, cs.IR, and cs.LG | (2411.02571v2)

Abstract: State-of-the-art retrieval models typically address a straightforward search scenario, in which retrieval tasks are fixed (e.g., finding a passage to answer a specific question) and only a single modality is supported for both queries and retrieved results. This paper introduces techniques for advancing information retrieval with multimodal LLMs (MLLMs), enabling a broader search scenario, termed universal multimodal retrieval, where multiple modalities and diverse retrieval tasks are accommodated. To this end, we first study fine-tuning an MLLM as a bi-encoder retriever on 10 datasets with 16 retrieval tasks. Our empirical results show that the fine-tuned MLLM retriever is capable of understanding challenging queries, composed of both text and image, but it underperforms compared to a smaller CLIP retriever in cross-modal retrieval tasks due to the modality bias exhibited by MLLMs. To address the issue, we propose modality-aware hard negative mining to mitigate the modality bias exhibited by MLLM retrievers. Second, we propose continuously fine-tuning the universal multimodal retriever to enhance its text retrieval capability while preserving multimodal retrieval capability. As a result, our model, MM-Embed, achieves state-of-the-art performance on the multimodal retrieval benchmark M-BEIR, which spans multiple domains and tasks, while also surpassing the state-of-the-art text retrieval model, NV-Embed-v1, on the MTEB retrieval benchmark. We also explore prompting the off-the-shelf MLLMs as zero-shot rerankers to refine the ranking of the candidates from the multimodal retriever. We find that, through prompt-and-reranking, MLLMs can further improve multimodal retrieval when the user queries (e.g., text-image composed queries) are more complex and challenging to understand. These findings also pave the way for advancing universal multimodal retrieval in the future.

Citations (1)

Summary

  • The paper introduces MM-Embed, a universal multimodal retrieval framework that processes interleaved text and image inputs for diverse tasks.
  • It fine-tunes multimodal LLMs using a bi-encoder approach across 16 retrieval tasks and employs modality-aware hard negative mining to tackle modality bias.
  • The study shows significant performance gains, including a 7-point mAP@5 boost in composed image retrieval via zero-shot reranking.

Universal Multimodal Retrieval with Multimodal LLMs

In this paper, titled "MM-Embed: Universal Multimodal Retrieval with Multimodal LLMs," the authors address a significant limitation in current state-of-the-art retrieval models that focus primarily on narrow, single-modal search scenarios. Traditional retrieval models are typically designed to handle either text or image queries, with fixed retrieval tasks, such as finding a passage to answer a specific question. This research proposes a "universal multimodal retrieval" framework that accommodates diverse retrieval tasks using queries and documents that consist of interleaved text and images.

The paper introduces methodologies for fine-tuning Multimodal LLMs (MLLMs) to perform retrieval tasks across multiple modalities. This is achieved by employing a multimodal retriever model named MM-Embed. The research first involves the evaluation of bi-encoder retrievers fine-tuned on a set of ten datasets covering sixteen different retrieval tasks. The empirical analysis reveals that MLLM-based retrievers, while capable of processing complex queries involving both text and images, experience modality biases that lead to performance deficits compared to smaller, task-specific models.

To address the observed modality bias, especially in scenarios where text-based queries are intended to retrieve image content, the authors introduce a technique termed "modality-aware hard negative mining." This approach aims to mitigate the tendency of MLLMs to favor text over image queries when evaluating retrieval accuracy. By continually refining the retriever through text-to-text retrieval fine-tuning, MM-Embed improves its efficacy in handling diverse retrieval tasks while advancing its state-of-the-art text retrieval capabilities across various benchmarks, including MTEB and M-BEIR.

A novel aspect of the research explores the application of MLLMs as zero-shot rerankers. The authors find that prompting MLLMs in a zero-shot setting enhances the retrieval accuracy in complex tasks that require the understanding of interleaved text and image inputs, such as visual question answering and composed image retrieval. For instance, in the composed image retrieval domain, deploying zero-shot rerankers results in a substantial increase in accuracy metrics such as mean Average Precision (mAP@5) by over 7 points compared to existing state-of-the-art systems.

The implications of this research are eventful, as MM-Embed sets a new benchmark by being the first universal multimodal retriever that demonstrates competitive performance across varied tasks while maintaining robust capabilities in traditional text retrieval contexts. The study also opens up avenues for prospective research, particularly in the domains of knowledge distillation for smaller retriever models and enhanced reranking techniques using MLLMs.

In conclusion, this paper lays the groundwork for the continuous evolution and development of retrieval models that are not only flexible and capable of handling diverse and complex queries but also can seamlessly function in real-world applications. With a strong empirical foundation, MM-Embed can inspire future work in optimizing curriculum-based learning strategies and expanding the scope of retrieval models to integrate other modalities, such as audio and video.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 3 tweets with 117 likes about this paper.