Universal Multimodal Retrieval with Multimodal LLMs
In this paper, titled "MM-Embed: Universal Multimodal Retrieval with Multimodal LLMs," the authors address a significant limitation in current state-of-the-art retrieval models that focus primarily on narrow, single-modal search scenarios. Traditional retrieval models are typically designed to handle either text or image queries, with fixed retrieval tasks, such as finding a passage to answer a specific question. This research proposes a "universal multimodal retrieval" framework that accommodates diverse retrieval tasks using queries and documents that consist of interleaved text and images.
The paper introduces methodologies for fine-tuning Multimodal LLMs (MLLMs) to perform retrieval tasks across multiple modalities. This is achieved by employing a multimodal retriever model named MM-Embed. The research first involves the evaluation of bi-encoder retrievers fine-tuned on a set of ten datasets covering sixteen different retrieval tasks. The empirical analysis reveals that MLLM-based retrievers, while capable of processing complex queries involving both text and images, experience modality biases that lead to performance deficits compared to smaller, task-specific models.
To address the observed modality bias, especially in scenarios where text-based queries are intended to retrieve image content, the authors introduce a technique termed "modality-aware hard negative mining." This approach aims to mitigate the tendency of MLLMs to favor text over image queries when evaluating retrieval accuracy. By continually refining the retriever through text-to-text retrieval fine-tuning, MM-Embed improves its efficacy in handling diverse retrieval tasks while advancing its state-of-the-art text retrieval capabilities across various benchmarks, including MTEB and M-BEIR.
A novel aspect of the research explores the application of MLLMs as zero-shot rerankers. The authors find that prompting MLLMs in a zero-shot setting enhances the retrieval accuracy in complex tasks that require the understanding of interleaved text and image inputs, such as visual question answering and composed image retrieval. For instance, in the composed image retrieval domain, deploying zero-shot rerankers results in a substantial increase in accuracy metrics such as mean Average Precision (mAP@5) by over 7 points compared to existing state-of-the-art systems.
The implications of this research are eventful, as MM-Embed sets a new benchmark by being the first universal multimodal retriever that demonstrates competitive performance across varied tasks while maintaining robust capabilities in traditional text retrieval contexts. The paper also opens up avenues for prospective research, particularly in the domains of knowledge distillation for smaller retriever models and enhanced reranking techniques using MLLMs.
In conclusion, this paper lays the groundwork for the continuous evolution and development of retrieval models that are not only flexible and capable of handling diverse and complex queries but also can seamlessly function in real-world applications. With a strong empirical foundation, MM-Embed can inspire future work in optimizing curriculum-based learning strategies and expanding the scope of retrieval models to integrate other modalities, such as audio and video.