GME: Improving Universal Multimodal Retrieval by Multimodal LLMs (2412.16855v1)

Published 22 Dec 2024 in cs.CL and cs.IR

Abstract: Universal Multimodal Retrieval (UMR) aims to enable search across various modalities using a unified model, where queries and candidates can consist of pure text, images, or a combination of both. Previous work has attempted to adopt multimodal LLMs (MLLMs) to realize UMR using only text data. However, our preliminary experiments demonstrate that more diverse multimodal training data can further unlock the potential of MLLMs. Despite its effectiveness, the existing multimodal training data is highly imbalanced in terms of modality, which motivates us to develop a training data synthesis pipeline and construct a large-scale, high-quality fused-modal training dataset. Based on the synthetic training data, we develop the General Multimodal Embedder (GME), an MLLM-based dense retriever designed for UMR. Furthermore, we construct a comprehensive UMR Benchmark (UMRB) to evaluate the effectiveness of our approach. Experimental results show that our method achieves state-of-the-art performance among existing UMR methods. Last, we provide in-depth analyses of model scaling, training strategies, and perform ablation studies on both the model and synthetic data.

PDF Abstract

Evaluating Universal Multimodal Retrieval with GME: An In-depth Analysis

The paper "GME: Improving Universal Multimodal Retrieval by Multimodal LLMs" delineates a novel approach to the challenge of Universal Multimodal Retrieval (UMR), aiming to establish a unified retrieval framework accommodating various modalities including text, images, and their combinations. The paper introduces the General Multimodal Embedder (GME), a dense retriever architecture built upon Multimodal LLMs (MLLMs), asserting its efficacy in executing UMR tasks with superior performance.

Key Contributions

Training Data Synthesis: The paper identifies a critical imbalance in existing multimodal datasets, where certain modalities are underrepresented, potentially limiting the efficacy of MLLMs in UMR tasks. To address this, the authors developed a data synthesis pipeline that generates large-scale fused-modal datasets. This dataset encompasses over 1.1 million fused-modal pairs, aimed at ameliorating the scarcity of balanced multimodal data for training.
Benchmark Development: To holistically evaluate GME and compare it against other methods, the authors compiled a comprehensive UMR Benchmark (UMRB) comprising tasks across text, multimodal content, and visual document retrieval domains. By doing so, they provide a robust platform for benchmarking MLLMs’ performance across a diversified set of retrieval tasks.
Contrastive Learning Paradigm: GME leverages a contrastive learning approach to embed queries and candidates from varied modalities into a shared representation space, fostering reliable similarity-based retrieval outcomes. Contrastive learning is driven by both text and visual inputs, with the optimization enabled through curated negative sampling techniques.
Analysis and Scalability Evaluation: The paper offers a detailed analysis of how different training strategies and model scales affect performance. It methodically examines model scalability, training data configurations, and the impact of employing hard negatives in contrastive learning—culminating in findings that guide the optimal deployment of MLLMs for UMR tasks.

Experimental Results

The performance evaluation reveals that GME attains state-of-the-art results when compared to existing UMR methods across various datasets included in UMRB. Notably, GME demonstrated significant improvements in accuracy over baselines in visual document retrieval tasks, positioning it as a leading-edge model for comprehensive multimodal tasks.

Implementation Details: The experiments underscored the importance of balanced data across modalities for training, revealing that data diversity substantially augments representation learning in UMR models.
Scalability Insights: Performance metrics indicated a linear scaling benefit with increased model parameters and prolonged training, though balanced by computational efficiency constraints.
Embedding Universality: By aligning embeddings from varying modalities, GME effectively transformed multimodal data into semantically coherent vectors, enabling robust retrieval across the spectrum of input types.

Implications and Future Outlook

This research carries significant implications for future work in AI, particularly in enhancing MLLM frameworks for versatile information retrieval applications. By successfully integrating diverse data types under a unified retrieval paradigm, the groundwork is laid for future advances in domains necessitating seamless cross-modal understanding and generation.

Future research may deepen exploration into multilingual contexts, optimizing interleaved retrievals where text and multiple images intermingle. Additionally, expanding the work towards more interactive retrieval applications like real-time query systems could further exploit the strengths of the multimodal architectures.

In conclusion, the paper posits substantial advancements in the field of multimodal retrieval by innovating data synthesis techniques, constructing valuable benchmarks, and demonstrating the versatility of MLLMs in managing complex retrieval tasks, making a notable contribution to the trajectory of large-scale AI deployment in retrieval systems.

PDF Markdown Bookmark Chat (Pro)

Authors (10)

Xin Zhang (904 papers)
Yanzhao Zhang (18 papers)
Wen Xie (7 papers)
Mingxin Li (14 papers)
Ziqi Dai (3 papers)
Dingkun Long (23 papers)
Pengjun Xie (85 papers)
Meishan Zhang (70 papers)
Wenjie Li (183 papers)
Min Zhang (630 papers)

Related Papers

Find Related Papers

Tweets

https://twitter.com/_reachsumit/status/1871457727236764084