DocMMIR: A Framework for Document Multi-modal Information Retrieval (2505.19312v2)

Published 25 May 2025 in cs.IR

Abstract: The rapid advancement of unsupervised representation learning and large-scale pre-trained vision-LLMs has significantly improved cross-modal retrieval tasks. However, existing multi-modal information retrieval (MMIR) studies lack a comprehensive exploration of document-level retrieval and suffer from the absence of cross-domain datasets at this granularity. To address this limitation, we introduce DocMMIR, a novel multi-modal document retrieval framework designed explicitly to unify diverse document formats and domains, including Wikipedia articles, scientific papers (arXiv), and presentation slides, within a comprehensive retrieval scenario. We construct a large-scale cross-domain multimodal benchmark, comprising 450K samples, which systematically integrates textual and visual information. Our comprehensive experimental analysis reveals substantial limitations in current state-of-the-art MLLMs (CLIP, BLIP2, SigLIP-2, ALIGN) when applied to our tasks, with only CLIP demonstrating reasonable zero-shot performance. Furthermore, we conduct a systematic investigation of training strategies, including cross-modal fusion methods and loss functions, and develop a tailored approach to train CLIP on our benchmark. This results in a +31% improvement in MRR@10 compared to the zero-shot baseline. All our data and code are released in https://github.com/J1mL1/DocMMIR.

Authors (6)

Zirui Li (43 papers)
Siwei Wu (26 papers)
Xingyu Wang (37 papers)
Yi Zhou (438 papers)
Yizhi Li (43 papers)
Chenghua Lin (127 papers)

Summary

The paper introduces DocMMIR, a framework and large-scale cross-domain dataset designed to address the challenge of document-level multi-modal information retrieval.
DocMMIR achieves significant retrieval performance gains, demonstrating a +31% improvement in MRR@10 over baseline models by extending and fine-tuning MLLMs like CLIP.
The framework establishes a new benchmark for future research and offers practical implications for enhancing search engines and knowledge management systems requiring accurate multi-modal document retrieval.

DocMMIR: A Framework for Document Multi-modal Information Retrieval

The paper introduces DocMMIR, a framework specifically designed to address the complex challenge of multi-modal document-level information retrieval across varied domains. This addresses a significant gap in existing research, where document-level retrieval in multi-modal contexts has been largely unexplored due to the absence of cross-domain datasets tailored for such tasks. DocMMIR integrates documents from diverse formats and domains such as Wikipedia articles, scientific papers from arXiv, and presentation slides, into a unified retrieval scenario. It employs a large-scale benchmark composed of 450K samples that merge textual and visual data, highlighting the limitations of current state-of-the-art models like CLIP, BLIP2, and others when applied directly to the task at hand.

The authors employ an innovative approach by extending architectures of dual-encoder Multimodal LLMs (MLLMs) to operate on entire documents. Key insights emerge from their systematic exploration of various training paradigms, including different cross-modal fusion strategies and loss functions. The research reveals a significant improvement in retrieval performance with their tailored strategy, achieving a +31\% enhancement in MRR@10 over the zero-shot baseline, particularly with the fine-tuned CLIP encoder on document-level tasks.

Implications

The implications of this paper are multi-faceted, encompassing practical advancements in information retrieval technology and theoretical contributions to the field of multi-modal deep learning. Practically, DocMMIR sets a benchmark for future developments in retrieval frameworks, improving the efficacy of search engines, academic databases, and any system reliant on accurate and efficient document retrieval. Theoretically, the introduction of robust training paradigms for MLLMs in document-level multimodal retrieval challenges existing perceptions and opens pathways for future research to explore deeper semantic alignments across modalities.

Performance Analysis

Empirical results underscore DocMMIR's superiority in effectively processing and retrieving complex document-level data compared to existing MLLMs. Notably, the fine-tuned CLIP model demonstrates an impressive capacity for semantic understanding in diverse contexts, achieving substantial enhancements in MRR@10 across cross-domain datasets. Such insights assert the importance of modality-aware models tailored to document complexity rather than relying solely on instance-level retrieval benchmarks.

Future Directions

Looking forward, the framework invites future research to explore dynamic fusion mechanisms and advanced layout-aware processing methods to further improve multimodal feature representation and retrieval accuracy. Additionally, the dataset and model release offer a generative basis for real-world applications spanning scientific research, educational content delivery, and enterprise knowledge management.

Conclusion

DocMMIR represents a substantive contribution to the field of document-level multi-modal information retrieval, leveraging cross-domain datasets to bolster retrieval effectiveness with advanced MLLM architectures. This paper not only challenges current models to evolve but also guides ongoing research towards more nuanced and comprehensive retrieval systems capable of understanding and aligning content across diverse modalities.