- The paper introduces M3DocRAG, a framework combining visual and textual data to overcome multi-page document Q&A challenges.
- It employs a three-stage approach—document embedding, page retrieval, and question answering—to manage both single and multi-document contexts.
- Experimental results confirm that M3DocRAG outperforms text-only methods, achieving state-of-the-art performance on diverse DocVQA benchmarks.
This paper introduces M3DocRAG (Multi-modal Multi-page Multi-Document Retrieval-Augmented Generation), a novel multi-modal RAG framework designed to address the challenges of document visual question answering (DocVQA) in real-world scenarios. M3DocRAG flexibly handles various document contexts, question hops, and evidence modalities by using a multi-modal retriever and a multi-modal LLM (MLM). The framework addresses limitations in existing DocVQA methods, which often struggle with questions requiring information across multiple pages or documents and the interpretation of visual elements like figures and charts.
Here's a breakdown of the key aspects:
- Problem Addressed: Existing DocVQA pipelines are limited by their focus on single-page documents or reliance on text-based RAG that ignores visual information. M3DocRAG overcomes these limitations by efficiently handling single or multiple documents while preserving visual information.
- M3DocRAG Framework: The M3DocRAG framework consists of three stages:
1. Document Embedding: Document pages are converted into RGB images, and visual embeddings are extracted using a multi-modal retrieval model such as ColPali.
2. Page Retrieval: Relevant document pages are retrieved using a multi-modal retrieval model (e.g., ColPali), based on text queries. Approximate page indices, such as inverted file index (IVF), are used for faster search in open-domain settings.
3. Question Answering: A multi-modal LLM (MLM), such as Qwen2-VL, is used to generate answers from the retrieved pages.
- M3DocVQA Dataset: The paper introduces M3DocVQA (Multi-modal Multi-page Multi-Document Visual Question Answering), a new benchmark for evaluating open-domain DocVQA. The dataset contains 2,441 multi-hop questions spanning 3,368 PDF documents (41,005+ pages). Unlike previous DocVQA datasets that focus on single-document question answering, M3DocVQA requires models to answer questions from a large corpus of documents.
- Experimental Results: The authors evaluate M3DocRAG on three benchmarks: M3DocVQA, MMLongBench-Doc, and MP-DocVQA. Results demonstrate that M3DocRAG with ColPali and Qwen2-VL 7B achieves superior performance compared to strong baselines, including state-of-the-art performance on MP-DocVQA.
- Key Findings:
- Multi-modal RAG outperforms text RAG, especially on non-text evidence sources.
- Multi-modal RAG boosts long document understanding of MLMs.
The problem of question answering is categorized into two settings with different document context sizes:
- Closed-domain question answering requires answering a query q from a given single document Di. The retrieval model outputs the top K relevant page images PKq from the page images Pi of the document Di.
- Open-domain question answering may require information from single or multiple documents within the entire document corpus C. The retrieval model outputs the top K relevant page images PKq from the entire set of page images P.
The relevance between the query q and the page p is computed using the MaxSim score s(q,p):
s(q,p)=i=1∑nqj∈[nv]maxEi,⋅q⋅Ej,⋅p
- nq is the number of text tokens.
- nv represents the number of visual tokens per page.
- Ei,⋅q∈Rd denotes the i-th row (vector) of the embedding matrix E∈Rn×d.
- Ei,⋅∈Rd denotes the i-th row (vector) of the embedding matrix E∈Rn×d.
- d denotes the embedding dimension.
The top K (≪N) pages, denoted as PKq, most relevant to answering the query q are identified using:
$P^{q}_{K} = \{p^{q}_{1}, p^{q}_{2}, \dots, p^{q}_{K}\} = \argtopk_{p \in P} s(q, p)$
The paper also discusses speed-accuracy tradeoffs using different indexing strategies (FlatIP, IVFFlat, IVFPQ) and compares various multi-modal LMs (Idefics2 8B, Idefics3 8B, InternVL2 8B, and Qwen2-VL 7B) and multi-modal retrieval models (ColPali v1 and ColQwen v0.1) within the M3DocRAG framework. Qualitative examples illustrate the framework's ability to handle questions with answer sources in various modalities and requiring multi-page reasoning. M3DocRAG achieves state-of-the-art results on MP-DocVQA, demonstrating its effectiveness in document understanding.