Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
194 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

M3DocRAG: Multi-modal Retrieval is What You Need for Multi-page Multi-document Understanding (2411.04952v1)

Published 7 Nov 2024 in cs.CV, cs.AI, and cs.CL

Abstract: Document visual question answering (DocVQA) pipelines that answer questions from documents have broad applications. Existing methods focus on handling single-page documents with multi-modal LLMs (MLMs), or rely on text-based retrieval-augmented generation (RAG) that uses text extraction tools such as optical character recognition (OCR). However, there are difficulties in applying these methods in real-world scenarios: (a) questions often require information across different pages or documents, where MLMs cannot handle many long documents; (b) documents often have important information in visual elements such as figures, but text extraction tools ignore them. We introduce M3DocRAG, a novel multi-modal RAG framework that flexibly accommodates various document contexts (closed-domain and open-domain), question hops (single-hop and multi-hop), and evidence modalities (text, chart, figure, etc.). M3DocRAG finds relevant documents and answers questions using a multi-modal retriever and an MLM, so that it can efficiently handle single or many documents while preserving visual information. Since previous DocVQA datasets ask questions in the context of a specific document, we also present M3DocVQA, a new benchmark for evaluating open-domain DocVQA over 3,000+ PDF documents with 40,000+ pages. In three benchmarks (M3DocVQA/MMLongBench-Doc/MP-DocVQA), empirical results show that M3DocRAG with ColPali and Qwen2-VL 7B achieves superior performance than many strong baselines, including state-of-the-art performance in MP-DocVQA. We provide comprehensive analyses of different indexing, MLMs, and retrieval models. Lastly, we qualitatively show that M3DocRAG can successfully handle various scenarios, such as when relevant information exists across multiple pages and when answer evidence only exists in images.

Summary

  • The paper introduces M3DocRAG, a framework combining visual and textual data to overcome multi-page document Q&A challenges.
  • It employs a three-stage approach—document embedding, page retrieval, and question answering—to manage both single and multi-document contexts.
  • Experimental results confirm that M3DocRAG outperforms text-only methods, achieving state-of-the-art performance on diverse DocVQA benchmarks.

This paper introduces M3DocRAG (Multi-modal Multi-page Multi-Document Retrieval-Augmented Generation), a novel multi-modal RAG framework designed to address the challenges of document visual question answering (DocVQA) in real-world scenarios. M3DocRAG flexibly handles various document contexts, question hops, and evidence modalities by using a multi-modal retriever and a multi-modal LLM (MLM). The framework addresses limitations in existing DocVQA methods, which often struggle with questions requiring information across multiple pages or documents and the interpretation of visual elements like figures and charts.

Here's a breakdown of the key aspects:

  • Problem Addressed: Existing DocVQA pipelines are limited by their focus on single-page documents or reliance on text-based RAG that ignores visual information. M3DocRAG overcomes these limitations by efficiently handling single or multiple documents while preserving visual information.
  • M3DocRAG Framework: The M3DocRAG framework consists of three stages:

1. Document Embedding: Document pages are converted into RGB images, and visual embeddings are extracted using a multi-modal retrieval model such as ColPali. 2. Page Retrieval: Relevant document pages are retrieved using a multi-modal retrieval model (e.g., ColPali), based on text queries. Approximate page indices, such as inverted file index (IVF), are used for faster search in open-domain settings. 3. Question Answering: A multi-modal LLM (MLM), such as Qwen2-VL, is used to generate answers from the retrieved pages.

  • M3DocVQA Dataset: The paper introduces M3DocVQA (Multi-modal Multi-page Multi-Document Visual Question Answering), a new benchmark for evaluating open-domain DocVQA. The dataset contains 2,441 multi-hop questions spanning 3,368 PDF documents (41,005+ pages). Unlike previous DocVQA datasets that focus on single-document question answering, M3DocVQA requires models to answer questions from a large corpus of documents.
  • Experimental Results: The authors evaluate M3DocRAG on three benchmarks: M3DocVQA, MMLongBench-Doc, and MP-DocVQA. Results demonstrate that M3DocRAG with ColPali and Qwen2-VL 7B achieves superior performance compared to strong baselines, including state-of-the-art performance on MP-DocVQA.
  • Key Findings:
    • Multi-modal RAG outperforms text RAG, especially on non-text evidence sources.
    • Multi-modal RAG boosts long document understanding of MLMs.

The problem of question answering is categorized into two settings with different document context sizes:

  • Closed-domain question answering requires answering a query qq from a given single document DiD_i. The retrieval model outputs the top KK relevant page images PKqP^q_K from the page images PiP_i of the document DiD_i.
  • Open-domain question answering may require information from single or multiple documents within the entire document corpus CC. The retrieval model outputs the top KK relevant page images PKqP^q_K from the entire set of page images PP.

The relevance between the query qq and the page pp is computed using the MaxSim score s(q,p)s(q, p):

s(q,p)=i=1nqmaxj[nv]Ei,qEj,ps(q, p) = \sum_{i = 1}^{n^q} \max_{j \in [n^v]} E^q_{i,\cdot} \cdot E^{p}_{j,\cdot}

  • nqn^q is the number of text tokens.
  • nvn^v represents the number of visual tokens per page.
  • Ei,qRdE^q_{i,\cdot} \in \mathbb{R}^d denotes the ii-th row (vector) of the embedding matrix ERn×dE \in \mathbb{R}^{n \times d}.
  • Ei,RdE_{i,\cdot} \in \mathbb{R}^d denotes the ii-th row (vector) of the embedding matrix ERn×dE \in \mathbb{R}^{n \times d}.
  • dd denotes the embedding dimension.

The top KK (N\ll N) pages, denoted as PKqP^{q}_{K}, most relevant to answering the query qq are identified using:

$P^{q}_{K} = \{p^{q}_{1}, p^{q}_{2}, \dots, p^{q}_{K}\} = \argtopk_{p \in P} s(q, p)$

The paper also discusses speed-accuracy tradeoffs using different indexing strategies (FlatIP, IVFFlat, IVFPQ) and compares various multi-modal LMs (Idefics2 8B, Idefics3 8B, InternVL2 8B, and Qwen2-VL 7B) and multi-modal retrieval models (ColPali v1 and ColQwen v0.1) within the M3DocRAG framework. Qualitative examples illustrate the framework's ability to handle questions with answer sources in various modalities and requiring multi-page reasoning. M3DocRAG achieves state-of-the-art results on MP-DocVQA, demonstrating its effectiveness in document understanding.

Youtube Logo Streamline Icon: https://streamlinehq.com