M3DocRAG: Multi-modal Retrieval is What You Need for Multi-page Multi-document Understanding (2411.04952v1)

Published 7 Nov 2024 in cs.CV, cs.AI, and cs.CL

Abstract: Document visual question answering (DocVQA) pipelines that answer questions from documents have broad applications. Existing methods focus on handling single-page documents with multi-modal LLMs (MLMs), or rely on text-based retrieval-augmented generation (RAG) that uses text extraction tools such as optical character recognition (OCR). However, there are difficulties in applying these methods in real-world scenarios: (a) questions often require information across different pages or documents, where MLMs cannot handle many long documents; (b) documents often have important information in visual elements such as figures, but text extraction tools ignore them. We introduce M3DocRAG, a novel multi-modal RAG framework that flexibly accommodates various document contexts (closed-domain and open-domain), question hops (single-hop and multi-hop), and evidence modalities (text, chart, figure, etc.). M3DocRAG finds relevant documents and answers questions using a multi-modal retriever and an MLM, so that it can efficiently handle single or many documents while preserving visual information. Since previous DocVQA datasets ask questions in the context of a specific document, we also present M3DocVQA, a new benchmark for evaluating open-domain DocVQA over 3,000+ PDF documents with 40,000+ pages. In three benchmarks (M3DocVQA/MMLongBench-Doc/MP-DocVQA), empirical results show that M3DocRAG with ColPali and Qwen2-VL 7B achieves superior performance than many strong baselines, including state-of-the-art performance in MP-DocVQA. We provide comprehensive analyses of different indexing, MLMs, and retrieval models. Lastly, we qualitatively show that M3DocRAG can successfully handle various scenarios, such as when relevant information exists across multiple pages and when answer evidence only exists in images.

Summary

The paper introduces M3DocRAG, a framework combining visual and textual data to overcome multi-page document Q&A challenges.
It employs a three-stage approach—document embedding, page retrieval, and question answering—to manage both single and multi-document contexts.
Experimental results confirm that M3DocRAG outperforms text-only methods, achieving state-of-the-art performance on diverse DocVQA benchmarks.

This paper introduces M3DocRAG (Multi-modal Multi-page Multi-Document Retrieval-Augmented Generation), a novel multi-modal RAG framework designed to address the challenges of document visual question answering (DocVQA) in real-world scenarios. M3DocRAG flexibly handles various document contexts, question hops, and evidence modalities by using a multi-modal retriever and a multi-modal LLM (MLM). The framework addresses limitations in existing DocVQA methods, which often struggle with questions requiring information across multiple pages or documents and the interpretation of visual elements like figures and charts.

Here's a breakdown of the key aspects:

Problem Addressed: Existing DocVQA pipelines are limited by their focus on single-page documents or reliance on text-based RAG that ignores visual information. M3DocRAG overcomes these limitations by efficiently handling single or multiple documents while preserving visual information.
M3DocRAG Framework: The M3DocRAG framework consists of three stages:

1. Document Embedding: Document pages are converted into RGB images, and visual embeddings are extracted using a multi-modal retrieval model such as ColPali. 2. Page Retrieval: Relevant document pages are retrieved using a multi-modal retrieval model (e.g., ColPali), based on text queries. Approximate page indices, such as inverted file index (IVF), are used for faster search in open-domain settings. 3. Question Answering: A multi-modal LLM (MLM), such as Qwen2-VL, is used to generate answers from the retrieved pages.

M3DocVQA Dataset: The paper introduces M3DocVQA (Multi-modal Multi-page Multi-Document Visual Question Answering), a new benchmark for evaluating open-domain DocVQA. The dataset contains 2,441 multi-hop questions spanning 3,368 PDF documents (41,005+ pages). Unlike previous DocVQA datasets that focus on single-document question answering, M3DocVQA requires models to answer questions from a large corpus of documents.
Experimental Results: The authors evaluate M3DocRAG on three benchmarks: M3DocVQA, MMLongBench-Doc, and MP-DocVQA. Results demonstrate that M3DocRAG with ColPali and Qwen2-VL 7B achieves superior performance compared to strong baselines, including state-of-the-art performance on MP-DocVQA.
Key Findings:
- Multi-modal RAG outperforms text RAG, especially on non-text evidence sources.
- Multi-modal RAG boosts long document understanding of MLMs.

The problem of question answering is categorized into two settings with different document context sizes:

Closed-domain question answering requires answering a query $q$ from a given single document $D_i$ . The retrieval model outputs the top $K$ relevant page images $P^q_K$ from the page images $P_i$ of the document $D_i$ .
Open-domain question answering may require information from single or multiple documents within the entire document corpus $C$ . The retrieval model outputs the top $K$ relevant page images $P^q_K$ from the entire set of page images $P$ .

The relevance between the query $q$ and the page $p$ is computed using the MaxSim score $s(q, p)$ :

$s(q, p) = \sum_{i = 1}^{n^q} \max_{j \in [n^v]} E^q_{i,\cdot} \cdot E^{p}_{j,\cdot}$

$n^q$ is the number of text tokens.
$n^v$ represents the number of visual tokens per page.
$E^q_{i,\cdot} \in \mathbb{R}^d$ denotes the $i$ -th row (vector) of the embedding matrix $E \in \mathbb{R}^{n \times d}$ .
$E_{i,\cdot} \in \mathbb{R}^d$ denotes the $i$ -th row (vector) of the embedding matrix $E \in \mathbb{R}^{n \times d}$ .
$d$ denotes the embedding dimension.

The top $K$ ( $\ll N$ ) pages, denoted as $P^{q}_{K}$ , most relevant to answering the query $q$ are identified using:

$P^{q}_{K} = \{p^{q}_{1}, p^{q}_{2}, \dots, p^{q}_{K}\} = \argtopk_{p \in P} s(q, p)$

The paper also discusses speed-accuracy tradeoffs using different indexing strategies (FlatIP, IVFFlat, IVFPQ) and compares various multi-modal LMs (Idefics2 8B, Idefics3 8B, InternVL2 8B, and Qwen2-VL 7B) and multi-modal retrieval models (ColPali v1 and ColQwen v0.1) within the M3DocRAG framework. Qualitative examples illustrate the framework's ability to handle questions with answer sources in various modalities and requiring multi-page reasoning. M3DocRAG achieves state-of-the-art results on MP-DocVQA, demonstrating its effectiveness in document understanding.

PDF Markdown

Related Papers

Tweets

https://twitter.com/jmin__cho/status/1854943826248712355

https://twitter.com/chalamalasetti/status/1855059460785394154

YouTube

Show All Videos