MDocAgent: A Multi-Modal Multi-Agent Framework for Document Understanding (2503.13964v1)

Published 18 Mar 2025 in cs.LG

Abstract: Document Question Answering (DocQA) is a very common task. Existing methods using LLMs or Large Vision LLMs (LVLMs) and Retrieval Augmented Generation (RAG) often prioritize information from a single modal, failing to effectively integrate textual and visual cues. These approaches struggle with complex multi-modal reasoning, limiting their performance on real-world documents. We present MDocAgent (A Multi-Modal Multi-Agent Framework for Document Understanding), a novel RAG and multi-agent framework that leverages both text and image. Our system employs five specialized agents: a general agent, a critical agent, a text agent, an image agent and a summarizing agent. These agents engage in multi-modal context retrieval, combining their individual insights to achieve a more comprehensive understanding of the document's content. This collaborative approach enables the system to synthesize information from both textual and visual components, leading to improved accuracy in question answering. Preliminary experiments on five benchmarks like MMLongBench, LongDocURL demonstrate the effectiveness of our MDocAgent, achieve an average improvement of 12.1% compared to current state-of-the-art method. This work contributes to the development of more robust and comprehensive DocQA systems capable of handling the complexities of real-world documents containing rich textual and visual information. Our data and code are available at https://github.com/aiming-lab/MDocAgent.

Summary

The paper introduces MDocAgent, a framework combining dual Retrieval Augmented Generation pipelines and a five-agent system for enhanced multi-modal document understanding and question answering.
MDocAgent employs separate RAG for text and images, using specialized agents (General, Critical, Text, Image, Summarizing) to collaboratively process information and facilitate detailed cross-modal analysis.
Evaluations across five benchmarks demonstrate MDocAgent's effectiveness, achieving an average performance improvement of 12.1% over strong baselines, highlighting its capability for complex document types.

MDocAgent introduces a framework for document question answering (DocQA) designed to overcome the limitations of existing methods, particularly their insufficient integration of multi-modal information (text and vision) and difficulties with complex reasoning over long documents (2503.13964). The system architecture combines Retrieval Augmented Generation (RAG) principles with a specialized multi-agent system to facilitate a more comprehensive understanding by explicitly leveraging both textual content and visual cues inherent in documents.

Architectural Framework

The MDocAgent framework is built upon two core components: a dual RAG pipeline for context retrieval and a five-agent system for collaborative analysis and synthesis.

1. Dual RAG Pipelines: Recognizing the distinct nature of textual and visual information, MDocAgent employs separate retrieval mechanisms for each modality: * Text-based RAG: Utilizes ColBERTv2 to embed and retrieve the top-k text segments ( $T_q$ ) from the document that are most relevant to the input query ( $q$ ). This focuses on semantic textual relevance. * Image-based RAG: Employs ColPali (or alternatively ColQwen2-v1.0) to retrieve the top-k document pages as images ( $I_q$ ) based on visual and textual relevance to the query. This captures information embedded in layout, figures, charts, and tables that might be lost or misrepresented in plain text extraction.

2. Multi-Agent System: The retrieved multi-modal context ( $T_q, I_q$ ) is processed by a collaborative system of five distinct agents, each with a specific role: * General Agent ( $A_G$ ): Receives both $T_q$ and $I_q$ and performs an initial multi-modal assessment to generate a preliminary answer ( $a_G$ ). This agent serves as the initial point of information fusion. * Critical Agent ( $A_C$ ): Analyzes the query $q$ , the retrieved contexts $T_q, I_q$ , and the general agent's answer $a_G$ . Its function is to identify and extract the most crucial textual snippets ( $T_c$ ) and visual elements (represented textually, $I_c$ ) required for an accurate response. This step explicitly pinpoints cross-modal evidence. * Text Agent ( $A_T$ ): Focuses exclusively on the textual domain. It processes the retrieved text segments $T_q$ , guided by the critical text information $T_c$ identified by $A_C$ , to generate a refined text-centric answer ( $a_T$ ). * Image Agent ( $A_I$ ): Specializes in visual analysis. It processes the retrieved images $I_q$ , guided by the critical visual information $I_c$ from $A_C$ , to generate an answer grounded in visual evidence ( $a_I$ ). * Summarizing Agent ( $A_S$ ): Acts as the final synthesizer. It integrates the preliminary multi-modal answer $a_G$ , the text-focused answer $a_T$ , and the image-focused answer $a_I$ to produce the final, consolidated answer $a_S$ . This agent resolves potential conflicts or inconsistencies between the modality-specific analyses.

This structured, role-differentiated approach allows for specialized processing while ensuring continuous integration and cross-validation of information derived from both text and visual sources.

Multi-Modal Integration Strategy

MDocAgent integrates textual and visual information systematically throughout its processing pipeline:

Pre-processing: Documents are processed to extract both textual content (via OCR or PDF parsing) and visual page representations (images). This parallel representation preserves both semantic content and visual layout/elements.
Multi-Modal Retrieval: The dual RAG system independently retrieves relevant context from both the text corpus and the image set, ensuring that potentially relevant information from either modality is available for downstream processing.
Initial Fusion and Criticality Assessment: The General Agent ( $A_G$ ) performs the first explicit fusion by considering both $T_q$ and $I_q$ . Subsequently, the Critical Agent ( $A_C$ ) performs a deeper cross-modal analysis to identify the specific pieces of text ( $T_c$ ) and visual information ( $I_c$ ) that are most pertinent to the query, establishing explicit links between modalities.
Guided Specialized Analysis: The Text Agent ( $A_T$ ) and Image Agent ( $A_I$ ) conduct modality-specific analysis, but crucially, their focus is guided by the critical information ( $T_c, I_c$ ) identified by $A_C$ . This prevents siloed reasoning and ensures their specialized analyses contribute to the integrated understanding required by the query.
Final Synthesis: The Summarizing Agent ( $A_S$ ) explicitly reconciles the preliminary multi-modal perspective ( $a_G$ ) with the refined, modality-specific answers ( $a_T, a_I$ ). This final step synthesizes diverse analytical outputs into a single, coherent answer reflecting a holistic understanding of the document content across modalities.

The iterative process involving initial fusion, critical evidence identification, guided specialized analysis, and final synthesis enables MDocAgent to handle complex queries that require drawing connections between text and visual elements like tables, charts, or diagrams.

Implementation and Experimental Validation

The practical implementation and effectiveness of MDocAgent were demonstrated through experiments using specific models and benchmark evaluations.

Models: Llama-3.1-8B-Instruct served as the backbone for the Text Agent ( $A_T$ ), while Qwen2-VL-7B-Instruct, a Large Vision LLM (LVLM), was used for the other four agents ( $A_G, A_C, A_I, A_S$ ). For retrieval, ColBERTv2 handled text, and ColPali or ColQwen2-v1.0 handled images.
Retrieval Settings: Experiments were conducted using both top-1 and top-4 retrieval settings for both text and image RAG pipelines to assess performance under varying context lengths.
Benchmarks: Evaluation was performed on five diverse datasets: MMLongBench, LongDocURL, PaperTab, PaperText, and FetaTab. These benchmarks cover various document types (webpages, scientific papers), lengths, and query complexities, including those requiring reasoning over text, tables, charts, and figures.
Performance Results: MDocAgent demonstrated significant improvements over baseline methods, including state-of-the-art LVLMs (e.g., Qwen2.5-VL, LLaVA) and RAG-based approaches (text-only ColBERTv2+LLaMA-3.1-8B, multi-modal M3DocRAG).
- Compared to the strongest baseline (M3DocRAG), MDocAgent achieved an average performance improvement of 12.1% across the five benchmarks with top-1 retrieval and 10.9% with top-4 retrieval.
- Compared to a strong text-only RAG baseline (ColBERTv2+Llama-3.1-8B), the improvement was 6.9% with top-4 retrieval, highlighting the benefit of multi-modal integration.
- The gains over end-to-end LVLMs were substantial (e.g., 51.9% average improvement over Qwen2.5-VL with top-1 retrieval), underscoring the advantages of the RAG + multi-agent structure for complex DocQA tasks.
Ablation Studies: Ablation experiments confirmed the contribution of each component. Removing either the Text Agent or the Image Agent led to performance degradation, validating the need for specialized processing. Removing the General and Critical agents also resulted in lower scores, confirming their roles in initial integration and guiding subsequent analysis.
Robustness: The framework showed robustness to the choice of image RAG backbone, achieving similar performance levels when using ColPali versus ColQwen2-v1.0.
Qualitative Analysis: A case paper illustrated MDocAgent's ability to correctly synthesize information from both text passages and tables within a document to answer a comparative question, a task where baseline RAG methods reportedly failed.

Significance and Contributions

The MDocAgent framework makes several contributions to the field of document understanding:

Enhanced Multi-Modal Reasoning: It provides a structured and effective mechanism for integrating textual and visual information, moving beyond simple concatenation or modality prioritization common in prior work. The multi-agent collaboration explicitly facilitates reasoning across modalities.
Improved Handling of Long and Complex Documents: By combining RAG's ability to retrieve relevant context from large documents with the focused analytical capabilities of specialized agents, MDocAgent addresses the challenges of information overload and detailed reasoning often faced by monolithic models.
Demonstration of Multi-Agent Systems for DocQA: The work successfully applies a multi-agent paradigm to the complex domain of multi-modal DocQA, showcasing how collaborative specialization can lead to more robust and accurate results.
State-of-the-Art Performance: The substantial empirical improvements demonstrated across multiple challenging benchmarks establish MDocAgent as a highly competitive approach for DocQA, particularly for documents containing rich visual elements.

In conclusion, MDocAgent presents a novel and effective architecture for multi-modal document question answering by synergistically combining dual RAG pipelines with a collaborative multi-agent system. Its design facilitates superior integration of textual and visual information, leading to significant performance improvements on complex DocQA tasks involving diverse document types.

PDF Markdown

GitHub

GitHub - aiming-lab/MDocAgent: MDocAgent: A Multi-Modal Multi-Agent Framework for Document Understanding (1 star)

Tweets

https://twitter.com/richardxp888/status/1903975589733867985