Agentic Document Collection VQA
- The paper introduces an agentic VQA framework leveraging multi-agent architectures to decompose complex queries across diverse document collections.
- It employs specialized agents, such as OCR verifiers and table readers, with dynamic routing to achieve verifiable, visually grounded evidence retrieval.
- Iterative retrieval loops and multi-hop reasoning enable robust evidence fusion, significantly outperforming static retrieval-augmented pipelines in handling multimodal data.
Agentic Document Collection Visual Question Answering (VQA) is a research paradigm and system design approach that employs autonomous, multi-component agent architectures to process, search, and reason over large heterogeneous document collections containing multimodal information for question answering. These agentic systems dynamically decompose complex queries, orchestrate specialized agents for evidence retrieval and inference, and provide verifiable, interpretable answers with explicit reference to relevant document locations, bridging the gap between passive retrieval-augmented pipelines and strategic, iterative, human-like document navigation and composite reasoning (Lassoued et al., 2 Mar 2026, Borchmann et al., 12 Mar 2026, Jin et al., 30 Oct 2025, Jain et al., 16 Jun 2025, Suri et al., 2024).
1. Task Definition and Formal Properties
Agentic Document Collection VQA generalizes single-document VQA to settings in which questions may require locating, synthesizing, and reasoning over content scattered across entire corpora of scanned forms, multipage PDFs, or presentations. The collection is formally denoted as , with each comprising multiple pages, rich visual elements, and machine- or handwritten text. Given a natural-language query , the agent must output:
- an answer , where each token is extractive from the underlying documents, and
- a minimal supporting evidence set of (possibly cross-document) pages or blocks, such that entails and is minimal (Borchmann et al., 12 Mar 2026, Tito et al., 2021).
Key characteristics:
- Extractive and multimodal: Answers must be directly supported by visual/textual document elements (not model priors).
- Multi-hop: Reasoning may require information integration from several locations or modalities.
- Agentic: No single retrieval operation suffices; agents must plan, navigate, and refine.
- Visual grounding: Answers leverage graphic, tabular, layout cues, not plain OCR text alone.
This task is evaluated using metrics that jointly assess answer correctness, evidence localization (precision/recall over retrieved document sections), and the efficiency of navigation/effort required to reach a correct conclusion (Borchmann et al., 12 Mar 2026, Tito et al., 2021, Suri et al., 2024, Mohammadshirazi et al., 22 Nov 2025).
2. Multi-Agent Architectures and Orchestration
The agentic paradigm advances over static retrieval-augmented generation (RAG) by implementing explicit multi-agent systems that decompose, route, and recompose reasoning in modular, interpretable fashion.
ORCA (Lassoued et al., 2 Mar 2026) exemplifies this architecture:
- Reasoning Agent (): Decomposes the query given documents into a chain of logical steps and emits an initial answer proposal .
- Routing Mechanism: For each sub-task , computes embeddings and applies a learned routing function to select specialized agents for each modality or task.
- Specialized-Agent Dock: Includes agents for layout parsing, OCR/HTR verification, table reading, chart comprehension, form analysis, free-text, image reasoning, binary classification, and miscellaneous tasks. Each agent receives and outputs a partial answer.
- Orchestrator: Executes agents sequentially according to a dynamic order, incorporating iterative refinement, consistency masking, and final aggregation.
Stress-testing, debate mechanisms, thesis-antithesis adjudication, and format sanity checking further enhance reliability, consistency, and answer verifiability. Empirical ablations confirm the centrality of reasoning decomposition and specialist routing to system performance, with debate-driven verification yielding critical improvements in challenging cases (Lassoued et al., 2 Mar 2026).
SlideAgent implements a hierarchical agentic framework—global, page, and element-level agents construct a structured, query-agnostic representation that enables coarse-to-fine inferencing and selective agent activation (Jin et al., 30 Oct 2025).
3. Iterative Retrieval and Evidence Synthesis
Agentic document collection VQA systems rely on sophisticated retrieval loops and memory structures to efficiently gather, filter, and synthesize evidence across complex corpora.
- Dual-cue retrieval in SimpleDoc uses both embedding-based similarity and LLM-driven summary relevance to shortlist and rank candidate document pages. A reasoner agent iteratively invokes this dual retriever, maintaining a working memory of accumulated evidence and rationale notes, repeatedly retrieving and updating its hypotheses until answer sufficiency or abstention is declared (Jain et al., 16 Jun 2025).
- Graph-based memory in -Reader leverages a dual-evolving architecture: a heterogeneous Content Graph encodes document-native structure (text spans, tables, figures, cross-modal references), while a Planning Graph captures the evolving reasoning plan as a DAG of sub-questions with stepwise, evidence-driven updates (Du et al., 29 Jan 2026).
- Evidence fusion and chain-of-thought (CoT): Modern systems (e.g., VisDoMRAG) process textual and visual evidence in parallel retrieval+CoT chains, enforcing consistency-constrained late fusion for final answer synthesis (Suri et al., 2024).
This agentic retrieval loop typically features the following workflow:
- Query reception and possible initial decomposition.
- Retrieval of top- evidence candidates using embedding and/or summary-similarity. For multi-document scenarios, both textual (e.g., BM25, BGE-1.5, ColQwen2) and visual pipelines (multi-vector LLMs) are used (Suri et al., 2024, Du et al., 29 Jan 2026).
- Evidence curation via LLM prompts to select supporting elements (paragraphs, tables, figures), generate chain-of-thought explanations, and produce intermediate answers.
- Consistency checking and fusion through explicit CoT alignment or debate, triggering re-retrieval or sub-query generation if inconsistency or low confidence arises.
- Final answer generation and rationale trace output.
A table summarizing major retrieval and reasoning architectures:
| System | Core Retrieval | Reasoning Paradigm | Fusion/Verification |
|---|---|---|---|
| ORCA (Lassoued et al., 2 Mar 2026) | Routing via page/task similarity | Multi-agent chain, debate | Stress-testing, adjudication |
| SimpleDoc (Jain et al., 16 Jun 2025) | Dual-cue (embedding + summary) | Iterative reasoner, memory | Answer/not-answerable loop |
| G-Reader (Du et al., 29 Jan 2026) | Content Graph readout | Planning Graph (DAG) | Evidence sufficiency, replanning |
| VisDoMRAG (Suri et al., 2024) | Textual + visual RAG | Parallel CoT chains | Consistency-constrained fusion |
4. Evaluation Datasets and Benchmarks
Benchmarking agentic document collection VQA requires datasets with large, diverse, and richly annotated document corpora and challenging question types. Key datasets include:
- MADQA (Borchmann et al., 12 Mar 2026): 2,250 human-authored queries over 800 PDF documents spanning 13 domains; features high proportions of multi-hop (17.3%) and visually grounded (58%) questions. Evaluates answer correctness (semantic and exact match), evidence retrieval (page/doc F1), and effort calibration (accuracy-effort Kuiper statistic), enabling direct comparison to human researchers.
- VisDoMBench (Suri et al., 2024): >2,000 queries over 1,277 documents (tables, charts, slides), with challenging multi-document retrieval, complex cross-modal reasoning, and distractors.
- SlideVQA (Tanaka et al., 2023): 43,738 QA pairs over 17,165 slides; provides evidence indices, arithmetic expressions, and supports both compositional and numerical multi-hop reasoning.
- DocCVQA (Tito et al., 2021): Designed for collection-level VQA, emphasizes extraction and grounded evidence retrieval with metrics for both answer and evidence set precision.
- SlideAgent (Jin et al., 30 Oct 2025): Benchmarked on SlideVQA, TechSlides, and FinSlides, with reportable gains (+7.9 to +9.8 overall) via hierarchical agentic reasoning.
Empirical results across these datasets consistently show that agentic multi-agent and iterative retrieval-based pipelines outperform static or end-to-end LLM or unimodal RAG baselines by 4–13 points, with significant gains in multi-hop, multimodal, and cross-document scenarios (Lassoued et al., 2 Mar 2026, Suri et al., 2024, Jin et al., 30 Oct 2025, Jain et al., 16 Jun 2025, Du et al., 29 Jan 2026, Borchmann et al., 12 Mar 2026).
5. Specialized Agents and Tool Integration
Agentic frameworks exploit modular tool ecosystems, leveraging component specialization for robust performance and interpretability:
- Specialist agents (ORCA): Layout parsers, OCR verifiers, table readers, figure analyzers, form extractors, text/graphic reasoners (Lassoued et al., 2 Mar 2026).
- Element-level agents (SlideAgent): Parse visual elements at the granularity of charts, icons, boxes; infer localized reasoning in compositional fashion (Jin et al., 30 Oct 2025).
- Tool orchestration (ARIAL): Sense–think–act agents coordinating OCR, semantic retrievers (FAISS, MiniLM), answer generators, and grounding modules, with transparent JSON-RPC logging and explicit spatial grounding for each predicted answer (Mohammadshirazi et al., 22 Nov 2025).
- Graph memory buffers: Enable the tracking and reuse of intermediate evidence and reasoning across retrieval rounds or sub-task invocations, supporting long-range dependencies and cross-document inference (Du et al., 29 Jan 2026, Lassoued et al., 2 Mar 2026).
This modularity provides auditability—a critical feature for high-stakes or regulated document analysis workflows—and enables scalable extension to new modalities, layout domains, and languages by plugging in trained agents or retrievers specialized for particular content (legal, diagrammatic, tabular, etc.) (Jin et al., 30 Oct 2025, Mohammadshirazi et al., 22 Nov 2025).
6. Limitations, Challenges, and Performance Bottlenecks
Despite significant progress, agentic document collection VQA remains fundamentally bottlenecked by the precision and coverage of document evidence retrieval, not by the final language reasoning step. Oracle experiments show that, when provided exactly the minimal gold evidence, even standard VLMs or LLMs can answer with >99% accuracy, highlighting a ∼17–18% performance gap attributable to imperfect retrieval, navigation, and multi-hop search (Borchmann et al., 12 Mar 2026, Tito et al., 2021, Suri et al., 2024).
Additional considerations:
- Effort calibration and wasted cycles: Non-strategic agents tend to persist in unproductive search loops; accuracy-effort calibration (Kuiper statistic) remains markedly lower for agents than for humans (Borchmann et al., 12 Mar 2026).
- Multi-hop and cross-modal complexity: Retrieval and reasoning chains become brittle as more distant or heterogeneous evidence is required, especially in table/chart reasoning and when visual layout is essential (Suri et al., 2024, Tanaka et al., 2023).
- Robustness to noisy OCR, ambiguous layouts, and incomplete metadata remains an open challenge (Tanaka et al., 2023, Jain et al., 16 Jun 2025).
7. Future Directions and Extensions
The agentic paradigm is extending in several critical dimensions:
- Reinforcement learning and policy optimization: Adaptive router/orchestrator agents to optimize agent selection, ordering, and evidence search, potentially using reward shaping over accuracy-effort objectives (Lassoued et al., 2 Mar 2026, Borchmann et al., 12 Mar 2026).
- Graph-augmented and planning-based models: Evolving explicit memory and plan graphs provides robustness to context fragmentation and support for persistent, non-repetitive exploration (Du et al., 29 Jan 2026).
- Plug-and-play specialty agents: Seamless addition of new tools (e.g., legal clause extractors, handwriting recognition, cross-lingual modules) to accommodate new domains and languages (Lassoued et al., 2 Mar 2026, Jin et al., 30 Oct 2025).
- Fully open-source, interpretable frameworks: Modular tool traces, explicit evidence chains, and spatial grounding to achieve explainability and trust in high-stakes environments (Mohammadshirazi et al., 22 Nov 2025).
- Interactive and collaborative regimes: Hybrid human-in-the-loop and agentic pipelines, leveraging the complementary strengths of agent efficiency and human strategic navigation for challenging, non-routine queries (Borchmann et al., 12 Mar 2026).
A plausible implication is that continued gains depend on further synergy between advanced retrieval, agent co-planning, robust multimodal parsing, and interpretable reasoning traces. Comprehensive benchmarks and open-source agentic toolkits will be essential for further progress in this domain.