Doc-Researcher: Multimodal Document Analysis
- Doc-Researcher is a unified system for deep multimodal document research that integrates advanced multimodal parsing and structured layout representation.
- It employs a dynamic retrieval architecture that fuses text-only, vision-only, and hybrid paradigms to maximize evidence recall and efficiency.
- The iterative multi-agent workflow refines and synthesizes evidence across documents, significantly outperforming previous state-of-the-art systems.
A Doc-Researcher is a unified system designed for deep research over multimodal document collections by combining advanced multimodal parsing, systematic text/vision/hybrid retrieval with dynamic granularity selection, and iterative multi-agent workflows for evidence accumulation and comprehensive answer synthesis (Dong et al., 24 Oct 2025). Unlike prior deep research architectures, which are fundamentally constrained to web-scale textual data, Doc-Researcher directly addresses the complexities of working with documents containing rich visual semantics such as figures, tables, charts, and equations. This paradigm necessitates sophisticated parsing to preserve layout, context-aware chunking at multiple levels, and retrieval mechanisms that span textual and visual modalities across complex document structures.
1. Multimodal Parsing and Representation
Doc-Researcher employs a deep parsing pipeline to transform raw documents—PDFs or scans containing a blend of text blocks, figures, tables, and mathematical expressions—into structured, multi-granular, and searchable representations. The system leverages the MinerU parser to identify and annotate each structural element , preserving its type (e.g., text, table, figure, equation) and normalized bounding box, thus retaining spatial and logical relationships.
Key steps in this parsing pipeline:
- Text blocks are merged into document-aware chunks with boundaries determined by section headers, pagination, or length constraints; this preserves narrative flow and avoids information fragmentation.
- Figures and tables are captioned and summarized using a vision-LLM (e.g., Qwen2.5-VL), providing both high-level and fine-grained descriptions.
- Equations are rendered as LaTeX using models such as UniMERNet, ensuring faithful mathematical interpretability.
- Hierarchy construction yields representations at multiple levels: fine-grained chunk, page, entire document text, and document summary.
A schematic pseudocode is presented in the paper:
1 2 3 4 5 6 |
for each document d_i ∈ 𝒟:
Extract elements E_i = {e₍i,j,k₎} with bounding boxes and type
Group contiguous text blocks by section/page
If token length > threshold, split into chunks
For visual elements, generate VLM-based descriptions
Construct representations: chunk, page, full text, summary |
This layout- and modality-aware approach allows subsequent retrieval and reasoning to access evidence at the appropriate spatial and semantic granularity.
2. Systematic Retrieval Architecture
Doc-Researcher supports three integrated retrieval paradigms:
- Text-only retrieval: OCR-extracted text and VLM-derived descriptions are encoded using dense embedding models (notably BGE-M3 or Qwen-3 Embedding). This approach excels on semantic textual queries and is efficient for large-scale search.
- Vision-only retrieval: Raw visual slices (either whole pages or cropped elements) are directly embedded and retrieved via visual encoders. This strategy is essential for information unavailable via OCR (e.g., layout-specific cues or images).
- Hybrid retrieval: Text and vision retrieval results are fused, leveraging the complementary informativeness of both modalities.
Critical is the architecture’s support for dynamic retrieval granularity: a dedicated Planner agent analyzes each user query (in the context of history ) to select an optimal granularity and to identify relevant sub-queries . For broad, contextual questions, summary or document-level retrieval is favored; for fine-grained evidence, chunk or page-level retrieval is invoked. The Planner refines the document search space via semantic matching.
This architecture enables performance tuning with respect to recall, coverage, and computational efficiency, adapting retrieval to both the question’s and the corpus’s complexity.
3. Iterative Multi-Agent Deep Research Workflow
The research workflow is implemented via an interactive multi-agent pipeline:
- Planner Agent: Given the user query and prior conversation turns, filters the corpus to a focused subset and chooses the appropriate retrieval granularity and sub-query decomposition.
- Searcher and Refiner Agents: For each sub-query , an iterative loop operates: Relevant passages are retrieved (), then refined and deduplicated (). A sufficiency criterion is evaluated:
This process repeats until sufficiency threshold or a maximum iteration limit is hit.
- Reporter Agent: Synthesizes the iteratively accumulated, cross-modal evidence into a comprehensive, verifiable answer, including references to specific document pages, bounding boxes, and modalities as required.
This agent architecture is not only modular (facilitating future improvement and ensemble research) but critical for decomposing complex multi-hop, multi-document, multi-modal queries into tractable subproblems.
4. M4DocBench: Multimodal Deep Research Benchmark
The system is evaluated using the M4DocBench benchmark, which is specifically constructed to test the limits of multi-modal, multi-hop, multi-document, and multi-turn reasoning capabilities. M4DocBench features:
- 158 expert-curated questions spanning 304 documents, with each query paired with complete annotated evidence chains.
- Multi-hop reasoning: Tasks often require at least two evidence sources and explicit reasoning links.
- Multi-modal integration: At least one visual or structural element is necessary for most queries.
- Fine-grained layout annotation: Evidence is labeled by page, bounding box, and modality, supporting granular retrieval evaluation.
- Dialog/multi-turn setting: Some questions explicitly require leveraging prior turns or conversation history.
This benchmark addresses gaps in prior datasets, which focus on single-document, single-turn QA or only textual data.
5. Experimental Results and Comparative Performance
The paper reports rigorous experimental evaluation:
- Doc-Researcher achieves 50.6% accuracy on M4DocBench, outperforming previous state-of-the-art systems (such as MDocAgent and M3DocRAG) by a factor of 3.4× in deep research tasks.
- Hybrid retrieval outperforms unimodal variants: Page-level retrieval recall shows 8–12% improvements using the hybrid approach.
- Ablation studies reveal the importance of the Planner (adaptive granularity selection) and iterative search-refine workflows: removing either leads to a 6–8% accuracy decline.
- Retrieval granularity: While document-level recall is high, page- and layout-level evidence are harder to recover, reinforced by the need for precise layout-aware parsing and multi-granular indexing.
Performance is validated through precision/recall curves and sufficiency plotting, supporting the adaptive benefits of iterative evidence gathering.
6. Significance, Impact, and Future Directions
Doc-Researcher represents a decisive advance in automated document research by tightly integrating multimodal parsing, granular adaptive retrieval, and multi-agent evidence synthesis. Potential research directions include:
- Refinement of multimodal embeddings to further improve retrieval efficiency and cross-modal alignment.
- Enhanced agent collaboration via self-learning or reinforcement learning, especially in dynamic or user-interactive settings.
- Wider application to academic, legal, financial, or scientific research domains where deep, multimodal evidence is crucial—ranging from large-scale literature surveys to regulatory compliance analysis.
- Expansion of M4DocBench to encompass more diverse modalities, document types, and real-world document noise characteristics.
The findings indicate that robust, trustworthy document research demands multimodal integrity and iterative agent-based reasoning—characteristics operationalized in Doc-Researcher and now measurable via comprehensive multi-hop, multi-modal evaluation protocols (Dong et al., 24 Oct 2025).