Document MAP (DMAP): Structural Framework

Updated 27 January 2026

Document MAP (DMAP) is a formal framework that encodes multimodal document structure by mapping hierarchical, relational, and spatial elements.
It employs a Structured-Semantic Understanding Agent to extract and organize document elements and a Reflective Reasoning Agent for tri-path evidence retrieval.
Empirical evaluations show DMAP’s superior performance in multimodal document QA, while noting challenges such as preprocessing overhead and dependency on accurate section parsing.

A Document MAP (DMAP) is a formal structural representation for multimodal documents that encodes hierarchical, relational, and spatial organization of content such as text, figures, tables, and charts. The DMAP paradigm explicitly models document structure in a human-aligned fashion, supporting evidence-grounded retrieval and reasoning for tasks such as multimodal document question answering (MMDocQA). By transcending flat, chunk-based document representations, DMAPs enable comprehensive navigation of logical, spatial, and semantic dependencies that are vital for complex information extraction and interpretability (Fu et al., 26 Jan 2026).

1. Formal Definition and Core Representation

A DMAP is defined by the tuple

$\mathrm{DMAP} = (V, E, \phi, \psi, \mathcal{T})$

where:

$V$ is the set of atomic document elements, including page content (full-page screenshot and OCR text), figures, tables, and charts. If $D$ consists of pages $P_1, \dots, P_n$ , each $P_i$ has elements $e_{i1}, \dots, e_{im}$ , with $V = \{ e_{ij} \mid 1 \leq i \leq n,\,1 \leq j \leq m_i \}$ .
$E \subseteq V \times V$ encodes relations among elements: hierarchical (section→page, page→element), cross-reference (e.g., “Table 3”→table), and figure–text alignment. $E = E_{\mathrm{hier}}\cup E_{\mathrm{ref}}\cup E_{\mathrm{align}}$ .
$\phi: V \to \mathbb{R}^d$ is a semantic embedding function assigning each element $e$ a text embedding $\mathbf{v}^T_e$ (e.g., ColBERTv2) and a visual embedding $\mathbf{v}^V_e$ (e.g., ColPali): $\phi(e) = (\mathbf{v}^T_e, \mathbf{v}^V_e)$ .
$\psi: V \to \mathbb{R}^4$ is the layout mapping function, giving bounding-box coordinates $(x_{\min},y_{\min},x_{\max},y_{\max})$ for each element.
$\mathcal{T}$ is a hierarchical tree organizing sections, pages, and elements: sections → pages → elements.

This formalism provides comprehensive coverage of both logical (hierarchical, referential) and spatial relationships, mirroring human document navigation (Fu et al., 26 Jan 2026).

2. Structured-Semantic Understanding Agent

The Structured-Semantic Understanding Agent (SSUA) is responsible for constructing the DMAP from a raw multimodal document through three principal stages:

Element Extraction: The document is split into pages. For each page $P_i$ , atomic elements $e_{i0}$ (page content) and $e_{i1}, \ldots, e_{im}$ (figures, tables, charts) are extracted via OCR and detection tools (e.g., pdffigure2). Semantic and visual embeddings are computed.
Hierarchical Population: SSUA incrementally builds the section→page→element hierarchy. New section headings are detected using font/layout/keyword heuristics, and section numbering is maintained. Elements are linked to their enclosing page and section.
Summary Tree Generation: Summaries for each section and page are generated and stored for fast, structured retrieval. All section parsing is rule-based, using layout and OCR cues, without additional supervised losses.

This approach encodes not only content and layout but also higher-level semantic organization, producing lightweight, navigable summaries for use in downstream retrieval and reasoning (Fu et al., 26 Jan 2026).

3. Reflective Reasoning Agent and Retrieval Methodology

The Reflective Reasoning Agent (RRA) answers queries using structure-aware, evidence-driven retrieval and generation over DMAP. Its operation proceeds as follows:

3.1 Tri-Path Retrieval

RRA fuses three distinct retrieval signals:

Structured Semantic Retrieval: Leveraging the summary tree, top-k candidate sections/pages are selected semantically relevant to the query.
Textual Feature Retrieval: The query's text embedding is used to retrieve elements with high textual similarity.
Visual Feature Retrieval: The query's visual embedding is used for visual match retrieval.

The union $R = R^{(s)} \cup R^{(t)} \cup R^{(v)}$ yields the full retrieval set, integrating logical, visual, and textual context.

3.2 Iterative, Reflective Generation

Leveraging features from retrieved elements, a multimodal generator produces a candidate answer. An LLM-based evaluator assesses answer sufficiency (completeness and consistency). If evidence is insufficient ( $\mathrm{done} = \mathtt{no}$ ), the agent broadens retrieval via DMAP’s structure (e.g., parent/neighbor nodes) and regenerates the answer. Iteration continues until sufficiency is affirmed or a step budget is exhausted.

This strategy prevents premature answer speculation, reduces hallucination, and supports negative acknowledgments when supporting evidence is absent (Fu et al., 26 Jan 2026).

4. Empirical Evaluation and Ablation

The DMAP framework was evaluated on five MMDocQA benchmarks: MMLongBench, LongDocURL, PaperTab, PaperText, and FetaTab. Performance was compared across large vision–LLMs (LVLMs), RAG-based systems, and DMAP-augmented variants. Results showed that DMAP with tri-path retrieval achieves the highest average accuracy (54.4%), outperforming both LVLMs (e.g., Qwen-2.5-VL-7B-Instruct at 27.4%) and standard RAG systems (e.g., MDocAgent at 48.4%).

Ablation experiments quantified each retrieval path's contribution:

Variant	Text	Image	DMAP	AvgAcc
w/o DMAP	✓	✓	✗	44.7%
w/o Image	✓	✗	✓	52.6%
w/o Text	✗	✓	✓	39.4%
Full (all paths)	✓	✓	✓	54.4%

Eliminating DMAP causes a 9.7 percentage-point drop, highlighting structural retrieval’s unique value. Textual retrieval is the strongest single path but is outperformed by the full multimodal tri-path fusion (Fu et al., 26 Jan 2026).

Analysis by question type (MMLongBench) revealed DMAP’s particular strength for “Layout,” “Table,” “Chart,” and “Figure” evidence, with relative improvements over MDocAgent ranging from +16.8% (Tables) to +89.4% (Charts).

5. Advantages and Limitations

Advantages

Human-Aligned Representation: DMAP’s hierarchical and relational schema (sections, pages, elements, and cross-references) preserves the document organization strategies humans naturally use.
Tri-Path Retrieval: Combining structured, textual, and visual paths yields higher retrieval precision and more consistent evidence aggregation.
Reflective Generation: Iterative answer refinement helps avoid unsupported answers and supports explicit abstention when evidence is lacking.

Limitations

Preprocessing Overhead: Construction and indexing (summaries, embeddings, layout graphs) require substantial offline effort.
Dependency on Heading Detection: Reliable section parsing is prerequisite; noisy or nonstandard PDF layouts degrade performance.
Rule-Based Heuristics: Current SSUA uses rule-based section segmentation, which may not generalize to irregular document formats.

6. Future Directions

Several research avenues have been identified:

Learned Structural Parsing: Replacing heuristic-based section segmentation and relation extraction with end-to-end trainable approaches.
Incremental and Dynamic Mapping: Supporting DMAP updates for streaming, evolving, or version-controlled documents.
Cross-Document DMAPs: Extending structures to collections (e.g., legal archives, research corpora) with cross-document links.
Joint Training: End-to-end optimization of SSUA and RRA, aligning structure construction directly with reasoning and QA objectives (Fu et al., 26 Jan 2026).

A plausible implication is that further automation and learning in DMAP construction and utilization may extend its applicability to a broader class of documents and more complex multilingual or cross-domain QA settings.

7. Relation to Broader Multimodal Document QA Paradigms

DMAP addresses limitations inherent in flat, chunk-based or naïve RAG (retrieval-augmented generation) methods that ignore document structure. Other frameworks such as ARIAL have focused on modular tool orchestration and explicit spatial localization for document VQA, but do not construct document-level structural maps akin to DMAP (Mohammadshirazi et al., 22 Nov 2025). The integration of structure-aware reasoning and evidence-driven answer generation in DMAP positions it as a pathway toward more trustworthy and explainable document AI systems.

Markdown Report Issue Upgrade to Chat

References (2)

DMAP: Human-Aligned Structural Document Map for Multimodal Document Understanding (2026)

ARIAL: An Agentic Framework for Document VQA with Precise Answer Localization (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Document MAP (DMAP).