Visually Rich Document Understanding

Updated 24 November 2025

Visually Rich Document Understanding (VRDU) is a field that combines text, layout, and visual features to extract structured information from complex documents.
It employs multimodal fusion techniques—including unified Transformers, graph-based, and OCR-free models—to capture spatial and semantic relationships for tasks like entity linking and QA.
Advances in VRDU are validated on benchmarks such as FUNSD, CORD, and DocVQA, with innovations in layout modeling and reading order correction improving extraction performance.

Visually Rich Document Understanding (VRDU) refers to the suite of computational techniques for extracting, structuring, and reasoning about information from documents whose semantics arise from the joint interplay of text, visual appearance, and two-dimensional layout. Unlike plain-text document understanding, VRDU targets heterogeneous sources such as scanned forms, receipts, invoices, academic articles, magazines, and web pages—domains where meaning is often distributed across complex arrays, tables, multi-column flows, and embedded graphics. VRDU systems must therefore integrate multimodal data (text, layout, image) and handle disparate reading orders, structural hierarchies, and spatial/semantic relations. This task is foundational for automating information extraction, entity linking, question answering, and workflow automation in domains like finance, healthcare, law, and publishing.

1. Conceptual Foundations and Problem Scope

The VRDU problem formalizes visually rich documents as multi-modal objects:

$T = \{ t_i \}$ : sequences of textual tokens (from OCR or PDF parsing), with semantic embeddings.
$B = \{ b_i \}$ : corresponding bounding boxes $(x_{i1}, y_{i1}, x_{i2}, y_{i2})$ normalized to a reference frame.
$V = \{ v_j \}$ : visual features, either pooled over image regions (CNN/Transformer features) or pixel patches.

The objective is to learn functions $f$ mapping the triplet $(T, B, V)$ to structured representations supporting core tasks:

Key Information Extraction (KIE): Mapping schema keys to text spans or values, accounting for visual context and layout.
Entity Recognition & Linking (ER/EL): Entity labeling and relation extraction, including parent–child or spatial/semantic edges.
Document Classification & QA: Assigning categorical labels or producing natural-language answers based on multi-modal evidence.

A prerequisite is robust handling of layout-induced ambiguity, fragmented reading order, and the alignment (or fusion) of modalities at various granularity levels (Ding et al., 2024, Wang et al., 2022).

2. Modeling Strategies: Architectures, Fusion, and Pretraining

VRDU models are commonly classified by fusion schema and modeling granularity:

Single-Stream (Unified) Models: LayoutLM/v2/v3, LayoutXLM, and Bi-VLDoc embed $T+B+V$ into a single sequence processed by a multi-modal Transformer with spatial-aware attention. Self-attention is enriched with absolute or relative 2D layout biases and image grid features (Xu et al., 2020, Luo et al., 2022, Li et al., 2023).
Multi-Stream (Dual/Tri-Stream) Models: LiLT or similar, split input into text and layout streams, merge them via cross-modal attention; sometimes an explicit third visual stream is included (Ding et al., 2024).
Graph-Based Models: GraphLayoutLM, FormNet, and FS-DAG construct explicit layout graphs (nodes: OCR lines/entities; edges: parent–child, sibling, spatial) and propagate information via masked-attention or GNNs to capture long-range and hierarchical dependencies (Li et al., 2023, Agarwal et al., 22 May 2025).
OCR-Free and Pixel-Based Models: DUBLIN and OCR-free MLLMs (e.g., Donut, Pix2Struct) process high-resolution images directly, masking out the need for external OCR and learning representations via pixel-level masking, bounding box regression, and rendered QA pretraining (Aggarwal et al., 2023).
Markup-Centric Models: MarkupLM represents digital documents as hierarchical markup trees (e.g., HTML), introducing XPath embeddings in lieu of spatial layout (Li et al., 2021).

Feature fusion strategies include early fusion (direct sum/concat), cross-attention (projected keys/values across modalities), and prompt-based linearization for LLM-based approaches (Ding et al., 14 Jul 2025, Fujitake, 2024).

Pretraining objectives are typically:

Masked Visual-Language Modeling (MVLM): Predict masked spans using all modalities.
Masked Image Modeling (MIM): Reconstruct masked image patches.
Text-Image Alignment (TIA), Matching (TIM), or Word-Patch Alignment (WPA): Binary or contrastive objectives to enforce cross-modal coherence.
Task-specific heads for coordinate regression, QA generation, and node/edge classification (Xu et al., 2020, Peng et al., 2022, Ding et al., 2024).

3. Advances in Layout and Structural Reasoning

Proper modeling of reading order, semantic groups, and document structure remains critical:

Reading Order Modeling: Methods correct noisy OCR order via layout-driven serialization (XYLayoutLM’s Augmented XY-Cut, ERNIE-Layout’s Document-Parser, ReLayout’s 1-LOP). These correct for fragmented scan sequences, grouping contiguous tokens and aligning with human reading patterns (Gu et al., 2022, Peng et al., 2022, Jiang et al., 2024).
Layout Graphs and Relational Attention: GraphLayoutLM constructs explicit layout graphs capturing parent–child and sibling directions, injecting these into self-attention via masks, and leveraging graph reordering to organize the input sequence (Li et al., 2023).
Semantic Grouping without Manual Annotations: ReLayout addresses real-world scenarios where gold semantic groups are unavailable, combining local order prediction (1-LOP) and self-supervised segment clustering (2-TSC) to recover groups from noisy OCR lines (Jiang et al., 2024).
Hierarchical and Multi-granular Approaches: 3MVRD and DAViD propose joint-grained models aligning token- and entity-level encodings, multi-teacher knowledge distillation, and domain adaptation with synthetic labels to cover both fine and coarse structure (Ding et al., 2024, Ding et al., 2024).

Ablation studies show that reading-order correction, graph/segment-aware attention, and layout-dependent serialization consistently yield 1–4 point F1 gains over vanilla input pipelines (Li et al., 2023, Peng et al., 2022, Gu et al., 2022, Jiang et al., 2024).

4. Benchmarks, Datasets, and Evaluation Protocols

Progress in VRDU is measured against a set of hierarchical, multi-format benchmarks:

FUNSD: Form understanding, token-level and entity-level F1; layout-rich with key-value pairs (Adnan et al., 2024, Wang et al., 2022).
CORD: Receipt extraction with complex, hierarchical entity schema; KIE and RE (Adnan et al., 2024, Li et al., 2023).
Form-NLU: Digital, printed, handwritten forms; entity-based and end-to-end localization (mean Average Precision), with strong cross-modal and cross-domain challenges (Ding et al., 2 Jun 2025, Dai, 10 Feb 2025).
Ad-Buy, Registration (VRDU benchmark): Industrial forms focusing on hierarchical and repeated fields, few/zero-shot settings, and template generalization (Wang et al., 2022).
MosaicDoc: Bilingual (Chinese/English) news/magazine corpus with 72K images and >600K QA pairs, annotated for OCR, reading order, QA (ANLSL), and spatial grounding. Layout complexity, multi-span QA, and complex reading order are emphasized (Chen et al., 13 Nov 2025).
DocVQA, AI2D, InfographicVQA, WebSRC: Document QA with metric diversity: Exact Match (EM), Average Normalized Levenshtein Similarity (ANLS), and F1.

Metrics are standardized for each task—entity/field F1, micro/macro F1, mAP, ANLS, and localization IoU—facilitating rigorous model comparisons (Wang et al., 2022, Chen et al., 13 Nov 2025, Ding et al., 2 Jun 2025).

5. Recent Model Innovations and Empirical Findings

State-of-the-art VRDU models feature several key characteristics:

Pretrained Multimodal Transformers: LayoutLMv3, ERNIE-Layout, and Bi-VLDoc set strong baselines via deep integration of text, layout, and image, spatially-biasing attention, and multi-task pretraining (Xu et al., 2020, Peng et al., 2022, Luo et al., 2022).
Graph/Hierarchical Layout Modeling: GraphLayoutLM demonstrates consistent $\sim$ 2 pp F1 gains on FUNSD/CORD by layout graph reordering and masking; ReLayout minimizes real-world degradation when gold segments are unavailable (vs $-9$ to $-19$ F1 drop in segment-dependent models) (Li et al., 2023, Jiang et al., 2024).
Pixel-Based Approaches: DUBLIN achieves SOTA or parity with OCR-based models on multi-benchmark document QA and extraction, outperforming prior pipeline systems on AI2D, DocVQA, and CORD (Aggarwal et al., 2023).
Few-Shot and Robust Models: FS-DAG attains superior performance ( $\sim$ 99% F1) in five-shot KIE settings, with sub-1% drop under simulated OCR noise and rapid convergence for industrial applications (<90M parameters) (Agarwal et al., 22 May 2025).
Multi-Granular and Synthetic Data Approaches: 3MVRD and DAViD achieve SOTA with minimal labeled data by distilling from token- and entity-level teachers and incorporating joint-grained encoding with synthetic annotations, attaining high F1 on noisy forms and producing robust cross-domain models (Ding et al., 2024, Ding et al., 2024).
Instruction-Tuned LLMs and OCR-Free MLLMs: LayoutLLM, PDF-WuKong, and OCR-free MLLMs fuse layout-aware vision-language encoders with LLM decoders, surpassing traditional modal-specific stacks, especially in extractive and generative QA tasks across diverse benchmark settings (Fujitake, 2024, Ding et al., 14 Jul 2025, Chen et al., 13 Nov 2025).
Augmentation for Domain Generalization: Document-specific image transforms (ink bleed, letterpress, scanner artifacts) substantially bridge the digital–handwritten performance gap (+4 mAP gains, >27% absolute improvement for domain-pretrained backbones) (Dai, 10 Feb 2025).

6. Open Challenges, Limitations, and Research Directions

Despite major advances, several persistent challenges demarcate the research frontier:

Layout and Reading Order Generalization: Models struggle with multi-column, non-Manhattan, and hierarchical layouts, as well as reconstructing complex reading orders; MosaicDoc reveals recall/F1 collapse for paragraph-level order prediction ( $<40\%$ F1) (Chen et al., 13 Nov 2025).
Robust Cross-Page and Multi-Span Reasoning: Nearly all systems exhibit large drops on multi-span or cross-entity QA tasks, failing to aggregate dispersed evidence, especially across pages or columns (Chen et al., 13 Nov 2025, Ding et al., 2024).
Domain Adaptation and Annotation Efficiency: Real-world deployment demands strong performance under few-shot labeled settings, high OCR or scan noise, and domain drift (e.g., handwritten or multilingual forms). Techniques such as aggressive data augmentation, synthetic annotation generation, and modular adapter-based architectures are active research tracks (Agarwal et al., 22 May 2025, Ding et al., 2024).
Interpretability and Structured Prediction: Explicit graph structure, attention visualization, and reason-aware decoding remain under-developed. Current models are largely uninterpretable and sensitive to input corruption (Li et al., 2023, Ding et al., 2024).
Computational Efficiency: OCR-free architectures, high input resolution, and multi-hop reasoning incur significant computational cost. Efficient visual compressors, sparse attention mechanisms, and distillation approaches are being investigated (Aggarwal et al., 2023, Ding et al., 14 Jul 2025).

Future research directions identified include hierarchical structure modeling, dynamic layout graph learning, efficient parameterization, robust multimodal adapters, and unified pretraining across raw pixels, text, and markup (Ding et al., 14 Jul 2025, Ding et al., 2024, Jiang et al., 2024, Li et al., 2023).

7. Benchmarking, SOTA, and Empirical Performance Trends

Empirical results from recent literature converge on several consensus findings:

Model/Strategy	FUNSD F1 (%)	CORD F1 (%)	DocVQA ANLS	WildReceipt F1	Few-Shot F1	Notes
LayoutLMv3-LARGE	92.1	97.5	83.4	–	–	Multi-modal Transformer
GraphLayoutLM-LARGE	94.4	97.7	–	–	–	Layout graph + reorder+mask
3MVRD (+Sim+Distil)	90.9	98.7†	–	–	–	Token/entity multi-teacher†
FS-DAG	–	–	–	93.9	98.9 (5-shot)	GNN, <90M params, OCR robust
DUBLIN (pixel-based)	77.8	97.1	80.7	–	–	OCR-free generative
ReLayout (ReVrDU, OCR-only)	83.1	96.8	76.0	–	–	Gold segment independence
LayoutLLM	95.3	98.6	86.9	–	–	Layout+LLM Instruction tuning
MosaicDoc (SOTA-VLM)	–	–	61.3%*	–	–	Multi-span QA, complex layout

*Gemini on MosaicDoc Magazines (en); †=FormNLU Printed

Superior performance is achieved by fusing domain-pretrained vision–language backbones with graph/structure reasoning, strong augmentation, and, increasingly, LLM-based instruction/fusion approaches. Performance gaps on multi-span, multi-lingual, and hierarchical benchmarks persist.

In summary, VRDU has evolved into a technically diverse field encompassing spatially-bias Transformers, layout graphs, joint-grained distillation, graph neural networks, and both OCR-dependent and OCR-free frameworks. Despite rapid progress, challenges in structure modeling, few/zero-shot generalization, noisy inputs, interpretability, and scaling to long or dense documents remain open. Benchmark initiatives such as VRDU, MosaicDoc, and VRD-IU are driving both methodological sophistication and empirical rigor, setting the agenda for the next generation of document intelligence systems (Ding et al., 2024, Wang et al., 2022, Chen et al., 13 Nov 2025, Ding et al., 2 Jun 2025, Li et al., 2023, Adnan et al., 2024, Ding et al., 14 Jul 2025).