Visually Rich Documents (VRDs)

Updated 21 November 2025

Visually Rich Documents (VRDs) are documents that integrate text, spatial layout, and visual cues to enable comprehensive content understanding.
They leverage advanced multimodal models—including transformers, graph-based methods, and LLM strategies—to capture complex spatial and semantic relationships.
VRDs support key tasks such as information extraction, layout analysis, and entity relation detection, driving robust document intelligence solutions.

Visually Rich Documents (VRDs) are document artifacts whose interpretation is fundamentally governed by the interplay of text, spatial layout, and diverse visual cues. These documents include invoices, receipts, forms, contracts, scientific articles, and diagrams, in which semantic entities cannot be reliably extracted or understood by analyzing their raw textual sequences alone. Instead, the alignment, grouping, hierarchical organization, and even implicit values must be inferred by integrating two-dimensional structure, font and appearance features, and often cross-page dependencies. The complexity and heterogeneity of VRDs have driven the evolution of specialized multimodal models, evaluation metrics, and domain-adaptive pipelines that bridge vision, language, and geometry for robust document intelligence.

1. Definitions, Taxonomy, and Task Scope

VRDs are characterized by their requirement for joint visual-language-layout modeling. Formally, a VRD page can be defined as $I = \{T, E, V\}$ , where $T = \{t_1,\dots,t_n\}$ are text tokens with bounding boxes, $E = \{e_1,\dots,e_m\}$ are higher-level entities such as paragraphs or fields, and $V$ is the raw image or its visual feature map (Ding et al., 2024). The essential traits distinguishing VRDs from plain-text documents are:

Semantic interpretation contingent on spatial proximity, alignment, and visual grouping
Rich heterogeneity in templates, languages, and graphical content
Combinatorial variation in reading order, often requiring two-dimensional or hierarchical sequence modeling
Frequent presence of nested, repeated and hierarchical structures (e.g., line items, tables within tables)
Data sparsity in domain-specific layouts, especially for low-resource domains (Ding et al., 2024)

Key downstream tasks in VRD Content Understanding (VRD-CU) include:

Key Information Extraction (KIE): structured field/entity extraction
Named Entity Recognition (NER), sequence labeling, and span extraction
Entity Relation Extraction (key-value pairing, linking, dependency parsing)
Layout analysis (segmentation, reading order prediction)
Table and diagram structure recognition
Document Visual Question Answering (DocVQA), both single- and multi-page (Ding et al., 2 Jun 2025, Napolitano et al., 14 Nov 2025, Sourati et al., 8 Oct 2025)

2. Core Modeling Approaches

2.1 Sequence-based, Graph-based, and Transformer Methods

Sequence-based Models: Linearize text as a 2D grid or sequence (e.g., Chargrid, CUTIE, LASER), apply CNN/RNN or encoder-decoder architectures with text-only or partially causal attention masks (Ding et al., 2024).
Graph-based Models: Model the document as a graph $G=(V,E)$ , where nodes are text segments and edges encode spatial/geometric features. Graph convolution layers incorporate both visual and textual context, allowing the model to integrate relationships such as proximity and alignment that are essential for key-value pairing and de-ambiguation (Liu et al., 2019, Wei et al., 2020).
Transformer-based Methods: Architectures such as LayoutLMv2/v3, SelfDoc, UniDoc, and joint-grained models (e.g., StrucText, MGDoc) fuse word/box-level and entity/ROI-level representations, often via multimodal cross-attention and learned spatial biases (Ding et al., 2024, Ding et al., 2024, Ding et al., 2024). Masked language/visual modeling, contrastive learning, and cross-modality alignment objectives underpin state-of-the-art performance.

2.2 Hybrid and LLM-based Innovations

LLM-centric Block-wise Parsing: BLOCKIE demonstrates an LLM-based strategy that decomposes VRDs into localized, self-contained “semantic blocks” $\{B_i\}$ , enabling focused, block-level reasoning and improved F1 (up to +3% over prior SOTA) with few-shot layout-matched prompting (Bhattacharyya et al., 18 May 2025).
Layout-aware Graphs and Evidence Retrieval: LAD-RAG augments neural-retriever pipelines with a symbolic document graph $G$ , dynamically guiding LLM queries through both neural semantic similarity and explicit structural queries. This yields near-perfect retrieval recall on multi-page benchmarks and significant DocVQA accuracy gains (Sourati et al., 8 Oct 2025).
Knowledge Distillation and Compression: Model efficiency for VRDs is tackled via response-based, feature-based, and classifier-sharing KD (SimKD). While student models (ViT-Tiny, ResNet-50) often retain most DIC/DLA performance, a large mAP gap remains versus full-size backbones; downstream DocVQA accuracy increases are measurable but modest (Landeghem et al., 2024).

3. Reading Order, Entity Relations, and Relational Representation

3.1 Reading Order Modeling

The reading order of VRD elements—critical for NER, QA, and relation extraction—is ill-posed under simple permutation-based approaches. Recent works model reading order as a graph of “immediate succession” relations $R = \{(e_i,e_j)\}$ , yielding an acyclic, expressive structure that captures nonlinear, multi-branch reading paths (e.g., tables, sidebars) (Zhang et al., 2024). Relation-based ordering outperforms sequence-based models on both ROP F1 (LayoutLMv3-large: 82.4%) and downstream IE/QA tasks.

TPP (Token Path Prediction) reframes NER as token-sequence path prediction in a token graph. This $O(N^2)$ edge scoring model achieves higher entity-F1 on real-world shuffled OCR outputs, is robust to non-contiguous entity mentions, and generalizes to other link-prediction-based tasks (Zhang et al., 2023).

3.2 Relational Pretraining and Embedding

DocReL proposes Relational Consistency Modeling (RCM), a self-supervised contrastive task where pairwise and global relational embeddings are required to remain consistent under entity-level positive augmentations. This produces task-agnostic relational embeddings, pushing SOTA on table structure recognition, key-value pairing, and reading order detection (Li et al., 2022).

Dependency-parsing-style entity relation extraction with biaffine scoring further leverages contextual and spatial cues, outperforming multi-head MLPs and providing strong results on both public and real-world VRD datasets (Zhang et al., 2021).

4. Evaluation Frameworks and Metrics

The evaluation of VRD IE systems now spans three axes:

Metric Class	Example Measures	Captured Aspect
Text-based	Exact match, Levenshtein, token-F1	String/token label fidelity
Geometric-based	IoU (Grouped, Constituent)	Spatial alignment
Hierarchical/tree	Hierarchical Edit Distance (HED/UHED)	Structural/nested-level recovery

DI-Metrics unifies these, demonstrating that hierarchical metrics (HED/UHED) are critical for measuring extraction quality on documents with nested or repeated structures, where flat string- or token-level metrics can mask grouping/association errors (DeGange et al., 2022).

Benchmark studies reveal that rule-based and learned heuristics for reading order and grouping can outperform human scan path alignment for some downstream models, indicating that model inductive biases and input strategies can dominate over “human-aligned” sequences (Wang et al., 2023).

5. Domain Adaptation, Synthetic Data, and Scalability

Domain adaptation in VRD U leverages both synthetic large-scale annotations and minimal guidance sets. DAViD operationalizes joint-grained representation learning, aligning fine-grained SST (token tagging), coarse-grained SIT (entity-level question answering), and SDS (structure alignment) using machine-generated noisy labels, followed by lightweight fine-tuning. This reduces real-label needs by >90% and attains up to +24 points absolute accuracy over prior baselines on challenging handwritten/document-specific tasks (Ding et al., 2024).

Few-shot relational learning in VRDs exploits prototypical rectification and explicit 2D ROI regression to rapidly adapt to novel relation types and layouts. Adding variational rectification and ROI-awareness boosts 1- to 5-shot F1 performance by up to 3 points over strong multimodal prototypical networks, with enhanced generalization to OOD relational classes (Wang et al., 2024).

6. Benchmarks, Competitions, and Empirical Performance

The VRD-IU competition (Form-NLU) sets clear benchmarks for both entity-based key information retrieval and direct localization from raw document images. SOTA models exploit hierarchical decomposition, multi-modal feature fusion, and transformer-based token/entity modeling; ensemble object detection frameworks are necessary for robust localization on complex, heterogeneous forms (Ding et al., 2 Jun 2025). The best models attain macro F1 up to 100% on test keys (Track A) and 65.79% mAP (Track B, private leaderboard), but a remaining performance gap for end-to-end detection underscores outstanding challenges.

For VLLMs, VRD-UQA shows that document-level accuracy in detecting unanswerable questions, even on multi-page DocVQA datasets, remains capped below 0.6 on high-capacity models, especially for corrupted questions involving structural entity swaps or high element density (Napolitano et al., 14 Nov 2025).

7. Limitations, Challenges, and Trends

Despite substantial progress, several core challenges persist:

Generalization to unseen layouts and novel field schemas
Robustness to extreme OCR/layout noise and non-Latin scripts
Efficient handling of long/multi-page inputs (context window, retrieval, memory limits)
Cost of annotation for hierarchical/nested ground truth
Gaps in compression (KD), especially for semantic DLA and downstream gains (Landeghem et al., 2024)

Emerging research is converging on further integrating LLMs, self-supervised relational learning, layout-aware retrieval, and synthetic + few-shot domain adaptation. Multi-modal pretraining objectives which conjoin text, layout, and visual streams—especially where relation structure is encoded or predicted—provide the most robust, adaptable document understanding for VRDs.

References

"Information Extraction from Visually Rich Documents using LLM-based Organization of Documents into Independent Textual Segments" (Bhattacharyya et al., 18 May 2025)
"Deep Learning based Visually Rich Document Content Understanding: A Survey" (Ding et al., 2024)
"VRD-IU: Lessons from Visually Rich Document Intelligence and Understanding" (Ding et al., 2 Jun 2025)
"Modeling Layout Reading Order as Ordering Relations for Visually-rich Document Understanding" (Zhang et al., 2024)
"Reading Order Matters: Information Extraction from Visually-rich Documents by Token Path Prediction" (Zhang et al., 2023)
"Relational Representation Learning in Visually-Rich Documents" (Li et al., 2022)
"Document Intelligence Metrics for Visually Rich Document Evaluation" (DeGange et al., 2022)
"DAViD: Domain Adaptive Visually-Rich Document Understanding with Synthetic Insights" (Ding et al., 2024)
"Towards Human-Like Machine Comprehension: Few-Shot Relational Learning in Visually-Rich Documents" (Wang et al., 2024)
"3MVRD: Multimodal Multi-task Multi-teacher Visually-Rich Form Document Understanding" (Ding et al., 2024)
"LAD-RAG: Layout-aware Dynamic RAG for Visually-Rich Document Understanding" (Sourati et al., 8 Oct 2025)
"Benchmarking Visual LLMs Resilience to Unanswerable Questions on Visually Rich Documents" (Napolitano et al., 14 Nov 2025)
"DistilDoc: Knowledge Distillation for Visually-Rich Document Applications" (Landeghem et al., 2024)
"Graph Convolution for Multimodal Information Extraction from Visually Rich Documents" (Liu et al., 2019)
"Entity Relation Extraction as Dependency Parsing in Visually Rich Documents" (Zhang et al., 2021)
"A Span Extraction Approach for Information Extraction on Visually-Rich Documents" (Nguyen et al., 2021)
"DocTrack: A Visually-Rich Document Dataset Really Aligned with Human Eye Movement for Machine Reading" (Wang et al., 2023)