FUNSD: Form Understanding Dataset

Updated 30 March 2026

FUNSD is a benchmark dataset with detailed multi-level annotations for form image analysis, enabling clear evaluation of text extraction and layout understanding.
It supports various tasks including word localization, semantic grouping, entity classification, and linking, with metrics like IoU, ARI, and F1 for precise assessment.
The dataset’s real-world forms, with challenges such as scanning noise and diverse layouts, drive innovations in modular and end-to-end document information extraction.

Form Understanding Dataset (FUNSD) is a benchmark corpus designed to catalyze research in visually rich document understanding, particularly in the domain of form image analysis. Targeting the extraction and structuring of textual content from real-world scanned forms, FUNSD provides comprehensive, multi-level annotations that support a wide array of tasks, including word localization, word grouping into semantic entities, entity classification and linking, and entity boundary detection. Its design enables both modular and end-to-end system development, serving as a foundational testbed for text, layout, and multimodal approaches in form document information extraction.

1. Dataset Composition and Annotation Schema

FUNSD contains 199 real, fully annotated, one-page form images sourced from the RVL-CDIP collection, encompassing high variability in layout, noise, font, and content domain. Each form averages approximately 47 semantic entities and 5–6 relational links.

The dataset is partitioned into:

Training set: 149 forms (22,512 words, 7,411 entities, 4,236 links)
Test set: 50 forms (8,973 words, 2,332 entities, 1,076 links)
No official validation split is provided (Jaume et al., 2019), with some later works introducing ad hoc splits (Vu et al., 2020, Davis et al., 2021).

Annotations leverage a structured JSON format per form, adhering to the following schema:

Entity-level annotation: Each entity is a spatial–semantic grouping of one or more words, with a bounding box, text content, and a categorical label:
- HEADER: section or block headings
- QUESTION: field prompts (e.g. “Name:”)
- ANSWER: fill-in regions (user or machine entries)
- OTHER: logos, decorative text, uninterpretable regions
Relational annotation: Directed edges between entities define question → answer (key–value) or higher-level hierarchical links (e.g., header → question).
Word-level annotation: Each word is tagged with its own bounding box and text, grouped into entities.

A summary of quantitative statistics is provided below.

Subset	Header	Question	Answer	Other	Total Entities	Relations
Train	441	3,266	2,802	902	7,411	4,236
Test	122	1,077	821	312	2,332	1,076

All bounding boxes are rectangular; links are explicitly directed and labeled in the annotation files (Jaume et al., 2019, Vu et al., 2020).

2. Task Definitions and Supported Evaluation Metrics

FUNSD provides comprehensive support for end-to-end information extraction and individual pipeline components:

Text Detection: Localize word bounding boxes; evaluated at IoU ≥ 0.5.
OCR: Recognize textual content in each detected word; string similarity based on normalized Levenshtein distance.
Word Grouping: Cluster detected words into entities; Adjusted Rand Index (ARI) is used for agreement with ground truth groupings (Jaume et al., 2019).
Entity Classification: Assign each entity a label from {header, question, answer, other}; micro-averaged Precision, Recall, and F1 at the entity (span + label) level.
Entity Linking: Predict directed links (key-value or higher-order), evaluated on the accuracy of predicted pairs.
Entity Boundary Detection: A prediction is correct only if both the complete entity span and class label match the reference.

Metric formalizations:

$\text{Precision} = \frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FP}}$
$\text{Recall} = \frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FN}}$
$\text{F}_1 = 2\cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision}+\text{Recall}}$

For entity linking, some models additionally report mean Average Precision (mAP), mean rank (mRank), and Hit@k scores, especially in the context of pairwise link retrieval (Wang et al., 2020, Villota et al., 2021).

3. Dataset Challenges, Limitations, and Structural Revisions

FUNSD was designed to embody the complexity of real-world form documents:

Scanning noise: Low-resolution (≈100 dpi), grayscale, with blur, streaks, and repeated scan–print artifacts.
Layout heterogeneity: Wide template diversity across domains and periods, with tabular regions, multi-column layouts, and both typed and handwritten text.
Annotation inconsistencies: Early releases exhibited ambiguous usage of “header,” over-/under-segmentation of entity spans, and occasional erroneous key–value links (Vu et al., 2020).
Limited sample size: Only 199 forms, increasing the importance of transfer learning and data-efficient algorithms.

A key revision initiative (Vu et al., 2020) eliminated ambiguous “header” labels and enforced strict key–value chain labeling rules, resulting in more consistent ground truth and marginally improved supervised model convergence (+1% mIoU on Q/A/O segmentation). Residual OCR errors and bounding–box artifacts persist.

4. Model Benchmarks and State-of-the-Art Results

FUNSD has become a de facto standard for benchmarking document understanding models, from early MLP and CNN approaches to current multimodal transformers and graph-based mechanisms.

Recent state-of-the-art on FUNSD (entity recognition, F1 metric):

Model	Modality	F1 (%)	Notes
GraphLayoutLM_BASE	T+L+Image	93.43	Multimodal encoding, MLP head (Li et al., 2024)
HGALayoutLM_BASE	T+L+Image	94.32	Hypergraph attention, span-boundary modeling
GraphLayoutLM_LARGE	T+L+Image	94.39
HGALayoutLM_LARGE	T+L+Image	95.31
FormNetV2	T+L+Image	86.35	Multimodal, unified GCL pretraining (Lee et al., 2023)
MSAU-PAF	Image+Layout	83.0	End-to-end image-based hierarchical (Dang et al., 2021)
U-Net + CI-Deform	Image+TextMask	0.73 mIoU	Key-value segmentation, layout-only (Vu et al., 2020)

HGALayoutLM’s hypergraph-attention head jointly scores all start–end token pairs per span and label, enhancing boundary detection over BIO-tag or MLP heads, with negligible additional computation (Li et al., 2024). FormNetV2’s multimodal graph-contrastive pretraining yields a more compact but slightly lower-F1 model relative to hypergraph-attention architectures (Lee et al., 2023).

Entity linking benchmarks demonstrate that BERT-based pairwise classification can attain F1=0.80, outperforming both earlier graph-based models (FUDGE: 0.62) and multi-modal LayoutLM (0.69) (Villota et al., 2021).

5. Methodological Advancements and Ablation Findings

Research on FUNSD advances both model architecture and learning protocol:

Span-centric attention: HGALayoutLM replaces per-token linear heads with span-based multi-head attention. Rotary span position encoding over OCR–node indices and balanced-hyperedge loss mitigate fragmentation at class boundaries, achieving substantial F1 gains (Li et al., 2024).
Graph-based modeling: FormNetV2 and DocStruct employ multimodal GCNs, infusing token-level layout and edge-level image features with self-supervised graph contrastive learning, significantly improving localization and label F1 (Lee et al., 2023, Wang et al., 2020).
Visual graph editing: Visual FUDGE models the document as an iteratively edited graph of text lines, using GCN-driven edge pruning, vertex merging, and grouping solely from visual features; this enables robustness to degraded or out-of-language forms, although entity labeling lags language-model–based methods (Davis et al., 2021).
Curriculum and schedule ablations: Progressive data scheduling (curriculum learning) reduces training time. Significant F1 benefits are observed for BERT (ΔF1=+0.023, p=0.022) but not for LayoutLMv3, suggesting that model capacity and inductive bias modulate curriculum effectiveness (Hamdan et al., 2 Feb 2026).
Segmentation with CI-Deformable Convolution: Channel-invariant deformable convolutions in a U-Net backbone accelerate convergence (~10% faster) and modestly improve mIoU over standard U-Net in layout-only key–value segmentation (Vu et al., 2020).

6. Downstream Benchmarks, Extensions, and Cross-Lingual Expansion

While the original FUNSD is English-only and entity–link–centered, several works address limitations in granularity and scope:

SRFUND augments FUNSD with dense multi-granular annotations (word↔line↔entity↔table↔hierarchy) and aligns protocols across eight languages (EN, ZH, JA, DE, FR, ES, IT, PT), enabling evaluation of full-document tree structure recovery and internationalized models (Ma et al., 2024).
Relation modeling: DocStruct and MSAU-PAF both pursue direct modeling of directed document graphs and key–value trees, with evaluation at edge and node levels; performance ceilings persist on cross-entity linking and deep structure recovery (Wang et al., 2020, Dang et al., 2021).
Table and complex layout localization: Emerging benchmarks in SRFUND include explicit table region annotation, diverse within-table entity grouping, and average document-tree depths (mean 3.049), advancing toward realistic administrative workflow automation (Ma et al., 2024).

7. Impact and Outlook in Document Intelligence

FUNSD’s influence is evidenced by its adoption in nearly all recent research on form document understanding and its direct role in catalyzing methodological innovations:

Span-based entity prediction, multimodal and visual–graph approaches, graph contrastive pretraining, and curriculum–driven training protocols all rely on and report against FUNSD as a central benchmark.
Its design foregrounds boundary-level evaluation and precise entity/linking accuracy, which model ablations have shown to be key to unlocking further improvements (Li et al., 2024, Lee et al., 2023).
Persistent limitations—scarcity of document types, English-only base annotations, annotation boundary ambiguities—have driven the creation of extensions such as SRFUND and the pursuit of more robust cross-lingual, layout-agnostic modeling.

A plausible implication is that future progress on form understanding will derive from richer hierarchical annotation protocols, explicit structural modeling, and tightly integrated visual–textual–layout supervision, all of which are being tested or enabled on benchmarks descended from or constructed atop FUNSD (Ma et al., 2024, Lee et al., 2023, Vu et al., 2020).