Key–Value Extraction
- Key–value extraction is a technique to transform visually and semantically linked entities into structured, machine-readable key–value pairs across diverse document types.
- Modern methods leverage multimodal models using textual, visual, and layout features, including segmentation, transformers, and graph-based approaches to accurately link keys to values.
- Systems are evaluated with metrics like precision, recall, and F1, driving advances in applications such as automated form processing and digital archiving.
Key–value extraction (KVE) is the task of identifying key–value pairs within structured or semi-structured documents, transforming visually and semantically linked entities into explicit, machine-readable tuples. In the context of document analysis, a key is an informational label or field name (e.g., “Invoice Date”), while a value represents the associated content (e.g., “2024-06-10”). The problem arises in both born-digital and scanned document images, and spans invoices, forms, receipts, historical records, and more. Modern KVE systems operate across visual, textual, and layout dimensions, employing deep multimodal models, sophisticated linking strategies, and large-scale annotated datasets to achieve robust extraction under arbitrary layouts.
1. Formal Task Definition and Problem Formulation
Given a document where each entity has textual content and a spatial bounding box, the objective is to predict a set of pairs such that each key and value are entities in , with functioning as a semantic label for (Naparstek et al., 2024). The definition extends to allowing unkeyed values (values without explicit keys) and unvalued keys (keys without a value). KVE is evaluated by matching predicted and ground-truth pairs according to text similarity (e.g., normalized edit distance threshold) and spatial overlap (e.g., Intersection-over-Union threshold).
Several architectural and methodological paradigms have been developed:
- Pixel-wise segmentation: Models treat keys/values as categories in a segmentation mask over the document image (Vu et al., 2020).
- Entity-linking via multimodal encoders: Systems utilize Transformer-based architectures with visual, textual, and spatial cues to encode entities and learn explicit key–value relations (Hu et al., 2023, Wei et al., 2023).
- Pointer–pointer or graph approaches: Extraction formulated as typed directed links between entity spans, allowing joint decoding of entities and relations (Wei et al., 2023).
- Instruction-tuned multimodal LLMs: Generative models receive the image and a prompt and output structured key–value lists, leveraging content-aware tokenization and region detection (Nguyen et al., 13 Jul 2025, Naparstek et al., 2024).
- Segmentation-free handwritten document approaches: Joint models learn to extract and transcribe key–value fields without explicit intermediate segmentation steps (Tarride et al., 2023, Tarride et al., 2023).
2. Datasets and Annotation Protocols
Robust KVE requires large, diverse, and accurately annotated datasets tailored for both predefined and open-schema key–value scenarios.
Summary Table: Major KVE Benchmarks
| Dataset | Modality | Size | Annotation | Layout/Schema Diversity |
|---|---|---|---|---|
| FUNSD | Scanned forms | 199 pg | Entity spans, key–value relations | Limited, four classes (“header”, “question”, “answer”, “other”); revised for consistency (Vu et al., 2020) |
| KVP10k | Business docs | 10,707 pg | Rich key–value links, 17 classes | Extreme: invoices, forms, statements, diverse and arbitrary keys (Naparstek et al., 2024) |
| CLEX | Forms (complex) | 5,860 pg | 1,162 unique semantic labels, key–value links | High: hundreds of categories, zero-/few-shot splits (Wei et al., 2023) |
| SIMARA | Handwritten, archival | 5,393 pg | 7 fixed metadata fields (page-level) | Historic, noisy, no spatial localization (Tarride et al., 2023) |
KVP10k specifically bridges KIE (predefined keys) and open or emergent key–value pair extraction. Its annotation schema encompasses bounding boxes, semantic class labels, explicit key–value links, unkeyed values, and section blocks. In contrast, SIMARA and related datasets from (Tarride et al., 2023) focus on segmentation-free, sequence-labeled key–value extractions, particularly from handwritten and historical documents.
3. Model Architectures and Extraction Paradigms
3.1 Segmentation-based Approaches
Segmentation models (e.g., U-Net with channel-invariant deformable convolutions) directly predict per-pixel class assignments ({key, value, other, background}) from document images (and optionally a text mask). These architectures focus on spatial cues, and post-hoc nearest-neighbor linking assigns values to detected keys (Vu et al., 2020).
3.2 Transformer-based, Entity-centric, and Pointer Formulations
Entity-centric methods encode detected text spans with visual, textual, and spatial embeddings (using LayoutLM, LayoutXLM, or similar), then explicitly predict key–value associations:
- KVPFormer implements a QA-style pipeline: keys are selected as “questions,” and a transformer decoder predicts their values via a two-stage coarse-to-fine ranking, enhanced by spatial compatibility attention (Hu et al., 2023). The shallow encoder performs key detection; the decoder matches each candidate key to value spans using self- and cross-attention with spatial bias terms .
- PPN (Parallel Pointer-based Network) casts KVE as dense, parallel link prediction between token pairs. Typed links (11 kinds: e.g., QuestionHead→ValueHead) express all relevant entity and relation types. Link logits are scored jointly via a large (N×N) score matrix and optimized with circle-loss to balance sparse positives and dense negatives. All candidate labels are decoded in parallel in a single model pass (Wei et al., 2023).
Generative LLM-based approaches (e.g., Mistral-7B in KVP10k) receive OCR-extracted lines and bounding boxes, and are trained to output a JSON-like serialization of all discovered key–value pairs. This enables handling of dynamic keys/values and supports unkeyed/unvalued extraction (Naparstek et al., 2024).
3.3 Layout-aware, Content-aware, and Multimodal LLMs
Instruction-tuned MLLMs such as VDInstruct (Nguyen et al., 13 Jul 2025) use explicit spatial ROI detection and content-aware tokenization to selectively encode only relevant regions of the document image. Key features include:
- Content-aware tokenization: Region proposals (from Faster R-CNN) define ROIs; each receives a minimal, complexity-adaptive set of spatial and semantic tokens. Typical token reduction is ∼3.6× over patch-based tiling.
- Dual vision encoder: Separate spatial and semantic channels, with pooled text-region, vision-region, and global tokens.
- Fusion with LLM: Tokens from the vision backbone are projected and concatenated with user instructions, then passed to a Vicuna (or similar) decoder for generative key–value output.
- Three-stage curriculum: Layout pretraining (ROI detection), feature learning (VQA-style parsing), and final instruction tuning on large, annotated KIE tasks. This results in SOTA zero-shot performance and strong efficiency.
4. Benchmarking, Metrics, and Error Analysis
Historically, KVE models have been evaluated with precision, recall, and F1 at the key–value pair level, with “match” criteria defined by both text similarity and spatial overlap (Naparstek et al., 2024). For more complex documents, additional grouping structure, e.g., tables or line-items, must be preserved.
Advanced Evaluation: KIEval Metric
KIEval (Khang et al., 7 Mar 2025) introduces application-centric, hierarchical metrics:
- Entity-Level F1: Standard matching of (key, value) pairs.
- Group-Level F1: Evaluates correct grouping of multiple extracted entities (e.g., line-item association).
- Aligned Correction-Cost Variant: Directly measures the minimal number of human corrections (substitutions, additions, deletions) required to reach ground truth, aligning model evaluation with practical RPA settings.
KIEval exposes shortcomings of entity-F1, especially for group-structured information extraction. On datasets with explicit grouping (e.g., CORD), entity-F1 may overstate model utility—group-F1 is significantly lower and tightly correlates to true correction cost.
5. Empirical Results, Generalization, and Methodological Insights
Recent models achieve strong gains over previous SOTA:
- KVPFormer: On FUNSD, 82.23% entity-linking F1; on XFUND (multi-language), 94.42% average F1—surpassing prior state of the art by 7.2 and 13.2 F1 points, respectively (Hu et al., 2023).
- PPN: On complex, zero-shot CLEX splits, 74.0% F1, outperforming Donut and QA by large margins. Few-shot adaptation produces F1 >83% with just 1 annotated sample per novel layout (Wei et al., 2023).
- VDInstruct: Achieves zero-shot F1 of 57.2% (vs. DocOwl 1.5 at 51.7%), overall F1 71.4% with only ~485 image tokens per page, and is robust to unseen document types (Nguyen et al., 13 Jul 2025).
- KVP10k Mistral-7B baseline: Combined (text+loc) F1 scores by type: regular KVP 0.611, unkeyed 0.584, unvalued 0.588, exposing challenges in variable layout and linkage (Naparstek et al., 2024).
- SIMARA (handwritten, segmentation-free): Micro-F1 ≈95% on most metadata fields; lower for rare categories (“arrangement” field F1=66.3%) (Tarride et al., 2023). Integrated architectures outperform classic two-stage systems especially in the presence of rich spatial context (Tarride et al., 2023).
Ablation studies consistently show the importance of:
- Multimodal fusion (text, vision, layout)
- Explicit spatial modeling (ROI detection, 2D positional embeddings, spatial attention bias)
- Joint, end-to-end optimization of entity and linking objectives
Limiting factors include small training sets, lack of heavy augmentation for rare layout scenarios, absence of fine-grained spatial supervision (for historical and handwritten records), and sparse tagging degrading key–value-specific model performance (Vu et al., 2020, Tarride et al., 2023).
6. Open Challenges and Future Directions
Key areas for development and research include:
- Generalization across schemas: Enabling models to extract novel key–value types without retraining; supporting schema-free information extraction, as in KVP10k.
- Multimodal, instruction-following models: Further exploration of instruction-tuned MLLMs with content-adaptive image tokenization and explicit cross-region attention (Nguyen et al., 13 Jul 2025).
- Joint entity extraction and relation linking: Avoiding error propagation by fully integrating detection and linking, building on pointer networks and graph-based architectures (Wei et al., 2023).
- Balanced evaluation: Adoption of KIEval for industrial/document-centric pipelines—reporting both entity- and group-level F1, and including correction-cost metrics for deployment decisions (Khang et al., 7 Mar 2025).
- Handwritten and historical documents: Expanding segmentation-free, end-to-end learning with robust architectures and targeted semi-/self-supervised pretraining (Tarride et al., 2023, Tarride et al., 2023).
- Layout robustness: Aggressive synthetic augmentation, graph attention, and hybrid losses to improve resilience to visually or structurally diverse business documents (Naparstek et al., 2024).
- Active learning and rare entity discovery: Optimizing annotation efforts for rare key–value patterns and ambiguous document structures.
- Downstream integration: Evaluating extraction models within full RPA pipelines (for example, form-to-database ingestion), with group-aware and correction-aware metrics.
Overall, key–value extraction—through the synergy of advanced multimodal encoders, adaptive generation strategies, open and richly annotated datasets, and application-centric evaluation—has made substantial strides but continues to present evolving challenges in recognition, linking, generalization, and deployment.