Key Information Extraction (KIE)

Updated 24 November 2025

Key Information Extraction (KIE) is the process of automatically identifying, localizing, and structuring key semantic content from unstructured documents.
It integrates multimodal techniques such as sequence tagging, graph reasoning, and OCR-free generative models to handle diverse layouts and complex document structures.
Recent advances in KIE improve extraction accuracy, mitigate challenges like template bias and OCR errors, and enable scalable automation in document analytics.

Key Information Extraction (KIE) is the process of automatically identifying, localizing, and structuring salient semantic content—typically field values, entities, and their relations—from unstructured or semi-structured documents such as receipts, invoices, contracts, and forms. KIE underpins document automation workflows, enabling downstream analytics and decision support by transforming raw visual or multi-modal document data into structured, machine-readable formats. The field integrates natural language processing, computer vision, and graph learning methodologies, and is challenged by document diversity, complex layouts, label imbalance, domain privacy, and the need for robust generalization.

1. Formal Problem Statement and Core Concepts

KIE aims to extract facts, such as key–value pairs (e.g., {“Invoice No.”: “933021”}), spans, and groupings, from a variety of visually-rich document types. The KIE pipeline may begin with OCR (for text localization), but recent advances include OCR-free approaches operating directly on images. Formally, given a document image $I$ (or PDF) and optionally its OCR tokens $T = \langle t_1, ..., t_N \rangle$ with layout coordinates, the KIE system predicts:

For each entity type $e$ in a predefined schema $E = \{e_1,...,e_k\}$ , the set of fields $V^e$ as text spans or boxes;
(Optional) relation or group assignments $G = \{g_1, ..., g_m\}$ such that fully-formed information units can be constructed (e.g., line-items with name–qty–price grouping in receipts (Khang et al., 7 Mar 2025)).

Methods span discriminative sequence labeling, generative models with prompt-based supervision, and token–box/region classification as well as structured matching and assignment optimization. Several lines of work highlight the need to model both text semantics and physical layout, and in some domains, multimodal visual features are critical for performance.

2. Methodological Approaches and Architectures

A wide spectrum of architectures have been developed for KIE. Major categories include:

Sequence Tagging Models: Traditional models tag each token with a field label (BIO/IOB schema), often using pre-trained Transformers (BERT, RoBERTa) enhanced with 2D positional embeddings for layout awareness. LayoutLM, LAMBERT, and LayoutLMv2/v3 (Stanisławek et al., 2021, Townsend et al., 2024) extend classic NLP token classifiers with multimodal cues.
Graph-based Methods: KIE is recast as reasoning over a document graph, with nodes as text segments and edges representing spatial/layout or semantic relations. PICK (Yu et al., 2020), GraphRevisedIE (Cao et al., 2024), and SDMG-R (Sun et al., 2021) exemplify approaches that jointly aggregate text, layout, and image features through graph convolution and attention to capture non-local context and improve entity linking.
Region-based and Pointer-based Extraction: Formulating KIE as 2D region or pointer prediction, as in RDU (Zhu et al., 2022) and PPN (Wei et al., 2023), allows fine control over non-sequential, table-heavy, or irregular layouts by predicting bounding boxes or pointer links in token graphs. These methods overcome the brittleness of sequence flattening and handle cases where entire table columns or discontinuous regions constitute a single field.
Generative and OCR-free Models: Emerging systems like GenKIE (Cao et al., 2023) and STNet (Liu et al., 2024) use sequence-to-sequence architectures that generate output strings conditioned on multimodal document encodings, bypassing explicit token annotation. OCR-free approaches treat the document as an image, leveraging vision transformers and spatial grounding mechanisms. VDInstruct (Nguyen et al., 13 Jul 2025) further decouples layout detection from semantic extraction via content-aware tokenization.
Optimization-based and One-shot Methods: Assignment optimization using mixed-integer programming enables KIE under rigid template priors without learning, as in (Cooney et al., 2023). For rapid adaptation to novel layouts or classes, graph-matching and one-shot transfer (DKIE (Yao et al., 2021)) align support and query documents through multi-modal affinity and combinatorial solvers, supporting transfer with minimal labeled data.
Rule-based and Hybrid Systems: To address label scarcity and domain priors, rule-based correction is effectively combined with deep learning (“DL+rules”), yielding substantial boosts in accuracy on numerically-grounded fields and in the presence of noisy/OCR-prone data (Arroyo et al., 2022).

3. Datasets, Benchmarks, and Evaluation Protocols

Robust KIE development relies on diverse annotated corpora and appropriate evaluation strategies. Key public benchmarks are:

Dataset	Type	Size/Stats	Challenge Highlights
SROIE [ICDAR 2019]	Receipts	626 train / 347 test	Simple fields, heavy template overlap (Laatiri et al., 2023)
FUNSD	Forms	149 / 50	Varied component types
CORD	Receipts	800 train / 100 val / 100 test	Complex tables, multi-label
Kleister (NDA/Charity) (Stanisławek et al., 2021)	Long contracts/reports	NDA: 540, Charity: 2788	Multi-page, sparse fields
RealKIE (Townsend et al., 2024)	Industry docs	5 domains, variable length	Sparse labels, long context
CLEX (Wei et al., 2023)	Complex forms	5,860 images, 1,162 categories	High layout/entity diversity
WildReceipt (Sun et al., 2021)	Receipts	1,740 images, 25 key/value types	Template-free, wild images
Business-License (Cao et al., 2024)	Licenses	320 real + 500 synthetic	Cross-template variation

Key findings indicate that template leakage between train/test splits leads to inflated F1 scores; up to 75% template overlap is reported for SROIE (Laatiri et al., 2023). Revised splits by clustering template signatures reveal drops of 10–20 F1 points for non-layout-aware models. It is therefore recommended to audit and restructure splits for robust generalization assessment.

For evaluation, conventional metrics include field/entity-level precision, recall, and F1 (often at token-level). Recent work introduces structure-sensitive metrics, such as KIEval (Khang et al., 7 Mar 2025), which account for correctness of entity groupings and human correction cost, better reflecting industrial requirements.

4. Key Advances and Empirical Results

Recent systems achieve state-of-the-art (SOTA) KIE performance by integrating multimodal fusion, explicit structure modeling, and robust training protocols. Representative results:

Graph and Transformer Hybrids: PICK (Yu et al., 2020) achieves 96.1% entity F1 on SROIE, outperforming LayoutLM baselines.
OCR-free Vision Models: STNet (Liu et al., 2024) surpasses prior models on CORD (F1=88.1), SROIE (F1=87.8), and DocVQA, extracting content and precise vision grounding.
Content-aware Tokenization: VDInstruct (Nguyen et al., 13 Jul 2025) demonstrates 3.6× reduction in token count and +5.5 F1 zero-shot gains on CORD and FUNSD over DocOwl 1.5, enabling scalable application to dense documents.
Generative KIE: GenKIE (Cao et al., 2023) attains 97.40 token F1 on SROIE and retains high accuracy (>94 F1) even with 50% OCR word corruption, leveraging prompt-based weak supervision.
Region-based Methods: RDU (Zhu et al., 2022) enables seamless adaptation across document types and improves performance for table–text mixed layouts by direct 2D box prediction.
Semi-supervised Learning: CRMSP (Zhang et al., 2024) tackles label imbalance and improves F1 on tail classes by 2–4 points over FixMatch and SimMatch through class-rebalancing and merged semantic label clustering.
One-shot & Zero-shot: PPN (Wei et al., 2023) and DKIE (Yao et al., 2021) exhibit high accuracy (up to 99.1%) on novel layouts with minimal support samples, employing pointer networks or partial graph matching.

See the table below for selected representative F1 scores on major benchmarks:

Model/Dataset	SROIE F1	CORD F1	FUNSD F1	Notes
LayoutLMv2	96.25	94.95	82.76	Text+layout+image baseline
PICK	96.1	—	—	Graph-based multimodal
GenKIE	97.40	95.75	83.45	Generative, prompt-based
STNet (OCR-free)	87.8	88.1	—	Vision-grounded extraction
GraphRevisedIE	96.42	94.26	78.41	GCN revision, light-weight
VDInstruct (zero-shot)	57.2	57.2	—	Content-aware, LLM

5. Challenges and Open Problems

Despite progress, several challenges persist:

Generalization and Template Bias: High template overlap in public splits obscures true model generalization; cross-layout transfer and real-world deployment expose limitations of memorized patterns. Template-agnostic evaluation and adaptive architectures are gaining traction (Laatiri et al., 2023, Nguyen et al., 13 Jul 2025).
Sparse Labels and Label Imbalance: In industrial documents (e.g. RealKIE), most tokens are not associated with any field, and tail classes are underrepresented. Semi-supervised and class-rebalancing strategies (e.g. CRMSP (Zhang et al., 2024)) have improved rare field performance, but extreme low-resource classes remain challenging.
Table Structure and Grouping: Extraction across multi-row, nested, or irregular table regions remains error-prone. Pointer-based models (PPN (Wei et al., 2023)) and region-based architectures (RDU (Zhu et al., 2022), STNet (Liu et al., 2024)) improve extraction by modeling 2D and group structure directly.
Robustness to OCR Errors: Systems relying on OCR pipelines suffer from cascading errors. OCR-free vision models (STNet (Liu et al., 2024)) and generative correction (GenKIE (Cao et al., 2023)) demonstrate resilience to recognition noise. However, layout reconstruction and fine-grained bounding localization can still degrade in low-quality scans.
Privacy and Federated Training: With widespread enterprise deployment, privacy-preserving KIE has become a critical topic, though only recently addressed at scale (Saifullah et al., 2023).

6. Evaluation Methodologies and Industrial Alignment

Standard metrics such as exact-match F1 fail to capture the cost and usability of KIE deployment in industrial RPA. KIEval (Khang et al., 7 Mar 2025) introduces a group-aware, correction-cost-aligned metric, incorporating both field accuracy and group extraction correctness. This stepwise metric, fueled by bipartite matching and explicit alignment of predicted and gold groups, bridges the gap between research benchmarks and real-world utility, especially for line-item extraction and post-processing automation.

Practitioners are urged to calibrate automation thresholds with KIEval to optimize the trade-off between automation rate and user correction burden, and to ensure models output structured groupings in addition to per-field predictions.

7. Future Directions

Research directions include:

Generalizable and Privacy-Preserving KIE: Models trained on foundation architectures under federated or differentially-private settings (Saifullah et al., 2023) supporting both utility and privacy constraints.
Instruction-Tuned LLMs for KIE: Integrating explicit instruction-following and content-aware token control (VDInstruct (Nguyen et al., 13 Jul 2025)), enabling zero-shot and task-adaptive document parsing.
KIE for Video and Multimodal Streams: Hierarchical KIE across video text (VKIE (An et al., 2023)) and time-variant layouts offers a new complexity dimension rarely addressed in static document KIE.
Compositional and Relational Extraction: Moving beyond pairs to structured graphs, relations, and hierarchical units (e.g., table parsing, cross-document coreference).
Robustness and Domain Adaptation: Pretraining/fine-tuning on noisy, highly varied, multilingual, and long-context documents as in enterprise and regulatory benchmarks (Townsend et al., 2024).

KIE research continues to evolve rapidly, with new datasets, architectures, and evaluation paradigms driving advances in automation, accuracy, and applicability in real-world, high-variance document environments.