XFormParser: Unified Multimodal KIE

Updated 2 March 2026

XFormParser is a unified multimodal, multilingual framework that jointly performs semantic entity recognition (SER) and relation extraction (RE) using a Transformer-based architecture.
It integrates spatial, visual, and textual features via 2D embeddings and a Bi-LSTM-enhanced biaffine decoder, significantly improving SER and RE F1 scores over previous state-of-the-art methods.
The framework supports both monolingual and cross-lingual applications, demonstrating robust scalability, high reproducibility, and industrial-grade performance.

XFormParser is a multimodal, multilingual framework for Key Information Extraction (KIE) from semi-structured form documents, implemented as a unified Transformer-based architecture that jointly addresses Semantic Entity Recognition (SER) and Relation Extraction (RE). Its primary contributions include the integration of spatial and visual features within a robust sequence model, a Bi-LSTM enhanced biaffine decoder for entity relations, and a supervised fine-tuning dataset tailored to industrial use cases. The system delivers substantial empirical performance improvements over previous state-of-the-art (SOTA) approaches in both monolingual and cross-lingual scenarios, with demonstrated scalability and reproducibility (Cheng et al., 2024).

1. Model Architecture and Key Components

XFormParser is constructed atop the LayoutXLM backbone, a multilingual extension of BERT/RoBERTa. This foundation is augmented with 2D position embeddings to encode token-level spatial coordinates $(x, y, w, h)$ and per-bounding-box image embeddings derived from cropped image patches around OCR-detected words. The input representation for each cell $b_i$ is the tuple $[{\rm Words}_i, {\rm Bbox}_i, {\rm Image}_i]$ , tokenized and embedded into a feature sequence $H \in \mathbb{R}^{n \times d}$ , where $n$ is the total number of OCR tokens.

Unlike traditional pipelines that separate SER and RE subsystems, XFormParser employs a unified architecture where both tasks operate over a shared contextual representation $H$ . SER predictions are obtained via a sequence of dense and MLP layers:

$H_{\text{ser}} = \mathrm{Dense}_{\text{ser}}(H)$
$\mathrm{logits}_{\text{ser}} = \mathrm{MLP}_{\text{ser}}(H_{\text{ser}})$
Label assignments $p_i = \arg\max(\mathrm{logits}_{\text{ser}}[i])$

For RE, entity embeddings $e_i$ are computed by mean pooling over $H_{\text{re}}$ (a transformed $H$ ) for each entity and concatenating with SER predictions. These entity representations feed a Bi-LSTM decoder, after which separate MLPs generate "head" and "tail" entity vectors, sent to a deep biaffine scorer to compute pairwise relation scores $\mathrm{Score}_{ij}$ . The loss function combines cross-entropy objectives for both SER and RE with equal weighting: $\mathrm{Loss}_{\text{total}} = \alpha\,\mathrm{Loss}_{\text{ser}} + \beta\,\mathrm{Loss}_{\text{re}}, \quad \alpha = \beta = 1$

The Bi-LSTM decoder is critical for modeling sequential dependencies and context across entities, particularly in multilingual settings with spatially aligned but script-diverse tokens. Ablation experiments show that removing this module significantly degrades RE F1 (from 91.24 to 79.79).

2. Pre-training and Supervised Fine-tuning

LayoutXLM is pre-trained with multiple multimodal objectives:

Masked Language Modeling (MLM):

$\mathcal{L}_{\text{MLM}} = -\sum_{t \in \text{masked}} \log P(w_t | H_{\setminus t})$

Text-Image Alignment: binary prediction of correspondence between tokens and image patches.
(Optional) Text-Image Matching: discriminative prediction of document consistency.

Fine-tuning leverages the InDFormSFT supervised dataset, comprising 562 semi-structured forms in Chinese and English from 8 industrial verticals (finance, manufacturing, education, government, etc.). Each OCR cell is annotated with a bounding box, text content, entity label (one of QUESTION, ANSWER, SINGLE, TITLE, ANSWERNUM), and key-value relation links. The data is partitioned 422/70/70 for train/dev/test, with the training set containing 6,702 Question and 6,825 Answer entities, and 11,806 one-to-one relations.

Table: InDFormSFT Annotation Schema

Field	Description	Example Value / Notes
box	Four-point bounding box	[x₁, y₁, x₂, y₂]
text	OCR cell content	'Invoice No.'
label	Entity type	QUESTION, ANSWER, SINGLE, etc.
linking	Key–value pair indices	[(question_id, answer_id), …]

The fine-tuning protocol uses RoBERTa tokenization (maximum 512 tokens), AdamW optimizer (weight decay 0.1, LR=5e-5), batch size 8, and 100 epochs. Importantly, a soft-label warm-up strategy commences at epoch 30 (duration 51 epochs), and its timing is validated in ablation studies for optimal RE F1.

3. Evaluation: Benchmarks and Metrics

XFormParser is assessed on standard KIE benchmarks:

FUNSD: English forms (149 train / 50 test)
XFUND: Multilingual forms (7 languages × 149 train / 50 test)
InDFormSFT: Industrial, Chinese + English

SER is evaluated using BIO-tagging F1 and cell-level accuracy (CA), with CA defined as $CA = CCD/TCC$ where $CCD$ is the count of correctly classified cells and $TCC$ the total. RE is evaluated by relation-level F1, computed as

$F_1 = 2\frac{\text{Precision} \times \text{Recall}}{\text{Precision}+\text{Recall}}, \qquad \text{Precision} = \frac{TP}{TP+FP}, \quad \text{Recall} = \frac{TP}{TP+FN}$

4. Experimental Results and Analysis

Language-specific Fine-tuning

When fine-tuned and evaluated on the same language (FUNSD, XFUND), XFormParser achieves 92.46% SER F1 (LayoutXLM: 79.40%) and 91.24% RE F1 (LayoutXLM: 54.83%), with maximum observed improvements of +6.53 pp SER F1 and +9.64 pp RE F1 over prior SOTA in per-language settings. A measured 1.79% RE F1 increase over the previous best multimodal layout model is explicitly highlighted (Cheng et al., 2024).

Multilingual Fine-tuning

Joint training across all 8 XFUND languages delivers average SER F1 of 91.67% and RE F1 of 95.89%, both outperforming LiLT by substantial margins (SER: 85.85%, RE: 81.25%). The integration of multilingual signals and the Bi-LSTM decoder is shown to be effective for robust cross-lingual transfer.

Zero-shot Transfer

After monolingual fine-tuning on FUNSD (English), the zero-shot transfer to the remaining 7 XFUND languages yields mean SER F1 of 71.35% (LiLT: 60.61%) and RE F1 of 81.18% (GOSE: 67.29%), signifying substantial resilience for cross-lingual deployment.

Ablation and Error Analysis

Removing RE task supervision minimally reduces SER F1 (92.46 → 91.89), however disables RE output.
Removing SER task supervision leaves RE functional with a minor drop (91.24 → 90.90).
Bi-LSTM decoder removal results in severe RE F1 degradation (91.24 → 79.79).
Removing soft-label warm-up decreases RE F1 by ≈0.5.
Varying the soft-label scheduling yields optimal RE F1=93.14 starting warm-up at epoch 30.

Qualitative analyses include visualizations (entities as orange boxes; key-value pairs as arrows) confirming accurate detection even in visually complex forms.

5. Implementation and Reproducibility

OCR preprocessing is compatible with any off-the-shelf OCR, yielding token + bounding box inputs. A standard RoBERTa tokenizer (512 tokens max) is employed. Training uses AdamW optimization, a linear learning rate schedule, batch size of 8, and can be completed within 12 hours on a single NVIDIA A100 GPU. The LayoutXLM backbone contains approximately 200 million parameters.

The full codebase, pre-trained models (including LayoutXLM and fine-tuned checkpoints), and datasets are available at https://github.com/zhbuaa0/xformparser. The InDFormSFT dataset, with corresponding JSON annotation files, is accessible in the repository for immediate experimentation and benchmarking (Cheng et al., 2024).

6. Applications, Limitations, and Extensions

XFormParser is designed for robust KIE from semi-structured documents across a broad set of domains, including industry, government, and education. The architectural focus on layout and multimodality, together with the unified SER+RE paradigm, ensures strong recall in both monolingual and multilingual, as well as zero-shot, contexts. The relatively lightweight design and support for efficient inference (including on resource-constrained hardware) enhance its practical deployment potential.

A plausible implication is that the joint modeling of entity and relation tasks, reinforced by layout-centric visual-language representations and context-aware sequence decoders, may be generalizable to other document intelligence settings beyond form parsing. Limitations associated with OCR quality, extreme text density, and rare script variants may warrant further research.

7. Summary

XFormParser operationalizes a lightweight but powerful multimodal architecture for semi-structured document parsing, combining the LayoutXLM transformer with a Bi-LSTM-enhanced, biaffine relation decoder and joint SER+RE training. Its design, validated by the InDFormSFT dataset and extensive benchmarking, yields performance gains of up to +9.64 percentage points in relation extraction F1 over previous SOTA, with strong cross-lingual and zero-shot generalization. Full resources are publicly released for replication and further research (Cheng et al., 2024).

Markdown Report Issue Upgrade to Chat

References (1)

XFormParser: A Simple and Effective Multimodal Multilingual Semi-structured Form Parser (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to XFormParser.