This paper introduces LayoutXLM, a multimodal pre-trained model designed for understanding visually-rich documents across multiple languages. It addresses the limitation of previous models that were either text-only multilingual or multimodal but monolingual (primarily English). LayoutXLM extends the LayoutLMv2 architecture to handle multilingual documents by incorporating text, layout (2D position), and visual (image) information.
Model Architecture and Pre-training
- Architecture: LayoutXLM uses a multimodal Transformer architecture, similar to LayoutLMv2. It takes text embeddings, 2D position embeddings (derived from OCR bounding boxes), and visual embeddings (from a visual backbone like ResNeXt applied to the document image) as input. These embeddings are combined and fed into a Transformer encoder with spatial-aware self-attention to jointly model the relationships between text, layout, and image features.
- Multilingual Adaptation: To handle different languages effectively, character-level bounding boxes are obtained using OCR. Then, for each token generated by a SentencePiece tokenizer, its bounding box is calculated by merging the bounding boxes of its constituent characters. This unifies the input process across languages with varying linguistic units.
- Pre-training Objectives: LayoutXLM uses three pre-training objectives adapted from LayoutLMv2:
- Multilingual Masked Visual-LLMing (MMVLM): Random text tokens are masked, and the model predicts them based on the surrounding text, layout, and image context.
- Text-Image Alignment (TIA): Some text lines are randomly selected, their corresponding image regions are covered (masked) on the document image, and the model predicts whether each text token's corresponding image region is covered. This enforces fine-grained alignment.
- Text-Image Matching (TIM): The model predicts whether a given document image and its corresponding text actually belong to the same document page. This promotes coarse-grained alignment.
Pre-training Data: The model was pre-trained on a large dataset comprising 30 million documents:
- 22 million publicly available, digital-born PDF documents in 53 languages, collected following Common Crawl principles and processed using PyMuPDF and BlingFire for language detection.
- 8 million scanned English documents from the IIT-CDIP dataset.
- Data was sampled across languages using a strategy similar to XLM and InfoXLM () to balance high- and low-resource languages.
XFUND Benchmark
- To evaluate multilingual performance, the paper introduces the XFUND benchmark, an extension of the English FUNSD dataset.
- Languages: XFUND includes human-annotated forms in 7 languages: Chinese, Japanese, Spanish, French, Italian, German, and Portuguese.
- Task: The primary task is key-value extraction, divided into:
- Semantic Entity Recognition (SER): Identifying and classifying text segments into predefined categories (e.g.,
HEADER
,QUESTION
,ANSWER
). This is framed as a sequence labeling task using BIO format. - Relation Extraction (RE): Identifying links between semantic entities, specifically key-value relationships. This is treated as a classification problem on entity pairs, using a biaffine attention classifier.
- Semantic Entity Recognition (SER): Identifying and classifying text segments into predefined categories (e.g.,
Data: XFUND contains 1,393 forms (199 per language), split into 149 for training and 50 for testing per language. Templates were collected online, filled with synthetic data (typed or handwritten), scanned, OCR'd (using Microsoft Read API), and manually annotated.
Experiments and Results
- LayoutXLM (Base and Large versions) was compared against strong multilingual text-only baselines (XLM-RoBERTa, InfoXLM).
- Evaluation Settings:
1. Language-specific fine-tuning: Training and testing on the same target language. 2. Zero-shot transfer: Training only on the English FUNSD dataset and testing on other XFUND languages. 3. Multitask fine-tuning: Training on all 8 languages (FUNSD + XFUND) simultaneously and testing on each language.
- Results: LayoutXLM significantly outperformed the text-only baselines across all languages and settings for both SER and RE tasks.
- In language-specific fine-tuning, LayoutXLM-Large achieved an average F1 of 82.82% for SER and 72.06% for RE across the 7 XFUND languages, compared to 74.71%/60.02% for InfoXLM-Large.
- Zero-shot results showed strong transfer capabilities, with LayoutXLM-Large achieving 61.15% SER / 54.87% RE average F1, demonstrating its ability to generalize layout understanding across languages.
- Multitask fine-tuning further boosted performance, yielding the best results (e.g., 84.29% SER / 84.58% RE average F1 for LayoutXLM-Large), indicating that the model benefits from shared layout patterns across different languages.
Conclusion
LayoutXLM effectively combines text, layout, and image information for multilingual document understanding. Its pre-training on diverse, multilingual documents allows it to outperform text-only models and generalize well across languages, as demonstrated on the newly introduced XFUND benchmark. The model and dataset were made publicly available to facilitate further research.