Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

ERNIE-Layout: Layout Knowledge Enhanced Pre-training for Visually-rich Document Understanding (2210.06155v2)

Published 12 Oct 2022 in cs.CL and cs.AI
ERNIE-Layout: Layout Knowledge Enhanced Pre-training for Visually-rich Document Understanding

Abstract: Recent years have witnessed the rise and success of pre-training techniques in visually-rich document understanding. However, most existing methods lack the systematic mining and utilization of layout-centered knowledge, leading to sub-optimal performances. In this paper, we propose ERNIE-Layout, a novel document pre-training solution with layout knowledge enhancement in the whole workflow, to learn better representations that combine the features from text, layout, and image. Specifically, we first rearrange input sequences in the serialization stage, and then present a correlative pre-training task, reading order prediction, to learn the proper reading order of documents. To improve the layout awareness of the model, we integrate a spatial-aware disentangled attention into the multi-modal transformer and a replaced regions prediction task into the pre-training phase. Experimental results show that ERNIE-Layout achieves superior performance on various downstream tasks, setting new state-of-the-art on key information extraction, document image classification, and document question answering datasets. The code and models are publicly available at http://github.com/PaddlePaddle/PaddleNLP/tree/develop/model_zoo/ernie-layout.

ERNIE-Layout: Enhancing Document Understanding with Layout Knowledge

The paper introduces ERNIE-Layout, a pre-training approach designed to enhance the representation learning for visually-rich document understanding (VrDU). The methodology delineates a comprehensive integration of text, layout, and image features, diverging from traditional models that inadequately utilize layout information. The crux of ERNIE-Layout lies in its layout knowledge enhancement, deployed through sophisticated serialization techniques and spatial-aware attention mechanisms.

Methodological Insights

ERNIE-Layout innovates by restructuring the input sequence using layout-based document parsing. This model tasks itself with generating a more semantically coherent reading order, refining the typical raster-scanning serialization that inadequately addresses complex document layouts. The serialization process aligns better with human reading patterns, potentially improving the understanding of structured documents, such as tables and forms.

The architecture employs a multi-modal transformer equipped with spatial-aware disentangled attention, inspired by the DeBERTa model. This mechanism introduces a novel way to integrate layout dimensions into the attention process, fostering a nuanced cross-modal interaction between text/image data and layout features. By separating content from positional information, this disentangled approach substantially enriches the model’s capability to process spatially complex documents.

Pre-training Tasks

ERNIE-Layout incorporates four pre-training tasks: masked visual-LLMing (MVLM), text-image alignment (TIA), reading order prediction (ROP), and replaced region prediction (RRP). ROP and RRP are the novel contributions of this research, explicitly targeting the alignment of reading order and the correlation between regions replaced in images versus texts.

  • Reading Order Prediction (ROP): Utilizes the model's attention matrix to predict the sequence of token reading, aligning computational sequences with human reading habits.
  • Replaced Region Prediction (RRP): Aims to enhance the model's understanding of the correspondence between image patches and linguistic tokens through direct manipulation of image components.

Evaluation and Results

The model was subjected to extensive evaluation across several key VrDU tasks: key information extraction, document question answering, and document image classification. ERNIE-Layout markedly outperforms existing baselines, setting new state-of-the-art results across multiple datasets such as FUNSD, CORD, and Kleister-NDA. Notably, the model achieves substantial improvements in the key information extraction task, with distinct advancements in handling complex document layouts.

Implications and Future Prospects

The ERNIE-Layout model underscores the potential advantages of leveraging layout-centric knowledge in document understanding models. By realigning text and image data with spatial context, this model offers enhanced performance in tasks requiring fine-grained comprehension and interaction across modalities.

In future research, this approach may be extended to more generalized multi-modal tasks, potentially serving as a foundation for developing comprehensive document processing systems capable of nuanced semantic analysis in varied contexts. Further exploration could include scaling this methodology to larger datasets and testing in real-world applications across various industries.

Through its systematic approach to integrating layout knowledge, ERNIE-Layout provides a compelling advancement in VrDU and sets a precedent for leveraging spatial awareness in natural language processing models.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (15)
  1. Qiming Peng (7 papers)
  2. Yinxu Pan (6 papers)
  3. Wenjin Wang (56 papers)
  4. Bin Luo (209 papers)
  5. Zhenyu Zhang (249 papers)
  6. Zhengjie Huang (25 papers)
  7. Teng Hu (26 papers)
  8. Weichong Yin (8 papers)
  9. Yongfeng Chen (1 paper)
  10. Yin Zhang (98 papers)
  11. Shikun Feng (37 papers)
  12. Yu Sun (226 papers)
  13. Hao Tian (146 papers)
  14. Hua Wu (191 papers)
  15. Haifeng Wang (194 papers)
Citations (66)