BROS: A Pre-trained Language Model Focusing on Text and Layout for Better Key Information Extraction from Documents (2108.04539v5)

Published 10 Aug 2021 in cs.CL

Abstract: Key information extraction (KIE) from document images requires understanding the contextual and spatial semantics of texts in two-dimensional (2D) space. Many recent studies try to solve the task by developing pre-trained LLMs focusing on combining visual features from document images with texts and their layout. On the other hand, this paper tackles the problem by going back to the basic: effective combination of text and layout. Specifically, we propose a pre-trained LLM, named BROS (BERT Relying On Spatiality), that encodes relative positions of texts in 2D space and learns from unlabeled documents with area-masking strategy. With this optimized training scheme for understanding texts in 2D space, BROS shows comparable or better performance compared to previous methods on four KIE benchmarks (FUNSD, SROIE*, CORD, and SciTSR) without relying on visual features. This paper also reveals two real-world challenges in KIE tasks-(1) minimizing the error from incorrect text ordering and (2) efficient learning from fewer downstream examples-and demonstrates the superiority of BROS over previous methods. Code is available at https://github.com/clovaai/bros.

PDF Abstract

Insights into "BROS: A Pre-trained LLM Focusing on Text and Layout for Better Key Information Extraction from Documents"

The paper introduces BROS (BERT Relying On Spatiality), a pre-trained LLM designed to improve key information extraction (KIE) from documents through a sophisticated understanding of text and layout within a 2D spatial context. This research shifts focus away from incorporating visual features into LLMs and instead emphasizes the spatial relationship between text elements on a document page.

Key Contributions and Methodology

Spatial Encoding: BROS enhances spatial understanding by encoding relative positions between text blocks, which contrasts with previous models that relied on absolute 2D positions. This approach better captures spatial dependencies, crucial for distinguishing entities with similar key-value relationships.
Area-masked LLM: BROS employs a novel area-masking strategy during its pre-training phase. This strategy masks entire areas of a document to understand the relationships and dependencies of masked tokens in 2D space better. The area-masking complements the traditional token-masked LLM (TMLM) by providing additional context through spatially coherent spans.
Performance Evaluation: BROS achieved superior or comparable outcomes to state-of-the-art models across four KIE benchmarks (FUNSD, SROIE $^*$ , CORD, and SciTSR) without integrating visual image features. Especially noteworthy is its robustness to errors from incorrect text ordering and success with few labeled examples, which addresses two practical challenges in real-world KIE tasks.

Strong Numerical Results

A comprehensive evaluation demonstrates that BROS consistently outperforms existing pre-trained models, particularly in scenarios where ordering errors are present or training data is limited. For instance, BROS's F1 score on the FUNSD EE task was 83.05, surpassing the best results of models incorporating visual features. Moreover, BROS maintained high performance even when training on only 20-30% of available FUNSD data, a testament to its data efficiency.

Implications and Future Directions

BROS's methodologies present significant implications for the development of LLMs. By emphasizing spatial relational encoding and area-based masking strategies, the model paves the way for more resource-efficient text processing in KIE tasks. This paper opens up potential avenues for applying similar strategies in other 2D information extraction tasks beyond industrial documents.

Looking ahead, further exploration of the relative position encoding mechanism could enhance understanding across various document layouts, particularly for mixed or complex layouts. Furthermore, integrating BROS with multi-modal approaches that selectively incorporate visual cues could outline a hybrid strategy for more nuanced document comprehension.

In conclusion, BROS offers a significant contribution to the field of document information extraction by proposing a model that balances spatial and textual understanding without extraneous computational overhead associated with visual features. Its success highlights the importance of refined spatial modeling in KIE tasks and suggests a promising direction for future research and application development in document automation technologies.

PDF Markdown Bookmark Chat (Pro)

Authors (6)

Teakgyu Hong (8 papers)
Donghyun Kim (129 papers)
Mingi Ji (8 papers)
Wonseok Hwang (24 papers)
Daehyun Nam (4 papers)
Sungrae Park (17 papers)

Citations (124)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - clovaai/bros (155 stars)