Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Vision Grid Transformer for Document Layout Analysis (2308.14978v1)

Published 29 Aug 2023 in cs.CV

Abstract: Document pre-trained models and grid-based models have proven to be very effective on various tasks in Document AI. However, for the document layout analysis (DLA) task, existing document pre-trained models, even those pre-trained in a multi-modal fashion, usually rely on either textual features or visual features. Grid-based models for DLA are multi-modality but largely neglect the effect of pre-training. To fully leverage multi-modal information and exploit pre-training techniques to learn better representation for DLA, in this paper, we present VGT, a two-stream Vision Grid Transformer, in which Grid Transformer (GiT) is proposed and pre-trained for 2D token-level and segment-level semantic understanding. Furthermore, a new dataset named D$4$LA, which is so far the most diverse and detailed manually-annotated benchmark for document layout analysis, is curated and released. Experiment results have illustrated that the proposed VGT model achieves new state-of-the-art results on DLA tasks, e.g. PubLayNet ($95.7\%$$\rightarrow$$96.2\%$), DocBank ($79.6\%$$\rightarrow$$84.1\%$), and D$4$LA ($67.7\%$$\rightarrow$$68.8\%$). The code and models as well as the D$4$LA dataset will be made publicly available ~\url{https://github.com/AlibabaResearch/AdvancedLiterateMachinery}.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Cheng Da (7 papers)
  2. Chuwei Luo (8 papers)
  3. Qi Zheng (62 papers)
  4. Cong Yao (70 papers)
Citations (15)

Summary

Vision Grid Transformer for Document Layout Analysis

The paper introduces the Vision Grid Transformer (VGT), a sophisticated model aimed at enhancing Document Layout Analysis (DLA). DLA is crucial for transforming document images into structured formats, facilitating subsequent tasks like information extraction and retrieval. While previous models predominantly focused on either visual or textual features, VGT combines these modalities effectively using a two-stream architecture—integrating a Vision Transformer (ViT) and a novel Grid Transformer (GiT).

Key Contributions

  1. Vision Grid Transformer Architecture: VGT leverages both visual and textual data. The ViT handles visual features, while the GiT processes 2D token-level semantic representations extracted from documents. This integration ensures that the semantic nuances of both modalities are captured effectively.
  2. Pre-training with Multi-Granularity Semantics: The GiT is pre-trained with two novel objectives—Masked Grid LLMing (MGLM) and Segment LLMing (SLM). MGLM focuses on token-level semantic representations, while SLM aligns segment-level semantic understanding of grid features with existing LLMs like BERT or LayoutLM through contrastive learning.
  3. Diverse and Detailed Dataset (D4^4LA): In conjunction with the model, a new dataset, D4^4LA, is released. It encompasses a wide variety of document types, enhancing the model's applicability in real-world scenarios. The dataset is annotated manually, providing a comprehensive benchmark for DLA with 27 different layout categories across 12 types of documents.

Experimental Evaluation

The experiments demonstrate that VGT achieves state-of-the-art results on existing datasets such as PubLayNet and DocBank, as well as on the new D4^4LA dataset. Noteworthy improvements in performance metrics such as mAP were observed. For instance, the model improved performance on PubLayNet from 95.7% to 96.2% and on DocBank from 79.6% to 84.1%. These results validate the model's efficacy in leveraging multi-modal information for more accurate document layout analysis.

Implications and Future Directions

The introduction of VGT presents significant implications for both academic research and practical applications. The dual-modality approach could inspire future studies exploring deeper integration of different data modalities. The pre-training strategies elaborated might influence future transformer-based architectures beyond document analysis, potentially impacting other domains requiring nuanced semantic understanding.

Looking forward, there’s potential for reducing computational complexity while maintaining accuracy, making the model more feasible for deployment in resource-constrained environments. Additionally, expanding the diversity of document types and improving the robustness against document complications such as low-quality scans remain fruitful areas for exploration.

Overall, the Vision Grid Transformer's ability to harmonize visual and textual features marks an advancement in the field of Document AI, with promising applications across numerous document-intensive industries.