Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

CREPE: Coordinate-Aware End-to-End Document Parser (2405.00260v1)

Published 1 May 2024 in cs.CV

Abstract: In this study, we formulate an OCR-free sequence generation model for visual document understanding (VDU). Our model not only parses text from document images but also extracts the spatial coordinates of the text based on the multi-head architecture. Named as Coordinate-aware End-to-end Document Parser (CREPE), our method uniquely integrates these capabilities by introducing a special token for OCR text, and token-triggered coordinate decoding. We also proposed a weakly-supervised framework for cost-efficient training, requiring only parsing annotations without high-cost coordinate annotations. Our experimental evaluations demonstrate CREPE's state-of-the-art performances on document parsing tasks. Beyond that, CREPE's adaptability is further highlighted by its successful usage in other document understanding tasks such as layout analysis, document visual question answering, and so one. CREPE's abilities including OCR and semantic parsing not only mitigate error propagation issues in existing OCR-dependent methods, it also significantly enhance the functionality of sequence generation models, ushering in a new era for document understanding studies.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Yamato Okamoto (4 papers)
  2. Youngmin Baek (7 papers)
  3. Geewook Kim (21 papers)
  4. Ryota Nakao (1 paper)
  5. Moon Bin Yim (2 papers)
  6. Seunghyun Park (26 papers)
  7. Bado Lee (9 papers)
  8. Donghyun Kim (129 papers)
Citations (1)