Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

PEneo: Unifying Line Extraction, Line Grouping, and Entity Linking for End-to-end Document Pair Extraction (2401.03472v3)

Published 7 Jan 2024 in cs.CL

Abstract: Document pair extraction aims to identify key and value entities as well as their relationships from visually-rich documents. Most existing methods divide it into two separate tasks: semantic entity recognition (SER) and relation extraction (RE). However, simply concatenating SER and RE serially can lead to severe error propagation, and it fails to handle cases like multi-line entities in real scenarios. To address these issues, this paper introduces a novel framework, PEneo (Pair Extraction new decoder option), which performs document pair extraction in a unified pipeline, incorporating three concurrent sub-tasks: line extraction, line grouping, and entity linking. This approach alleviates the error accumulation problem and can handle the case of multi-line entities. Furthermore, to better evaluate the model's performance and to facilitate future research on pair extraction, we introduce RFUND, a re-annotated version of the commonly used FUNSD and XFUND datasets, to make them more accurate and cover realistic situations. Experiments on various benchmarks demonstrate PEneo's superiority over previous pipelines, boosting the performance by a large margin (e.g., 19.89%-22.91% F1 score on RFUND-EN) when combined with various backbones like LiLT and LayoutLMv3, showing its effectiveness and generality. Codes and the new annotations are available at https://github.com/ZeningLin/PEneo.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (30)
  1. Docformer: End-to-end transformer for document understanding. In 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 973–983. IEEE, 2021.
  2. Query-driven generative network for document information extraction in the wild. In Proceedings of the 30th ACM International Conference on Multimedia, pages 4261–4271, 2022.
  3. Visual fudge: Form understanding via dynamic graph editing. In Document Analysis and Recognition–ICDAR 2021: 16th International Conference, Lausanne, Switzerland, September 5–10, 2021, Proceedings, Part I 16, pages 416–431. Springer, 2021.
  4. End-to-end document recognition and understanding with dessurt. In European Conference on Computer Vision, pages 280–296. Springer, 2022.
  5. Unidoc: Unified pretraining framework for document understanding. Advances in Neural Information Processing Systems, 34:39–50, 2021.
  6. Xylayoutlm: Towards layout-aware multimodal networks for visually-rich document understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4583–4592, 2022.
  7. Jean-Philippe Thiran Guillaume Jaume, Hazim Kemal Ekenel. Funsd: A dataset for form understanding in noisy scanned documents. In ICDAR-OST, 2019.
  8. Bros: A pre-trained language model focusing on text and layout for better key information extraction from documents. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 10767–10775, 2022.
  9. A question-answering approach to key value pair extraction from form-like document images. Proceedings of the AAAI Conference on Artificial Intelligence, 37(11):12899–12906, 2023.
  10. Layoutlmv3: Pre-training for document ai with unified text and image masking. In Proceedings of the 30th ACM International Conference on Multimedia, pages 4083–4091, 2022.
  11. Icdar2019 competition on scanned receipt ocr and information extraction. In 2019 International Conference on Document Analysis and Recognition (ICDAR), pages 1516–1520. IEEE, 2019.
  12. Spatial dependency parsing for semi-structured document information extraction. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 330–343, 2021.
  13. Ocr-free document understanding transformer. In European Conference on Computer Vision, pages 498–517. Springer, 2022.
  14. Visual information extraction in the wild: practical dataset and end-to-end solution. In International Conference on Document Analysis and Recognition, pages 36–53. Springer, 2023.
  15. Selfdoc: Self-supervised document representation learning. In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5648–5656. IEEE, 2021a.
  16. Structext: Structured text understanding with multi-modal transformers. In Proceedings of the 29th ACM International Conference on Multimedia, pages 1912–1920, 2021b.
  17. Doctr: Document transformer for structured information extraction in documents. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 19584–19594, 2023.
  18. Decoupled weight decay regularization. In Proceedings of the 7th International Conference on Learning Representations (ICLR), 2019.
  19. Geolayoutlm: Geometric pre-training for visual information extraction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7092–7101, 2023.
  20. Cord: A consolidated receipt dataset for post-ocr parsing. In Workshop on Document Intelligence at NeurIPS 2019, 2019.
  21. Ernie-layout: Layout knowledge enhanced pre-training for visually-rich document understanding. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 3744–3756, 2022.
  22. Information management system using structure analysis of paper/electronic documents and its applications. In Ninth International Conference on Document Analysis and Recognition (ICDAR 2007), pages 689–693, 2007.
  23. Lilt: A simple yet effective language-independent layout transformer for structured document understanding. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7747–7757, 2022.
  24. Tplinker: Single-stage joint extraction of entities and relations through token pair linking. In Proceedings of the 28th International Conference on Computational Linguistics, pages 1572–1582, 2020.
  25. Layout recognition of multi-kinds of table-form documents. IEEE Transactions on Pattern Analysis and Machine Intelligence, 17(4):432–445, 1995.
  26. Layoutlm: Pre-training of text and layout for document image understanding. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 1192–1200, 2020.
  27. Layoutxlm: Multimodal pre-training for multilingual visually-rich document understanding, 2021a.
  28. Layoutlmv2: Multi-modal pre-training for visually-rich document understanding. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 2579–2591, 2021b.
  29. Xfund: A benchmark dataset for multilingual visually rich form understanding. In Findings of the Association for Computational Linguistics: ACL 2022, pages 3214–3224, 2022.
  30. Modeling entities as semantic points for visual information extraction in the wild. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15358–15367, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Zening Lin (2 papers)
  2. Jiapeng Wang (22 papers)
  3. Teng Li (83 papers)
  4. Wenhui Liao (4 papers)
  5. Dayi Huang (1 paper)
  6. Longfei Xiong (1 paper)
  7. Lianwen Jin (116 papers)

Summary

We haven't generated a summary for this paper yet.