Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
156 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

The future of document indexing: GPT and Donut revolutionize table of content processing (2403.07553v1)

Published 12 Mar 2024 in cs.IR, cs.AI, and cs.CV

Abstract: Industrial projects rely heavily on lengthy, complex specification documents, making tedious manual extraction of structured information a major bottleneck. This paper introduces an innovative approach to automate this process, leveraging the capabilities of two cutting-edge AI models: Donut, a model that extracts information directly from scanned documents without OCR, and OpenAI GPT-3.5 Turbo, a robust LLM. The proposed methodology is initiated by acquiring the table of contents (ToCs) from construction specification documents and subsequently structuring the ToCs text into JSON data. Remarkable accuracy is achieved, with Donut reaching 85% and GPT-3.5 Turbo reaching 89% in effectively organizing the ToCs. This landmark achievement represents a significant leap forward in document indexing, demonstrating the immense potential of AI to automate information extraction tasks across diverse document types, boosting efficiency and liberating critical resources in various industries.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (35)
  1. Bros: A pre-trained language model focusing on text and layout for better key information extraction from documents. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 10767–10775, 2022.
  2. Deeperdive: The unreasonable effectiveness of weak supervision in document understanding a case study in collaboration with uipath inc. arXiv preprint arXiv:2208.08000, 2022.
  3. Google. Document ai documentation, 2023. Last accessed 2 May 2023.
  4. hyperscience. Document ai documentation, 2023. Last accessed 2 May 2023.
  5. uipath. Document ai documentation, 2023. Last accessed 2 May 2023.
  6. Kleister: A novel task for information extraction involving long documents with complex layout. arXiv preprint arXiv:2003.02356, 2020.
  7. Kleister: key information extraction datasets involving long documents with complex layouts. In Document Analysis and Recognition–ICDAR 2021: 16th International Conference, Lausanne, Switzerland, September 5–10, 2021, Proceedings, Part I, pages 564–579. Springer, 2021.
  8. Vibertgrid: a jointly trained multi-modal 2d document representation for key information extraction from documents. In Document Analysis and Recognition–ICDAR 2021: 16th International Conference, Lausanne, Switzerland, September 5–10, 2021, Proceedings, Part I 16, pages 548–563. Springer, 2021.
  9. An ocr engine for printed receipt images using deep learning techniques. International Journal of Advanced Computer Science and Applications, 14(2), 2023.
  10. Jørgen Burchardt. Are searches in ocr-generated archives trustworthy? an analysis of digital newspaper archives. Jahrbuch für Wirtschaftsgeschichte/Economic History Yearbook, 64(1):31–54, 2023.
  11. Automated invoice data extraction using image processing. IAES International Journal of Artificial Intelligence, 12(2):514, 2023.
  12. Pp-structurev2: A stronger document analysis system. arXiv preprint arXiv:2210.05391, 2022.
  13. Improving information extraction on business documents with specific pre-training tasks. In Document Analysis Systems: 15th IAPR International Workshop, DAS 2022, La Rochelle, France, May 22–25, 2022, Proceedings, pages 111–125. Springer, 2022.
  14. Automated invoice processing: Machine learning-based information extraction for long tail suppliers. Available at SSRN 4386107.
  15. Docparser: End-to-end ocr-free information extraction from visually rich documents. arXiv preprint arXiv:2304.12484, 2023.
  16. Ocr-free document understanding transformer. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXVIII, pages 498–517. Springer, 2022.
  17. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  18. Pp-ocr: A practical ultra lightweight ocr system. arXiv preprint arXiv:2009.09941, 2020.
  19. Pp-ocrv3: More attempts for the improvement of ultra lightweight ocr system. arXiv preprint arXiv:2206.03001, 2022.
  20. Layoutlm: Pre-training of text and layout for document image understanding. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 1192–1200, 2020.
  21. Ocr-free table of contents detection in urdu books. In 2012 10th IAPR International Workshop on Document Analysis Systems, pages 404–408. IEEE, 2012.
  22. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  23. Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053, 2019.
  24. Shreekant Mandvikar. Augmenting intelligent document processing (idp) workflows with contemporary large language models (llms). International Journal of Computer Trends and Technology, 71(10):80–91, 2023.
  25. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions. arXiv preprint arXiv:2311.05232, 2023.
  26. A survey on large language models: Applications, challenges, limitations, and practical usage. Authorea Preprints, 2023.
  27. Self-rag: Learning to retrieve, generate, and critique through self-reflection. arXiv preprint arXiv:2310.11511, 2023.
  28. Chatgpt is not enough: Enhancing large language models with knowledge graphs for fact-aware language modeling. arXiv preprint arXiv:2306.11489, 2023.
  29. Cheonsu Jeong. A study on the implementation of generative ai services using an enterprise data-based llm application architecture. arXiv preprint arXiv:2309.01105, 2023.
  30. CS Krishna. Prompt generate train (pgt): A framework for few-shot domain adaptation, alignment, and uncertainty calibration of a retriever augmented generation (rag) model for domain specific open book question-answering. arXiv preprint arXiv:2307.05915, 2023.
  31. Llm-take: Theme-aware keyword extraction using large language models. pages 4318–4324, 2023.
  32. Large language models for generative information extraction: A survey. arXiv preprint arXiv:2312.17617, 2023.
  33. Evaluation of deep convolutional nets for document image classification and retrieval. In 2015 13th International Conference on Document Analysis and Recognition (ICDAR), pages 991–995. IEEE, 2015.
  34. On evaluation of document classifiers using rvl-cdip. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 2657–2670, 2023.
  35. Icdar2019 competition on scanned receipt ocr and information extraction. In 2019 International Conference on Document Analysis and Recognition (ICDAR), pages 1516–1520. IEEE, 2019.

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com