Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

OmniParser: A Unified Framework for Text Spotting, Key Information Extraction and Table Recognition (2403.19128v1)

Published 28 Mar 2024 in cs.CV

Abstract: Recently, visually-situated text parsing (VsTP) has experienced notable advancements, driven by the increasing demand for automated document understanding and the emergence of Generative LLMs capable of processing document-based questions. Various methods have been proposed to address the challenging problem of VsTP. However, due to the diversified targets and heterogeneous schemas, previous works usually design task-specific architectures and objectives for individual tasks, which inadvertently leads to modal isolation and complex workflow. In this paper, we propose a unified paradigm for parsing visually-situated text across diverse scenarios. Specifically, we devise a universal model, called OmniParser, which can simultaneously handle three typical visually-situated text parsing tasks: text spotting, key information extraction, and table recognition. In OmniParser, all tasks share the unified encoder-decoder architecture, the unified objective: point-conditioned text generation, and the unified input & output representation: prompt & structured sequences. Extensive experiments demonstrate that the proposed OmniParser achieves state-of-the-art (SOTA) or highly competitive performances on 7 datasets for the three visually-situated text parsing tasks, despite its unified, concise design. The code is available at https://github.com/AlibabaResearch/AdvancedLiterateMachinery.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (98)
  1. Docformer: End-to-end transformer for document understanding. In Proceedings of the IEEE/CVF international conference on computer vision, pages 993–1003, 2021.
  2. Character region attention for text spotting. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXIX 16, pages 504–521. Springer, 2020.
  3. Query-driven generative network for document information extraction in the wild. In Proceedings of the 30th ACM International Conference on Multimedia, pages 4261–4271, 2022.
  4. Attention where it matters: Rethinking visual document understanding with selective region concentration. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 19517–19527, 2023a.
  5. Genkie: Robust generative multimodal document key information extraction. arXiv preprint arXiv:2310.16131, 2023b.
  6. Pix2seq: A language modeling framework for object detection. In International Conference on Learning Representations, 2021.
  7. Pali-x: On scaling up a multilingual vision and language model. ArXiv, abs/2305.18565, 2023.
  8. Total-text: toward orientation robustness in scene text detection. International Journal on Document Analysis and Recognition (IJDAR), 23(1):31–52, 2020.
  9. Multi-granularity prediction with learnable fusion for scene text recognition. arXiv preprint arXiv:2307.13244, 2023.
  10. End-to-end document recognition and understanding with dessurt. In European Conference on Computer Vision, pages 280–296. Springer, 2022.
  11. Image-to-markup generation with coarse-to-fine attention. In International Conference on Machine Learning, pages 980–989. PMLR, 2017.
  12. Docparser: End-to-end ocr-free information extraction from visually rich documents. arXiv preprint arXiv:2304.12484, 2023.
  13. Abinet++: Autonomous, bidirectional and iterative language modeling for scene text spotting. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022.
  14. Unidoc: A universal large multimodal model for simultaneous text detection, recognition, spotting and understanding. arXiv preprint arXiv:2308.11592, 2023.
  15. Textdragon: An end-to-end framework for arbitrary shaped text spotting. In Proceedings of the IEEE/CVF international conference on computer vision, pages 9076–9085, 2019.
  16. Icdar2017 robust reading challenge on coco-text. In 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), pages 1435–1443. IEEE, 2017.
  17. Unidoc: Unified pretraining framework for document understanding. Advances in Neural Information Processing Systems, 34:39–50, 2021.
  18. Xylayoutlm: Towards layout-aware multimodal networks for visually-rich document understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4583–4592, 2022.
  19. Trust: An accurate and end-to-end table structure recognizer using splitting-based transformers. arXiv preprint arXiv:2208.14687, 2022.
  20. An end-to-end textspotter with explicit alignment and attention. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5020–5029, 2018.
  21. Bros: A pre-trained language model focusing on text and layout for better key information extraction from documents. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 10767–10775, 2022.
  22. Swintextspotter: Scene text spotting via better synergy between text detection and text recognition. In proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4593–4603, 2022a.
  23. Layoutlmv3: Pre-training for document ai with unified text and image masking. In Proceedings of the 30th ACM International Conference on Multimedia, pages 4083–4091, 2022b.
  24. Improving table structure recognition with visual-alignment sequential coordinate modeling. In CVPR, pages 11134–11143, 2023.
  25. Icdar2019 competition on scanned receipt ocr and information extraction. In 2019 International Conference on Document Analysis and Recognition (ICDAR), pages 1516–1520. IEEE, 2019.
  26. Spatial dependency parsing for semi-structured document information extraction. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 330–343, 2021.
  27. Icdar 2013 robust reading competition. In 2013 12th international conference on document analysis and recognition, pages 1484–1493. IEEE, 2013.
  28. Icdar 2015 competition on robust reading. In 2015 13th international conference on document analysis and recognition (ICDAR), pages 1156–1160. IEEE, 2015.
  29. Towards unified scene text spotting based on sequence generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15223–15232, 2023.
  30. Ocr-free document understanding transformer. In Computer Vision - ECCV 2022 - 17th European Conference, Tel Aviv, Israel, October 23-27, 2022, Proceedings, Part XXVIII, pages 498–517. Springer, 2022.
  31. Towards weakly-supervised text spotting using a multi-task transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4604–4613, 2022.
  32. Open images v5 text annotation and yet another mask text spotter. In Asian Conference on Machine Learning, pages 379–389. PMLR, 2021.
  33. Visual information extraction in the wild: practical dataset and end-to-end solution. In International Conference on Document Analysis and Recognition, pages 36–53. Springer, 2023.
  34. Formnet: Structural encoding beyond sequential modeling in form document information extraction. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3735–3754, 2022.
  35. Structurallm: Structural pre-training for form understanding. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 6309–6318, 2021a.
  36. Towards end-to-end text spotting with convolutional recurrent neural networks. In Proceedings of the IEEE international conference on computer vision, pages 5238–5246, 2017.
  37. BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models. In ICML, 2023.
  38. Selfdoc: Self-supervised document representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5652–5660, 2021b.
  39. Relational representation learning in visually-rich documents. In Proceedings of the 30th ACM International Conference on Multimedia, pages 4614–4624, 2022.
  40. Mask textspotter v3: Segmentation proposal network for robust scene text spotting. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XI 16, pages 706–722. Springer, 2020.
  41. Real-time scene text detection with differentiable binarization and adaptive scale fusion. IEEE transactions on pattern analysis and machine intelligence, 45(1):919–931, 2022.
  42. Feature pyramid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2117–2125, 2017.
  43. Tsrformer: Table structure recognition with transformers. In ACM MM, pages 6473–6482, 2022.
  44. Fots: Fast oriented text spotting with a unified network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5676–5685, 2018.
  45. Curved scene text detection via transverse and longitudinal sequence connection. Pattern Recognition, 90:337–345, 2019.
  46. Abcnet v2: Adaptive bezier-curve network for real-time end-to-end text spotting. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(11):8048–8064, 2021a.
  47. Spts v2: single-point scene text spotting. arXiv preprint arXiv:2301.01635, 2023.
  48. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pages 10012–10022, 2021b.
  49. Parsing table structures in the wild. In ICCV, pages 944–952, 2021.
  50. Towards end-to-end unified scene text detection and layout analysis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1049–1059, 2022.
  51. Decoupled weight decay regularization. In International Conference on Learning Representations, 2018.
  52. Geolayoutlm: Geometric pre-training for visual information extraction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7092–7101, 2023.
  53. An end-to-end local attention based model for table recognition. In ICDAR, pages 20–36. Springer, 2023.
  54. Mask textspotter: An end-to-end trainable neural network for spotting text with arbitrary shapes. In Proceedings of the European conference on computer vision (ECCV), pages 67–83, 2018.
  55. Gridformer: Towards accurate table structure recognition via grid prediction. In ACM MM, pages 7747–7757, 2023.
  56. Tableformer: Table structure understanding with transformers. In CVPR, pages 4614–4623, 2022.
  57. Icdar2017 robust reading challenge on multi-lingual scene text detection and script identification-rrc-mlt. In 2017 14th IAPR international conference on document analysis and recognition (ICDAR), pages 1454–1459. IEEE, 2017.
  58. OpenAI. ChatGPT. https://openai.com/chatgpt, 2023a. Accessed: 2023-09-27.
  59. OpenAI. GPT-4. https://openai.com/gpt-4, 2023b. Accessed: 2023-09-27.
  60. OpenAI. GPT-4V(ision) System Card. https://cdn.openai.com/papers/GPTV_System_Card.pdf, 2023c. Accessed: 2023-10-09.
  61. Cord: A consolidated receipt dataset for post-ocr parsing. In Document Intelligence Workshop at Neural Information Processing Systems, 2019.
  62. Spts: single-point text spotting. In Proceedings of the 30th ACM International Conference on Multimedia, pages 4272–4281, 2022a.
  63. Ernie-layout: Layout knowledge enhanced pre-training for visually-rich document understanding. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 3744–3756, 2022b.
  64. Text perceptron: Towards end-to-end arbitrary-shaped text spotting. In Proceedings of the AAAI conference on artificial intelligence, pages 11899–11907, 2020.
  65. Mango: A mask attention guided one-stage scene text spotter. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 2467–2476, 2021.
  66. Towards unconstrained end-to-end text spotting. In Proceedings of the IEEE/CVF international conference on computer vision, pages 4704–4714, 2019.
  67. Glass: Global to local attention for scene-text spotting. In European Conference on Computer Vision, pages 249–266. Springer, 2022.
  68. An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE transactions on pattern analysis and machine intelligence, 39(11):2298–2304, 2016.
  69. Textocr: Towards large-scale end-to-end reasoning for arbitrary-shaped scene text. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8802–8812, 2021.
  70. Vision-language pre-training for boosting scene text detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15681–15691, 2022.
  71. Corpus conversion service: A machine learning platform to ingest documents at scale. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 774–782, 2018.
  72. Textnet: Irregular text reading from images with an end-to-end trainable network. In Computer Vision–ACCV 2018: 14th Asian Conference on Computer Vision, Perth, Australia, December 2–6, 2018, Revised Selected Papers, Part III 14, pages 83–99. Springer, 2019.
  73. Matchvie: Exploiting match relevancy between entities for visual information extraction. arXiv preprint arXiv:2106.12940, 2021.
  74. Unifying vision, text, and layout for universal document processing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19254–19264, 2023.
  75. All you need is boundary: Toward arbitrary-shaped text spotting. In Proceedings of the AAAI conference on artificial intelligence, pages 12160–12167, 2020.
  76. Towards robust visual information extraction in real world: New dataset and novel solution. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 2738–2745, 2021a.
  77. Pgnet: Real-time arbitrarily-shaped text spotting with point gathering network. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 2782–2790, 2021b.
  78. Pan++: Towards efficient and accurate end-to-end spotting of arbitrarily-shaped text. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(9):5349–5367, 2021c.
  79. Tpsnet: Reverse thinking of thin plate splines for arbitrary shape scene text representation. In Proceedings of the 30th ACM International Conference on Multimedia, pages 5014–5025, 2022.
  80. Layoutreader: Pre-training of text and layout for reading order detection. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 4735–4744, 2021d.
  81. Ppn: Parallel pointer-based network for key information extraction with complex layouts. arXiv preprint arXiv:2307.10551, 2023.
  82. Convolutional character networks. In Proceedings of the IEEE/CVF international conference on computer vision, pages 9126–9136, 2019.
  83. On layer normalization in the transformer architecture. In International Conference on Machine Learning, pages 10524–10533. PMLR, 2020.
  84. Layoutlm: Pre-training of text and layout for document image understanding. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 1192–1200, 2020.
  85. Layoutxlm: Multimodal pre-training for multilingual visually-rich document understanding. arXiv preprint arXiv:2104.08836, 2021a.
  86. Layoutlmv2: Multi-modal pre-training for visually-rich document understanding. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 2579–2591, 2021b.
  87. Modeling entities as semantic points for visual information extraction in the wild. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15358–15367, 2023.
  88. Pingan-vcgroup’s solution for icdar 2021 competition on scientific literature parsing task b: table recognition to html. arXiv preprint arXiv:2105.01848, 2021.
  89. Deepsolo: Let transformer decoder with explicit points solo for text spotting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19348–19357, 2023.
  90. Pick: processing key information extraction from documents using improved graph learning-convolutional networks. In 2020 25th International Conference on Pattern Recognition (ICPR), pages 4363–4370. IEEE, 2021.
  91. Structextv2: Masked visual-textual prediction for document image pre-training. In The Eleventh International Conference on Learning Representations, 2022.
  92. Reading order matters: Information extraction from visually-rich documents by token path prediction. arXiv preprint arXiv:2310.11016, 2023.
  93. Trie: end-to-end text reading and information extraction for document understanding. In Proceedings of the 28th ACM International Conference on Multimedia, pages 1413–1422, 2020.
  94. Text spotting transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9519–9528, 2022.
  95. Global table extractor (gte): A framework for joint table identification and cell structure recognition using visual context. In WACV, pages 697–706, 2021.
  96. Publaynet: largest dataset ever for document layout analysis. In ICDAR, pages 1015–1022. IEEE, 2019.
  97. Image-based table recognition: data, model, and evaluation. In ECCV, pages 564–580. Springer, 2020.
  98. East: an efficient and accurate scene text detector. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 5551–5560, 2017.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (9)
  1. Jianqiang Wan (6 papers)
  2. Sibo Song (13 papers)
  3. Wenwen Yu (16 papers)
  4. Yuliang Liu (82 papers)
  5. Wenqing Cheng (12 papers)
  6. Fei Huang (409 papers)
  7. Xiang Bai (222 papers)
  8. Cong Yao (70 papers)
  9. Zhibo Yang (43 papers)
Citations (7)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com