OmniParser: A Unified Framework for Text Spotting, Key Information Extraction and Table Recognition (2403.19128v1)
Abstract: Recently, visually-situated text parsing (VsTP) has experienced notable advancements, driven by the increasing demand for automated document understanding and the emergence of Generative LLMs capable of processing document-based questions. Various methods have been proposed to address the challenging problem of VsTP. However, due to the diversified targets and heterogeneous schemas, previous works usually design task-specific architectures and objectives for individual tasks, which inadvertently leads to modal isolation and complex workflow. In this paper, we propose a unified paradigm for parsing visually-situated text across diverse scenarios. Specifically, we devise a universal model, called OmniParser, which can simultaneously handle three typical visually-situated text parsing tasks: text spotting, key information extraction, and table recognition. In OmniParser, all tasks share the unified encoder-decoder architecture, the unified objective: point-conditioned text generation, and the unified input & output representation: prompt & structured sequences. Extensive experiments demonstrate that the proposed OmniParser achieves state-of-the-art (SOTA) or highly competitive performances on 7 datasets for the three visually-situated text parsing tasks, despite its unified, concise design. The code is available at https://github.com/AlibabaResearch/AdvancedLiterateMachinery.
- Docformer: End-to-end transformer for document understanding. In Proceedings of the IEEE/CVF international conference on computer vision, pages 993–1003, 2021.
- Character region attention for text spotting. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXIX 16, pages 504–521. Springer, 2020.
- Query-driven generative network for document information extraction in the wild. In Proceedings of the 30th ACM International Conference on Multimedia, pages 4261–4271, 2022.
- Attention where it matters: Rethinking visual document understanding with selective region concentration. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 19517–19527, 2023a.
- Genkie: Robust generative multimodal document key information extraction. arXiv preprint arXiv:2310.16131, 2023b.
- Pix2seq: A language modeling framework for object detection. In International Conference on Learning Representations, 2021.
- Pali-x: On scaling up a multilingual vision and language model. ArXiv, abs/2305.18565, 2023.
- Total-text: toward orientation robustness in scene text detection. International Journal on Document Analysis and Recognition (IJDAR), 23(1):31–52, 2020.
- Multi-granularity prediction with learnable fusion for scene text recognition. arXiv preprint arXiv:2307.13244, 2023.
- End-to-end document recognition and understanding with dessurt. In European Conference on Computer Vision, pages 280–296. Springer, 2022.
- Image-to-markup generation with coarse-to-fine attention. In International Conference on Machine Learning, pages 980–989. PMLR, 2017.
- Docparser: End-to-end ocr-free information extraction from visually rich documents. arXiv preprint arXiv:2304.12484, 2023.
- Abinet++: Autonomous, bidirectional and iterative language modeling for scene text spotting. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022.
- Unidoc: A universal large multimodal model for simultaneous text detection, recognition, spotting and understanding. arXiv preprint arXiv:2308.11592, 2023.
- Textdragon: An end-to-end framework for arbitrary shaped text spotting. In Proceedings of the IEEE/CVF international conference on computer vision, pages 9076–9085, 2019.
- Icdar2017 robust reading challenge on coco-text. In 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), pages 1435–1443. IEEE, 2017.
- Unidoc: Unified pretraining framework for document understanding. Advances in Neural Information Processing Systems, 34:39–50, 2021.
- Xylayoutlm: Towards layout-aware multimodal networks for visually-rich document understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4583–4592, 2022.
- Trust: An accurate and end-to-end table structure recognizer using splitting-based transformers. arXiv preprint arXiv:2208.14687, 2022.
- An end-to-end textspotter with explicit alignment and attention. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5020–5029, 2018.
- Bros: A pre-trained language model focusing on text and layout for better key information extraction from documents. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 10767–10775, 2022.
- Swintextspotter: Scene text spotting via better synergy between text detection and text recognition. In proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4593–4603, 2022a.
- Layoutlmv3: Pre-training for document ai with unified text and image masking. In Proceedings of the 30th ACM International Conference on Multimedia, pages 4083–4091, 2022b.
- Improving table structure recognition with visual-alignment sequential coordinate modeling. In CVPR, pages 11134–11143, 2023.
- Icdar2019 competition on scanned receipt ocr and information extraction. In 2019 International Conference on Document Analysis and Recognition (ICDAR), pages 1516–1520. IEEE, 2019.
- Spatial dependency parsing for semi-structured document information extraction. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 330–343, 2021.
- Icdar 2013 robust reading competition. In 2013 12th international conference on document analysis and recognition, pages 1484–1493. IEEE, 2013.
- Icdar 2015 competition on robust reading. In 2015 13th international conference on document analysis and recognition (ICDAR), pages 1156–1160. IEEE, 2015.
- Towards unified scene text spotting based on sequence generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15223–15232, 2023.
- Ocr-free document understanding transformer. In Computer Vision - ECCV 2022 - 17th European Conference, Tel Aviv, Israel, October 23-27, 2022, Proceedings, Part XXVIII, pages 498–517. Springer, 2022.
- Towards weakly-supervised text spotting using a multi-task transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4604–4613, 2022.
- Open images v5 text annotation and yet another mask text spotter. In Asian Conference on Machine Learning, pages 379–389. PMLR, 2021.
- Visual information extraction in the wild: practical dataset and end-to-end solution. In International Conference on Document Analysis and Recognition, pages 36–53. Springer, 2023.
- Formnet: Structural encoding beyond sequential modeling in form document information extraction. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3735–3754, 2022.
- Structurallm: Structural pre-training for form understanding. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 6309–6318, 2021a.
- Towards end-to-end text spotting with convolutional recurrent neural networks. In Proceedings of the IEEE international conference on computer vision, pages 5238–5246, 2017.
- BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models. In ICML, 2023.
- Selfdoc: Self-supervised document representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5652–5660, 2021b.
- Relational representation learning in visually-rich documents. In Proceedings of the 30th ACM International Conference on Multimedia, pages 4614–4624, 2022.
- Mask textspotter v3: Segmentation proposal network for robust scene text spotting. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XI 16, pages 706–722. Springer, 2020.
- Real-time scene text detection with differentiable binarization and adaptive scale fusion. IEEE transactions on pattern analysis and machine intelligence, 45(1):919–931, 2022.
- Feature pyramid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2117–2125, 2017.
- Tsrformer: Table structure recognition with transformers. In ACM MM, pages 6473–6482, 2022.
- Fots: Fast oriented text spotting with a unified network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5676–5685, 2018.
- Curved scene text detection via transverse and longitudinal sequence connection. Pattern Recognition, 90:337–345, 2019.
- Abcnet v2: Adaptive bezier-curve network for real-time end-to-end text spotting. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(11):8048–8064, 2021a.
- Spts v2: single-point scene text spotting. arXiv preprint arXiv:2301.01635, 2023.
- Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pages 10012–10022, 2021b.
- Parsing table structures in the wild. In ICCV, pages 944–952, 2021.
- Towards end-to-end unified scene text detection and layout analysis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1049–1059, 2022.
- Decoupled weight decay regularization. In International Conference on Learning Representations, 2018.
- Geolayoutlm: Geometric pre-training for visual information extraction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7092–7101, 2023.
- An end-to-end local attention based model for table recognition. In ICDAR, pages 20–36. Springer, 2023.
- Mask textspotter: An end-to-end trainable neural network for spotting text with arbitrary shapes. In Proceedings of the European conference on computer vision (ECCV), pages 67–83, 2018.
- Gridformer: Towards accurate table structure recognition via grid prediction. In ACM MM, pages 7747–7757, 2023.
- Tableformer: Table structure understanding with transformers. In CVPR, pages 4614–4623, 2022.
- Icdar2017 robust reading challenge on multi-lingual scene text detection and script identification-rrc-mlt. In 2017 14th IAPR international conference on document analysis and recognition (ICDAR), pages 1454–1459. IEEE, 2017.
- OpenAI. ChatGPT. https://openai.com/chatgpt, 2023a. Accessed: 2023-09-27.
- OpenAI. GPT-4. https://openai.com/gpt-4, 2023b. Accessed: 2023-09-27.
- OpenAI. GPT-4V(ision) System Card. https://cdn.openai.com/papers/GPTV_System_Card.pdf, 2023c. Accessed: 2023-10-09.
- Cord: A consolidated receipt dataset for post-ocr parsing. In Document Intelligence Workshop at Neural Information Processing Systems, 2019.
- Spts: single-point text spotting. In Proceedings of the 30th ACM International Conference on Multimedia, pages 4272–4281, 2022a.
- Ernie-layout: Layout knowledge enhanced pre-training for visually-rich document understanding. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 3744–3756, 2022b.
- Text perceptron: Towards end-to-end arbitrary-shaped text spotting. In Proceedings of the AAAI conference on artificial intelligence, pages 11899–11907, 2020.
- Mango: A mask attention guided one-stage scene text spotter. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 2467–2476, 2021.
- Towards unconstrained end-to-end text spotting. In Proceedings of the IEEE/CVF international conference on computer vision, pages 4704–4714, 2019.
- Glass: Global to local attention for scene-text spotting. In European Conference on Computer Vision, pages 249–266. Springer, 2022.
- An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE transactions on pattern analysis and machine intelligence, 39(11):2298–2304, 2016.
- Textocr: Towards large-scale end-to-end reasoning for arbitrary-shaped scene text. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8802–8812, 2021.
- Vision-language pre-training for boosting scene text detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15681–15691, 2022.
- Corpus conversion service: A machine learning platform to ingest documents at scale. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 774–782, 2018.
- Textnet: Irregular text reading from images with an end-to-end trainable network. In Computer Vision–ACCV 2018: 14th Asian Conference on Computer Vision, Perth, Australia, December 2–6, 2018, Revised Selected Papers, Part III 14, pages 83–99. Springer, 2019.
- Matchvie: Exploiting match relevancy between entities for visual information extraction. arXiv preprint arXiv:2106.12940, 2021.
- Unifying vision, text, and layout for universal document processing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19254–19264, 2023.
- All you need is boundary: Toward arbitrary-shaped text spotting. In Proceedings of the AAAI conference on artificial intelligence, pages 12160–12167, 2020.
- Towards robust visual information extraction in real world: New dataset and novel solution. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 2738–2745, 2021a.
- Pgnet: Real-time arbitrarily-shaped text spotting with point gathering network. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 2782–2790, 2021b.
- Pan++: Towards efficient and accurate end-to-end spotting of arbitrarily-shaped text. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(9):5349–5367, 2021c.
- Tpsnet: Reverse thinking of thin plate splines for arbitrary shape scene text representation. In Proceedings of the 30th ACM International Conference on Multimedia, pages 5014–5025, 2022.
- Layoutreader: Pre-training of text and layout for reading order detection. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 4735–4744, 2021d.
- Ppn: Parallel pointer-based network for key information extraction with complex layouts. arXiv preprint arXiv:2307.10551, 2023.
- Convolutional character networks. In Proceedings of the IEEE/CVF international conference on computer vision, pages 9126–9136, 2019.
- On layer normalization in the transformer architecture. In International Conference on Machine Learning, pages 10524–10533. PMLR, 2020.
- Layoutlm: Pre-training of text and layout for document image understanding. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 1192–1200, 2020.
- Layoutxlm: Multimodal pre-training for multilingual visually-rich document understanding. arXiv preprint arXiv:2104.08836, 2021a.
- Layoutlmv2: Multi-modal pre-training for visually-rich document understanding. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 2579–2591, 2021b.
- Modeling entities as semantic points for visual information extraction in the wild. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15358–15367, 2023.
- Pingan-vcgroup’s solution for icdar 2021 competition on scientific literature parsing task b: table recognition to html. arXiv preprint arXiv:2105.01848, 2021.
- Deepsolo: Let transformer decoder with explicit points solo for text spotting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19348–19357, 2023.
- Pick: processing key information extraction from documents using improved graph learning-convolutional networks. In 2020 25th International Conference on Pattern Recognition (ICPR), pages 4363–4370. IEEE, 2021.
- Structextv2: Masked visual-textual prediction for document image pre-training. In The Eleventh International Conference on Learning Representations, 2022.
- Reading order matters: Information extraction from visually-rich documents by token path prediction. arXiv preprint arXiv:2310.11016, 2023.
- Trie: end-to-end text reading and information extraction for document understanding. In Proceedings of the 28th ACM International Conference on Multimedia, pages 1413–1422, 2020.
- Text spotting transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9519–9528, 2022.
- Global table extractor (gte): A framework for joint table identification and cell structure recognition using visual context. In WACV, pages 697–706, 2021.
- Publaynet: largest dataset ever for document layout analysis. In ICDAR, pages 1015–1022. IEEE, 2019.
- Image-based table recognition: data, model, and evaluation. In ECCV, pages 564–580. Springer, 2020.
- East: an efficient and accurate scene text detector. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 5551–5560, 2017.
- Jianqiang Wan (6 papers)
- Sibo Song (13 papers)
- Wenwen Yu (16 papers)
- Yuliang Liu (82 papers)
- Wenqing Cheng (12 papers)
- Fei Huang (409 papers)
- Xiang Bai (222 papers)
- Cong Yao (70 papers)
- Zhibo Yang (43 papers)