3MVRD: Multimodal Multi-task Multi-teacher Visually-Rich Form Document Understanding (2402.17983v3)
Abstract: This paper presents a groundbreaking multimodal, multi-task, multi-teacher joint-grained knowledge distillation model for visually-rich form document understanding. The model is designed to leverage insights from both fine-grained and coarse-grained levels by facilitating a nuanced correlation between token and entity representations, addressing the complexities inherent in form documents. Additionally, we introduce new inter-grained and cross-grained loss functions to further refine diverse multi-teacher knowledge distillation transfer process, presenting distribution gaps and a harmonised understanding of form documents. Through a comprehensive evaluation across publicly available form document understanding datasets, our proposed model consistently outperforms existing baselines, showcasing its efficacy in handling the intricate structures and content of visually complex form documents.
- Docformerv2: Local features for document understanding. arXiv preprint arXiv:2306.01733.
- V-doc: Visual questions answers with documents. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21492–21498.
- Form-nlu: Dataset for the form natural language understanding. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 2807–2816.
- Pdf-vqa: A new dataset for real-world vqa on pdf documents. In Machine Learning and Knowledge Discovery in Databases: Applied Data Science and Demo Track, pages 585–601. Springer Nature Switzerland.
- Bros: A pre-trained language model focusing on text and layout for better key information extraction from documents. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 10767–10775.
- Layoutlmv3: Pre-training for document ai with unified text and image masking. In Proceedings of the 30th ACM International Conference on Multimedia, pages 4083–4091.
- Funsd: A dataset for form understanding in noisy scanned documents. In 2019 International Conference on Document Analysis and Recognition Workshops (ICDARW), volume 2, pages 1–6. IEEE.
- Ocr-free document understanding transformer. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXVIII, pages 498–517. Springer.
- FormNetV2: Multimodal graph contrastive learning for form document information extraction. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 9011–9026. Association for Computational Linguistics.
- Pix2struct: Screenshot parsing as pretraining for visual language understanding. In International Conference on Machine Learning, pages 18893–18912. PMLR.
- Visualbert: A simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557.
- Doc-gcn: Heterogeneous graph convolutional networks for document layout analysis. In Proceedings of the 29th International Conference on Computational Linguistics, pages 2906–2916.
- Cord: a consolidated receipt dataset for post-ocr parsing. In Workshop on Document Intelligence at NeurIPS 2019.
- Mary Phuong and Christoph Lampert. 2019. Towards understanding knowledge distillation. In International conference on machine learning, pages 5142–5151. PMLR.
- Visual question answering using deep learning: A survey and performance analysis. In Computer Vision and Image Processing - 5th International Conference,CVIP 2020, volume 1377 of Communications in Computer and Information Science, pages 75–86. Springer.
- Hao Tan and Mohit Bansal. 2019. Lxmert: Learning cross-modality encoder representations from transformers. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 5100–5111.
- Lilt: A simple yet effective language-independent layout transformer for structured document understanding. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7747–7757.
- Towards robust visual information extraction in real world: New dataset and novel solution. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 2738–2745.
- Layoutlm: Pre-training of text and layout for document image understanding. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 1192–1200.
- Fast-structext: An efficient hourglass transformer with modality-guided dynamic token merge for document understanding. In Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence, IJCAI 2023, pages 5269–5277.
- Publaynet: largest dataset ever for document layout analysis. In 2019 International Conference on Document Analysis and Recognition (ICDAR), pages 1015–1022. IEEE.
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days freePaper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.