On Task-personalized Multimodal Few-shot Learning for Visually-rich Document Entity Retrieval (2311.00693v2)
Abstract: Visually-rich document entity retrieval (VDER), which extracts key information (e.g. date, address) from document images like invoices and receipts, has become an important topic in industrial NLP applications. The emergence of new document types at a constant pace, each with its unique entity types, presents a unique challenge: many documents contain unseen entity types that occur only a couple of times. Addressing this challenge requires models to have the ability of learning entities in a few-shot manner. However, prior works for Few-shot VDER mainly address the problem at the document level with a predefined global entity space, which doesn't account for the entity-level few-shot scenario: target entity types are locally personalized by each task and entity occurrences vary significantly among documents. To address this unexplored scenario, this paper studies a novel entity-level few-shot VDER task. The challenges lie in the uniqueness of the label space for each task and the increased complexity of out-of-distribution (OOD) contents. To tackle this novel task, we present a task-aware meta-learning based framework, with a central focus on achieving effective task personalization that distinguishes between in-task and out-of-task distribution. Specifically, we adopt a hierarchical decoder (HC) and employ contrastive learning (ContrastProtoNet) to achieve this goal. Furthermore, we introduce a new dataset, FewVEX, to boost future research in the field of entity-level few-shot VDER. Experimental results demonstrate our approaches significantly improve the robustness of popular meta-learning baselines.
- Few-shot object detection: A survey. ACM Computing Surveys (CSUR), 54(11s):1–37.
- Docformer: End-to-end transformer for document understanding. In Proceedings of the IEEE/CVF international conference on computer vision, pages 993–1003.
- Language models are few-shot learners. In Advances in Neural Information Processing Systems, volume 33, pages 1877–1901. Curran Associates, Inc.
- Named entity recognition and relation extraction with graph neural networks in semi structured documents. In 2020 25th International Conference on Pattern Recognition (ICPR), pages 9622–9627. IEEE.
- Optical character recognition systems. In Optical Character Recognition Systems for Different Languages with Soft Computing, pages 9–41. Springer.
- Jiayi Chen and Aidong Zhang. 2021. Hetmaml: Task-heterogeneous model-agnostic meta-learning for few-shot learning across modalities. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management, CIKM ’21, page 191–200, New York, NY, USA. Association for Computing Machinery.
- Jiayi Chen and Aidong Zhang. 2022a. Fedmsplit: Correlation-adaptive federated multi-task learning across multimodal split networks. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 87–96.
- Jiayi Chen and Aidong Zhang. 2022b. Topological transduction for hybrid few-shot learning. In Proceedings of the ACM Web Conference 2022, WWW ’22, page 3134–3142, New York, NY, USA. Association for Computing Machinery.
- Meta-baseline: Exploring simple meta-learning for few-shot learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9062–9071.
- Model-agnostic meta-learning for fast adaptation of deep networks. In International conference on machine learning, pages 1126–1135. PMLR.
- Cardinal graph convolution framework for document information extraction. In Proceedings of the ACM Symposium on Document Engineering 2020, DocEng ’20, New York, NY, USA. Association for Computing Machinery.
- Lambert: layout-aware language modeling for information extraction. In International Conference on Document Analysis and Recognition, pages 532–547. Springer.
- Unidoc: Unified pretraining framework for document understanding. In Advances in Neural Information Processing Systems, volume 34, pages 39–50. Curran Associates, Inc.
- Evaluation of deep convolutional nets for document image classification and retrieval. In 2015 13th International Conference on Document Analysis and Recognition (ICDAR), pages 991–995.
- Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778.
- Bros: A pre-trained language model focusing on text and layout for better key information extraction from documents. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 10767–10775.
- Few-shot named entity recognition: An empirical baseline study. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 10408–10423, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
- Layoutlmv3: Pre-training for document ai with unified text and image masking. In Proceedings of the 30th ACM International Conference on Multimedia, pages 4083–4091.
- Icdar2019 competition on scanned receipt ocr and information extraction. In 2019 International Conference on Document Analysis and Recognition (ICDAR), pages 1516–1520.
- Funsd: A dataset for form understanding in noisy scanned documents. In 2019 International Conference on Document Analysis and Recognition Workshops (ICDARW), volume 2, pages 1–6.
- Taewon Jeong and Heeyoung Kim. 2020. Ood-maml: Meta-learning for few-shot out-of-distribution detection and classification. Advances in Neural Information Processing Systems, 33:3907–3916.
- Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of NAACL-HLT, pages 4171–4186.
- Supervised contrastive learning. In Advances in Neural Information Processing Systems, volume 33, pages 18661–18673. Curran Associates, Inc.
- Siamese neural networks for one-shot image recognition. In ICML deep learning workshop, volume 2. Lille.
- Few-shot object detection: A comprehensive survey. IEEE Transactions on Neural Networks and Learning Systems, pages 1–21.
- Poodle: Improving few-shot learning via penalizing out-of-distribution samples. In Advances in Neural Information Processing Systems, volume 34, pages 23942–23955. Curran Associates, Inc.
- FormNet: Structural encoding beyond sequential modeling in form document information extraction. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3735–3754, Dublin, Ireland. Association for Computational Linguistics.
- Few-shot named entity recognition via meta-learning. IEEE Transactions on Knowledge and Data Engineering, 34(9):4245–4256.
- Selfdoc: Self-supervised document representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5652–5660.
- Multimodality helps unimodality: Cross-modal few-shot learning with multimodal models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19325–19337.
- Graph convolution for multimodal information extraction from visually rich documents. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Industry Papers), pages 32–39.
- Decomposed meta-learning for few-shot named entity recognition. In Findings of the Association for Computational Linguistics: ACL 2022, pages 1584–1596, Dublin, Ireland. Association for Computational Linguistics.
- On the impact of spurious correlation for out-of-distribution detection. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 10051–10059.
- Hiroki Nakayama. 2018. seqeval: A python framework for sequence labeling evaluation. Software available from https://github. com/chakki-works/seqeval.
- On first-order meta-learning algorithms. arXiv preprint arXiv:1803.02999.
- Tadam: Task dependent adaptive metric for improved few-shot learning. In Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc.
- Multimodal prototypical networks for few-shot learning. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 2644–2653.
- Inho Park and Sungho Kim. 2020. Performance indicator survey for object detection. In 2020 20th International Conference on Control, Automation and Systems (ICCAS), pages 284–288.
- Cord: a consolidated receipt dataset for post-ocr parsing. In Workshop on Document Intelligence at NeurIPS 2019.
- Robust aggregation for federated learning. IEEE Transactions on Signal Processing, 70:1142–1154.
- Going full-tilt boogie on document understanding with text-image-layout transformer. In Document Analysis and Recognition–ICDAR 2021: 16th International Conference, Lausanne, Switzerland, September 5–10, 2021, Proceedings, Part II 16, pages 732–747. Springer.
- Rapid learning or feature reuse? towards understanding the effectiveness of maml. In International Conference on Learning Representations.
- Zero-shot text-to-image generation. In Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 8821–8831. PMLR.
- A simple fix to mahalanobis distance for improving near-ood detection. arXiv preprint arXiv:2106.09022.
- Meta-learning with latent embedding optimization. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net.
- Layoutparser: A unified toolkit for deep learning based document image analysis. In Document Analysis and Recognition–ICDAR 2021: 16th International Conference, Lausanne, Switzerland, September 5–10, 2021, Proceedings, Part I 16, pages 131–146. Springer.
- Layoutgcn: A lightweight architecture for visually rich document understanding. In Document Analysis and Recognition - ICDAR 2023, pages 149–165, Cham. Springer Nature Switzerland.
- Prototypical networks for few-shot learning. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc.
- A comprehensive survey of few-shot learning: Evolution, applications, challenges, and opportunities. ACM Comput. Surv., 55(13s).
- Fsce: Few-shot object detection via contrastive proposal encoding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7352–7362.
- Fedbert: When federated learning meets pre-training. ACM Transactions on Intelligent Systems and Technology (TIST), 13(4):1–26.
- Layoutmask: Enhance text-layout interaction in multi-modal pre-training for document understanding. arXiv preprint arXiv:2305.18721.
- Laurens van der Maaten and Geoffrey E. Hinton. 2008. Visualizing data using t-sne. Journal of Machine Learning Research, 9:2579–2605.
- Scikit-learn: Machine learning without learning the machinery. GetMobile Mob. Comput. Commun., 19:29–33.
- Matching networks for one shot learning. In Neural Information Processing Systems.
- Docgraphlm: Documental graph language model for information extraction. Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval.
- Lilt: A simple yet effective language-independent layout transformer for structured document understanding. In Annual Meeting of the Association for Computational Linguistics, pages 7747–7757.
- Frustratingly simple few-shot object detection. In International Conference on Machine Learning, pages 9919–9928. PMLR.
- Meta self-training for few-shot neural sequence labeling. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, pages 1737–1747.
- Few-shot fast-adaptive anomaly detection. Advances in Neural Information Processing Systems, 35:4957–4970.
- QueryForm: A simple zero-shot form entity query framework. In Findings of the Association for Computational Linguistics: ACL 2023, pages 4146–4159, Toronto, Canada. Association for Computational Linguistics.
- Zilong Wang and Jingbo Shang. 2022. Towards few-shot entity recognition in document images: A label-aware sequence-to-sequence framework. In Findings of the Association for Computational Linguistics: ACL 2022, pages 4174–4186, Dublin, Ireland. Association for Computational Linguistics.
- Layoutreader: Pre-training of text and layout for reading order detection. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 4735–4744.
- Likelihood regret: An out-of-distribution detection score for variational auto-encoder. Advances in neural information processing systems, 33:20685–20696.
- LayoutLMv2: Multi-modal pre-training for visually-rich document understanding. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 2579–2591, Online. Association for Computational Linguistics.
- Layoutlm: Pre-training of text and layout for document image understanding. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 1192–1200.
- Bayesian model-agnostic meta-learning. In Advances in Neural Information Processing Systems, pages 7332–7342.
- Documentnet: Bridging the data gap in document pre-training. arXiv preprint arXiv:2306.08937.
- Trie: end-to-end text reading and information extraction for document understanding. In Proceedings of the 28th ACM International Conference on Multimedia, pages 1413–1422.
- Parallelized stochastic gradient descent. In Advances in Neural Information Processing Systems, volume 23. Curran Associates, Inc.