Integrating Large Pre-trained Models into Multimodal Named Entity Recognition with Evidential Fusion (2306.16991v1)
Abstract: Multimodal Named Entity Recognition (MNER) is a crucial task for information extraction from social media platforms such as Twitter. Most current methods rely on attention weights to extract information from both text and images but are often unreliable and lack interpretability. To address this problem, we propose incorporating uncertainty estimation into the MNER task, producing trustworthy predictions. Our proposed algorithm models the distribution of each modality as a Normal-inverse Gamma distribution, and fuses them into a unified distribution with an evidential fusion mechanism, enabling hierarchical characterization of uncertainties and promotion of prediction accuracy and trustworthiness. Additionally, we explore the potential of pre-trained large foundation models in MNER and propose an efficient fusion approach that leverages their robust feature representations. Experiments on two datasets demonstrate that our proposed method outperforms the baselines and achieves new state-of-the-art performance.
- J. Yu, J. Jiang, L. Yang, and R. Xia, “Improving multimodal named entity recognition via entity span detection with unified multimodal transformer.” Association for Computational Linguistics, 2020.
- B. Xu, S. Huang, C. Sha, and H. Wang, “Maf: A general matching and alignment framework for multimodal named entity recognition,” in Proceedings of the Fifteenth ACM International Conference on Web Search and Data Mining, 2022, pp. 1215–1223.
- L. Sun, J. Wang, K. Zhang, Y. Su, and F. Weng, “Rpbert: A text-image relation propagation-based BERT model for multimodal NER,” in Thirty-Fifth AAAI Conference on Artificial Intelligence 2021, 2021, pp. 13 860–13 868.
- D. Zhang, S. Wei, S. Li, H. Wu, Q. Zhu, and G. Zhou, “Multi-modal graph fusion for named entity recognition with targeted visual guidance,” in Proceedings of the AAAI conference on artificial intelligence, vol. 35, no. 16, 2021, pp. 14 347–14 355.
- X. Wang, J. Cai, Y. Jiang, P. Xie, K. Tu, and W. Lu, “Named entity and relation extraction with multi-modal retrieval,” arXiv preprint arXiv:2212.01612, 2022.
- X. Chen, N. Zhang, L. Li, Y. Yao, S. Deng, C. Tan, F. Huang, L. Si, and H. Chen, “Good visual guidance makes a better extractor: Hierarchical visual prefix for multimodal entity and relation extraction,” arXiv preprint arXiv:2205.03521, 2022.
- A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural language supervision,” in International conference on machine learning, 2021, pp. 8748–8763.
- A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y. Lo et al., “Segment anything,” arXiv preprint arXiv:2304.02643, 2023.
- B. M. Sundheim, “Named entity task definition, version 2.1,” in Proceedings of the Sixth Message Understanding Conference, 1995, pp. 319–332.
- X. Ma and E. Hovy, “End-to-end sequence labeling via bi-directional lstm-cnns-crf,” arXiv preprint arXiv:1603.01354, 2016.
- G. Lample, M. Ballesteros, S. Subramanian, K. Kawakami, and C. Dyer, “Neural architectures for named entity recognition,” arXiv preprint arXiv:1603.01360, 2016.
- J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018.
- A. Akbik, D. Blythe, and R. Vollgraf, “Contextual string embeddings for sequence labeling,” in Proceedings of the 27th international conference on computational linguistics, 2018, pp. 1638–1649.
- I. Yamada, A. Asai, H. Shindo, H. Takeda, and Y. Matsumoto, “Luke: Deep contextualized entity representations with entity-aware self-attention,” arXiv preprint arXiv:2010.01057, 2020.
- Y. Liu, F. Meng, J. Zhang, J. Xu, Y. Chen, and J. Zhou, “GCDT: A global context enhanced deep transition architecture for sequence labeling,” in Proceedings of ACL, 2019, pp. 2431–2441.
- N. Zhang, X. Chen, X. Xie, S. Deng, C. Tan, M. Chen, F. Huang, L. Si, and H. Chen, “Document-level relation extraction as semantic segmentation,” in Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence, IJCAI 2021, Virtual Event / Montreal, Canada, 19-27 August 2021, 2021, pp. 3999–4006.
- K. Liu, Y. Fu, C. Tan, M. Chen, N. Zhang, S. Huang, and S. Gao, “Noisy-labeled NER with confidence estimation,” in Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2021, Online, June 6-11, 2021, 2021, pp. 3437–3445.
- X. Chen, N. Zhang, X. Xie, S. Deng, Y. Yao, C. Tan, F. Huang, L. Si, and H. Chen, “Knowprompt: Knowledge-aware prompt-tuning with synergistic optimization for relation extraction,” CoRR, vol. abs/2104.07650, 2021.
- X. Chen, N. Zhang, L. Li, X. Xie, S. Deng, C. Tan, F. Huang, L. Si, and H. Chen, “Lightner: A lightweight generative framework with prompt-guided attention for low-resource NER,” CoRR, vol. abs/2109.00720, 2021.
- Q. Zhang, J. Fu, X. Liu, and X. Huang, “Adaptive co-attention network for named entity recognition in tweets,” in Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, S. A. McIlraith and K. Q. Weinberger, Eds. AAAI Press, 2018, pp. 5674–5681.
- D. Lu, L. Neves, V. Carvalho, N. Zhang, and H. Ji, “Visual attention model for name tagging in multimodal social media,” in Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia, 2018, pp. 1990–1999.
- S. Moon, L. Neves, and V. Carvalho, “Multimodal named entity recognition for short social media posts,” in Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). New Orleans, Louisiana: Association for Computational Linguistics, 2018, pp. 852–860.
- O. Arshad, I. Gallo, S. Nawaz, and A. Calefati, “Aiding intra-text representations with visual context for multimodal named entity recognition,” ArXiv preprint, vol. abs/1904.01356, 2019.
- T. Jing, H. Xia, J. Hamm, and Z. Ding, “Augmented multimodality fusion for generalized zero-shot sketch-based visual retrieval,” IEEE Transactions on Image Processing, vol. 31, pp. 3657–3668, 2022.
- W. Liu, C. Zhang, G. Lin, and F. Liu, “Crnet: Cross-reference networks for few-shot segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 4165–4173.
- W. Liu, G. Lin, T. Zhang, and Z. Liu, “Guided co-segmentation network for fast video object segmentation,” IEEE Transactions on Circuits and Systems for Video Technology, 2020.
- W. Liu, C. Zhang, G. Lin, T.-Y. Hung, and C. Miao, “Weakly supervised segmentation with maximum bipartite graph matching,” in Proceedings of the 28th ACM International Conference on Multimedia, 2020, pp. 2085–2094.
- W. Liu, X. Kong, T.-Y. Hung, and G. Lin, “Cross-image region mining with region prototypical network for weakly supervised segmentation,” IEEE Transactions on Multimedia, 2021.
- W. Liu, Z. Wu, H. Ding, F. Liu, J. Lin, and G. Lin, “Few-shot segmentation with global and local contrastive learning,” arXiv preprint arXiv:2108.05293, 2021.
- W. Liu, C. Zhang, H. Ding, T.-Y. Hung, and G. Lin, “Few-shot segmentation with optimal transport matching and message flow,” IEEE Transactions on Multimedia, 2021.
- W. Liu, C. Zhang, G. Lin, and F. Liu, “Crcnet: Few-shot segmentation with cross-reference and region–global conditional networks,” International Journal of Computer Vision, vol. 130, no. 12, pp. 3140–3157, 2022.
- W. Liu, Z. Wu, Y. Wang, H. Ding, F. Liu, J. Lin, and G. Lin, “Long-tailed recognition by learning from latent categories,” arXiv preprint arXiv:2206.01010, 2022.
- W. Liu, Z. Wu, Y. Zhao, Y. Fang, C.-S. Foo, J. Cheng, and G. Lin, “Harmonizing base and novel classes: A class-contrastive approach for generalized few-shot segmentation,” arXiv preprint arXiv:2303.13724, 2023.
- T. Zhang, G. Lin, W. Liu, J. Cai, and A. Kot, “Splitting vs. merging: Mining object regions with discrepancy and intersection loss for weakly supervised semantic segmentation,” in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXII 16. Springer International Publishing, 2020, pp. 663–679.
- J. Hou, H. Ding, W. Lin, W. Liu, and Y. Fang, “Distilling knowledge from object classification to aesthetics assessment,” IEEE Transactions on Circuits and Systems for Video Technology, 2022.
- J. Hou, W. Lin, G. Yue, W. Liu, and B. Zhao, “Interaction-matrix based personalized image aesthetics assessment,” IEEE Transactions on Multimedia, 2022.
- G. Yue, G. Zhuo, S. Li, T. Zhou, J. Du, W. Yan, J. Hou, W. Liu, and T. Wang, “Benchmarking polyp segmentation methods in narrow-band imaging colonoscopy images,” IEEE Journal of Biomedical and Health Informatics, 2023.
- G. Yue, D. Cheng, T. Zhou, J. Hou, W. Liu, L. Xu, T. Wang, and J. Cheng, “Perceptual quality assessment of enhanced colonoscopy images: A benchmark dataset and an objective method,” IEEE Transactions on Circuits and Systems for Video Technology, 2023.
- H. Zhan, L. Li, S. Li, W. Liu, M. Gupta, and A. C. Kot, “Towards explainable recommendation via bert-guided explanation generator,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5.
- J. Yu, J. Jiang, L. Yang, and R. Xia, “Improving multimodal named entity recognition via entity span detection with unified multimodal transformer,” in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020, pp. 3342–3352.
- D. Hafner, D. Tran, T. P. Lillicrap, A. Irpan, and J. Davidson, “Noise contrastive priors for functional uncertainty,” in UAI, 2019.
- K. Qaddoum and E. L. Hines, “Reliable yield prediction with regression neural networks,” in WSEAS international conference on systems theory and scientific computation, 2012.
- A. Khodayari, A. Ghaffari, S. Ameli, and J. Flahatgar, “A historical review on lateral and longitudinal control of autonomous vehicle motions,” in International Conference on Mechanical & Electrical Technology, 2010.
- R. J. Perrin, A. M. Fagan, and D. M. Holtzman, “Multimodal techniques for diagnosis and prognosis of alzheimer’s disease,” Nature, vol. 461, pp. 916–922, 2009.
- D. J. MacKay, “Bayesian interpolation,” Neural computation, vol. 4, no. 3, pp. 415–447, 1992.
- D. Molchanov, A. Ashukha, and D. P. Vetrov, “Variational dropout sparsifies deep neural networks,” in ICML, 2017.
- Y. Gal and Z. Ghahramani, “Dropout as a bayesian approximation: Representing model uncertainty in deep learning,” in ICML, 2016.
- A. Kendall and Y. Gal, “What uncertainties do we need in bayesian deep learning for computer vision?” in NeurIPS, 2017.
- J. Mukhoti and Y. Gal, “Evaluating bayesian deep learning methods for semantic segmentation,” CoRR, 2018.
- B. Lakshminarayanan, A. Pritzel, and C. Blundell, “Simple and scalable predictive uncertainty estimation using deep ensembles,” in NeurIPS, 2017.
- J. Antorán, J. U. Allingham, and J. M. Hernández-Lobato, “Depth uncertainty in neural networks,” in NeurIPS, 2020.
- J. van Amersfoort, L. Smith, Y. W. Teh, and Y. Gal, “Uncertainty estimation using a single deep deterministic neural network,” in ICML, 2020.
- C. Corbière, N. Thome, A. Bar-Hen, M. Cord, and P. Pérez, “Addressing failure prediction by learning model confidence,” in NeurIPS, 2019.
- J. Mukhoti, V. Kulharia, A. Sanyal, S. Golodetz, P. Torr, and P. Dokania, “Calibrating deep neural networks using focal loss,” in NeurIPS, 2020.
- A. Amini, W. Schwarting, A. Soleimany, and D. Rus, “Deep evidential regression,” Advances in Neural Information Processing Systems, vol. 33, pp. 14 927–14 937, 2020.
- A. Fürst, E. Rumetshofer, J. Lehner, V. T. Tran, F. Tang, H. Ramsauer, D. Kreil, M. Kopp, G. Klambauer, A. Bitto et al., “Cloob: Modern hopfield networks with infoloob outperform clip,” Advances in neural information processing systems, vol. 35, pp. 20 450–20 468, 2022.
- C. Jia, Y. Yang, Y. Xia, Y.-T. Chen, Z. Parekh, H. Pham, Q. Le, Y.-H. Sung, Z. Li, and T. Duerig, “Scaling up visual and vision-language representation learning with noisy text supervision,” in International Conference on Machine Learning, 2021, pp. 4904–4916.
- Y. Li, F. Liang, L. Zhao, Y. Cui, W. Ouyang, J. Shao, F. Yu, and J. Yan, “Supervision exists everywhere: A data efficient contrastive language-image pre-training paradigm,” arXiv preprint arXiv:2110.05208, 2021.
- X. Zhai, X. Wang, B. Mustafa, A. Steiner, D. Keysers, A. Kolesnikov, and L. Beyer, “Lit: Zero-shot transfer with locked-image text tuning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 18 123–18 133.
- A. Singh, R. Hu, V. Goswami, G. Couairon, W. Galuba, M. Rohrbach, and D. Kiela, “Flava: A foundational language and vision alignment model,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 15 638–15 650.
- J. Chen, H. Guo, K. Yi, B. Li, and M. Elhoseiny, “Visualgpt: Data-efficient adaptation of pretrained language models for image captioning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 18 030–18 040.
- J.-B. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr, Y. Hasson, K. Lenc, A. Mensch, K. Millican, M. Reynolds et al., “Flamingo: a visual language model for few-shot learning,” Advances in Neural Information Processing Systems, vol. 35, pp. 23 716–23 736, 2022.
- D. Zhang, S. Wei, S. Li, H. Wu, Q. Zhu, and G. Zhou, “Multi-modal graph fusion for named entity recognition with targeted visual guidance,” pp. 14 347–14 355, 2021.
- A. Der Kiureghian and O. Ditlevsen, “Aleatory or epistemic? does it matter?” Structural safety, vol. 31, pp. 105–112, 2009.
- H. Qian, “Big data bayesian linear regression and variable selection by normal-inverse-gamma summation,” 2018.
- H. Zhang, F. Li, S. Liu, L. Zhang, H. Su, J. Zhu, L. M. Ni, and H.-Y. Shum, “Dino: Detr with improved denoising anchor boxes for end-to-end object detection,” arXiv preprint arXiv:2203.03605, 2022.
- C. Zheng, J. Feng, Z. Fu, Y. Cai, Q. Li, and T. Wang, “Multimodal relation extraction with efficient graph alignment,” in Proceedings of the 29th ACM International Conference on Multimedia, 2021, p. 5298–5306.
- Z. Wu, C. Zheng, Y. Cai, J. Chen, H.-f. Leung, and Q. Li, “Multimodal representation with embedded visual guiding objects for named entity recognition in social media posts,” in Proceedings of the 28th ACM International Conference on Multimedia, 2020, p. 1038–1046.
- C. Zheng, J. Feng, Z. Fu, Y. Cai, Q. Li, and T. Wang, “Multimodal relation extraction with efficient graph alignment,” in Proceedings of the 29th ACM International Conference on Multimedia. New York, NY, USA: Association for Computing Machinery, 2021, p. 5298–5306.
- S. Chen, G. Aguilar, L. Neves, and T. Solorio, “Can images help recognize entities? a study of the role of images for multimodal NER,” in Proceedings of the Seventh Workshop on Noisy User-generated Text (W-NUT 2021), 2021, pp. 87–96.