UMIE: Unified Multimodal Information Extraction with Instruction Tuning (2401.03082v1)
Abstract: Multimodal information extraction (MIE) gains significant attention as the popularity of multimedia content increases. However, current MIE methods often resort to using task-specific model structures, which results in limited generalizability across tasks and underutilizes shared knowledge across MIE tasks. To address these issues, we propose UMIE, a unified multimodal information extractor to unify three MIE tasks as a generation problem using instruction tuning, being able to effectively extract both textual and visual mentions. Extensive experiments show that our single UMIE outperforms various state-of-the-art (SoTA) methods across six MIE datasets on three tasks. Furthermore, in-depth analysis demonstrates UMIE's strong generalization in the zero-shot setting, robustness to instruction variants, and interpretability. Our research serves as an initial step towards a unified MIE model and initiates the exploration into both instruction tuning and LLMs within the MIE domain. Our code, data, and model are available at https://github.com/ZUCC-AI/UMIE
- Aiding Intra-Text Representations with Visual Context for Multimodal Named Entity Recognition. In Proceddings of ICDAR.
- Chain-of-Thought Prompt Distillation for Multimodal Named Entity and Multimodal Relation Extraction. arXiv:2306.14122.
- Learning Implicit Entity-object Relations by Bidirectional Generative Alignment for Multimodal NER. arXiv:2308.02570.
- Good Visual Guidance Make A Better Extractor: Hierarchical Visual Prefix for Multimodal Entity and Relation Extraction. In Findings of NAACL, 1607–1618.
- Scaling Instruction-Finetuned Language Models. CoRR, abs/2210.11416.
- Deep Residual Learning for Image Recognition. In Proceddings of CVPR.
- Multimodal Relation Extraction with Cross-Modal Retrieval and Synthesis. arXiv:2305.16166.
- MNER-QG: an end-to-end MRC framework for multimodal named entity recognition with query grounding. In Proceedings of AAAI.
- CLIP-Event: Connecting Text and Images with Event Structures. In Proceedings of CVPR.
- Cross-media Structured Common Space for Multimedia Event Extraction. In Proceedings of ACL.
- Multimedia Event Extraction From News With a Unified Contrastive Learning Framework. In Proceedings of ACM Multimedia.
- A Novel Framework for Multimodal Named Entity Recognition with Multi-level Alignments. arXiv preprint arXiv:2305.08372.
- Integrating Large Pre-trained Models into Multimodal Named Entity Recognition with Evidential Fusion. arXiv preprint arXiv:2306.16991.
- Universal Information Extraction as Unified Semantic Matching. In Proceedings of AAAI.
- Is Prompt All You Need? No. A Comprehensive and Broader View of Instruction Learning. arXiv preprint arXiv:2303.10475.
- Visual attention model for name tagging in multimodal social media. In Proceedings of ACL.
- Flat Multi-modal Interaction Transformer for Named Entity Recognition. In Proceedings of COLING.
- Unified Structure Generation for Universal Information Extraction. In Proceedings of ACL.
- Multimodal Named Entity Recognition for Short Social Media Posts. In Proceedings of NAACL.
- OpenAI. 2023. GPT-4 Technical Report. arXiv preprint arXiv:2303.08774.
- Training language models to follow instructions with human feedback. CoRR, abs/2203.02155.
- Grounded situation recognition. In Proceedings of ECCV.
- Multitask Prompted Training Enables Zero-Shot Task Generalization. In Proceddings of ICLR.
- UnifieR: A Unified Retriever for Large-Scale Retrieval. arXiv:2205.11194.
- RIVA: A Pre-trained Tweet Multimodal Model Based on Text-image Relation for Multimodal NER. In Proceedings of COLING.
- RpBERT: a text-image relation propagation-based BERT model for multimodal NER. In Proceedings of AAAI.
- LXMERT: Learning Cross-Modality Encoder Representations from Transformers. In Proceedings of EMNLP-IJCNLP.
- Image enhanced event detection in news articles. In Proceedings of AAAI.
- Attention is all you need. In Proceedings of NeurIPS.
- Ace 2005 multilingual training corpus. Linguistic Data Consortium, Philadelphia, 57: 45.
- Named Entity and Relation Extraction with Multi-Modal Retrieval. In Findings of EMNLP.
- ITA: Image-Text Alignments for Multi-Modal Named Entity Recognition. In Proceedings of NAACL.
- CAT-MNER: Multimodal Named Entity Recognition with Knowledge-Refined Cross-Modal Attention. In Proceedings of ICME.
- Super-NaturalInstructions:Generalization via Declarative Instructions on 1600+ Tasks. In Proceedings of EMNLP.
- Adaptive Chameleon or Stubborn Sloth: Unraveling the Behavior of Large Language Models in Knowledge Clashes. arXiv:2305.13300.
- Improving Multimodal Named Entity Recognition via Entity Span Detection with Unified Multimodal Transformer. In Proceedings of ACL.
- Automatic Evaluation of Attribution by Large Language Models. arXiv:2305.06311.
- Multi-modal Graph Fusion for Named Entity Recognition with Targeted Visual Guidance. In Proceedings of AAAI.
- Aligning Instruction Tasks Unlocks Large Language Models as Zero-Shot Relation Extractors. In Findings of ACL.
- LED: Lexicon-Enlightened Dense Retriever for Large-Scale Retrieval. In Proceedings of WWW.
- Adaptive co-attention network for named entity recognition in tweets. In Proceedings of AAAI.
- Improving Event Extraction via Multimodal Integration. In Proceedings of ACM Multimedia.
- Learning from Different Text-Image Pairs: A Relation-Enhanced Graph Convolutional Network for Multimodal NER. In Proceedings of ACM Multimedia.
- Multimodal Relation Extraction with Efficient Graph Alignment. In Proceedings of ACM Multimedia.
- MNRE: A Challenge Multimodal Dataset for Neural Relation Extraction with Visual Evidence in Social Media Posts. In Proceedings of ICME.
- Multi-Modal Knowledge Graph Construction and Application: A Survey. IEEE Transactions on Knowledge and Data Engineering.