Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
96 tokens/sec
Gemini 2.5 Pro Premium
48 tokens/sec
GPT-5 Medium
15 tokens/sec
GPT-5 High Premium
23 tokens/sec
GPT-4o
104 tokens/sec
DeepSeek R1 via Azure Premium
77 tokens/sec
GPT OSS 120B via Groq Premium
466 tokens/sec
Kimi K2 via Groq Premium
201 tokens/sec
2000 character limit reached

UMIE: Unified Multimodal Information Extraction with Instruction Tuning (2401.03082v1)

Published 5 Jan 2024 in cs.AI

Abstract: Multimodal information extraction (MIE) gains significant attention as the popularity of multimedia content increases. However, current MIE methods often resort to using task-specific model structures, which results in limited generalizability across tasks and underutilizes shared knowledge across MIE tasks. To address these issues, we propose UMIE, a unified multimodal information extractor to unify three MIE tasks as a generation problem using instruction tuning, being able to effectively extract both textual and visual mentions. Extensive experiments show that our single UMIE outperforms various state-of-the-art (SoTA) methods across six MIE datasets on three tasks. Furthermore, in-depth analysis demonstrates UMIE's strong generalization in the zero-shot setting, robustness to instruction variants, and interpretability. Our research serves as an initial step towards a unified MIE model and initiates the exploration into both instruction tuning and LLMs within the MIE domain. Our code, data, and model are available at https://github.com/ZUCC-AI/UMIE

Definition Search Book Streamline Icon: https://streamlinehq.com
References (46)
  1. Aiding Intra-Text Representations with Visual Context for Multimodal Named Entity Recognition. In Proceddings of ICDAR.
  2. Chain-of-Thought Prompt Distillation for Multimodal Named Entity and Multimodal Relation Extraction. arXiv:2306.14122.
  3. Learning Implicit Entity-object Relations by Bidirectional Generative Alignment for Multimodal NER. arXiv:2308.02570.
  4. Good Visual Guidance Make A Better Extractor: Hierarchical Visual Prefix for Multimodal Entity and Relation Extraction. In Findings of NAACL, 1607–1618.
  5. Scaling Instruction-Finetuned Language Models. CoRR, abs/2210.11416.
  6. Deep Residual Learning for Image Recognition. In Proceddings of CVPR.
  7. Multimodal Relation Extraction with Cross-Modal Retrieval and Synthesis. arXiv:2305.16166.
  8. MNER-QG: an end-to-end MRC framework for multimodal named entity recognition with query grounding. In Proceedings of AAAI.
  9. CLIP-Event: Connecting Text and Images with Event Structures. In Proceedings of CVPR.
  10. Cross-media Structured Common Space for Multimedia Event Extraction. In Proceedings of ACL.
  11. Multimedia Event Extraction From News With a Unified Contrastive Learning Framework. In Proceedings of ACM Multimedia.
  12. A Novel Framework for Multimodal Named Entity Recognition with Multi-level Alignments. arXiv preprint arXiv:2305.08372.
  13. Integrating Large Pre-trained Models into Multimodal Named Entity Recognition with Evidential Fusion. arXiv preprint arXiv:2306.16991.
  14. Universal Information Extraction as Unified Semantic Matching. In Proceedings of AAAI.
  15. Is Prompt All You Need? No. A Comprehensive and Broader View of Instruction Learning. arXiv preprint arXiv:2303.10475.
  16. Visual attention model for name tagging in multimodal social media. In Proceedings of ACL.
  17. Flat Multi-modal Interaction Transformer for Named Entity Recognition. In Proceedings of COLING.
  18. Unified Structure Generation for Universal Information Extraction. In Proceedings of ACL.
  19. Multimodal Named Entity Recognition for Short Social Media Posts. In Proceedings of NAACL.
  20. OpenAI. 2023. GPT-4 Technical Report. arXiv preprint arXiv:2303.08774.
  21. Training language models to follow instructions with human feedback. CoRR, abs/2203.02155.
  22. Grounded situation recognition. In Proceedings of ECCV.
  23. Multitask Prompted Training Enables Zero-Shot Task Generalization. In Proceddings of ICLR.
  24. UnifieR: A Unified Retriever for Large-Scale Retrieval. arXiv:2205.11194.
  25. RIVA: A Pre-trained Tweet Multimodal Model Based on Text-image Relation for Multimodal NER. In Proceedings of COLING.
  26. RpBERT: a text-image relation propagation-based BERT model for multimodal NER. In Proceedings of AAAI.
  27. LXMERT: Learning Cross-Modality Encoder Representations from Transformers. In Proceedings of EMNLP-IJCNLP.
  28. Image enhanced event detection in news articles. In Proceedings of AAAI.
  29. Attention is all you need. In Proceedings of NeurIPS.
  30. Ace 2005 multilingual training corpus. Linguistic Data Consortium, Philadelphia, 57: 45.
  31. Named Entity and Relation Extraction with Multi-Modal Retrieval. In Findings of EMNLP.
  32. ITA: Image-Text Alignments for Multi-Modal Named Entity Recognition. In Proceedings of NAACL.
  33. CAT-MNER: Multimodal Named Entity Recognition with Knowledge-Refined Cross-Modal Attention. In Proceedings of ICME.
  34. Super-NaturalInstructions:Generalization via Declarative Instructions on 1600+ Tasks. In Proceedings of EMNLP.
  35. Adaptive Chameleon or Stubborn Sloth: Unraveling the Behavior of Large Language Models in Knowledge Clashes. arXiv:2305.13300.
  36. Improving Multimodal Named Entity Recognition via Entity Span Detection with Unified Multimodal Transformer. In Proceedings of ACL.
  37. Automatic Evaluation of Attribution by Large Language Models. arXiv:2305.06311.
  38. Multi-modal Graph Fusion for Named Entity Recognition with Targeted Visual Guidance. In Proceedings of AAAI.
  39. Aligning Instruction Tasks Unlocks Large Language Models as Zero-Shot Relation Extractors. In Findings of ACL.
  40. LED: Lexicon-Enlightened Dense Retriever for Large-Scale Retrieval. In Proceedings of WWW.
  41. Adaptive co-attention network for named entity recognition in tweets. In Proceedings of AAAI.
  42. Improving Event Extraction via Multimodal Integration. In Proceedings of ACM Multimedia.
  43. Learning from Different Text-Image Pairs: A Relation-Enhanced Graph Convolutional Network for Multimodal NER. In Proceedings of ACM Multimedia.
  44. Multimodal Relation Extraction with Efficient Graph Alignment. In Proceedings of ACM Multimedia.
  45. MNRE: A Challenge Multimodal Dataset for Neural Relation Extraction with Visual Evidence in Social Media Posts. In Proceedings of ICME.
  46. Multi-Modal Knowledge Graph Construction and Application: A Survey. IEEE Transactions on Knowledge and Data Engineering.
Citations (7)

Summary

  • The paper introduces a unified framework that transforms multiple MIE tasks into a generation problem through instruction tuning.
  • UMIE adapts to diverse extraction scenarios without task-specific architectures, outperforming state-of-the-art methods across six datasets.
  • The research demonstrates robust zero-shot performance and improved interpretability, opening new avenues for multimodal information extraction.

The paper "UMIE: Unified Multimodal Information Extraction with Instruction Tuning" addresses the challenges faced by existing multimodal information extraction (MIE) methods, primarily their reliance on task-specific model structures. This task specificity often leads to limited generalizability and an underuse of shared knowledge across various MIE tasks.

To overcome these limitations, the authors introduce UMIE, a unified multimodal information extractor. UMIE transforms three core MIE tasks into a generation problem through the use of instruction tuning. This approach allows UMIE to effectively extract information from both textual and visual data.

Key contributions and findings of the paper include:

  1. Unified Framework: UMIE consolidates multiple MIE tasks under a single unified framework. By framing information extraction as a generative process, the model leverages shared knowledge, enhancing generalizability across different tasks.
  2. Instruction Tuning: The authors employ instruction tuning, which enables the model to adapt to different MIE tasks through specific instructions. This method ensures that UMIE can handle diverse MIE scenarios without the need for task-specific architectures.
  3. Performance: Extensive experiments demonstrate that UMIE outperforms various state-of-the-art (SoTA) methods across six different MIE datasets spanning three tasks. The results highlight the effectiveness of UMIE’s unified approach in extracting multimodal information.
  4. Generalization and Robustness: The paper emphasizes UMIE's strong generalization capabilities, particularly in zero-shot settings. This means the model performs well on new, unseen tasks without requiring additional fine-tuning. Additionally, UMIE shows robustness to variations in instructions, which underscores the flexibility of the instruction tuning paradigm.
  5. Interpretability: UMIE provides insights into its decision-making processes, making it more interpretable compared to other models. This is significant for practical applications where understanding model behavior is crucial.
  6. Initial Exploration: The research represents an initial step towards developing a truly unified MIE model. It also initiates exploration into the use of instruction tuning and LLMs within the MIE domain.

The authors have made their code, data, and model publicly available, fostering further research and development in the field of multimodal information extraction. This openness aims to encourage collaboration and expedite advancements in creating more unified and generalizable models for various MIE tasks.

Overall, UMIE sets a new benchmark in the field of multimodal information extraction by addressing the limitations of task-specific models and demonstrating the power of a unified approach through instruction tuning.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.