Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

UniMEL: A Unified Framework for Multimodal Entity Linking with Large Language Models (2407.16160v2)

Published 23 Jul 2024 in cs.AI and cs.CL

Abstract: Multimodal Entity Linking (MEL) is a crucial task that aims at linking ambiguous mentions within multimodal contexts to the referent entities in a multimodal knowledge base, such as Wikipedia. Existing methods focus heavily on using complex mechanisms and extensive model tuning methods to model the multimodal interaction on specific datasets. However, these methods overcomplicate the MEL task and overlook the visual semantic information, which makes them costly and hard to scale. Moreover, these methods can not solve the issues like textual ambiguity, redundancy, and noisy images, which severely degrade their performance. Fortunately, the advent of LLMs with robust capabilities in text understanding and reasoning, particularly Multimodal LLMs (MLLMs) that can process multimodal inputs, provides new insights into addressing this challenge. However, how to design a universally applicable LLMs-based MEL approach remains a pressing challenge. To this end, we propose UniMEL, a unified framework which establishes a new paradigm to process multimodal entity linking tasks using LLMs. In this framework, we employ LLMs to augment the representation of mentions and entities individually by integrating textual and visual information and refining textual information. Subsequently, we employ the embedding-based method for retrieving and re-ranking candidate entities. Then, with only ~0.26% of the model parameters fine-tuned, LLMs can make the final selection from the candidate entities. Extensive experiments on three public benchmark datasets demonstrate that our solution achieves state-of-the-art performance, and ablation studies verify the effectiveness of all modules. Our code is available at https://github.com/Javkonline/UniMEL.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (78)
  1. 2023. OpenAI. Gpt-4v(ision) system card. (2023).
  2. 2024. Introducing meta llama 3: The most capable openly available llm to date. (2024). https://ai.meta.com/blog/meta-llama-3/
  3. Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023).
  4. Building a Multimodal Entity Linking Dataset From Tweets. In Proceedings of the Twelfth Language Resources and Evaluation Conference. European Language Resources Association, Marseille, France, 4285–4292.
  5. Multimodal Entity Linking for Tweets. In European Conference on Information Retrieval (ECIR). Lisbon, Portugal.
  6. Akiko Aizawa. 2003. An information-theoretic perspective of tf–idf measures. Information Processing & Management 39, 1 (2003), 45–65.
  7. Qwen Technical Report. arXiv preprint arXiv:2309.16609 (2023).
  8. Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966 (2023).
  9. Improving image generation with better captions. Computer Science. https://cdn. openai. com/papers/dall-e-3. pdf 2, 3 (2023), 8.
  10. Language Models are Few-Shot Learners. In Advances in Neural Information Processing Systems, Vol. 33. Curran Associates, Inc., 1877–1901.
  11. Characterizing Multimodal Long-form Summarization: A Case Study on Financial Reports. arXiv preprint arXiv:2404.06162 (2024).
  12. Donghua Chen and Runtong Zhang. 2024. Building Multimodal Knowledge Bases With Multimodal Computational Sequences and Generative Adversarial Networks. IEEE Transactions on Multimedia 26 (2024), 2027–2040. https://doi.org/10.1109/TMM.2023.3291503
  13. Knowledge Graphs Meet Multi-Modal Learning: A Comprehensive Survey. arXiv preprint arXiv:2402.05391 (2024).
  14. ChatLaw: Open-Source Legal Large Language Model with Integrated External Knowledge Bases. ArXiv abs/2306.16092 (2023).
  15. Instructblip: Towards general-purpose vision-language models with instruction tuning. Advances in Neural Information Processing Systems 36 (2024).
  16. Question Answering by Reasoning Across Documents with Graph Convolutional Networks. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics, Minneapolis, Minnesota, 2306–2317.
  17. Autoregressive Entity Retrieval. In ICLR 2021-9th International Conference on Learning Representations, Vol. 2021. ICLR.
  18. Seeing is Believing: Mitigating Hallucination in Large Vision-Language Models via CLIP-Guided Decoding. arXiv preprint arXiv:2402.15300 (2024).
  19. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
  20. Zhang Dongjie and Longtao Huang. 2022. Multimodal Knowledge Learning for Named Entity Disambiguation. In Findings of the Association for Computational Linguistics: EMNLP 2022. Association for Computational Linguistics, Abu Dhabi, United Arab Emirates, 3160–3169.
  21. Named entity disambiguation for noisy text. arXiv preprint arXiv:1706.09147 (2017).
  22. Capturing Semantic Similarity for Entity Linking with Convolutional Neural Networks. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, San Diego, California, 1256–1261.
  23. Multimodal Entity Linking: A New Dataset and A Baseline. In Proceedings of the 29th ACM International Conference on Multimedia (Virtual Event, China) (MM ’21). Association for Computing Machinery, New York, NY, USA, 993–1001.
  24. Octavian-Eugen Ganea and Thomas Hofmann. 2017. Deep Joint Entity Disambiguation with Local Neural Attention. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Copenhagen, Denmark, 2619–2629.
  25. Recommendation as Language Processing (RLP): A Unified Pretrain, Personalized Prompt & Predict Paradigm (P5). In Proceedings of the 16th ACM Conference on Recommender Systems (Seattle, WA, USA) (RecSys ’22). Association for Computing Machinery, New York, NY, USA, 299–315.
  26. Learning Dense Representations for Entity Retrieval. In Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL). Association for Computational Linguistics, Hong Kong, China, 528–537.
  27. Large Language Models as Zero-Shot Conversational Recommenders. In Proceedings of the 32nd ACM International Conference on Information and Knowledge Management, CIKM 2023, Birmingham, United Kingdom, October 21-25, 2023. 720–730.
  28. Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation 9, 8 (1997), 1735–1780.
  29. Large language models are zero-shot rankers for recommender systems. In European Conference on Information Retrieval. Springer, 364–381.
  30. LoRA: Low-Rank Adaptation of Large Language Models. In International Conference on Learning Representations. https://openreview.net/forum?id=nZeVKeeFYf9
  31. Mistral 7B. arXiv preprint arXiv:2310.06825 (2023).
  32. Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of naacL-HLT, Vol. 1. 2.
  33. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Online, 7871–7880.
  34. Retrieval-augmented generation for knowledge-intensive NLP tasks. In Proceedings of the 34th International Conference on Neural Information Processing Systems (Vancouver, BC, Canada) (NIPS ’20). Curran Associates Inc., Red Hook, NY, USA, Article 793, 16 pages.
  35. BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models. In Proceedings of the 40th International Conference on Machine Learning (Honolulu, Hawaii, USA) (ICML’23). JMLR.org, Article 814, 13 pages.
  36. Towards general text embeddings with multi-stage contrastive learning. arXiv preprint arXiv:2308.03281 (2023).
  37. Mitigating hallucination in large multi-modal models via robust instruction tuning. In The Twelfth International Conference on Learning Representations.
  38. Improved Baselines with Visual Instruction Tuning.
  39. LLaVA-NeXT: Improved reasoning, OCR, and world knowledge. https://llava-vl.github.io/blog/2024-01-30-llava-next/
  40. Visual Instruction Tuning. In Advances in Neural Information Processing Systems, A. Oh, T. Neumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine (Eds.), Vol. 36. Curran Associates, Inc., 34892–34916. https://proceedings.neurips.cc/paper_files/paper/2023/file/6dcf277ea32ce3288914faf369fe6de0-Paper-Conference.pdf
  41. Visual Instruction Tuning.
  42. Entity-Based Knowledge Conflicts in Question Answering. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Online and Punta Cana, Dominican Republic, 7052–7063.
  43. Ilya Loshchilov and Frank Hutter. 2018. Decoupled Weight Decay Regularization. In International Conference on Learning Representations.
  44. Hierarchical question-image co-attention for visual question answering. In Proceedings of the 30th International Conference on Neural Information Processing Systems (Barcelona, Spain) (NIPS’16). Curran Associates Inc., Red Hook, NY, USA, 289–297.
  45. Fantastically Ordered Prompts and Where to Find Them: Overcoming Few-Shot Prompt Order Sensitivity. Association for Computational Linguistics.
  46. Multi-Grained Multimodal Interaction Network for Entity Linking. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. Association for Computing Machinery, New York, NY, USA, 1583–1594.
  47. Efficient Estimation of Word Representations in Vector Space. In International Conference on Learning Representations.
  48. Multimodal Named Entity Disambiguation for Noisy Social Media Posts. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Melbourne, Australia, 2000–2008.
  49. Training language models to follow instructions with human feedback. Advances in neural information processing systems 35 (2022), 27730–27744.
  50. Summarization is (almost) dead. arXiv preprint arXiv:2309.09558 (2023).
  51. UniDM: A Unified Framework for Data Manipulation with Large Language Models. arXiv preprint arXiv:2405.06510 (2024).
  52. Learning Transferable Visual Models From Natural Language Supervision. In Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event (Proceedings of Machine Learning Research, Vol. 139). PMLR, 8748–8763.
  53. Improving language understanding by generative pre-training. ([n. d.]).
  54. Language models are unsupervised multitask learners. OpenAI blog 1, 8 (2019), 9.
  55. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. J. Mach. Learn. Res. 21 (2019), 140:1–140:67.
  56. SFR-Embedding-Mistral:Enhance Text Retrieval with Transfer Learning. Salesforce AI Research Blog. https://blog.salesforceairesearch.com/sfr-embedded-mistral/
  57. Generative multimodal entity linking. arXiv preprint arXiv:2306.12725 (2023).
  58. IndicGenBench: A Multilingual Benchmark to Evaluate Generation Capabilities of LLMs on Indic Languages. arXiv preprint arXiv:2404.16816 (2024).
  59. A dual-way enhanced framework from text matching point of view for multimodal entity linking. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 38. 19008–19016.
  60. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023).
  61. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023).
  62. Richpedia: a large-scale, comprehensive multi-modal knowledge graph. Big Data Research 22 (2020), 100159.
  63. Multimodal entity linking with gated hierarchical fusion and contrastive training. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval. 938–948.
  64. OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework. In International Conference on Machine Learning. PMLR, 23318–23340.
  65. Benchmarking Diverse-Modal Entity Linking with Generative Models. In Findings of the Association for Computational Linguistics: ACL 2023. 7841–7857.
  66. WikiDiverse: a multimodal entity linking dataset with diversified contextual topics and entity types. arXiv preprint arXiv:2204.06347 (2022).
  67. Scalable Zero-shot Entity Linking with Dense Entity Retrieval. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 6397–6407.
  68. DRIN: Dynamic Relation Interactive Network for Multimodal Entity Linking. In Proceedings of the 31st ACM International Conference on Multimedia. 3599–3608.
  69. Improving Question Answering over Incomplete KBs with Knowledge-Aware Reader. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Florence, Italy, 4258–4264.
  70. MMEL: A Joint Learning Framework for Multi-Mention Entity Linking. In Proceedings of the Thirty-Ninth Conference on Uncertainty in Artificial Intelligence (Proceedings of Machine Learning Research, Vol. 216), Robin J. Evans and Ilya Shpitser (Eds.). PMLR, 2411–2421.
  71. Rlhf-v: Towards trustworthy mllms via behavior alignment from fine-grained correctional human feedback. arXiv preprint arXiv:2312.00849 (2023).
  72. HuatuoGPT, Towards Taming Language Model to Be a Doctor. In Findings of the Association for Computational Linguistics: EMNLP 2023. Association for Computational Linguistics, Singapore, 10859–10885.
  73. Attention-Based Multimodal Entity Linking with High-Quality Images. In Database Systems for Advanced Applications: 26th International Conference, DASFAA 2021, Taipei, Taiwan, April 11–14, 2021, Proceedings, Part II (Taipei, Taiwan). Springer-Verlag, Berlin, Heidelberg, 533–548.
  74. Benchmarking large language models for news summarization. Transactions of the Association for Computational Linguistics (2024).
  75. OVEL: Large Language Model as Memory Manager for Online Video Entity Linking. arXiv preprint arXiv:2403.01411 (2024).
  76. Calibrate before use: Improving few-shot performance of language models. In International conference on machine learning. PMLR, 12697–12706.
  77. Beyond hallucinations: Enhancing lvlms through hallucination-aware direct preference optimization. arXiv preprint arXiv:2311.16839 (2023).
  78. Weibo-mel, Wikidata-mel and Richpedia-mel: multimodal entity linking benchmark datasets. In Knowledge Graph and Semantic Computing: Knowledge Graph Empowers New Infrastructure Construction: 6th China Conference, CCKS 2021, Guangzhou, China, November 4-7, 2021, Proceedings 6. Springer, 315–320.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Liu Qi (3 papers)
  2. He Yongyi (1 paper)
  3. Lian Defu (1 paper)
  4. Zheng Zhi (2 papers)
  5. Xu Tong (2 papers)
  6. Liu Che (1 paper)
  7. Chen Enhong (1 paper)
Citations (1)

Summary

We haven't generated a summary for this paper yet.