Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
125 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

OVEL: Large Language Model as Memory Manager for Online Video Entity Linking (2403.01411v1)

Published 3 Mar 2024 in cs.CL

Abstract: In recent years, multi-modal entity linking (MEL) has garnered increasing attention in the research community due to its significance in numerous multi-modal applications. Video, as a popular means of information transmission, has become prevalent in people's daily lives. However, most existing MEL methods primarily focus on linking textual and visual mentions or offline videos's mentions to entities in multi-modal knowledge bases, with limited efforts devoted to linking mentions within online video content. In this paper, we propose a task called Online Video Entity Linking OVEL, aiming to establish connections between mentions in online videos and a knowledge base with high accuracy and timeliness. To facilitate the research works of OVEL, we specifically concentrate on live delivery scenarios and construct a live delivery entity linking dataset called LIVE. Besides, we propose an evaluation metric that considers timelessness, robustness, and accuracy. Furthermore, to effectively handle OVEL task, we leverage a memory block managed by a LLM and retrieve entity candidates from the knowledge base to augment LLM performance on memory management. The experimental results prove the effectiveness and efficiency of our method.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (42)
  1. Building a multimodal entity linking dataset from tweets. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 4285–4292.
  2. Multimodal entity linking for tweets. In European Conference on Information Retrieval, pages 463–478. Springer.
  3. Self-rag: Learning to retrieve, generate, and critique through self-reflection.
  4. Qwen technical report.
  5. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
  6. Altclip: Altering the language encoder in clip for extended language capabilities.
  7. Mmel: A joint learning framework for multi-mention entity linking. In The 39th Conference on Uncertainty in Artificial Intelligence.
  8. Bert: Pre-training of deep bidirectional transformers for language understanding.
  9. An image is worth 16x16 words: Transformers for image recognition at scale.
  10. Multi-modal transformer for video retrieval. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part IV 16, pages 214–229. Springer.
  11. Multimodal entity linking: a new dataset and a baseline. In Proceedings of the 29th ACM International Conference on Multimedia, pages 993–1001.
  12. Realm: Retrieval-augmented language model pre-training.
  13. Deep residual learning for image recognition.
  14. Atlas: Few-shot learning with retrieval augmented language models.
  15. Retrieval-augmented generation for knowledge-intensive nlp tasks.
  16. Semantic video entity linking based on visual content and metadata. In Proceedings of the IEEE International Conference on Computer Vision, pages 4615–4623.
  17. Unleashing infinite-length input capacity for large-scale language models with self-controlled memory system. arXiv preprint arXiv:2304.13343.
  18. Ra-dit: Retrieval-augmented dual instruction tuning.
  19. Roberta: A robustly optimized bert pretraining approach.
  20. Zero-shot entity linking by reading entity descriptions. arXiv preprint arXiv:1906.07348.
  21. Clip4clip: An empirical study of clip for end to end video clip retrieval and captioning. Neurocomputing, 508:293–304.
  22. Multi-grained multimodal interaction network for entity linking. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 1583–1594.
  23. Howto100m: Learning a text-video embedding by watching hundred million narrated video clips. In Proceedings of the IEEE/CVF international conference on computer vision, pages 2630–2640.
  24. OpenAI. 2023. Gpt-4 technical report.
  25. Training language models to follow instructions with human feedback, 2022. URL https://arxiv. org/abs/2203.02155, 13.
  26. Robust speech recognition via large-scale weak supervision.
  27. Improving language understanding by generative pre-training.
  28. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9.
  29. Generative multimodal entity linking. arXiv preprint arXiv:2306.12725.
  30. Visual named entity linking: A new dataset and a baseline. arXiv preprint arXiv:2211.04872.
  31. Entity linking across vision and language. Multimedia Tools and Applications, 76:22599–22622.
  32. Freshllms: Refreshing large language models with search engine augmentation.
  33. Multimodal entity linking with gated hierarchical fusion and contrastive training. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 938–948.
  34. Wikidiverse: a multimodal entity linking dataset with diversified contextual topics and entity types. arXiv preprint arXiv:2204.06347.
  35. Emergent abilities of large language models. arXiv preprint arXiv:2206.07682.
  36. Drin: Dynamic relation interactive network for multimodal entity linking. In Proceedings of the 31st ACM International Conference on Multimedia, pages 3599–3608.
  37. Msr-vtt: A large video description dataset for bridging video and language. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5288–5296.
  38. Chinese clip: Contrastive vision-language pretraining in chinese.
  39. Ameli: Enhancing multimodal entity linking with fine-grained attributes. arXiv preprint arXiv:2305.14725.
  40. Attention-based multimodal entity linking with high-quality images. In Database Systems for Advanced Applications: 26th International Conference, DASFAA 2021, Taipei, Taiwan, April 11–14, 2021, Proceedings, Part II 26, pages 533–548. Springer.
  41. Memorybank: Enhancing large language models with long-term memory. arXiv preprint arXiv:2305.10250.
  42. Weibo-mel, wikidata-mel and richpedia-mel: multimodal entity linking benchmark datasets. In Knowledge Graph and Semantic Computing: Knowledge Graph Empowers New Infrastructure Construction: 6th China Conference, CCKS 2021, Guangzhou, China, November 4-7, 2021, Proceedings 6, pages 315–320. Springer.
Citations (1)

Summary

We haven't generated a summary for this paper yet.