Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
169 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

HKUST at SemEval-2023 Task 1: Visual Word Sense Disambiguation with Context Augmentation and Visual Assistance (2311.18273v1)

Published 30 Nov 2023 in cs.CV and cs.MM

Abstract: Visual Word Sense Disambiguation (VWSD) is a multi-modal task that aims to select, among a batch of candidate images, the one that best entails the target word's meaning within a limited context. In this paper, we propose a multi-modal retrieval framework that maximally leverages pretrained Vision-LLMs, as well as open knowledge bases and datasets. Our system consists of the following key components: (1) Gloss matching: a pretrained bi-encoder model is used to match contexts with proper senses of the target words; (2) Prompting: matched glosses and other textual information, such as synonyms, are incorporated using a prompting template; (3) Image retrieval: semantically matching images are retrieved from large open datasets using prompts as queries; (4) Modality fusion: contextual information from different modalities are fused and used for prediction. Although our system does not produce the most competitive results at SemEval-2023 Task 1, we are still able to beat nearly half of the teams. More importantly, our experiments reveal acute insights for the field of Word Sense Disambiguation (WSD) and multi-modal learning. Our code is available on GitHub.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (15)
  1. Romain Beaumont. 2022. Clip retrieval: Easily compute clip embeddings and build a clip retrieval system with them. https://github.com/rom1504/clip-retrieval.
  2. Michele Bevilacqua and Roberto Navigli. 2019. Quasi bidirectional encoder representations from transformers for word sense disambiguation. In Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019), pages 122–131, Varna, Bulgaria. INCOMA Ltd.
  3. Terra Blevins and Luke Zettlemoyer. 2020. Moving down the long tail of word sense disambiguation with gloss informed bi-encoders. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1006–1017, Online. Association for Computational Linguistics.
  4. Unsupervised visual sense disambiguation for verbs using multimodal embeddings. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 182–192, San Diego, California. Association for Computational Linguistics.
  5. Improved word sense disambiguation using pre-trained contextualized word representations. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 5297–5306, Hong Kong, China. Association for Computational Linguistics.
  6. Vilt: Vision-and-language transformer without convolution or region supervision. In International Conference on Machine Learning, pages 5583–5594. PMLR.
  7. Michael Lesk. 1986. Automatic sense disambiguation using machine readable dictionaries: How to tell a pine cone from an ice cream cone. In Proceedings of the 5th Annual International Conference on Systems Documentation, SIGDOC ’86, page 24–26, New York, NY, USA. Association for Computing Machinery.
  8. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International Conference on Machine Learning, pages 12888–12900. PMLR.
  9. George A. Miller. 1995. Wordnet: A lexical database for english. Commun. ACM, 38(11):39–41.
  10. Roberto Navigli and Simone Paolo Ponzetto. 2012. Babelnet: The automatic construction, evaluation and application of a wide-coverage multilingual semantic network. Artificial Intelligence, 193:217–250.
  11. Learning transferable visual models from natural language supervision.
  12. SemEval-2023 task 1: Visual word sense disambiguation. In Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval-2023), pages 2227–2234, Toronto, Canada. Association for Computational Linguistics.
  13. Sensembert: Context-enhanced sense embeddings for multilingual word sense disambiguation. Proceedings of the AAAI Conference on Artificial Intelligence, 34(05):8758–8765.
  14. LAION-5b: An open large-scale dataset for training next generation image-text models. In Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track.
  15. SRCB at SemEval-2023 task 1: Prompt based and cross-modal retrieval enhanced visual word sense disambiguation. In Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval-2023), pages 439–446, Toronto, Canada. Association for Computational Linguistics.

Summary

We haven't generated a summary for this paper yet.