Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
184 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Language Models as Knowledge Bases for Visual Word Sense Disambiguation (2310.01960v1)

Published 3 Oct 2023 in cs.CL and cs.AI

Abstract: Visual Word Sense Disambiguation (VWSD) is a novel challenging task that lies between linguistic sense disambiguation and fine-grained multimodal retrieval. The recent advancements in the development of visiolinguistic (VL) transformers suggest some off-the-self implementations with encouraging results, which however we argue that can be further improved. To this end, we propose some knowledge-enhancement techniques towards improving the retrieval performance of VL transformers via the usage of LLMs as Knowledge Bases. More specifically, knowledge stored in LLMs is retrieved with the help of appropriate prompts in a zero-shot manner, achieving performance advancements. Moreover, we convert VWSD to a purely textual question-answering (QA) problem by considering generated image captions as multiple-choice candidate answers. Zero-shot and few-shot prompting strategies are leveraged to explore the potential of such a transformation, while Chain-of-Thought (CoT) prompting in the zero-shot setting is able to reveal the internal reasoning steps an LLM follows to select the appropriate candidate. In total, our presented approach is the first one to analyze the merits of exploiting knowledge stored in LLMs in different ways to solve WVSD.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (26)
  1. SemEval-2023 Task 1: Visual Word Sense Disambiguation, in: Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval-2023), Association for Computational Linguistics, Toronto, Canada, 2023.
  2. Recent trends in word sense disambiguation: A survey, in: Z.-H. Zhou (Ed.), Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence, IJCAI-21, International Joint Conferences on Artificial Intelligence Organization, 2021, pp. 4330–4338. URL: https://doi.org/10.24963/ijcai.2021/593. doi:10.24963/ijcai.2021/593, survey Track.
  3. C. Fellbaum, Wordnet: An electronic lexical database (1998).
  4. R. Navigli, S. P. Ponzetto, Babelnet: The automatic construction, evaluation and application of a wide-coverage multilingual semantic network, Artificial Intelligence 193 (2012) 217–250. URL: https://www.sciencedirect.com/science/article/pii/S0004370212000793. doi:https://doi.org/10.1016/j.artint.2012.07.001.
  5. Plug-and-play vqa: Zero-shot vqa by conjoining large pretrained models with zero training, in: Conference on Empirical Methods in Natural Language Processing, 2022.
  6. From images to textual prompts: Zero-shot vqa with frozen large language models, ArXiv abs/2212.10846 (2022a).
  7. From images to textual prompts: Zero-shot vqa with frozen large language models, ArXiv abs/2212.10846 (2022b).
  8. Clipcap: Clip prefix for image captioning, ArXiv abs/2111.09734 (2021).
  9. Time-aware language models as temporal knowledge bases, Transactions of the Association for Computational Linguistics 10 (2022) 257–273. URL: https://aclanthology.org/2022.tacl-1.15. doi:10.1162/tacl_a_00459.
  10. MathQA: Towards interpretable math word problem solving with operation-based formalisms, in: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Association for Computational Linguistics, Minneapolis, Minnesota, 2019, pp. 2357–2367. URL: https://aclanthology.org/N19-1245. doi:10.18653/v1/N19-1245.
  11. Chain of thought prompting elicits reasoning in large language models, ArXiv abs/2201.11903 (2022).
  12. J. Huang, K. C.-C. Chang, Towards reasoning in large language models: A survey, ArXiv abs/2212.10403 (2022).
  13. Language models are few-shot learners, in: H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, H. Lin (Eds.), Advances in Neural Information Processing Systems, volume 33, Curran Associates, Inc., 2020, pp. 1877–1901. URL: https://proceedings.neurips.cc/paper_files/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf.
  14. P. H. Winston, Learning and reasoning by analogy, Commun. ACM 23 (1980) 689–703. URL: https://doi.org/10.1145/359038.359042. doi:10.1145/359038.359042.
  15. G. Qin, J. Eisner, Learning how to ask: Querying LMs with mixtures of soft prompts, in: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Association for Computational Linguistics, Online, 2021, pp. 5203–5212. URL: https://aclanthology.org/2021.naacl-main.410. doi:10.18653/v1/2021.naacl-main.410.
  16. An image is worth 16x16 words: Transformers for image recognition at scale, in: International Conference on Learning Representations, 2021. URL: https://openreview.net/forum?id=YicbFdNTTy.
  17. LAION-5b: An open large-scale dataset for training next generation image-text models, in: Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2022. URL: https://openreview.net/forum?id=M3Y74vmsMcY.
  18. Scaling up visual and vision-language representation learning with noisy text supervision (2021). URL: https://arxiv.org/abs/2102.05918. doi:10.48550/ARXIV.2102.05918.
  19. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions, Transactions of the Association for Computational Linguistics 2 (2014) 67–78. URL: https://aclanthology.org/Q14-1006. doi:10.1162/tacl_a_00166.
  20. Language models are unsupervised multitask learners, OpenAI blog 1 (2019) 9.
  21. Galactica: A large language model for science, 2022.
  22. Llama: Open and efficient foundation language models, arXiv preprint arXiv:2302.13971 (2023).
  23. A. Kumar, The illustrated image captioning using transformers, ankur3107.github.io (2022). URL: https://ankur3107.github.io/blogs/the-illustrated-image-captioning-using-transformers/.
  24. Langchain prompt templates, ???? URL: https://python.langchain.com/docs/modules/model_io/prompts/prompt_templates/.
  25. What makes good in-context examples for GPT-3?, in: Proceedings of Deep Learning Inside Out (DeeLIO 2022): The 3rd Workshop on Knowledge Extraction and Integration for Deep Learning Architectures, Association for Computational Linguistics, Dublin, Ireland and Online, 2022, pp. 100–114. URL: https://aclanthology.org/2022.deelio-1.10. doi:10.18653/v1/2022.deelio-1.10.
  26. Fantastically ordered prompts and where to find them: Overcoming few-shot prompt order sensitivity, in: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Association for Computational Linguistics, Dublin, Ireland, 2022, pp. 8086–8098. URL: https://aclanthology.org/2022.acl-long.556. doi:10.18653/v1/2022.acl-long.556.
Citations (2)

Summary

We haven't generated a summary for this paper yet.