Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
140 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Naming, Describing, and Quantifying Visual Objects in Humans and LLMs (2403.06935v3)

Published 11 Mar 2024 in cs.CL

Abstract: While human speakers use a variety of different expressions when describing the same object in an image, giving rise to a distribution of plausible labels driven by pragmatic constraints, the extent to which current Vision & Language LLMs (VLLMs) can mimic this crucial feature of language use is an open question. This applies to common, everyday objects, but it is particularly interesting for uncommon or novel objects for which a category label may be lacking or fuzzy. Furthermore, similar patterns of variation are observed among human speakers for highly context-sensitive expressions, such as the quantifiers 'few' or 'most'. In our work, we evaluate VLLMs (FROMAGe, BLIP-2, LLaVA) on three categories (nouns, attributes, and quantifiers) where humans show great subjective variability concerning the distribution over plausible labels, using datasets and resources mostly under-explored in previous work. Our results reveal mixed evidence on the ability of VLLMs to capture human naming preferences at generation time: while some models are good at mimicking human distributions for nouns and attributes, all of them fail to assign quantifiers, a task that requires more accurate, high-level reasoning.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (26)
  1. The (un) suitability of automatic evaluation metrics for text simplification. Computational Linguistics, 47(4):861–889.
  2. Roger Brown. 1958. How shall a thing be called? Psychological review, 65(1):14.
  3. Hierarchical neural story generation. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 889–898, Melbourne, Australia. Association for Computational Linguistics.
  4. Vision-language pre-training: Basics, recent advances, and future trends. Foundations and Trends® in Computer Graphics and Vision, 14(3–4):163–352.
  5. Animal, dog, or dalmatian? level of abstraction in nominal referring expressions. In CogSci.
  6. The curious case of neural text degeneration. arXiv preprint arXiv:1904.09751.
  7. Jessica S Horst and Michael C Hout. 2016. The novel object and unusual name (noun) database: A collection of novel images for use in experimental research. Behavior research methods, 48:1393–1409.
  8. Jie Huang and Kevin Chen-Chuan Chang. 2023. Towards reasoning in large language models: A survey. In Findings of the Association for Computational Linguistics: ACL 2023, pages 1049–1065, Toronto, Canada. Association for Computational Linguistics.
  9. ReferItGame: Referring to objects in photographs of natural scenes. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 787–798, Doha, Qatar. Association for Computational Linguistics.
  10. Grounding language models to images for multimodal inputs and outputs. In International Conference on Machine Learning, pages 17283–17300. PMLR.
  11. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International journal of computer vision, 123:32–73.
  12. Willem JM Levelt. 1993. Speaking: From intention to articulation. MIT press.
  13. Multimodal foundation models: From specialists to general-purpose assistants. arXiv preprint arXiv:2309.10020, 1(2):2.
  14. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597.
  15. Jianhua Lin. 1991. Divergence measures based on the shannon entropy. IEEE Transactions on Information theory, 37(1):145–151.
  16. Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744.
  17. Visual instruction tuning. arXiv preprint arXiv:2304.08485.
  18. David R Olson. 1970. Language and thought: aspects of a cognitive theory of semantics. Psychological review, 77(4):257.
  19. Probing the mental representation of quantifiers. Cognition, 181:117–126.
  20. Communicating with cost-based implicature: A game-theoretic approach to ambiguity. In The 16th workshop on the semantics and pragmatics of dialogue, paris, september.
  21. Object naming in language and vision: A survey and a new dataset. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 5792–5801, Marseille, France. European Language Resources Association.
  22. Humans meet models on object naming: A new dataset and analysis. In Proceedings of the 28th International Conference on Computational Linguistics, pages 1893–1905, Barcelona, Spain (Online). International Committee on Computational Linguistics.
  23. Describing images fast and slow: Quantifying and predicting the variation in human signals during visuo-linguistic processes. arXiv preprint arXiv:2402.01352.
  24. Quantifiers in a multimodal world: Hallucinating vision with language and sound. In Proceedings of the Workshop on Cognitive Modeling and Computational Linguistics, pages 105–116, Minneapolis, Minnesota. Association for Computational Linguistics.
  25. Modeling context in referring expressions. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part II 14, pages 69–85. Springer.
  26. A survey of large language models. arXiv preprint arXiv:2303.18223.

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com