Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

WinoViz: Probing Visual Properties of Objects Under Different States (2402.13584v1)

Published 21 Feb 2024 in cs.CL

Abstract: Humans perceive and comprehend different visual properties of an object based on specific contexts. For instance, we know that a banana turns brown when it becomes rotten,'' whereas it appears greenwhen it is unripe.'' Previous studies on probing visual commonsense knowledge have primarily focused on examining LLMs' understanding of typical properties (e.g., colors and shapes) of objects. We present WinoViz, a text-only evaluation dataset, consisting of 1,380 examples that probe the reasoning abilities of LLMs regarding variant visual properties of objects under different contexts or states. Our task is challenging since it requires pragmatic reasoning (finding intended meanings) and visual knowledge reasoning. We also present multi-hop data, a more challenging version of our data, which requires multi-step reasoning chains to solve our task. In our experimental analysis, our findings are: a) LLMs such as GPT-4 demonstrate effective performance, but when it comes to multi-hop data, their performance is significantly degraded. b) Large models perform well on pragmatic reasoning, but visual knowledge reasoning is a bottleneck in our task. c) Vision-LLMs outperform their language-model counterparts. d) A model with machine-generated images performs poorly in our task. This is due to the poor quality of the generated images.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (36)
  1. Vqa: Visual question answering. In Proceedings of the IEEE international conference on computer vision, pages 2425–2433.
  2. A large annotated corpus for learning natural language inference. arXiv preprint arXiv:1508.05326.
  3. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
  4. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416.
  5. Kevin Crowston. 2012. Amazon mechanical turk: A research tool for organizations and information systems scholars. In Shaping the Future of ICT Research. Methods and Approaches: IFIP WG 8.2, Working Conference, Tampa, FL, USA, December 13-14, 2012. Proceedings, pages 210–221. Springer.
  6. Instructblip: Towards general-purpose vision-language models with instruction tuning. CoRR, abs/2305.06500.
  7. Do language models have coherent mental models of everyday things? arXiv preprint arXiv:2212.10029.
  8. Lovisa Hagström and Richard Johansson. 2022. What do models learn from training on more than text? measuring visual commonsense knowledge. arXiv preprint arXiv:2205.07065.
  9. Leveraging visual knowledge in language tasks: An empirical study on intermediate pre-training for cross-modal knowledge transfer. arXiv preprint arXiv:2203.07519.
  10. Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of naacL-HLT, volume 1, page 2.
  11. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597.
  12. Can language models understand physical concepts? arXiv preprint arXiv:2305.14057.
  13. Oscar: Object-semantics aligned pre-training for vision-language tasks. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXX 16, pages 121–137. Springer.
  14. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer.
  15. Visual instruction tuning. arXiv preprint arXiv:2304.08485.
  16. Things not written in text: Exploring spatial commonsense from visual signals. arXiv preprint arXiv:2203.08075.
  17. Generation and comprehension of unambiguous object descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 11–20.
  18. Semantic feature production norms for a large set of living and nonliving things. Behavior research methods, 37(4):547.
  19. Adversarial nli: A new benchmark for natural language understanding. arXiv preprint arXiv:1910.14599.
  20. Transferring knowledge from vision to language: How to achieve it and how to measure it? arXiv preprint arXiv:2109.11321.
  21. OpenAI. 2023. GPT-4 technical report. CoRR, abs/2303.08774.
  22. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744.
  23. The world of an octopus: How reporting bias influences a language model’s perception of color. arXiv preprint arXiv:2110.08182.
  24. Paco: Preconditions attributed to commonsense knowledge. arXiv preprint arXiv:2104.08712.
  25. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR.
  26. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551.
  27. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695.
  28. Viphy: Probing" visible" physical commonsense knowledge. arXiv preprint arXiv:2209.07000.
  29. Vl-bert: Pre-training of generic visual-linguistic representations. arXiv preprint arXiv:1908.08530.
  30. Hao Tan and Mohit Bansal. 2019. Lxmert: Learning cross-modality encoder representations from transformers. arXiv preprint arXiv:1908.07490.
  31. Hao Tan and Mohit Bansal. 2020. Vokenization: Improving language understanding with contextualized, visual-grounded supervision. arXiv preprint arXiv:2010.06775.
  32. Vidlankd: Improving language understanding via video-distilled knowledge transfer. Advances in Neural Information Processing Systems, 34:24468–24481.
  33. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  34. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837.
  35. Visual commonsense in pretrained unimodal and multimodal models. arXiv preprint arXiv:2205.01850.
  36. Visualize before you write: Imagination-guided open-ended text generation. arXiv preprint arXiv:2210.03765.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Woojeong Jin (17 papers)
  2. Tejas Srinivasan (20 papers)
  3. Jesse Thomason (65 papers)
  4. Xiang Ren (194 papers)
Citations (1)