Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
149 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Picturing Ambiguity: A Visual Twist on the Winograd Schema Challenge (2405.16277v3)

Published 25 May 2024 in cs.CL, cs.AI, cs.CV, and cs.LG

Abstract: LLMs have demonstrated remarkable success in tasks like the Winograd Schema Challenge (WSC), showcasing advanced textual common-sense reasoning. However, applying this reasoning to multimodal domains, where understanding text and images together is essential, remains a substantial challenge. To address this, we introduce WinoVis, a novel dataset specifically designed to probe text-to-image models on pronoun disambiguation within multimodal contexts. Utilizing GPT-4 for prompt generation and Diffusion Attentive Attribution Maps (DAAM) for heatmap analysis, we propose a novel evaluation framework that isolates the models' ability in pronoun disambiguation from other visual processing challenges. Evaluation of successive model versions reveals that, despite incremental advancements, Stable Diffusion 2.0 achieves a precision of 56.7% on WinoVis, only marginally surpassing random guessing. Further error analysis identifies important areas for future research aimed at advancing text-to-image models in their ability to interpret and interact with the complex visual world.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (45)
  1. Language models are few-shot learners.
  2. Nassim Dehouche and Kullathida Dehouche. 2023. What’s in a text-to-image prompt? The potential of stable diffusion in visual arts education. Heliyon.
  3. Back to square one: Artifact detection, training and commonsense disentanglement in the Winograd schema. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 10486–10500, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  4. An analysis of dataset overlap on Winograd-style tasks. In Proceedings of the 28th International Conference on Computational Linguistics, pages 5855–5865.
  5. The KnowRef coreference corpus: Removing gender and number cues for difficult pronominal anaphora resolution. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3952–3961, Florence, Italy. Association for Computational Linguistics.
  6. Chatgpt outperforms crowd-workers for text-annotation tasks. arXiv preprint arXiv:2303.15056.
  7. Multimodal neurons in artificial neural networks. Distill, 6(3):e30.
  8. Shun Hattori and Madoka Takahara. 2023. A study on human-computer interaction with text-to/from-image game AIs for diversity education. In International Conference on Human-Computer Interaction, pages 471–486. Springer.
  9. Mask R-CNN. In Proceedings of the IEEE International Conference on Computer Vision, pages 2961–2969.
  10. Denoising diffusion probabilistic models. In Conference on Neural Information Processing Systems.
  11. A surprisingly robust trick for the Winograd schema challenge. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4837–4842, Florence, Italy. Association for Computational Linguistics.
  12. The defeat of the Winograd Schema Challenge. Artificial Intelligence, 325:103971.
  13. The Winograd Schema Challenge. In AAAI Spring Symposium: Logical Formalizations of Commonsense Reasoning.
  14. Explainable computational creativity. In International Conference on Computational Creativity 2020, ICCC’20, pages 334–341. Association for Computational Creativity (ACC). International Conference on Computational Creativity 2020, ICCC 2020 ; Conference date: 07-09-2020 Through 11-09-2020.
  15. Gpt-4 technical report.
  16. OpenAI. 2023. Gpt-4 technical report. ArXiv, abs/2303.08774.
  17. Jonas Oppenlaender. 2023. A taxonomy of prompt modifiers for text-to-image generation. Behaviour & Information Technology, pages 1–14.
  18. VALSE: A task-independent benchmark for vision and language models centered on linguistic phenomena. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8253–8280, Dublin, Ireland. Association for Computational Linguistics.
  19. Nikita Pavlichenko and Dmitry Ustalov. 2023. Best prompts for text-to-image models and how to find them. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’23, page 2067–2071, New York, NY, USA. Association for Computing Machinery.
  20. Sdxl: Improving latent diffusion models for high-resolution image synthesis.
  21. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 1(2):3.
  22. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10684–10695.
  23. U-Net: Convolutional networks for biomedical image segmentation.
  24. Gender bias in coreference resolution. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pages 8–14, New Orleans, Louisiana. Association for Computational Linguistics.
  25. Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems, 35:36479–36494.
  26. WinoGrande: An adversarial Winograd schema challenge at scale. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 8732–8740.
  27. Deep unsupervised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning.
  28. Yang Song and Stefano Ermon. 2019. Generative modeling by estimating gradients of the data distribution. In Neural Information Processing Systems.
  29. Richard Szeliski. 2022. Computer vision: Algorithms and applications. Springer Nature.
  30. Gated-SCNN: Gated shape CCNs for semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5229–5238.
  31. What the DAAM: Interpreting stable diffusion using cross attention. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5644–5659, Toronto, Canada. Association for Computational Linguistics.
  32. Winoground: Probing vision and language models for visio-linguistic compositionality. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5228–5238.
  33. On the evaluation of common-sense reasoning in natural language understanding. In Critiquing and Correcting Trends in Machine Learning NeurIPS 2018 Workshop.
  34. Attention is all you need.
  35. Diffusers: State-of-the-art diffusion models. https://github.com/huggingface/diffusers.
  36. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 353–355, Brussels, Belgium. Association for Computational Linguistics.
  37. T2IAT: Measuring valence and stereotypical biases in text-to-image generation. In Findings of the Association for Computational Linguistics: ACL 2023, pages 2560–2574, Toronto, Canada. Association for Computational Linguistics.
  38. DiffusionDB: A large-scale prompt gallery dataset for text-to-image generative models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 893–911, Toronto, Canada. Association for Computational Linguistics.
  39. Chain-of-thought prompting elicits reasoning in large language models.
  40. Max Welling and Yee W Teh. 2011. Bayesian learning via stochastic gradient langevin dynamics. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), pages 681–688. Citeseer.
  41. Terry Winograd. 1972. Understanding natural language. Cognitive Psychology, 3(1):1–191.
  42. Uncovering the disentanglement capability in text-to-image diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1900–1910.
  43. Pardis Sadat Zahraei and Ali Emami. 2024. WSC+: Enhancing the Winograd Schema Challenge using tree-of-experts. arXiv preprint arXiv:2401.17703.
  44. Gender bias in coreference resolution: Evaluation and debiasing methods. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pages 15–20, New Orleans, Louisiana. Association for Computational Linguistics.
  45. Improving semantic segmentation via video propagation and label relaxation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8856–8865.

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets