Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
139 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Understanding Figurative Meaning through Explainable Visual Entailment (2405.01474v2)

Published 2 May 2024 in cs.CL, cs.AI, and cs.CV

Abstract: Large Vision-LLMs (VLMs) have demonstrated strong capabilities in tasks requiring a fine-grained understanding of literal meaning in images and text, such as visual question-answering or visual entailment. However, there has been little exploration of these models' capabilities when presented with images and captions containing figurative meaning, such as metaphors or humor. To close this gap, we propose a new task framing the figurative meaning understanding problem as an explainable visual entailment task, where the model has to predict whether the image (premise) entails a caption (hypothesis) and justify the predicted label with a textual explanation. The figurative phenomena can be present either in the image, the caption, or both. Utilizing a human-AI collaboration approach, we build the accompanying expert-verified dataset V-FLUTE, containing 6,027 {image, caption, label, explanation} instances spanning five diverse figurative phenomena: metaphors, similes, idioms, sarcasm, and humor. Through automatic evaluation, we find that VLMs struggle to generalize from literal to figurative meaning, particularly when it is present in images. Further, we identify common types of errors in VLM reasoning via human evaluation.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (57)
  1. Explanations for CommonsenseQA: New Dataset and Models. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 3050–3065, Online. Association for Computational Linguistics.
  2. Metaclue: Towards comprehensive visual metaphors research. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23201–23211.
  3. Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35:23716–23736.
  4. Anthropic. 2024. The claude 3 model family: Opus, sonnet, haiku. https://www.anthropic.com/news/claude-3-family.
  5. A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 632–642, Lisbon, Portugal. Association for Computational Linguistics.
  6. e-snli: Natural language inference with natural language explanations. In Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc.
  7. Sociocultural norm similarities and differences via situational alignment and explainable textual entailment. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 3548–3564, Singapore. Association for Computational Linguistics.
  8. FLUTE: Figurative language understanding through textual explanations. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 7139–7159, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  9. I spy a metaphor: Large language models and diffusion models co-create visual metaphors. In Findings of the Association for Computational Linguistics: ACL 2023, pages 7370–7388, Toronto, Canada. Association for Computational Linguistics.
  10. Jacob Cohen. 1960. A coefficient of agreement for nominal scales. Educational and psychological measurement, 20(1):37–46.
  11. Nice perfume. how long did you marinate in it? multimodal sarcasm explanation. Proceedings of the AAAI Conference on Artificial Intelligence, 36(10):10563–10571.
  12. Socialite-llama: An instruction-tuned model for social scientific tasks. arXiv preprint arXiv:2402.01980.
  13. Charles Forceville. 2002. Pictorial metaphor in advertising. Routledge.
  14. Susan R Fussell and Mallie M Moss. 2014. Figurative language in emotional communication. In Social and cognitive approaches to interpersonal communication, pages 113–141. Psychology Press.
  15. Competency problems: On finding and removing artifacts in language data. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 1801–1813, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  16. Annotation artifacts in natural language inference data. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pages 107–112, New Orleans, Louisiana. Association for Computational Linguistics.
  17. {DEBERTA}: {DECODING}-{enhanced} {bert} {with} {disentangled} {attention}. In International Conference on Learning Representations.
  18. Do androids laugh at electric sheep? humor “understanding” benchmarks from the new yorker caption contest. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 688–714, Toronto, Canada. Association for Computational Linguistics.
  19. LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations.
  20. EunJeong Hwang and Vered Shwartz. 2023. MemeCap: A dataset for captioning and interpreting memes. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 1433–1445, Singapore. Association for Computational Linguistics.
  21. Mistral 7b.
  22. e-vil: A dataset and benchmark for natural language explanations in vision-language tasks. In Proceedings of the IEEE/CVF international conference on computer vision, pages 1244–1254.
  23. Philipp Koehn. 2004. Statistical significance tests for machine translation evaluation. In Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing, pages 388–395, Barcelona, Spain. Association for Computational Linguistics.
  24. Widia Lestari. 2019. Irony analysis of memes on instagram social media. Pioneer: Journal of Language and Literature, 10(2):114–123.
  25. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International Conference on Machine Learning, pages 12888–12900. PMLR.
  26. WANLI: Worker and AI collaboration for natural language inference dataset creation. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 6826–6847, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  27. Improved baselines with visual instruction tuning.
  28. Llava-next: Improved reasoning, ocr, and world knowledge.
  29. Visual instruction tuning. In Advances in Neural Information Processing Systems, volume 36, pages 34892–34916. Curran Associates, Inc.
  30. Learn to explain: Multimodal reasoning via thought chains for science question answering. In The 36th Conference on Neural Information Processing Systems (NeurIPS).
  31. Bill MacCartney and Christopher D. Manning. 2008. Modeling semantic containment and exclusion in natural language inference. In Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008), pages 521–528, Manchester, UK. Coling 2008 Organizing Committee.
  32. Right for the wrong reasons: Diagnosing syntactic heuristics in natural language inference. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3428–3448, Florence, Italy. Association for Computational Linguistics.
  33. Compositional chain-of-thought prompting for large multimodal models.
  34. OpenAI. 2023. Gpt-4v(ision) system card. https://cdn.openai.com/papers/GPTV_System_Card.pdf.
  35. Anna Piata. 2016. When metaphor becomes a joke: Metaphor journeys from political ads to internet memes. Journal of Pragmatics, 106:39–56.
  36. Hypothesis only baselines in natural language inference. In Proceedings of the Seventh Joint Conference on Lexical and Computational Semantics, pages 180–191, New Orleans, Louisiana. Association for Computational Linguistics.
  37. Learning compact metrics for mt. In Proceedings of EMNLP.
  38. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR.
  39. Explain yourself! leveraging language models for commonsense reasoning. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4932–4942, Florence, Italy. Association for Computational Linguistics.
  40. Richard M. Roberts and Roger J. Kreuz. 1994. Why do people use figurative language? Psychological Science, 5(3):159–163.
  41. A report on the figlang 2022 shared task on understanding figurative language. In Proceedings of the 3rd Workshop on Figurative Language Processing (FLP), pages 178–183.
  42. Linda M Scott. 1994. Images in advertising: The need for a theory of visual rhetoric. Journal of consumer research, 21(2):252–273.
  43. BLEURT: Learning robust metrics for text generation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7881–7892, Online. Association for Computational Linguistics.
  44. Ekaterina V Shutova. 2011. Computational approaches to figurative language. Technical report, University of Cambridge, Computer Laboratory.
  45. Pub: A pragmatics understanding benchmark for assessing llms’ pragmatics capabilities. arXiv preprint arXiv:2401.07078.
  46. IMPLI: Investigating NLI models’ performance on figurative language. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5375–5388, Dublin, Ireland. Association for Computational Linguistics.
  47. Gemini: A family of highly capable multimodal models.
  48. Metaphor: A computational perspective. Morgan & Claypool Publishers.
  49. Reframing human-AI collaboration for generating free-text explanations. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 632–658, Seattle, United States. Association for Computational Linguistics.
  50. Sarah Wiegreffe and Ana Marasovic. 2021. Teach me to explain: A review of datasets for explainable natural language processing. In Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks, volume 1. Curran.
  51. A broad-coverage challenge corpus for sentence understanding through inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1112–1122, New Orleans, Louisiana. Association for Computational Linguistics.
  52. Visual entailment: A novel task for fine-grained image understanding. ArXiv, abs/1901.06706.
  53. IRFL: Image recognition of figurative language. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 1044–1058, Singapore. Association for Computational Linguistics.
  54. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. In Proceedings of CVPR.
  55. Bertscore: Evaluating text generation with bert. In International Conference on Learning Representations.
  56. PIE: A parallel idiomatic expression corpus for idiomatic sentence generation and paraphrasing. In Proceedings of the 17th Workshop on Multiword Expressions (MWE 2021), pages 33–48, Online. Association for Computational Linguistics.
  57. Can large language models transform computational social science? Computational Linguistics, pages 1–55.
Citations (2)

Summary

We haven't generated a summary for this paper yet.