Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
139 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Cracking the Code of Juxtaposition: Can AI Models Understand the Humorous Contradictions (2405.19088v2)

Published 29 May 2024 in cs.CL and cs.CV

Abstract: Recent advancements in large multimodal LLMs have demonstrated remarkable proficiency across a wide range of tasks. Yet, these models still struggle with understanding the nuances of human humor through juxtaposition, particularly when it involves nonlinear narratives that underpin many jokes and humor cues. This paper investigates this challenge by focusing on comics with contradictory narratives, where each comic consists of two panels that create a humorous contradiction. We introduce the YesBut benchmark, which comprises tasks of varying difficulty aimed at assessing AI's capabilities in recognizing and interpreting these comics, ranging from literal content comprehension to deep narrative reasoning. Through extensive experimentation and analysis of recent commercial or open-sourced large (vision) LLMs, we assess their capability to comprehend the complex interplay of the narrative humor inherent in these comics. Our results show that even state-of-the-art models still lag behind human performance on this task. Our findings offer insights into the current limitations and potential improvements for AI in understanding human creative expressions.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (66)
  1. Jessica Pressman. Digital modernism: Making it new in new media. Oxford University Press, USA, 2014.
  2. Alan D Manning. Understanding comics: The invisible art. 1998.
  3. Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744, 2023.
  4. A survey on multimodal large language models. arXiv preprint arXiv:2306.13549, 2023.
  5. Multimodal large language models: A survey. In 2023 IEEE International Conference on Big Data (BigData), pages 2247–2256. IEEE, 2023.
  6. Language models, agent models, and world models: The law for machine reasoning and planning. arXiv preprint arXiv:2312.05230, 2023.
  7. Do androids laugh at electric sheep? humor “understanding” benchmarks from the new yorker caption contest. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors, Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 688–714, Toronto, Canada, July 2023. Association for Computational Linguistics.
  8. Artificial general intelligence: Roadmap to achieving human-level capabilities, 2023.
  9. The power of comics: History, form and culture. A&C Black, 2009.
  10. Can large multimodal models uncover deep semantics behind images? arXiv preprint arXiv:2402.11281, 2024.
  11. James O Young. Art and knowledge. Routledge, 2003.
  12. Thierry Groensteen. Comics and narration. Univ. Press of Mississippi, 2013.
  13. Eve Bearne. Rethinking literacy: Communication, representation and text. Reading, 37(3):98–103, 2003.
  14. Jason Dittmer. Comic book visualities: a methodological manifesto on geography, montage and narration. Transactions of the Institute of British Geographers, 35(2):222–236, 2010.
  15. Joshua Schechter. Juxtaposition: A new way to combine logics. The Review of Symbolic Logic, 4(4):560–606, 2011.
  16. Comics-based research: The affordances of comics for research across disciplines. Qualitative Research, 21(2):195–214, 2021.
  17. Eliminating reasoning via inferring with planning: A new framework to guide llms’ non-linear thinking. arXiv preprint arXiv:2310.12342, 2023.
  18. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712, 2023.
  19. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744, 2022.
  20. AI@Meta. Llama 3 model card. 2024.
  21. Large language models: A survey. arXiv preprint arXiv:2402.06196, 2024.
  22. Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in Neural Information Processing Systems, 36, 2024.
  23. Alpacafarm: A simulation framework for methods that learn from human feedback. Advances in Neural Information Processing Systems, 36, 2024.
  24. Pandalm: An automatic evaluation benchmark for llm instruction tuning optimization. arXiv preprint arXiv:2306.05087, 2023.
  25. C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models. Advances in Neural Information Processing Systems, 36, 2024.
  26. Mmt-bench: A comprehensive multimodal benchmark for evaluating large vision-language models towards multitask agi. arXiv preprint arXiv:2404.16006, 2024.
  27. Visit-bench: A benchmark for vision-language instruction following inspired by real-world use. arXiv preprint arXiv:2308.06595, 2023.
  28. Breaking common sense: Whoops! a vision-and-language benchmark of synthetic and compositional images. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2616–2627, 2023.
  29. Seed-bench: Benchmarking multimodal llms with generative comprehension. arXiv preprint arXiv:2307.16125, 2023.
  30. Seed-bench-2: Benchmarking multimodal large language models. arXiv preprint arXiv:2311.17092, 2023.
  31. Jerry Palmer. Taking humour seriously. Routledge, 2003.
  32. Predicting audience’s laughter using convolutional neural network. arXiv preprint arXiv:1702.02584, 2017.
  33. Recognizing humour using word associations and humour anchor extraction. In Proceedings of the 27th international conference on computational linguistics, pages 1849–1858, 2018.
  34. Humor recognition and humor anchor extraction. In Proceedings of the 2015 conference on empirical methods in natural language processing, pages 2367–2376, 2015.
  35. A survey on approaches to computational humor generation. In Stefania DeGaetano, Anna Kazantseva, Nils Reiter, and Stan Szpakowicz, editors, Proceedings of the 4th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature, pages 29–41, Online, December 2020. International Committee on Computational Linguistics.
  36. We are humor beings: Understanding and predicting visual humor. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4603–4612, 2016.
  37. Inside jokes: Identifying humorous cartoon captions. In Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining, pages 1065–1074, 2015.
  38. Humor in collective discourse: Unsupervised funniness detection in the new yorker cartoon caption contest. arXiv preprint arXiv:1506.08126, 2015.
  39. The laughing machine: Predicting humor in video. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 2073–2082, 2021.
  40. Comment-aided video-language alignment via contrastive pre-training for short-form video humor detection. arXiv preprint arXiv:2402.09055, 2024.
  41. Chatgpt is fun, but it is not funny! humor is still challenging large language models. arXiv preprint arXiv:2306.04563, 2023.
  42. Brain networks for visual creativity: a functional connectivity study of planning a visual artwork. Scientific reports, 6(1):39185, 2016.
  43. Best humans still outperform artificial intelligence in a creative divergent thinking task. Scientific reports, 13(1):13601, 2023.
  44. MemeCap: A dataset for captioning and interpreting memes. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 1433–1445, Singapore, December 2023. Association for Computational Linguistics.
  45. Humor in collective discourse: Unsupervised funniness detection in the new yorker cartoon caption contest. In Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Sara Goggi, Marko Grobelnik, Bente Maegaard, Joseph Mariani, Helene Mazo, Asuncion Moreno, Jan Odijk, and Stelios Piperidis, editors, Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), pages 475–479, Portorož, Slovenia, May 2016. European Language Resources Association (ELRA).
  46. Gemini in reasoning: Unveiling commonsense in multimodal large language models. arXiv preprint arXiv:2312.17661, 2023.
  47. From recognition to cognition: Visual commonsense reasoning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6720–6731, 2019.
  48. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6700–6709, 2019.
  49. Winoground: Probing vision and language models for visio-linguistic compositionality. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5238–5248, 2022.
  50. Learn to explain: Multimodal reasoning via thought chains for science question answering. Advances in Neural Information Processing Systems, 35:2507–2521, 2022.
  51. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  52. AI Anthropic. The claude 3 model family: Opus, sonnet, haiku. Claude-3 Model Card, 2024.
  53. Llava-next: Improved reasoning, ocr, and world knowledge, January 2024.
  54. Cogvlm: Visual expert for pretrained language models. arXiv preprint arXiv:2311.03079, 2023.
  55. Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966, 2023.
  56. mplug-owl2: Revolutionizing multi-modal large language model with modality collaboration, 2023.
  57. Instructblip: Towards general-purpose vision-language models with instruction tuning. Advances in Neural Information Processing Systems, 36, 2024.
  58. Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.
  59. Evaluation of text generation: A survey. arXiv preprint arXiv:2006.14799, 2020.
  60. Chin-Yew Lin. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pages 74–81, Barcelona, Spain, July 2004. Association for Computational Linguistics.
  61. Bertscore: Evaluating text generation with bert. In International Conference on Learning Representations, 2020.
  62. CLAIR: Evaluating image captions with large language models. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 13638–13646, Singapore, December 2023. Association for Computational Linguistics.
  63. G-eval: NLG evaluation using gpt-4 with better human alignment. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 2511–2522, Singapore, December 2023. Association for Computational Linguistics.
  64. Americano: Argument generation with discourse-driven decomposition and agent interaction. arXiv preprint arXiv:2310.20352, 2023.
  65. Reasoning with language model prompting: A survey. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors, Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5368–5393, Toronto, Canada, July 2023. Association for Computational Linguistics.
  66. Emergent abilities of large language models. arXiv preprint arXiv:2206.07682, 2022.

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

Youtube Logo Streamline Icon: https://streamlinehq.com