Do Androids Laugh at Electric Sheep? Humor "Understanding" Benchmarks from The New Yorker Caption Contest (2209.06293v2)

Published 13 Sep 2022 in cs.CL and cs.CV

Abstract: Large neural networks can now generate jokes, but do they really "understand" humor? We challenge AI models with three tasks derived from the New Yorker Cartoon Caption Contest: matching a joke to a cartoon, identifying a winning caption, and explaining why a winning caption is funny. These tasks encapsulate progressively more sophisticated aspects of "understanding" a cartoon; key elements are the complex, often surprising relationships between images and captions and the frequent inclusion of indirect and playful allusions to human experience and culture. We investigate both multimodal and language-only models: the former are challenged with the cartoon images directly, while the latter are given multifaceted descriptions of the visual scene to simulate human-level visual understanding. We find that both types of models struggle at all three tasks. For example, our best multimodal models fall 30 accuracy points behind human performance on the matching task, and, even when provided ground-truth visual scene descriptors, human-authored explanations are preferred head-to-head over the best machine-authored ones (few-shot GPT-4) in more than 2/3 of cases. We release models, code, leaderboard, and corpus, which includes newly-gathered annotations describing the image's locations/entities, what's unusual in the scene, and an explanation of the joke.

References (72)

Citations (67)

View on Semantic Scholar

Summary

The paper introduces novel tasks to assess AI humor understanding using the New Yorker Caption Contest framework.
The study employs both multimodal and language-only models, revealing a 30-point accuracy gap in matching tasks compared to human performance.
The research underscores the complexity of humor comprehension and the need for culturally-informed, multimodal AI systems.

Humor "Understanding" Benchmarks from The New Yorker Caption Contest

The paper "Do Androids Laugh at Electric Sheep? Humor 'Understanding' Benchmarks from The New Yorker Caption Contest" presents a meticulous analysis of AI systems' ability to comprehend humor. The authors have devised a series of tasks based on the New Yorker Cartoon Caption Contest to test AI's humor understanding capabilities. These tasks involve matching a cartoon with its corresponding caption, assessing the quality of captions, and explaining why certain captions are humorous. The paper reveals significant gaps in AI performance compared to human benchmarks, underscoring the complexity of humor understanding even with advanced AI systems like GPT-4.

Research Design and Methods

The paper employs both language-only and multimodal models to address three specific tasks:

Matching Task: Models are required to match captions to cartoons from a set of distractors.
Quality Ranking Task: Models need to assess the quality of captions, distinguishing winners from non-finalists.
Explanation Task: Models generate explanations for why captions are funny, simulating deeper humor comprehension.

For multimodal evaluation, the authors leverage CLIP and OFA models, which integrate visual cues with language processing. In contrast, language-only models, such as T5 and various GPT versions, simulate "visual understanding" using rich descriptions rather than actual images.

Key Findings

The authors report that both models struggle significantly with these tasks. The best-performing multimodal models lag by 30 accuracy points compared to human performance in matching tasks. In explanation tasks, human-authored explanations are preferred over those of models two-thirds of the time, even when using advanced LLMs like GPT-4.

The discrepancy highlights the nuanced difficulty AI systems face in humor comprehension, which is characterized by subtle and often culturally-dependent associations between text and imagery.

Implications and Future Directions

The findings of this paper have dual implications. Practically, they suggest that while AI technologies have advanced considerably, subtleties in human-like tasks such as humor understanding remain challenging. Theoretically, the research emphasizes the value of integrating multimodal learning to capture the nuanced interplay between linguistic and visual stimuli that humor often requires.

From a speculative standpoint, closing this gap may necessitate more sophisticated models incorporating wider cultural and experiential data. This indicates a future where AI could assist more comprehensively in creative processes, contributing substantively to tasks that require deep context understanding, such as writing, education, and entertainment.

Conclusion

The research presents significant contributions to AI's understanding of humor, introducing benchmarks that mirror human performance for more realistic assessments. The continued development of such tasks will be vital in advancing AI’s capabilities in comprehending increasingly complex human characteristics, thereby propelling advancements in AI-human collaborative environments. The paper serves as a foundational step in realizing AI systems that can one day fully engage in the multifaceted domain of human humor, emphasizing that humor understanding remains a formidable challenge in the evolution of intelligent systems.

PDF Markdown

Related Papers

Tweets

https://twitter.com/jmhessel/status/1780474591372337506

https://twitter.com/jmhessel/status/1895226833802797138

https://twitter.com/infoxiao/status/1900992016621989967

YouTube

Show All Videos