Chumor 2.0: Towards Benchmarking Chinese Humor Understanding (2412.17729v1)

Published 23 Dec 2024 in cs.CL and cs.AI

Abstract: Existing humor datasets and evaluations predominantly focus on English, leaving limited resources for culturally nuanced humor in non-English languages like Chinese. To address this gap, we construct Chumor, the first Chinese humor explanation dataset that exceeds the size of existing humor datasets. Chumor is sourced from Ruo Zhi Ba, a Chinese Reddit-like platform known for sharing intellectually challenging and culturally specific jokes. We test ten LLMs through direct and chain-of-thought prompting, revealing that Chumor poses significant challenges to existing LLMs, with their accuracy slightly above random and far below human. In addition, our analysis highlights that human-annotated humor explanations are significantly better than those generated by GPT-4o and ERNIE-4-turbo. We release Chumor at https://huggingface.co/datasets/dnaihao/Chumor, our project page is at https://dnaihao.github.io/Chumor-dataset/, our leaderboard is at https://huggingface.co/spaces/dnaihao/Chumor, and our codebase is at https://github.com/dnaihao/Chumor-dataset.

Summary

The paper introduces Chumor 2.0, a novel, large-scale Chinese humor dataset with detailed human annotations, addressing a significant gap in non-English humor resources for LLMs.
Benchmarking state-of-the-art language models on Chumor 2.0 reveals they perform only slightly better than random chance (44.6%-60.3% accuracy), falling significantly short of human understanding (78.3%).
The findings highlight the critical limitations of current LLMs in grasping culturally specific and linguistically nuanced humor, indicating a need for future AI research to integrate deeper cultural context.

Chumor 2.0: Towards Benchmarking Chinese Humor Understanding

The paper presents "Chumor 2.0," a notable contribution to the computational paper of humor through its focus on culturally nuanced and non-English language humor. Recognizing the paucity of resources for analyzing humor in languages like Chinese, the authors construct a novel Chinese humor explanation dataset sourced from Ruo Zhi Ba (RZB), a platform comparable to Reddit, known for sharing culturally and intellectually challenging jokes. This dataset addresses a critical gap in that existing humor datasets are predominantly English-centric, lacking cultural diversity.

Dataset Construction and Characteristics

The Chumor dataset is distinguished by its comprehensive scale, surpassing existing datasets through the inclusion of a diverse range of humor styles unique to Chinese culture. It encompasses several joke categories—cultural, situational, pun-based, homophonic, glyph-based, and cross-lingual—illustrating the varied humor mechanisms in Chinese jokes. This categorization aids in analyzing different challenges posed to LLMs in humor understanding. The dataset includes 3,339 instances with manually annotated humor explanations, divided between accurate and inaccurate by human annotators, which reflect a significant advancement over previous works in both size and thematic specificity.

Evaluation of LLMs

The paper critically evaluates the performance of ten different LLMs, including both open-source and commercial models, on the Chumor dataset using direct and chain-of-thought prompting strategies. Evaluation results compellingly reveal that all tested models perform only marginally better than random chance, with accuracy ranging from 44.6% to 60.3%, while human-level understanding demonstrated an accuracy of 78.3%. This substantial gap underscores the difficulty LLMs have in grasping the nuances of culturally-specific humor. It is notable that direct prompting generally outperformed chain-of-thought prompting in guiding models towards correct interpretations, suggesting the latter's potential to introduce over-analysis without substantial benefit.

Implications and Future Directions

The findings affirm the limitations of current LLMs in capturing and interpreting humor grounded in linguistic and cultural intricacies particular to non-English languages. As humor inherently involves context, wordplay, and cultural cues, the challenges identified here suggest pathways for refining multi-lingual LLM capabilities. Future research might explore more sophisticated pre-training techniques that integrate cultural contexts within LLMs, as well as advanced reasoning strategies capable of leveraging multiple humor forms.

The authors speculate that the collected dataset and ensuing benchmark may guide developments in AI that are culturally aware, thus pushing the boundaries of computational humor understanding. This work also invites further exploration into methodological approaches for humor recognition across other less-resourced languages and cultures, thereby contributing to a more inclusive understanding of LLMs' capabilities.

PDF Markdown

Related Papers

GitHub

Chumor-dataset
GitHub - dnaihao/Chumor-dataset (6 stars)