- The paper introduces Chumor 2.0, a novel, large-scale Chinese humor dataset with detailed human annotations, addressing a significant gap in non-English humor resources for LLMs.
- Benchmarking state-of-the-art language models on Chumor 2.0 reveals they perform only slightly better than random chance (44.6%-60.3% accuracy), falling significantly short of human understanding (78.3%).
- The findings highlight the critical limitations of current LLMs in grasping culturally specific and linguistically nuanced humor, indicating a need for future AI research to integrate deeper cultural context.
Chumor 2.0: Towards Benchmarking Chinese Humor Understanding
The paper presents "Chumor 2.0," a notable contribution to the computational paper of humor through its focus on culturally nuanced and non-English language humor. Recognizing the paucity of resources for analyzing humor in languages like Chinese, the authors construct a novel Chinese humor explanation dataset sourced from Ruo Zhi Ba (RZB), a platform comparable to Reddit, known for sharing culturally and intellectually challenging jokes. This dataset addresses a critical gap in that existing humor datasets are predominantly English-centric, lacking cultural diversity.
Dataset Construction and Characteristics
The Chumor dataset is distinguished by its comprehensive scale, surpassing existing datasets through the inclusion of a diverse range of humor styles unique to Chinese culture. It encompasses several joke categories—cultural, situational, pun-based, homophonic, glyph-based, and cross-lingual—illustrating the varied humor mechanisms in Chinese jokes. This categorization aids in analyzing different challenges posed to LLMs in humor understanding. The dataset includes 3,339 instances with manually annotated humor explanations, divided between accurate and inaccurate by human annotators, which reflect a significant advancement over previous works in both size and thematic specificity.
Evaluation of LLMs
The paper critically evaluates the performance of ten different LLMs, including both open-source and commercial models, on the Chumor dataset using direct and chain-of-thought prompting strategies. Evaluation results compellingly reveal that all tested models perform only marginally better than random chance, with accuracy ranging from 44.6% to 60.3%, while human-level understanding demonstrated an accuracy of 78.3%. This substantial gap underscores the difficulty LLMs have in grasping the nuances of culturally-specific humor. It is notable that direct prompting generally outperformed chain-of-thought prompting in guiding models towards correct interpretations, suggesting the latter's potential to introduce over-analysis without substantial benefit.
Implications and Future Directions
The findings affirm the limitations of current LLMs in capturing and interpreting humor grounded in linguistic and cultural intricacies particular to non-English languages. As humor inherently involves context, wordplay, and cultural cues, the challenges identified here suggest pathways for refining multi-lingual LLM capabilities. Future research might explore more sophisticated pre-training techniques that integrate cultural contexts within LLMs, as well as advanced reasoning strategies capable of leveraging multiple humor forms.
The authors speculate that the collected dataset and ensuing benchmark may guide developments in AI that are culturally aware, thus pushing the boundaries of computational humor understanding. This work also invites further exploration into methodological approaches for humor recognition across other less-resourced languages and cultures, thereby contributing to a more inclusive understanding of LLMs' capabilities.