Fann or Flop: A Multigenre, Multiera Benchmark for Arabic Poetry Understanding in LLMs (2505.18152v2)

Published 23 May 2025 in cs.CL

Abstract: Arabic poetry is one of the richest and most culturally rooted forms of expression in the Arabic language, known for its layered meanings, stylistic diversity, and deep historical continuity. Although LLMs have demonstrated strong performance across languages and tasks, their ability to understand Arabic poetry remains largely unexplored. In this work, we introduce \emph{Fann or Flop}, the first benchmark designed to assess the comprehension of Arabic poetry by LLMs in 12 historical eras, covering 14 core poetic genres and a variety of metrical forms, from classical structures to contemporary free verse. The benchmark comprises a curated corpus of poems with explanations that assess semantic understanding, metaphor interpretation, prosodic awareness, and cultural context. We argue that poetic comprehension offers a strong indicator for testing how good the LLM understands classical Arabic through Arabic poetry. Unlike surface-level tasks, this domain demands deeper interpretive reasoning and cultural sensitivity. Our evaluation of state-of-the-art LLMs shows that most models struggle with poetic understanding despite strong results on standard Arabic benchmarks. We release "Fann or Flop" along with the evaluation suite as an open-source resource to enable rigorous evaluation and advancement for Arabic LLMs. Code is available at: https://github.com/mbzuai-oryx/FannOrFlop.

Summary

Multigenre, Multiera Benchmark for Arabic Poetry Understanding in LLMs

This paper introduces "Fann or Flop," a novel benchmark specifically designed to evaluate the proficiency of LLMs in understanding Arabic poetry, a domain characterized by intricate linguistic features, cultural depth, and historical richness. Arabic poetry, renowned for its complex stylistic devices and profound thematic expressions, poses unique challenges for computational models striving for comprehensive language comprehension.

Benchmark Composition

The "Fann or Flop" benchmark encompasses a wide array of Arabic poetic works spanning twelve distinct eras, from the Pre-Islamic period to the contemporary era. It covers 21 core poetic genres and diverse metrical forms, ranging from classical structures to modern free verse. This extensive coverage ensures a robust evaluation framework that scrutinizes semantic understanding, metaphor interpretation, prosody, and cultural context—elements critical for authentic Arabic poetry comprehension.

Evaluation Methodology

To assess the performance of LLMs, the benchmark employs a curated corpus of nearly 7,000 poem-explanation pairs, each verified by native Arabic speakers. This rigorous validation process ensures linguistic authenticity and interpretive accuracy, providing a reliable basis for evaluating deep cultural and literary reasoning. The evaluation suite includes automatic metrics such as BLEU, BERTScore, and Textual Entailment, alongside human judgement focused on interpretive depth and fluency.

Key Findings

While state-of-the-art LLMs show impressive results on conventional Arabic NLP tasks, they consistently fall short in the domain of Arabic poetry. Despite demonstrating strong performance on standard benchmarks, these models struggle with achieving the interpretive depth and cultural sensitivity required for poetry understanding. For example, models like GPT-4o and Gemini-2.5 Flash excel in automatic metrics but face challenges in capturing the nuanced metaphorical and thematic dimensions of Arabic poetry.

Era-Specific Evaluation

The paper provides a detailed era-wise analysis, highlighting discrepancies in model performance across different historical periods. LLMs tend to perform better on poems from modern eras compared to ancient texts, which often feature complex linguistic structures and cultural references. This gap underscores the importance of temporal generalization and culturally informed training.

Implications and Future Directions

The "Fann or Flop" benchmark underscores the necessity for culturally and historically informed benchmarks in the continuous evaluation and advancement of LLMs. By illuminating the limitations of current models, it calls for enhanced training protocols that integrate rich cultural and literary datasets. The open-source release of this comprehensive benchmark serves as a catalyst for future research aimed at refining Arabic LLMs and expanding their capabilities within artistic and culturally-significant domains.

Conclusion

The introduction of the "Fann or Flop" benchmark marks a significant step toward addressing the existing gaps in Arabic NLP concerning poetic and cultural comprehension. The paper's comprehensive assessment framework offers valuable insights into the complexities of Arabic poetic expression and sets the stage for further advancements in linguistically and culturally nuanced model development. As LLMs continue to evolve, incorporating benchmarks that reflect diverse cultural narratives will be crucial in achieving truly multilingual and culturally adept LLMs.

Related Papers

GitHub

GitHub - mbzuai-oryx/FannOrFlop: A Multigenre, Multiera Benchmark for Arabic Poetry Understanding in LLMs (2 stars)

Tweets

https://twitter.com/sarahsavant1/status/1927656629815714093

https://twitter.com/ArxivToday/status/1927045661750346066

https://twitter.com/vladthecto/status/1928393727204065482