Multigenre, Multiera Benchmark for Arabic Poetry Understanding in LLMs
This paper introduces "Fann or Flop," a novel benchmark specifically designed to evaluate the proficiency of LLMs in understanding Arabic poetry, a domain characterized by intricate linguistic features, cultural depth, and historical richness. Arabic poetry, renowned for its complex stylistic devices and profound thematic expressions, poses unique challenges for computational models striving for comprehensive language comprehension.
Benchmark Composition
The "Fann or Flop" benchmark encompasses a wide array of Arabic poetic works spanning twelve distinct eras, from the Pre-Islamic period to the contemporary era. It covers 21 core poetic genres and diverse metrical forms, ranging from classical structures to modern free verse. This extensive coverage ensures a robust evaluation framework that scrutinizes semantic understanding, metaphor interpretation, prosody, and cultural context—elements critical for authentic Arabic poetry comprehension.
Evaluation Methodology
To assess the performance of LLMs, the benchmark employs a curated corpus of nearly 7,000 poem-explanation pairs, each verified by native Arabic speakers. This rigorous validation process ensures linguistic authenticity and interpretive accuracy, providing a reliable basis for evaluating deep cultural and literary reasoning. The evaluation suite includes automatic metrics such as BLEU, BERTScore, and Textual Entailment, alongside human judgement focused on interpretive depth and fluency.
Key Findings
While state-of-the-art LLMs show impressive results on conventional Arabic NLP tasks, they consistently fall short in the domain of Arabic poetry. Despite demonstrating strong performance on standard benchmarks, these models struggle with achieving the interpretive depth and cultural sensitivity required for poetry understanding. For example, models like GPT-4o and Gemini-2.5 Flash excel in automatic metrics but face challenges in capturing the nuanced metaphorical and thematic dimensions of Arabic poetry.
Era-Specific Evaluation
The paper provides a detailed era-wise analysis, highlighting discrepancies in model performance across different historical periods. LLMs tend to perform better on poems from modern eras compared to ancient texts, which often feature complex linguistic structures and cultural references. This gap underscores the importance of temporal generalization and culturally informed training.
Implications and Future Directions
The "Fann or Flop" benchmark underscores the necessity for culturally and historically informed benchmarks in the continuous evaluation and advancement of LLMs. By illuminating the limitations of current models, it calls for enhanced training protocols that integrate rich cultural and literary datasets. The open-source release of this comprehensive benchmark serves as a catalyst for future research aimed at refining Arabic LLMs and expanding their capabilities within artistic and culturally-significant domains.
Conclusion
The introduction of the "Fann or Flop" benchmark marks a significant step toward addressing the existing gaps in Arabic NLP concerning poetic and cultural comprehension. The paper's comprehensive assessment framework offers valuable insights into the complexities of Arabic poetic expression and sets the stage for further advancements in linguistically and culturally nuanced model development. As LLMs continue to evolve, incorporating benchmarks that reflect diverse cultural narratives will be crucial in achieving truly multilingual and culturally adept LLMs.