MasalBench: Persian Proverbs Evaluation
- MasalBench is a benchmark that assesses multilingual LLMs’ ability to interpret Persian proverbs and map them to English equivalents.
- It features two key tasks: contextual identification in dialogues with multiple choice distractors and binary cross-cultural mapping for semantic alignment.
- It employs a hybrid annotation pipeline combining manual curation and LLM-driven generation to ensure robust and culturally accurate evaluations.
MasalBench is a comprehensive evaluation suite designed to probe multilingual LLMs on their abilities in contextual and cross-cultural understanding of Persian proverbs, a fundamental component of everyday communication in Persian, a low-resource language. Developed to address the research gap in non-English figurative and cultural language comprehension by LLMs, MasalBench operationalizes two key tasks—contextual interpretation of proverbs within conversational settings, and cross-lingual mapping to semantically equivalent English proverbs. The benchmark also provides systematic methodology for dataset construction, rigorous annotation, and empirical evaluation of leading LLMs, setting a precedent for culturally grounded assessment frameworks in under-represented languages (Kalhor et al., 29 Jan 2026).
1. Benchmark Structure and Task Formulation
MasalBench consists of two main evaluation tracks:
Task 1: Contextual Identification of Persian Proverbs
This task focuses on metaphorical reasoning in conversational Persian. Each data point contains a concise two-speaker dialogue where the reply incorporates a Persian proverb. The LLM is provided with four candidate explanations and must select the one that best captures the intended conversational meaning. Distractor explanations are carefully constructed to probe for three specific error types: (i) literal interpretations (word-by-word), (ii) plausible but incorrect inferences, and (iii) irrelevant options.
Task 2: Cross-Cultural Mapping to English Proverbs
This component probes analogical and cultural transfer abilities. For each Persian proverb, LLMs face a binary-choice question pairing two English proverbs—one semantically/functionally equivalent and one form-similar but semantically divergent distractor. The LLM's task is to identify the genuine cultural equivalent.
These dual tasks jointly measure in-language metaphor comprehension and the ability to perform semantic analogy across languages, advancing the assessment of both surface-level and deep cultural understanding in LLMs.
2. Dataset Construction and Annotation Pipeline
The benchmark is grounded in authentic idiomatic material, sourced from “Foote Koozegari: Persian Proverbs and Their Stories” (Rahmandoost et al., 2011). Extraction leveraged Gemini 2.5 Pro with OCR-driven prompts (temperature = 1, top-p = 0.95), followed by manual preprocessing for typographical accuracy and a frequency-based selection of 1,000 proverbs by native speakers.
Dialogues and distractors were generated using a hybrid approach: for 100 proverbs, all materials were crafted manually by the authors; for the remaining 900 proverbs, dialogue and distractors were generated via Gemini prompting and subsequently verified and edited for naturalness and fidelity by bilingual annotators. For Task 2, equivalent and distractor English proverbs were proposed by Gemini, with 700 valid pairs retained post manual filtering.
Annotation quality control required that each item (dialogue, explanation, distractor) be independently reviewed by at least two native Persian speakers, ensuring linguistic and contextual authenticity.
Dataset statistics are summarized as follows:
| Task | Proverbs | Items | Distractor Types |
|---|---|---|---|
| Contextual (Task 1) | 1,000 | 1,000 × 4 options | Literal, plausible, irrelevant |
| Cross-cultural (Task 2) | 700 | 700 binary questions | Form-similar divergent |
No explicit train/dev/test splits are included; the entire dataset is intended for zero-shot evaluation. This design emphasizes robust model generalization on real-world, out-of-distribution scenarios.
3. Evaluation Metrics and Formal Protocol
The primary quantitative metric is accuracy:
where , , , represent the standard confusion matrix components for item-level correctness. In Task 1, an additional breakdown by distractor type (literal, plausible, irrelevant) is reported to enable fine-grained error analysis. No additional metrics (such as precision, recall, or F1), nor statistical significance testing, are included in the current release.
Prompting is strictly zero-shot: option order is randomized to mitigate positional bias, and no few-shot exemplars or fine-tuning is allowed. Model outputs are limited to five tokens per response.
4. Experimental Results and Findings
Eight state-of-the-art LLMs were evaluated, including Llama 4 Scout, Llama 3.3 70B Instruct, Qwen 2.5 72B Instruct, Qwen QwQ 32B, DeepSeek V3.1, DeepSeek R1, GPT-4.1 mini, and GPT-4o mini. The experimental configuration fixed temperature at 0 and top-p at 1.
Task 1 Results (Contextual Understanding):
All models demonstrate strong contextual reasoning in Persian proverbs, with accuracies exceeding 0.90. The highest performance is attained by DeepSeek V3.1 (0.943), while the lowest is Qwen QwQ 32B (0.903). Instruction-tuned variants systematically outperform their unmodified counterparts.
Error Decomposition:
Over 80% of Task 1 errors are attributable to selection of the “plausible but incorrect” distractor. Literal and irrelevant options are almost never selected (approximately 0–2% combined), indicating that even when failing, LLMs' errors largely reflect deep semantic confusion rather than surface or naïve misreading.
Task 2 Results (Cross-Cultural Mapping):
Accuracy drops considerably on this track. The best-performing model, DeepSeek R1, achieves 0.793 accuracy, while the lowest, Llama 4 Scout, achieves 0.647. Gains from instruction tuning persist but are less pronounced, reflecting the additional difficulty of cultural analogy and mapping.
Summary of selected model performances:
| Model | Task 1 (Accuracy) | Task 2 (Accuracy) |
|---|---|---|
| DeepSeek V3.1 | 0.943 | 0.703 |
| DeepSeek R1 | 0.927 | 0.793 |
| Llama 3.3 Inst. | 0.927 | 0.710 |
| GPT-4.1 mini | 0.920 | 0.703 |
| Qwen 2.5 Inst. | 0.911 | 0.691 |
This evidences that, while LLMs are robust in in-language figurative comprehension when properly tuned, cross-cultural mapping—particularly in low-resource language settings—remains a marked challenge.
5. Contributions to Multilingual and Cultural Language Evaluation
MasalBench is the first large-scale, two-dimensional benchmark that systematically evaluates both in-language contextual understanding and cross-language mapping of proverbs for Persian. Its methodological innovations include a multistage annotation pipeline combining state-of-the-art OCR, LLM-driven generation, and critical native-speaker review. Empirical results illustrate (i) that advanced LLMs, especially when instruction-tuned, already excel in Persian metaphorical reasoning, and (ii) that cultural analogy remains a notable performance bottleneck.
The benchmark offers a reproducible model for extending evaluation to other low-resource languages and for laying bare the limits of current models’ cultural and analogical competencies.
6. Prospective Developments and Research Extensions
Future research directions motivated by MasalBench's release include:
- Extending the framework to additional low-resource languages (e.g., Swahili, Quechua) by replicating the pipeline with domain-specific proverbs.
- Incorporating few-shot or retrieval-augmented paradigms to enhance LLMs’ cross-cultural mapping performance.
- Making available explicit train/dev/test splits and introducing finer-grained metrics (e.g., human–model agreement, error severity), which could facilitate more rigorous comparative analyses.
- Exploring multilingual instruction tuning and targeted cultural knowledge injection, aiming to bridge observed deficiencies in analogical and cultural reasoning.
This suggests that MasalBench is positioned not only as an evaluative instrument but also as a conceptual substrate for advancing culturally grounded NLU, especially in underrepresented language communities.