TaarofBench: Persian Politeness Evaluation
- TaarofBench is a culturally specific evaluation framework that operationalizes the Persian politeness norm taarof through 450 detailed role-play scenarios.
- The framework features rigorous annotation by native Persian experts and utilizes both automatic and human-centric metrics for validation.
- Empirical findings indicate that while LLMs struggle with nuanced taarof expressions, adaptation strategies like supervised fine-tuning and direct preference optimization significantly improve performance.
TaarofBench is an evaluation framework and dataset designed to measure the ability of LLMs to understand and express the culturally specific system of Persian politeness known as taarof. Unlike general politeness benchmarks, TaarofBench operationalizes the implicit, hierarchical, and contextually contingent norms that characterize Persian social interactions, providing a rigorous testbed for cultural competence in language agents. The benchmark assesses models’ responses to 450 role-play scenarios, spans a representative taxonomy of interaction types, and is validated by native speakers. The evaluation methodology includes both automatic and human-centric metrics, compares model performance with humans of varying cultural backgrounds, and empirically investigates adaptation strategies such as supervised fine-tuning and direct preference optimization.
1. Formal Characterization of Taarof
Taarof is a complex, multi-layered communicative norm pervasive in Iranian culture, where overt expressions of deference, indirectness, ritualized offering, and humility govern social exchanges. In practice, taarof encompasses formal exchanges such as repeated invitations that are expected to be initially declined, the ritualized refusal and acceptance of gifts or services, and the minimization of compliments. Taarof interactions diverge fundamentally from direct or Western-style politeness conventions: compliance with taarof often requires non-literal language and interpretation of implicit social hierarchy, relationship, gender, and setting.
LLMs face difficulty with taarof because its correct realization depends on subtle cues embedded in conversational context, role asymmetry, and environment. Models trained on datasets attuned to Western politeness norms frequently generate responses that, while polite by Gricean or Brown-Levinson standards, violate taarof if they are too direct or fail to incorporate requisite ritualized refusal and negotiation behaviors. This disconnect highlights the inadequacy of imported politeness frameworks and underscores the need for a data-driven, culturally grounded benchmark.
2. Structure and Annotation of TaarofBench
TaarofBench operationalizes taarof through 450 meticulously constructed role-play scenarios, each expressed as a tuple
where:
- denotes the environment (e.g., home, workplace, restaurant),
- is the user role (Speaker A),
- is the model’s role (Speaker B),
- encodes conversational context,
- is the initiating user utterance,
- specifies the expected culturally congruent response pattern.
Scenarios were derived both from the academic literature and iterative augmentation using advanced LLMs (GPT-4), then validated for fidelity and diversity by five native Persian-speaking experts with NLP and sociolinguistics backgrounds. The dataset covers twelve representative interaction topics, including but not limited to:
- Gift-giving and acceptance
- Making and refusing requests
- Food and drink offering
- Compliment exchange
- Payment negotiation
- Invitations This breadth ensures the dataset reflects the spectrum of contexts where taarof is operative or suppressed, with nuanced annotation of when indirectness, hesitation, or role-dependent deference are required.
Each scenario’s is rigorously annotated to account for both obligatory taarof and contexts where its deployment is socially inappropriate.
3. Empirical Findings and LLM Performance
Evaluation of five contemporary LLMs—GPT-4o, Claude 3.5 Haiku, Llama 3-8b-instruct, DeepSeek V3, and Dorna (a Llama 3 variant fine-tuned on Persian)—reveals systematic deficits in cultural alignment:
- On taarof-expected scenarios, no model exceeded 42% accuracy, with accuracy rates 40–48% below native Persian speakers.
- On non-taarof scenarios, model precision was substantially higher (76–93%), indicating that failures are concentrated in instances requiring nuanced cultural reasoning.
- Prompting in Persian, as opposed to English, significantly increased performance for some models (e.g., DeepSeek V3 improved from 36.6% to 68.6%), demonstrating that culturally relevant linguistic cues activate latent model knowledge.
- Larger models (GPT-4o, Claude 3.5, DeepSeek V3) showed resilience to the omission of explicit “Iran” cues, while smaller models exhibited pronounced performance drops when cultural or geographic context was obfuscated.
Topic-specific analysis exposed that “gift” scenarios present less challenge for models, likely due to broader cross-cultural congruence, whereas “making a request” and “compliment” interactions precipitate higher error rates due to greater context-dependence within taarof protocol.
Accuracy rates for human participants, which provide essential context for model evaluation, follow a strict gradient:
- Native Persian speakers: 81.8%
- Heritage speakers: 60.0%
- Non-Iranians: 42.3% This delineates the complexity of taarof as a learned cultural practice and validates the human-likeness metric employed in benchmarking.
4. Model Adaptation Strategies and Quantitative Improvement
TaarofBench empirically tests two principal adaptation methods to improve LLM performance on culturally nuanced tasks:
- Supervised Fine-Tuning (SFT): Directly fine-tunes the model on scenario-response pairs with correct taarof-conforming outputs, optimizing the next-token prediction loss for the proper response distribution.
- Result: Approximate absolute improvement of 21.8% in model alignment with taarof expectations.
- Direct Preference Optimization (DPO): Utilizes preference learning by contrasting positive (correct) and negative (taarof-incongruent) responses, without recourse to an externally trained reward model or classifier.
- Result: Improvement of 42.3%, with DPO nearly doubling the F1 score on challenging scenarios and yielding a calibrated response profile closely approaching that of native speakers (up to 79.5% vs. the 81.8% human ceiling).
Both methods present a tractable path for instantiating culturally adaptive LLMs, confirming that with sufficient annotation and proper learning objectives, LLMs can be guided to express contextually coherent indirectness, refusal, and deference.
5. Human Evaluation Methodology and Insights
The human evaluation protocol comprises a balanced cohort of 33 participants (11 native, 11 heritage, 11 non-Iranian). Each responds to a subset of 30 TaarofBench scenarios in a zero-shot, open-ended role-play format, directly paralleling LLM evaluation settings. The results crystallize the substantial performance gap between culturally enculturated and non-enculturated users.
Human baselines provide essential calibration for automatic metrics. A secondary evaluation, where GPT-4 acts as an external judge, shows 94% agreement with expert human judgments, supporting the reliability of evaluation protocols and further highlighting the pitfalls of overreliance on generic “politeness” classifiers trained on Western conversational data.
The performance gradient—correlated with depth of experience in Persian cultural practice—demonstrates that real competence in taarof requires not only linguistic but also cultural and pragmatic conditioning. This underpins the critical importance of both in-domain data and culturally sensitive evaluation frameworks in LLM alignment.
6. Broader Implications and Applications
The construction and findings of TaarofBench have significant implications for the future of culturally attuned AI:
- Standard “polite” LLM responses by Western metrics can systematically violate non-Western norms, causing communication breakdowns in sensitive cross-cultural contexts.
- Direct application scenarios include educational tools for cultural training, customer service bots capable of engaging in nuanced diplomatic or service exchanges, and digital agents for cultural preservation.
- Embedding benchmarks such as TaarofBench into broader evaluation suites is necessary for rigorous global deployment of AI—models must demonstrate context-dependent adaptation, not simply generic “politeness”.
- The methodologies (role-play formulation, expert-validated annotation, contrastive adaptation) provide a blueprint for operationalizing other culture-specific communicative practices.
- A plausible implication is that similar frameworks could be extended to capture the subtleties of etiquette protocols in other cultures, advancing research on AI capable of robust, global, context-aware communication.
7. Limitations and Research Trajectory
TaarofBench, as formalized, is subject to coverage constraints: it targets a curated taxonomy of everyday scenarios validated by expert informants, not the full spectrum of interactional settings. The fine granularity of required pragmatic inferences remains challenging, especially in edge cases where taarof is contextually inappropriate. These limitations highlight the difficulty of scaling to richer scenarios or of generalizing to spontaneous, multi-party interactions.
Potential research directions include:
- Scaling benchmark construction to include more diverse and subtle interaction types.
- Exploring compositional approaches for scenario generation that encode escalating pragmatic ambiguity.
- Investigating curriculum learning strategies wherein LLMs are incrementally exposed to complex cultural negotiation patterns.
- Extending adaptation techniques beyond fine-tuning and preference optimization to incorporate elements of self-reflection or role conditioning.
TaarofBench thus forms a foundational step for the principled development and evaluation of AI systems capable of participating in, and mediating, culturally complex social environments.
Model | Accuracy (Taarof Expected) | Accuracy (Non-Taarof) |
---|---|---|
GPT-4o | <42% | 76–93% |
DeepSeek V3 (Persian) | 68.6% | — |
Human (Native) | 81.8% | — |
Human (Heritage) | 60.0% | — |
Human (Non-Iranian) | 42.3% | — |
TaarofBench, as the first rigorously annotated benchmark for Persian taarof, exposes and quantifies a substantial gap between current LLM output and the behavioral expectations of Persian-speaking societies. The techniques and findings demonstrate both the feasibility and necessity of integrating culturally specific evaluation and adaptation mechanisms as foundational elements in the next generation of globally deployed LLMs (Sadr et al., 1 Sep 2025).