Persian SafeBench-fa: A Safety Benchmark
- Persian SafeBench-fa is a safety evaluation benchmark for Persian LLMs, defined by its culturally and linguistically tailored harm-mitigation protocols.
- It systematically tests models with 206 adversarial prompts across six risk domains, ensuring robust assessment of compliance in sensitive contexts.
- The benchmark employs both LLM-as-a-judge scoring and binary safety compliance metrics, setting best practices for low-resource language safety evaluation.
Persian SafeBench-fa is a specialized safety evaluation benchmark for LLMs operating in Persian, providing a linguistically and culturally contextualized suite of prompts and metrics to systematically assess model compliance with harm-mitigation, safety, and refusal rules. Conceived as a critical complement to existing English-centric safety benchmarks, SafeBench-fa emerges from the need to evaluate LLMs’ robustness in averting harmful, unlawful, or culturally inappropriate content in Persian digital contexts, filling a notable gap in alignment research for low-resource and non-Western languages (Pourbahman et al., 17 Apr 2025, Mirbagheri et al., 8 Sep 2025).
1. Concept and Motivation
Persian SafeBench-fa is designed to address safety risks manifesting in generative models’ outputs for socially and religiously sensitive contexts within Persian-language communities. The inadequacy of translated English benchmarks for this purpose arises from distinct cultural taboos, linguistic nuances, and regional threats not covered in global frameworks. SafeBench-fa’s focus includes but is not limited to, harm-to-minors idioms, localized privacy issues, and context-specific forms of violence and unlawful conduct (Pourbahman et al., 17 Apr 2025). Integration within larger trustworthiness frameworks such as the EPT Benchmark (Mirbagheri et al., 8 Sep 2025) situates SafeBench-fa as the principal toolkit for evaluating and ranking model safety in the Persian LLM ecosystem.
2. Dataset Design and Taxonomy
The construction of SafeBench-fa follows a methodical multi-step protocol to ensure relevance, coverage, and authenticity:
- Topic and Subtopic Selection: Six high-risk domains are prioritized: violence, unlawful conduct, harms to minors, adult content, mental health issues, and privacy violations. For comprehensive coverage, each topic is expanded into subtopics using generative models to mine fine-grained categories such as types of harm (physical, psychological, social), privacy attack vectors, and sensitive disease/condition references.
- Prompt Synthesis and Vetting: Synthetic prompt generation yields an initial candidate pool—e.g., 300 prompts (≈50 per domain). Manual review by native-speaking annotators eliminates redundancy, periphrasis, poorly localized language, and culturally extraneous requests, distilling the evaluation set to 206 high-quality adversarial prompts (Pourbahman et al., 17 Apr 2025).
- Cultural and Regulatory Alignment: Each retained item is confirmed as reflective of locally salient safety threats, using guidelines that consider Persian social conventions, Islamic jurisprudence, and current Iranian legal/ethical standards.
The following table outlines SafeBench-fa’s taxonomy:
| Topic | Representative Subtopics |
|---|---|
| Violence | – |
| Unlawful conduct | – |
| Harms to minors | Physical, Emotional/Psychological, Social, Sexual, Educational harm |
| Adult content | Sexuality, Betting, Graphic violence, Profanity |
| Mental health issues | Anxiety, Depression, Bipolar, Trauma, Eating disorders, Personality disorders |
| Privacy violations | Unauthorized access, Disclosure, Data misuse, Identity theft |
3. Annotation and Evaluation Protocol
Annotation employs a single-expert filtering process: prompts must demand proscribed content in idiomatic Persian, be unambiguous, and free from language artifacts. There is no deployment of multi-annotator consensus for the public SafeBench-fa set, and inter-annotator agreement statistics are thus not available (Pourbahman et al., 17 Apr 2025).
Models are evaluated in a generative safety filtering paradigm, required to either refuse the unsafe prompt or produce an unequivocally safe response. The primary scoring leverages LLM-as-a-Judge methodology: GPT-4o-mini assigns to each model output, where 10 indicates perfect refusal, 0 complete compliance failure (unsafe output or off-target non-Persian text). The aggregate safety score for a model is computed:
where for SafeBench-fa.
Binary compliance-based scoring, as used in EPT, labels responses as “Compliant” or “Non-compliant” based on concordance with expert-prepared safe references, and computes safety compliance rate:
where is the count of safe (refusal or non-harmful) responses and total prompts (typically 200 per dimension) (Mirbagheri et al., 8 Sep 2025).
4. Benchmark Statistics and Composition
SafeBench-fa comprises 206 evaluation-only prompts spanning the six main domains. The initial generation targeted uniform domain-level counts, but manual filtering resulted in uneven subtopic distribution. Prompts are publicly released without splits, exclusively for held-out evaluation (Pourbahman et al., 17 Apr 2025). Prompts are structurally adversarial or forbidden, requiring models to navigate nuanced refusal strategies, especially for privacy leaks, prohibited mental health advice, and local idioms of harm not well represented in translated benchmarks.
In EPT, the safety dimension is derived from a similar expert-curated 200-prompts subset (out of 1,200 prompts spanning six trustworthiness axes). Annotation uses dual-phase scoring: initial automated scoring by ChatGPT and final human expert adjudication via majority vote (Mirbagheri et al., 8 Sep 2025).
5. Empirical Results and Failure Modes
Model performance on SafeBench-fa reveals heterogeneity by model size, training regimen, and alignment methodology. Representative results are summarized:
| Model | SafeBench-fa avg. score (/10) (Pourbahman et al., 17 Apr 2025) | EPT safety compliance (%) (Mirbagheri et al., 8 Sep 2025) |
|---|---|---|
| gemma-2-9B-it | 9.56 | — |
| gemma-2-2B-it | 9.51 | — |
| aya-expanse-8B | 8.53 | — |
| Qwen2.5-7B-Instruct | 7.05 | — |
| Dorna2-Llama3.1-8B-Instruct | 6.35 | — |
| Mistral-8B-Instruct-2410 | 6.51 | — |
| Qwen2.5-3B-Instruct | 5.95 | — |
| ChatGPT (GPT-4o) | — | 56.7 |
| Claude 3.7 Sonnet | — | 92.0 |
| DeepSeek v3 | — | 54–56 |
| Gemini 2.5 Pro | — | 58–60 |
| Grok 3 | — | 55.3 |
| Llama 3.3 | — | 50.1 |
| Mistral 3 | — | 57.2 |
| Qwen 3 | — | 48.8 |
Failure analysis shows mid- and small-sized models often leak partial instructions on illegal acts or self-harm, inadequately refuse plausible-seeming but unsafe queries, and sometimes over-refuse benign queries containing violence-related terms. Only large models fine-tuned on culturally aligned Persian data (e.g., gemma-2-9B-it) reach near-saturation on safe handling (Pourbahman et al., 17 Apr 2025).
EPT data indicate that, except Claude 3.7 Sonnet (92% safety compliance), most popular LLMs remain below 60% compliance on Persian Safety, highlighting a persistent vulnerability in this language domain (Mirbagheri et al., 8 Sep 2025).
6. Challenges, Implications, and Best Practices
SafeBench-fa’s construction exposes the necessity for culturally-grounded subtopic coverage: many models, including those fine-tuned on translated English safety suites, underperform on Persian-specific cues of harm. Culturally tailored adversarial data and refusal instruction fine-tuning yield significantly higher compliance for Persian-language LLMs.
Recommended best practices extracted from benchmark insights include:
- Incorporation of safety-focused RLHF maps for Persian, embedding Islamic legal and social jurisprudence in refusal heuristics.
- Expansion of training datasets with regionally representative toxic corpora, including nuanced hate speech, extremist narratives, and mental health taboo language.
- Deployment of multi-stage safety filtering pipelines: pre-prompt screening, generative moderation, and post-response consistency checks.
- Continuous red-teaming and adversarial attack by native-speaking and bilingual experts to expose bypass vulnerabilities.
- Establishment of open community feedback and prompt-reporting systems for living dataset evolution and real-world robustness (Mirbagheri et al., 8 Sep 2025).
This suggests that improving aligned safety in Persian LLMs requires investment in both high-quality annotation and the development of dynamic, context-sensitive filtering mechanisms. The lack of cross-annotator agreement data points to a need for methodological standardization and expanded human annotation in future releases.
7. Future Directions
Future development of SafeBench-fa involves several key axes:
- Expansion with adversarial queries and multi-turn dialogs harvested from Persian social media, capturing emergent threat vectors.
- Annotation with multi-annotator frameworks and reporting of inter-annotator agreement metrics such as Cohen’s to stabilize benchmarking reliability.
- Introduction of a binary safety classification head for each model to facilitate computation of traditional metrics—accuracy, precision, recall, —alongside the current LLM-Judge scoring.
- Integration of topic-sensitive safety policies that dynamically escalate refusal rigor on prompts falling under highly sensitive subtopics (Pourbahman et al., 17 Apr 2025).
A plausible implication is that comprehensive safety benchmarking in Persian (and similar languages) will increasingly blend automated LLM-based annotation, human consensus, and socio-culturally adaptive refusal logic, setting a standard for trustworthy multilingual LLM evaluation. SafeBench-fa stands as a foundational benchmark and methodology for safeguarding Persian digital discourse in the era of generative AI.