Multilingual Safety Benchmark
- Multilingual safety benchmark is a framework designed to evaluate LLM safety by using detailed taxonomies, quality multilingual prompts, and robust evaluation protocols.
- It identifies risks such as toxicity, bias, privacy breaches, and culturally inappropriate responses across text, vision, and audio modalities.
- The benchmark enhances AI alignment through diverse data sources, red teaming, and tailored evaluation methods that address language-specific and cultural nuances.
A multilingual safety benchmark is a systematic framework for evaluating the safety of LLMs and related AI systems across multiple languages, cultural contexts, and modalities. Such benchmarks aim to rigorously identify unsafe outputs—such as toxicity, policy violations, and culturally inappropriate responses—in real and adversarial scenarios, thereby providing essential tools to measure, compare, and ultimately improve the safety of LLMs globally. As LLMs are increasingly deployed in multilingual and multicultural environments, the development and adoption of multilingual safety benchmarks are critical to ensuring equitable, reliable, and responsible AI behavior.
1. Definitions, Scope, and Rationale
A multilingual safety benchmark encompasses diverse safety taxonomies, high-quality multilingual prompts, and robust evaluation protocols to assess how effectively LLMs avoid producing unsafe, harmful, or policy-violating outputs across languages. Motivated by empirical findings that LLMs often exhibit higher rates of unsafe or inconsistent behavior in non-English or low-resource languages, these benchmarks extend beyond simple translation—they systematically capture linguistic, cultural, and legal variations that shape the interpretation of safety and risk. They support both monomodal (text) and multimodal (vision, audio) LLMs, and may test both generation and classification/guardrail tasks.
Multilingual safety benchmarks target a broad spectrum of risks, including but not limited to:
- Hate speech, discrimination, and bias
- Illegal instructions or criminal activity
- Privacy violations and data leakage
- Physical, mental, and societal harms (e.g., medical misinformation, advice on self-harm)
- Cultural and geo-legal norm noncompliance
- Security-critical vulnerabilities and agentic misuse
Prominent initiatives include XSafety (Wang et al., 2023), M-ALERT (Friedrich et al., 19 Dec 2024), USB-SafeBench (Zheng et al., 26 May 2025), SafeWorld (Yin et al., 9 Dec 2024), PolyGuardPrompts (Kumar et al., 6 Apr 2025), RabakBench (Chua et al., 8 Jul 2025), AIR-Bench 2024 (Zeng et al., 11 Jul 2024), as well as evaluation pipelines attached to agentic systems like MAPS (Hofman et al., 21 May 2025) and GUI environments (Yang et al., 4 Jun 2025).
2. Taxonomy Design and Risk Category Coverage
Comprehensive safety benchmarking depends on well-structured taxonomies that map diverse real-world risks. Taxonomies may be derived from academic literature, regulatory frameworks (e.g., EU AI Act, company Acceptable Use Policies), or empirical incidence of harmful LLM output.
Typical structures include:
- Macro categories (e.g., “Hate and Discrimination”, “Illegal Activities”, “Physical Harm”, “Privacy Violations”, “Morality”, “Ethics”, “Cultural Norms”, “Security”)
- Micro categories (e.g., “hate_women”, “substance_cannabis”, “crime_tax”, “crime_kidnapping” (Tedeschi et al., 6 Apr 2024, Friedrich et al., 19 Dec 2024))
- Tertiary or policy-aligned categories (e.g., 314 granular risks in AIR-Bench 2024 (Zeng et al., 11 Jul 2024); 61 subcategories in USB-SafeBench (Zheng et al., 26 May 2025))
Such detailed taxonomies permit:
- Fine-grained discrimination between distinct forms of harm
- Policy/regulation-oriented customization (removal/wighting of categories for jurisdictional alignment)
- Diagnostic analytics to identify specific weak points in model alignment.
Multilingual benchmarks extend these categories to cover language-specific and culture-specific harm, including regional taboos (Yin et al., 9 Dec 2024), slang, idioms (Chua et al., 8 Jul 2025), and context-dependent symbolic violations (Qiu et al., 20 May 2025).
3. Dataset Construction and Multilinguality
Modern multilingual safety benchmarks employ extensive data pipelines to curate, generate, and annotate high-quality evaluation samples:
Prompt Generation and Coverage
- Manual curation and red teaming: Adversarial prompts are crafted manually and with LLM assistance for coverage of edge cases (Sun et al., 2023, Chua et al., 8 Jul 2025).
- Semi-automatic and LLM-augmented expansion (Self-Instruct, template-based, iterative questioning): Large datasets, e.g., SafetyPrompts (100K+ Chinese/English prompts) (Sun et al., 2023), PolyGuardMix (1.91M samples across 17 languages) (Kumar et al., 6 Apr 2025), XThreatBench (adversarial prompts in 12 languages) (Banerjee et al., 16 Feb 2025).
- Diverse sources: Native language content, real online dialogues, translations, and synthetic data to ensure broad representation (Zheng et al., 26 May 2025).
- Translation and cross-lingual adaptation: High-quality translation pipelines (human, NMT, LLM-verification, masking for technical content) (Hofman et al., 21 May 2025, Friedrich et al., 19 Dec 2024, Chua et al., 8 Jul 2025). Preservation of slang, ambiguity, and cultural nuance is paramount for low-resource or code-mixed settings.
Safety Annotation
- Multi-level annotation: Binary (safe/unsafe), fine-grained category labels, and severity scoring (Yang et al., 29 Oct 2024, Chua et al., 8 Jul 2025).
- Human-in-the-loop: Native speaker and expert review for translation fidelity and cultural/contextual appropriateness (Yin et al., 9 Dec 2024, Chua et al., 8 Jul 2025).
- LLM-based annotation and jury deliberation: Majority-voted LLM ensembles and jury-inspired protocols for scalable and reliable large-scale labeling (Ying et al., 24 Oct 2024).
4. Evaluation Methodologies
Evaluation protocols are tailored to the nature of the benchmark:
- Multiple-choice and classification: Automatic accuracy/F1 via MCQ-style questions for comprehension of safety concepts (Zhang et al., 2023, Kumar et al., 6 Apr 2025).
- Generation task assessment: LLM-as-judge or dedicated safety classifiers such as LlamaGuard-3/4 or PolyGuard, sometimes combined with human or crowdworker verification (Friedrich et al., 19 Dec 2024, Yang et al., 29 Oct 2024, Kumar et al., 6 Apr 2025).
- Attack Success Rates and Oversensitivity: Attack Success Rate (ASR) computes the proportion of harmful prompts leading to unsafe responses; Average Refusal Rate (ARR) measures how often harmless queries are unnecessarily blocked (Zheng et al., 26 May 2025).
- Multimodal and Dialogue Settings: Evaluation across text, vision, audio, and interactive GUI modalities using parallel corpora, vision-embedded prompts, and simulation of dialogic red teaming (Gao et al., 24 Mar 2025, Ying et al., 24 Oct 2024, Yang et al., 4 Jun 2025, Cao et al., 16 Feb 2025).
- Statistical and Fairness Controls: Performance variation and disparities are stratified by language, resource level, and task type, with detailed reporting by category and language (Wang et al., 2023, Friedrich et al., 19 Dec 2024).
Relevant mathematical expressions include macro-averaged safety score formulas, n-gram diversity for translation consistency (), and structured loss functions for preference optimization ( in SafeWorld (Yin et al., 9 Dec 2024)).
5. Empirical Findings and Safety Insights
Evaluation across numerous benchmarks reveals clear themes:
- LLMs are generally less safe and consistent in non-English and low-resource languages, with unsafety rates up to 40%-60% in some settings vs. <2% in English (Wang et al., 2023, Friedrich et al., 19 Dec 2024, Kumar et al., 6 Apr 2025, Chua et al., 8 Jul 2025).
- Safety weaknesses are especially pronounced in adversarial contexts (jailbreak attacks, instruction blending, multilingual code-switching) (Song et al., 10 Jul 2024, Zheng et al., 26 May 2025, Cao et al., 16 Feb 2025).
- Category- and language-specific inconsistencies can be stark; for example, tax evasion prompts or cannabis-related content yield more unsafe outputs in certain languages (Friedrich et al., 19 Dec 2024).
- Multimodal benchmarks highlight additional vulnerabilities due to insufficient OCR, variable text rendering in images, and symbolic/cultural visual cues beyond text (Gao et al., 24 Mar 2025, Qiu et al., 20 May 2025).
- Current guardrails and moderation models, even when high-performing in English, drop in F1 or fail catastrophically in code-switched or local-dialect samples (Yang et al., 29 Oct 2024, Chua et al., 8 Jul 2025).
- Safe response rates—measured as refusal to generate or explicit safe refusals—rarely surpass 99% consistently across all languages and risk categories, with large gaps in policy-aligned or culturally embedded harm categories (Yin et al., 9 Dec 2024, Zeng et al., 11 Jul 2024).
6. Mitigation Strategies and Model Alignment
Benchmarks inform and validate a range of practical alignment interventions:
- Prompt engineering (SafePrompt, XLingPrompt) directly instructs LLMs to be safe in a given language or to "think in English" when generating non-English outputs, halving unsafe rates in some cases (Wang et al., 2023).
- Functional parameter steering (Soteria) fine-tunes only the most causally responsible model heads for harm generation, yielding safety improvements across languages without utility loss (Banerjee et al., 16 Feb 2025).
- Supervised and preference-based fine-tuning (DPO, Safety-SFT, Safety-DPO) leverages culturally/legally grounded preference pairs to train models on safe vs. unsafe differentiated behaviors (Yin et al., 9 Dec 2024, Qiu et al., 20 May 2025).
- Expanding safety training with large, diverse, and real-world data (PolyGuardMix, SafetyPrompts, XThreatBench) to address the "curse of multilinguality" and encourage more robust generalization (Sun et al., 2023, Kumar et al., 6 Apr 2025).
- Domain adaptation and multimodal robustness training, especially for interactive or GUI agents, to account for domain-specific layouts, mirrored interfaces, right-to-left scripts, and image or speech input peculiarities (Yang et al., 4 Jun 2025, Ying et al., 24 Oct 2024).
7. Implications, Limitations, and Ongoing Directions
The adoption of multilingual safety benchmarks has several far-reaching implications:
- They provide actionable diagnostic feedback for model developers to identify, localize, and prioritize safety weaknesses.
- By enabling cross-model, cross-language, and cross-modal evaluation, they help drive safe and equitable AI adoption worldwide and inform regulatory compliance (e.g., by mapping compliance gaps to policy frameworks in AIR-Bench 2024 (Zeng et al., 11 Jul 2024)).
- They expose critical open problems: the need for more nuanced and dynamic policy-aware evaluation, improved annotation methods for cultural/legal/context-specific harm, and continual updating of benchmarks to match evolving threats and societal expectations.
- Future work is likely to pursue further expansion to low-resource and dialectal settings (Chua et al., 8 Jul 2025), richer multimodal integration (Zheng et al., 26 May 2025), and ongoing extensions of both safety taxonomies and real-world adversarial scenarios.
In sum, multilingual safety benchmarks constitute the empirical backbone for understanding, assessing, and improving the safety of LLMs in a linguistically and culturally diverse world. Their continued development and rigorous application are essential for responsible AI deployment.