Multilingual Safety Benchmark

Updated 14 July 2025

Multilingual safety benchmark is a framework designed to evaluate LLM safety by using detailed taxonomies, quality multilingual prompts, and robust evaluation protocols.
It identifies risks such as toxicity, bias, privacy breaches, and culturally inappropriate responses across text, vision, and audio modalities.
The benchmark enhances AI alignment through diverse data sources, red teaming, and tailored evaluation methods that address language-specific and cultural nuances.

A multilingual safety benchmark is a systematic framework for evaluating the safety of LLMs and related AI systems across multiple languages, cultural contexts, and modalities. Such benchmarks aim to rigorously identify unsafe outputs—such as toxicity, policy violations, and culturally inappropriate responses—in real and adversarial scenarios, thereby providing essential tools to measure, compare, and ultimately improve the safety of LLMs globally. As LLMs are increasingly deployed in multilingual and multicultural environments, the development and adoption of multilingual safety benchmarks are critical to ensuring equitable, reliable, and responsible AI behavior.

1. Definitions, Scope, and Rationale

A multilingual safety benchmark encompasses diverse safety taxonomies, high-quality multilingual prompts, and robust evaluation protocols to assess how effectively LLMs avoid producing unsafe, harmful, or policy-violating outputs across languages. Motivated by empirical findings that LLMs often exhibit higher rates of unsafe or inconsistent behavior in non-English or low-resource languages, these benchmarks extend beyond simple translation—they systematically capture linguistic, cultural, and legal variations that shape the interpretation of safety and risk. They support both monomodal (text) and multimodal (vision, audio) LLMs, and may test both generation and classification/guardrail tasks.

Multilingual safety benchmarks target a broad spectrum of risks, including but not limited to:

Hate speech, discrimination, and bias
Illegal instructions or criminal activity
Privacy violations and data leakage
Physical, mental, and societal harms (e.g., medical misinformation, advice on self-harm)
Cultural and geo-legal norm noncompliance
Security-critical vulnerabilities and agentic misuse

Prominent initiatives include XSafety (2310.00905), M-ALERT (2412.15035), USB-SafeBench (2505.23793), SafeWorld (2412.06483), PolyGuardPrompts (2504.04377), RabakBench (2507.05980), AIR-Bench 2024 (2407.17436), as well as evaluation pipelines attached to agentic systems like MAPS (2505.15935) and GUI environments (2506.04135).

2. Taxonomy Design and Risk Category Coverage

Comprehensive safety benchmarking depends on well-structured taxonomies that map diverse real-world risks. Taxonomies may be derived from academic literature, regulatory frameworks (e.g., EU AI Act, company Acceptable Use Policies), or empirical incidence of harmful LLM output.

Typical structures include:

Macro categories (e.g., “Hate and Discrimination”, “Illegal Activities”, “Physical Harm”, “Privacy Violations”, “Morality”, “Ethics”, “Cultural Norms”, “Security”)
Micro categories (e.g., “hate_women”, “substance_cannabis”, “crime_tax”, “crime_kidnapping” (2404.08676, 2412.15035))
Tertiary or policy-aligned categories (e.g., 314 granular risks in AIR-Bench 2024 (2407.17436); 61 subcategories in USB-SafeBench (2505.23793))

Such detailed taxonomies permit:

Fine-grained discrimination between distinct forms of harm
Policy/regulation-oriented customization (removal/wighting of categories for jurisdictional alignment)
Diagnostic analytics to identify specific weak points in model alignment.

Multilingual benchmarks extend these categories to cover language-specific and culture-specific harm, including regional taboos (2412.06483), slang, idioms (2507.05980), and context-dependent symbolic violations (2505.14972).

3. Dataset Construction and Multilinguality

Modern multilingual safety benchmarks employ extensive data pipelines to curate, generate, and annotate high-quality evaluation samples:

Prompt Generation and Coverage

Manual curation and red teaming: Adversarial prompts are crafted manually and with LLM assistance for coverage of edge cases (2304.10436, 2507.05980).
Semi-automatic and LLM-augmented expansion (Self-Instruct, template-based, iterative questioning): Large datasets, e.g., SafetyPrompts (100K+ Chinese/English prompts) (2304.10436), PolyGuardMix (1.91M samples across 17 languages) (2504.04377), XThreatBench (adversarial prompts in 12 languages) (2502.11244).
Diverse sources: Native language content, real online dialogues, translations, and synthetic data to ensure broad representation (2505.23793).
Translation and cross-lingual adaptation: High-quality translation pipelines (human, NMT, LLM-verification, masking for technical content) (2505.15935, 2412.15035, 2507.05980). Preservation of slang, ambiguity, and cultural nuance is paramount for low-resource or code-mixed settings.

Safety Annotation

Multi-level annotation: Binary (safe/unsafe), fine-grained category labels, and severity scoring (2410.22153, 2507.05980).
Human-in-the-loop: Native speaker and expert review for translation fidelity and cultural/contextual appropriateness (2412.06483, 2507.05980).
LLM-based annotation and jury deliberation: Majority-voted LLM ensembles and jury-inspired protocols for scalable and reliable large-scale labeling (2410.18927).

4. Evaluation Methodologies

Evaluation protocols are tailored to the nature of the benchmark:

Multiple-choice and classification: Automatic accuracy/F1 via MCQ-style questions for comprehension of safety concepts (2309.07045, 2504.04377).
Generation task assessment: LLM-as-judge or dedicated safety classifiers such as LlamaGuard-3/4 or PolyGuard, sometimes combined with human or crowdworker verification (2412.15035, 2410.22153, 2504.04377).
Attack Success Rates and Oversensitivity: Attack Success Rate (ASR) computes the proportion of harmful prompts leading to unsafe responses; Average Refusal Rate (ARR) measures how often harmless queries are unnecessarily blocked (2505.23793).
Multimodal and Dialogue Settings: Evaluation across text, vision, audio, and interactive GUI modalities using parallel corpora, vision-embedded prompts, and simulation of dialogic red teaming (2503.18484, 2410.18927, 2506.04135, 2502.11090).
Statistical and Fairness Controls: Performance variation and disparities are stratified by language, resource level, and task type, with detailed reporting by category and language (2310.00905, 2412.15035).

Relevant mathematical expressions include macro-averaged safety score formulas, n-gram diversity for translation consistency ( $D_\ell = - \Sigma_n P_n \log P_n$ ), and structured loss functions for preference optimization ( $\mathcal L_{\text{DPO}}$ in SafeWorld (2412.06483)).

5. Empirical Findings and Safety Insights

Evaluation across numerous benchmarks reveals clear themes:

LLMs are generally less safe and consistent in non-English and low-resource languages, with unsafety rates up to 40%-60% in some settings vs. <2% in English (2310.00905, 2412.15035, 2504.04377, 2507.05980).
Safety weaknesses are especially pronounced in adversarial contexts (jailbreak attacks, instruction blending, multilingual code-switching) (2407.07342, 2505.23793, 2502.11090).
Category- and language-specific inconsistencies can be stark; for example, tax evasion prompts or cannabis-related content yield more unsafe outputs in certain languages (2412.15035).
Multimodal benchmarks highlight additional vulnerabilities due to insufficient OCR, variable text rendering in images, and symbolic/cultural visual cues beyond text (2503.18484, 2505.14972).
Current guardrails and moderation models, even when high-performing in English, drop in F1 or fail catastrophically in code-switched or local-dialect samples (2410.22153, 2507.05980).
Safe response rates—measured as refusal to generate or explicit safe refusals—rarely surpass 99% consistently across all languages and risk categories, with large gaps in policy-aligned or culturally embedded harm categories (2412.06483, 2407.17436).

6. Mitigation Strategies and Model Alignment

Benchmarks inform and validate a range of practical alignment interventions:

Prompt engineering (SafePrompt, XLingPrompt) directly instructs LLMs to be safe in a given language or to "think in English" when generating non-English outputs, halving unsafe rates in some cases (2310.00905).
Functional parameter steering (Soteria) fine-tunes only the most causally responsible model heads for harm generation, yielding safety improvements across languages without utility loss (2502.11244).
Supervised and preference-based fine-tuning (DPO, Safety-SFT, Safety-DPO) leverages culturally/legally grounded preference pairs to train models on safe vs. unsafe differentiated behaviors (2412.06483, 2505.14972).
Expanding safety training with large, diverse, and real-world data (PolyGuardMix, SafetyPrompts, XThreatBench) to address the "curse of multilinguality" and encourage more robust generalization (2304.10436, 2504.04377).
Domain adaptation and multimodal robustness training, especially for interactive or GUI agents, to account for domain-specific layouts, mirrored interfaces, right-to-left scripts, and image or speech input peculiarities (2506.04135, 2410.18927).

7. Implications, Limitations, and Ongoing Directions

The adoption of multilingual safety benchmarks has several far-reaching implications:

They provide actionable diagnostic feedback for model developers to identify, localize, and prioritize safety weaknesses.
By enabling cross-model, cross-language, and cross-modal evaluation, they help drive safe and equitable AI adoption worldwide and inform regulatory compliance (e.g., by mapping compliance gaps to policy frameworks in AIR-Bench 2024 (2407.17436)).
They expose critical open problems: the need for more nuanced and dynamic policy-aware evaluation, improved annotation methods for cultural/legal/context-specific harm, and continual updating of benchmarks to match evolving threats and societal expectations.
Future work is likely to pursue further expansion to low-resource and dialectal settings (2507.05980), richer multimodal integration (2505.23793), and ongoing extensions of both safety taxonomies and real-world adversarial scenarios.

In sum, multilingual safety benchmarks constitute the empirical backbone for understanding, assessing, and improving the safety of LLMs in a linguistically and culturally diverse world. Their continued development and rigorous application are essential for responsible AI deployment.