Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 89 tok/s
Gemini 2.5 Pro 54 tok/s Pro
GPT-5 Medium 27 tok/s
GPT-5 High 22 tok/s Pro
GPT-4o 89 tok/s
GPT OSS 120B 457 tok/s Pro
Kimi K2 169 tok/s Pro
2000 character limit reached

Multilingual Safety Benchmark

Updated 14 July 2025
  • Multilingual safety benchmark is a framework designed to evaluate LLM safety by using detailed taxonomies, quality multilingual prompts, and robust evaluation protocols.
  • It identifies risks such as toxicity, bias, privacy breaches, and culturally inappropriate responses across text, vision, and audio modalities.
  • The benchmark enhances AI alignment through diverse data sources, red teaming, and tailored evaluation methods that address language-specific and cultural nuances.

A multilingual safety benchmark is a systematic framework for evaluating the safety of LLMs and related AI systems across multiple languages, cultural contexts, and modalities. Such benchmarks aim to rigorously identify unsafe outputs—such as toxicity, policy violations, and culturally inappropriate responses—in real and adversarial scenarios, thereby providing essential tools to measure, compare, and ultimately improve the safety of LLMs globally. As LLMs are increasingly deployed in multilingual and multicultural environments, the development and adoption of multilingual safety benchmarks are critical to ensuring equitable, reliable, and responsible AI behavior.

1. Definitions, Scope, and Rationale

A multilingual safety benchmark encompasses diverse safety taxonomies, high-quality multilingual prompts, and robust evaluation protocols to assess how effectively LLMs avoid producing unsafe, harmful, or policy-violating outputs across languages. Motivated by empirical findings that LLMs often exhibit higher rates of unsafe or inconsistent behavior in non-English or low-resource languages, these benchmarks extend beyond simple translation—they systematically capture linguistic, cultural, and legal variations that shape the interpretation of safety and risk. They support both monomodal (text) and multimodal (vision, audio) LLMs, and may test both generation and classification/guardrail tasks.

Multilingual safety benchmarks target a broad spectrum of risks, including but not limited to:

  • Hate speech, discrimination, and bias
  • Illegal instructions or criminal activity
  • Privacy violations and data leakage
  • Physical, mental, and societal harms (e.g., medical misinformation, advice on self-harm)
  • Cultural and geo-legal norm noncompliance
  • Security-critical vulnerabilities and agentic misuse

Prominent initiatives include XSafety (Wang et al., 2023), M-ALERT (Friedrich et al., 19 Dec 2024), USB-SafeBench (Zheng et al., 26 May 2025), SafeWorld (Yin et al., 9 Dec 2024), PolyGuardPrompts (Kumar et al., 6 Apr 2025), RabakBench (Chua et al., 8 Jul 2025), AIR-Bench 2024 (Zeng et al., 11 Jul 2024), as well as evaluation pipelines attached to agentic systems like MAPS (Hofman et al., 21 May 2025) and GUI environments (Yang et al., 4 Jun 2025).

2. Taxonomy Design and Risk Category Coverage

Comprehensive safety benchmarking depends on well-structured taxonomies that map diverse real-world risks. Taxonomies may be derived from academic literature, regulatory frameworks (e.g., EU AI Act, company Acceptable Use Policies), or empirical incidence of harmful LLM output.

Typical structures include:

  • Macro categories (e.g., “Hate and Discrimination”, “Illegal Activities”, “Physical Harm”, “Privacy Violations”, “Morality”, “Ethics”, “Cultural Norms”, “Security”)
  • Micro categories (e.g., “hate_women”, “substance_cannabis”, “crime_tax”, “crime_kidnapping” (Tedeschi et al., 6 Apr 2024, Friedrich et al., 19 Dec 2024))
  • Tertiary or policy-aligned categories (e.g., 314 granular risks in AIR-Bench 2024 (Zeng et al., 11 Jul 2024); 61 subcategories in USB-SafeBench (Zheng et al., 26 May 2025))

Such detailed taxonomies permit:

  • Fine-grained discrimination between distinct forms of harm
  • Policy/regulation-oriented customization (removal/wighting of categories for jurisdictional alignment)
  • Diagnostic analytics to identify specific weak points in model alignment.

Multilingual benchmarks extend these categories to cover language-specific and culture-specific harm, including regional taboos (Yin et al., 9 Dec 2024), slang, idioms (Chua et al., 8 Jul 2025), and context-dependent symbolic violations (Qiu et al., 20 May 2025).

3. Dataset Construction and Multilinguality

Modern multilingual safety benchmarks employ extensive data pipelines to curate, generate, and annotate high-quality evaluation samples:

Prompt Generation and Coverage

Safety Annotation

4. Evaluation Methodologies

Evaluation protocols are tailored to the nature of the benchmark:

Relevant mathematical expressions include macro-averaged safety score formulas, n-gram diversity for translation consistency (D=ΣnPnlogPnD_\ell = - \Sigma_n P_n \log P_n), and structured loss functions for preference optimization (LDPO\mathcal L_{\text{DPO}} in SafeWorld (Yin et al., 9 Dec 2024)).

5. Empirical Findings and Safety Insights

Evaluation across numerous benchmarks reveals clear themes:

6. Mitigation Strategies and Model Alignment

Benchmarks inform and validate a range of practical alignment interventions:

  • Prompt engineering (SafePrompt, XLingPrompt) directly instructs LLMs to be safe in a given language or to "think in English" when generating non-English outputs, halving unsafe rates in some cases (Wang et al., 2023).
  • Functional parameter steering (Soteria) fine-tunes only the most causally responsible model heads for harm generation, yielding safety improvements across languages without utility loss (Banerjee et al., 16 Feb 2025).
  • Supervised and preference-based fine-tuning (DPO, Safety-SFT, Safety-DPO) leverages culturally/legally grounded preference pairs to train models on safe vs. unsafe differentiated behaviors (Yin et al., 9 Dec 2024, Qiu et al., 20 May 2025).
  • Expanding safety training with large, diverse, and real-world data (PolyGuardMix, SafetyPrompts, XThreatBench) to address the "curse of multilinguality" and encourage more robust generalization (Sun et al., 2023, Kumar et al., 6 Apr 2025).
  • Domain adaptation and multimodal robustness training, especially for interactive or GUI agents, to account for domain-specific layouts, mirrored interfaces, right-to-left scripts, and image or speech input peculiarities (Yang et al., 4 Jun 2025, Ying et al., 24 Oct 2024).

7. Implications, Limitations, and Ongoing Directions

The adoption of multilingual safety benchmarks has several far-reaching implications:

  • They provide actionable diagnostic feedback for model developers to identify, localize, and prioritize safety weaknesses.
  • By enabling cross-model, cross-language, and cross-modal evaluation, they help drive safe and equitable AI adoption worldwide and inform regulatory compliance (e.g., by mapping compliance gaps to policy frameworks in AIR-Bench 2024 (Zeng et al., 11 Jul 2024)).
  • They expose critical open problems: the need for more nuanced and dynamic policy-aware evaluation, improved annotation methods for cultural/legal/context-specific harm, and continual updating of benchmarks to match evolving threats and societal expectations.
  • Future work is likely to pursue further expansion to low-resource and dialectal settings (Chua et al., 8 Jul 2025), richer multimodal integration (Zheng et al., 26 May 2025), and ongoing extensions of both safety taxonomies and real-world adversarial scenarios.

In sum, multilingual safety benchmarks constitute the empirical backbone for understanding, assessing, and improving the safety of LLMs in a linguistically and culturally diverse world. Their continued development and rigorous application are essential for responsible AI deployment.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (19)

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube