Papers
Topics
Authors
Recent
2000 character limit reached

CSSBench: Chinese-Specific Safety Benchmark

Updated 9 January 2026
  • The paper introduces CSSBench as a Chinese-specific safety benchmark that rigorously quantifies LLM vulnerabilities using adversarial prompt engineering and hierarchical risk taxonomies.
  • It employs detailed datasets and structured evaluation methodologies, including metrics like ASR, ORR, and CER, to capture multi-faceted regulatory and sociotechnical risks.
  • The benchmark informs best practices for developing robust Chinese-centric LLMs, balancing safety and helpfulness against sophisticated, culturally nuanced risks.

A Chinese-Specific Safety Benchmark (CSSBench) constitutes a suite of datasets, protocols, and evaluation frameworks expressly designed to rigorously quantify and analyze the safety properties of LLMs in the Chinese linguistic, legal, and cultural context. CSSBench captures the full spectrum of Chinese-centric risks—including region-specific expression, adversarial query patterns, multi-faceted regulatory norms, and sociotechnical value alignment—with a level of granularity and adversarial coverage not found in benchmarks focused on English or generic multilingual corpora. Recent iterations of CSSBench have incorporated adversarial prompt engineering, hierarchical risk taxonomies derived from Chinese statutory and social guidelines, and comprehensive empirical evaluations of both open-source and commercial LLMs (Zhou et al., 2 Jan 2026).

1. Structural Taxonomy and Risk Domains

CSSBench typologies are designed to comprehensively represent Chinese safety risks across hierarchical dimensions, drawing from national standards, regulatory guidelines, and empirical incident analysis. Most modern CSSBench instantiations deploy a two-level taxonomy. For example, CHiSafetyBench organizes content into five macro-domains—Discrimination, Violation of Values, Commercial Violations, Infringement of Rights, and Security Requirements for Specific Services—spanning 31–40 micro-categories (e.g., ethnic discrimination, propagating violence, privacy leakage, unreliable content) (&&&1&&&). JailBench refines this approach with a 5×40 hierarchy, explicitly mapping each query to a (domain, risk type) tuple to guarantee coverage of both universal and Chinese-specific threats (such as undermining core socialist values) (Liu et al., 26 Feb 2025). Similar coverage is found in ChineseSafe, CValues, and other benchmarks, although label granularity and domain coverage may vary (Zhang et al., 2024, Xu et al., 2023).

A representative high-level taxonomy organization:

Macro-domain Example Fine-grained Subcategories
Discrimination Ethnic, gender, regional, age, occupation
Violation of (Core Socialist) Values State subversion, extremism, separatism
Commercial Violations IP infringement, business ethics, fraud
Infringement of Rights Privacy, reputation, physical/mental health
Security for Specific Services Medical misinformation, infrastructure harm

This table reflects the formal structure used in CHiSafetyBench (Zhang et al., 2024).

2. Adversarial Prompt Engineering and Dataset Construction

CSSBench benchmarks emphasize adversarial robustness by integrating Chinese-specific perturbation schemes, such as homophone/shape substitutions, pinyin obfuscation, symbol-based splitting, and zero-width insertion (Zhou et al., 2 Jan 2026). Datasets are constructed via multi-stage processes:

  • Initial curation of malicious seed queries using domain-specific lexicons.
  • Application of Chinese-written adversarial patterns to each prompt—generating variants that bypass literal keyword/token detectors.
  • Automated prompt expansion (context learning, few-shot LLM prompting) to address data sparsity and diversify scenarios; see JailBench’s Automatic Jailbreak Prompt Engineer (AJPE) methodology (Liu et al., 26 Feb 2025).
  • Human and LLM-in-the-loop annotation for risk labeling and quality validation, with structured adjudication steps to maximize inter-annotator agreement (Zhang et al., 2024).

CSSBench datasets typically feature both open-ended and structured tasks (MCQ, TF, QA). For instance, CSSBench (2026) includes ∼900 adversarial seed queries per domain, with four variants per prompt for pattern robustness, and an additional ∼250 benign “borderline” prompts targeting false positive over-refusal (Zhou et al., 2 Jan 2026). Such adversarial construction directly exposes model vulnerabilities under realistic Chinese obfuscation.

3. Evaluation Methodologies and Metrics

CSSBench employs detailed, task-specific evaluation protocols to measure both safety and helpfulness:

ASR=1# successful refusals or safe responses# total adversarial queries\mathrm{ASR} = 1 - \frac{\text{\# successful refusals or safe responses}}{\text{\# total adversarial queries}}

  • Over-Refusal Rate (ORR): The fraction of benign queries for which the model responds with unnecessary refusal, capturing lost helpfulness:

ORR=# unnecessary refusals on benign queries# total benign (borderline) queries\mathrm{ORR} = \frac{\text{\# unnecessary refusals on benign queries}}{\text{\# total benign (borderline) queries}}

  • Composite Error Rate (CER): Aggregates both safety and helpfulness errors:

CER=tTNtMAt+NOOtTNtM+NO\mathrm{CER} = \frac{\sum_{t\in\mathcal T}N^M_{t}\,\mathcal A_t + N^O\,\mathcal O}{\sum_{t\in\mathcal T}N^M_{t} + N^O}

where T\mathcal T is the set of task types (QA, TF, MCQ), NtMN^M_{t} is the number of adversarial queries for type tt, At\mathcal A_t is ASR on type tt, NON^O is the number of benign queries, and O\mathcal O is the ORR (Zhou et al., 2 Jan 2026).

  • Task-specific correctness/ACC/HR: For MCQ and refusal tasks, metrics such as risk-content identification accuracy (ACC), responsible refusal rates (RR-1, RR-2), and harm rate (HR) are well defined, e.g. (Zhang et al., 2024, Zhang et al., 18 Mar 2025).
  • Human/LLM-as-Judge Protocols: Recent variants use LLM-based (e.g., Qwen-72B, GPT-4) automatic evaluators, either via rule-based classifiers or CoT-augmented in-context reasoning (Zhang et al., 2024, Cai et al., 11 Aug 2025).

This design enables granular domain-wise, pattern-wise, and modality-wise (QA/MCQ/TF) measurement of model safety postures under realistic adversarial and benign contexts.

4. Empirical Findings and Model Analysis

Extensive empirical evaluation on CSSBench reveals the current limits and disparities of Chinese-capable LLMs:

  • Adversarial Robustness: Lightweight Chinese LLMs (0.5B–8B) exhibit high vulnerability, with ASR on obfuscated queries reaching 30–39%. Adult content and fraud/hate categories are especially weak spots (>50% ASR) (Zhou et al., 2 Jan 2026). Larger instruction-tuned LLMs generally perform better but remain non-robust to sophisticated Chinese-specific patterns (Liu et al., 26 Feb 2025).
  • Safety–Helpfulness Trade-off: ORR can range from ∼31% to 87%, indicating that some models achieve safety by blunt over-refusal. The CER reveals trade-offs between these dimensions (Zhou et al., 2 Jan 2026).
  • Pattern and Domain Sensitivity: Pinyin mix and symbol mix patterns yield the highest ASR. In contrast, zero-width and homophone perturbations have less impact but still reveal systematic weaknesses in pattern normalization (Zhou et al., 2 Jan 2026).
  • Task and Domain Heterogeneity: Open-ended QA is most attack-prone; MCQ and TF demonstrate comparatively lower ASR. Public/political safety, fraud, adult content, and discrimination consistently dominate the unsafe response profiles across LLMs (Zhang et al., 2024, Liu et al., 26 Feb 2025).
  • Other CSSBench Implementations: Hierarchical dual-task evaluation in CHiSafetyBench uncovers systematic deficiencies in discrimination detection (lowest MCQ accuracy, lowest refusal rates) and a pronounced failure in multi-turn risk refusal (RR-1 drops up to 50 points vs. single-turn) (Zhang et al., 2024).

5. Comparative Context and Evolution

CSSBench represents an advance over prior general and translated safety benchmarks in several respects:

  • Coverage and Granularity: Compared to early “Do-Not-Answer” or SafetyBench datasets, CSSBench offers finer-grained taxonomy (up to 40 risk types), systematic adversarial coverage, explicit region- and policy-specific edge cases, and composite metrics (Wang et al., 2024, Sun et al., 2023).
  • Dynamic and Continuous Updates: Novel frameworks such as LiveSecBench propose dynamic dataset updates, ELO-style tournament scoring, and continual prompt injection to prevent overfitting and adapt to new threat vectors, including planned multimodal and agentic safety extensions (Li et al., 4 Nov 2025).
  • Dual-Task and Responsibility Design: Benchmarks like CValues and variants of CHiSafetyBench motivate dual-layer designs—incorporating both harm prevention (safety) and higher-order value alignment (responsibility, empathy)—with separate evaluation tracks for adversarial prompts and expert-sourced responsibility dilemmas (Xu et al., 2023).
  • Specialized Subdomains: Newer CSSBench instances include tailored modules for mental health support (PsyCrisis-Bench), handling suicidal ideation and self-injury with pointwise expert-criteria scoring, as well as factuality-focused SafetyQA targeting legal/policy knowledge (Cai et al., 11 Aug 2025, Tan et al., 2024).

6. Significance, Recommendations, and Ongoing Directions

CSSBench defines the de facto standard for evaluating LLM safety in Chinese deployment environments:

  • Development Implications: Adversarial-pattern data and rigorous taxonomy should be incorporated into fine-tuning or RLHF protocols for Chinese-centric LLMs. Static keyword lists and English-aligned safety filters are inadequate; model and pre-tokenizer pipelines must be robustified to region-specific obfuscation (Zhou et al., 2 Jan 2026).
  • Benchmark Evolution: Benchmarks must provision for dynamic update mechanisms as models and attack strategies coevolve. Multi-modal, dialogue-history-aware, and domain-transferable safety evaluation represent ongoing research frontiers (Li et al., 4 Nov 2025, Zhang et al., 2024).
  • Best Practices: Model evaluation on CSSBench is foundational for commercial, governmental, and compliance monitoring of AI systems in China. Developers are advised to audit for both unsafe compliance and unhelpful overrefusal, calibrate sensitivity to subtle variants, and contextualize benchmarks with evolving regulations and societal values (Wang et al., 2024, Zhang et al., 2024).

A plausible implication is that CSSBench’s modular, adversarial, and hierarchical framework will inform not only Chinese LLM safety evaluation, but also the development of future multilingual and culturally-contextualized AI safety protocols globally.

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Chinese-Specific Safety Benchmark (CSSBench).