McBE: Chinese Bias Evaluation Benchmark

Updated 7 July 2025

McBE is a comprehensive framework that measures LLM biases through 4,077 finely-crafted instances covering 12 culturally relevant bias categories.
It employs a multi-task evaluation approach—including preference computation, subcategory classification, scenario selection, bias analysis, and bias scoring—to enable detailed cross-model comparisons.
The benchmark leverages expert annotations and culturally salient substitution lists to support systematic bias detection and the development of targeted mitigation techniques.

The Multi-task Chinese Bias Evaluation Benchmark (McBE) is a comprehensive evaluation framework designed to rigorously assess and quantify biases in LLMs from a Chinese cultural and linguistic perspective. Addressing the scarcity of culturally grounded bias evaluation tools for Chinese, McBE provides extensive coverage across bias categories, diverse task formulations, and fine-grained annotation strategies. The benchmark advances the state of NLP bias assessment by enabling systematic cross-model comparison, fine-grained error analysis, and the development of culturally appropriate mitigation techniques.

1. Composition and Category Coverage

McBE comprises 4,077 meticulously constructed Bias Evaluation Instances (BEIs), each designed to probe different aspects of model bias within the Chinese context. Every BEI includes a contextualized template sentence featuring a placeholder (e.g., “[PLH]”), a substitution list of culturally salient target words, a human-authored explanation elucidating the nature of the bias, and a severity rating on a 0–10 scale. This enables a granular approach to bias detection and interpretation.

The benchmark encompasses 12 primary bias categories, further subdivided into 82 subcategories:

Gender
Religion
Nationality
Socioeconomic Status
Age
Appearance
Health
Region
LGBTQ+
Worldview
Subculture
Race

These categories are derived from a combination of Chinese legal frameworks (such as areas protected under labor and disability law) and emergent social stereotypes, making the coverage both comprehensive and culturally relevant (2507.02088).

2. Multi-task Evaluation Framework

To provide multidimensional measurement, McBE introduces five distinct evaluation tasks. Each task captures a different facet of model behavior, spanning both intrinsic model outputs and interpretative abilities:

Preference Computation (PC) For a given BEI, the placeholder is substituted with each term in the substitution list; the model’s negative log-likelihood (NLL) for each sentence is measured. The system computes the variance $V$ of these NLLs and transforms it to a normalized score: $V = \frac{1}{n}\sum_{i=1}^{n} [NLL(S_i) - \mu]^2$ $\text{Score} = 100 \cdot \exp(-r \cdot V)$ where $r$ is a decay constant. Higher variance (i.e., stronger model preference for certain substitutions) receives lower scores.
Subcategory Classification (SC) The model is fed a default sentence and must assign it to one of the 82 subcategories. Accuracy is computed as the percentage of correctly classified BEIs times 100.
Scenario Selection (SS) Pairs of sentences, generated by alternately substituting the placeholder, are presented; the model is tasked to select the more likely one. Pairwise preference variances are scored similarly to PC, with normalization via exponential decay.
Bias Analysis (BA) The model must analyze a given sentence, identifying the latent bias. Its output is compared—typically by a reference LLM judge—with a human-provided explanation, with sub-scores weighted for factors such as accuracy, cultural awareness, and insight.
Bias Scoring (BS) The model rates the severity of bias in a sentence (on a scale the same as human annotators). The mean absolute difference between model and human scores is mapped onto: $\text{Final Score} = 100 - k \cdot (\text{mean absolute difference})$ where $k$ scales the range to [0,100].

This multi-task structure ensures that both model predisposition (e.g., in preference tendency) and bias awareness (e.g., in explanatory and classification abilities) are systematically evaluated (2507.02088).

3. Dataset Construction and Annotation

Dataset creation follows a multi-stage, expert-driven process to guarantee both diversity and annotation fidelity:

Instance Design: BEIs are sourced from social platforms, lived experiences, and adapted examples from prior benchmarks. Templates are crafted to ensure a range of real-world scenarios covering both explicit and subtle forms of bias.
Placeholders & Substitution Lists: Sentence templates use [PLH] as a stand-in for names, roles, or attributes, supporting controlled substitution and sensitive error analysis.
Annotation: At least 30 graduate-level native Chinese annotators from diverse geographical backgrounds provide category assignment, explanations, and severity scores. Discrepancies are resolved through discussion or expert mediation.
Quality Control: Manual tagging extends to every BEI, and labels are cross-validated for consistency.

This design circumvents the limitations of benchmarks imported from English or those lacking multi-aspect coverage, grounding evaluation in the sociolinguistic realities of Chinese society (2507.02088).

4. Experimental Analysis and Model Insights

McBE has been used to systematically evaluate a diverse set of LLMs—including monolingual Chinese, multilingual, and instruction-tuned models—across parameter scales (e.g., Qwen2.5, InternLM2.5, Baichuan2, GLM4, DeepSeek-V3, and Mistral). The experimental findings indicate:

Category-specific Performance: LLMs exhibit less bias (higher scores) in categories such as religion and region but more pronounced bias in categories such as nationality and race.
Effect of Model Scale: Larger models tend to outperform smaller ones in tasks that involve explicit bias awareness (SC, BA, BS), yet smaller models sometimes appear “better” in scenario selection due to randomness—a phenomenon interpreted as a lack of systemic preference rather than true fairness.
Training Data and Cultural Context: Multilingual models or those primarily trained on non-Chinese corpora often underperform on culturally specific biases, highlighting the importance of in-domain data.
No Universally Fair Model: Variations are substantial across models and categories. The results establish that state-of-the-art LLMs, even at scale, are not free from bias; some larger models can even amplify bias if the underlying training data contains cultural stereotypes (2507.02088).

5. Methodological Rigor and Mathematical Formulation

McBE applies a quantitative and qualitative methodology that balances reproducible measurement with human interpretability:

Automated Intrinsic Tasks: Preference Computation and Scenario Selection quantify model subjective bias by measuring internal sentence probabilities across target attributes.
Interpretative Tasks: Subcategory Classification, Bias Analysis, and Bias Scoring require models to show practical understanding, leveraging both structured outputs and judged free-form analysis.
Statistical Transformation: Variances, means, and errors are mapped onto scores via exponential decay or linear transformations, ensuring task comparability and enabling summary statistics across bias dimensions.

This rigorous construction addresses deficiencies common in earlier benchmarks—such as limited cultural engagement or overreliance on a single metric—and supports principled cross-model comparison (2507.02088).

6. Implications and Future Directions

The implementation of McBE yields several substantive implications for both research and deployment of LLMs:

Cultural Adaptation: The variability in model performance across bias categories demonstrates the necessity of localized, culturally sensitive datasets for both evaluation and mitigation.
Data-driven Bias Origin: Model bias is closely linked to the distributional properties of the training data; thus, targeted dataset curation and balanced representation are essential.
Challenge for Black-box Models: Certain evaluation tasks require access to internal model probabilities, limiting the current benchmark’s applicability to API-only systems; advancing transfer and prompt-based methods could alleviate this.
Framework for Debiasing Approaches: By publicly specifying bias categories, subcategories, and representative examples, McBE provides a robust foundation for developing, ablation-testing, and benchmarking debiasing strategies targeted at Chinese language contexts.

The paradigm advanced by McBE foregrounds cultural context as a core consideration in bias evaluation, establishing standards for future work in Chinese and, by extension, other under-resourced languages and cultures (2507.02088).

7. Relation to Broader Benchmarking and Bias Research

McBE supplements a landscape of Chinese benchmarks that, until recently, have prioritized knowledge and reasoning skills (e.g., CMMLU (2306.09212), M3KE (2305.10263), ZhuJiu (2308.14353)) over explicit bias analysis. It advances beyond single-task or English-centric bias datasets, such as CHBias (2305.11262) and CBBQ (2306.16244), by offering multi-category, multi-aspect, culturally situated bias measurement rather than focusing narrowly on a limited domain or bias type.

Its innovations—such as fine-grained BEIs, task variety, and normalization formulas—define a blueprint for future multi-task bias benchmarks. In aggregate, McBE represents a critical step toward comprehensive, culture-aware bias auditing for LLMs in the Chinese NLP ecosystem.