Code Bias Score (CBS)

Updated 24 May 2026

Code Bias Score (CBS) is a quantitative metric that measures output disparities when controlled changes are made to protected attributes in model-generated code and text.
It encompasses methodologies such as metamorphic testing, classifier-based labeling, and masked language modeling to objectively evaluate bias.
CBS evaluations inform fairness trade-offs in LLMs, guiding bias mitigation through prompt engineering, classifier post-filtering, and debiasing techniques.

The Code Bias Score (CBS) is a quantitative metric designed to measure the prevalence and severity of social bias in generative models, with principal application in the assessment of LLMs tasked with code or language generation. It has emerged as a response to the growing deployment of LLM-powered systems in domains where systematic disparities based on demographic or protected attributes constitute a primary risk. Variants of CBS have been defined for code generation, template-based masked language modeling, and word embeddings; each instantiation operationalizes the underlying notion of demographic parity or invariance with respect to protected attributes (Lee et al., 2024, Liu et al., 2023, Ling et al., 2024, Rabbi et al., 1 May 2026).

1. Formal Definitions and Metric Variants

The CBS metric always centers on measuring whether variation in a protected attribute (gender, race, age, etc.) leads to a change in the model’s output, interpreted as an indicator of bias. Three dominant formalisms have emerged:

Counterfactual Code Bias Score: Most LLM code-generation bias studies, including Solar and SocialBias-Bench, define CBS as the proportion (percent) of generated code snippets that, when subjected to controlled flips of a protected attribute, yield changes in output. Formally, letting $N_e$ be the number of executable snippets and $N_b$ the number with any attribute-based output disparity, CBS is:

$\text{CBS} = \frac{N_b}{N_e} \times 100$

Per-dimension CBS can be similarly defined by restricting $N_b$ to failures involving a specific attribute (Ling et al., 2024, Rabbi et al., 1 May 2026).

Classifier-Based CBS: In some work, notably Liu et al., the metric is the fraction of completions flagged as biased by a binary classifier trained on annotated data:

$\mathrm{CBS} = \frac{1}{N}\sum_{i=1}^N \mathbbm{1}[P_{\mathrm{cls}}(\text{code}_i)\ge0.5] \times 100$

Here $P_{\mathrm{cls}}$ is the classifier’s bias confidence per code snippet (Liu et al., 2023).

Categorical Bias Score in MLMs: In template-based masked language modeling, CBS quantifies the average absolute deviation in the predicted probability mass assigned to each class relative to its base rate:

$\mathrm{CBS}(A,T) = \frac{1}{|A|}\sum_{a\in A} \biggl|\,\frac{1}{|T|}\sum_{t\in T} P_t(a) - P(a)\biggr|$

$A$ is the set of possible attributes (e.g., nationality), $T$ the template set (Lee et al., 2024).

A related but distinct metric, FairScore, combines refusal rate and normalized entropy to balance subgroup parity with the model's tendency to avoid using sensitive attributes altogether (Du et al., 9 Jan 2025).

2. Methodological Foundations

For LLM-generated code, CBS is typically measured through a metamorphic testing framework (Ling et al., 2024, Rabbi et al., 1 May 2026):

Task and Prompt Construction: Human-centered tasks are formalized with all protected and relevant non-protected attributes explicit in the code template.
Code Generation and Filtering: Multiple completions per task are generated, and only successfully executable code snippets are retained.
Metamorphic Test Suite: For each protected attribute dimension, the instance is varied over all possible values (holding all else fixed), and the generated code’s outputs are compared.
Bias Attribution: A snippet is deemed "biased" if there exists any discrepancy in output due solely to a change in the protected attribute.

Classifier-based methods train a neural classifier to label completions as “biased” or “not biased,” using human-annotated data as ground truth (Liu et al., 2023).

For MLM-based CBS, template masking and aggregate probability deviations across attribute values are computed as outlined above (Lee et al., 2024).

3. Datasets, Attribute Coverage, and Test Construction

The SocialBias-Bench dataset exemplifies the scale and structure necessary for robust CBS estimation in code. It comprises 343 distinct tasks spanning domains such as social benefits, education, employment, health, licensing, hobbies, and occupations. Each task systematically cross-annotates seven demographic dimensions: race, age, employment status, education, gender, religion, and marital status, each with a finite value set (Rabbi et al., 1 May 2026, Ling et al., 2024).

Test generation uses a domain-specific language and code skeleton to emit all combinations of protected attributes, enabling exhaustive pairwise testing for causal-discrimination (Ling et al., 2024). In template-based language modeling, comparable coverage is constructed across multiple templates and target terms (Lee et al., 2024).

4. Empirical Properties, Interpretation, and Ranges

CBS is always interpreted as a percentage (0–100%), measuring how often the model produces output that is not invariant to protected-attribute flips. In large-scale evaluations:

CBS values in practice range widely: <20% is considered low, 20–40% moderate, >40% high (Rabbi et al., 1 May 2026, Ling et al., 2024).
Model size, generation hyperparameters (temperature, top-p), and prompt construction directly affect CBS. Larger models often achieve higher code quality (e.g., Pass@1) but also increased CBS—highlighting a fairness/utility trade-off (Liu et al., 2023).
CBS can be computed globally or per attribute dimension, enabling fine-grained analysis (e.g., CBS_gender, CBS_race) (Ling et al., 2024).

The original KcBERT study for MLMs reported that fine-tuned models with debiasing regularization significantly reduced CBS for ethnic bias (from 0.1175 to 0.0395) (Lee et al., 2024).

5. Mitigation Strategies and CBS Sensitivity

Intervention studies rigorously benchmark CBS under prompt and architectural modifications:

Prompt Engineering: Iterative prompt feedback (feeding CBS results back into prompt construction) dramatically reduces CBS, while standard Chain-of-Thought and persona-based prompting have mixed or adverse effects (Ling et al., 2024, Rabbi et al., 1 May 2026).
Multi-Agent Frameworks and Monitors: Structured multi-agent pipelines with explicit fairness-scoping roles (e.g., the Fairness Monitor Agent) have achieved substantial CBS reductions without sacrificing functionality. The FMA pipeline in (Rabbi et al., 1 May 2026) reduced CBS from 48.4% to 16.91%.
Classifier Post-filtering: Real-time rejection of classifier-flagged completions is an effective practical mitigation (Liu et al., 2023).
Data Balancing and Regularization: For MLMs, data balancing and debiasing regularization correspondingly decrease CBS or related LPBS measures (Lee et al., 2024).

6. Limitations and Methodological Caveats

Principal limitations of the CBS framework include:

CBS only detects bias for one attribute at a time; intersectional bias is outside its scope (Ling et al., 2024).
It cannot distinguish between malicious bias and justified, attribute-specific functional requirements (e.g., age-based health recommendations).
Measurement fidelity depends on exhaustiveness of test case coverage and metric sensitivity.
Classifier-based CBS is limited by classifier accuracy and potential annotation domain shift (Liu et al., 2023).
Refusal/entropy composite metrics (FairScore) do not reflect the real-world prevalence or severity of harm but rather balance two formal desiderata (Du et al., 9 Jan 2025).

7. Recommendations and Best Practices

Based on comparative studies and empirical audits, researchers are advised to:

Always report both overall and per-dimension CBS across tasks (Rabbi et al., 1 May 2026, Ling et al., 2024).
Verify that the metric is magnitude-comparable and unbiased-trustworthy at zero in the geometric embedding setting (Schröder et al., 2024).
Supplement CBS with distributions of test failures, functional correctness statistics, and downstream fairness/counterfactual assessments (Ling et al., 2024, Rabbi et al., 1 May 2026).
Use multi-pass and randomization protocols to mitigate sampling and positional bias (Du et al., 9 Jan 2025).
Ensure transparency by reporting all underlying metrics and test construction details.