Conditional Bias Scan (CBS)
- Conditional Bias Scan (CBS) is a flexible auditing framework that reveals intersectional and contextual biases by identifying subgroups with statistically significant deviations.
- It recasts bias detection as a conditional independence test, leveraging efficient subset scanning and robust statistical calibration to quantify disparities.
- Empirical case studies, such as COMPAS analyses, demonstrate CBS's superior capability in uncovering hidden false positive and calibration disparities compared to traditional fairness metrics.
Conditional Bias Scan (CBS) is a flexible auditing framework for detecting intersectional and contextual biases in classification models. Designed to reveal subgroup disparities not detectable by standard group fairness metrics, CBS systematically searches for subgroups within a protected class whose predicted outcomes—or realized errors on those predictions—differ significantly from their counterparts in the non-protected class, according to a broad range of fairness criteria. The method integrates efficient subset scanning, robust statistical calibration, and the capacity to audit both probabilistic and binarized model outcomes (Boxer et al., 2023).
1. Formalization of Intersectional Bias
Standard group-fairness metrics (e.g., equalized odds, calibration) compare rates between protected () and non-protected () groups in aggregate. However, a classifier may mask disparities that manifest only in subgroups defined by intersections of covariates, such as race gender or age criminal history.
Let index individuals with the following observed:
- Sensitive attribute ( denotes membership in the protected class)
- Covariates
- Binary outcome
- Probabilistic prediction
- Binary recommendation 0
An intersectional bias exists if there exists a subgroup 1, defined by a conjunction of covariate values, such that its fairness metric (e.g., false positive rate) differs from the corresponding subgroup 2 in the non-protected class—even when group-level parity holds:
3
CBS operationalizes the discovery of 4 and quantifies the statistical significance of the observed deviation.
2. Mathematical Formulation and Scan Statistics
CBS recasts the audit for subgroup bias as a conditional independence test. For each individual, CBS defines an event variable 5 (e.g., prediction, recommendation, or outcome), and tests the null hypothesis
6
where 7 is a conditioning variable (e.g., 8 or 9), as defined by the specific fairness criterion.
For each 0 with 1, CBS estimates an expected “counterfactual” value:
2
A scan statistic 3 aggregates the evidence for bias within each candidate subgroup 4, with the log-likelihood ratio (LLR) distinguishing the observed 5's in 6 from their expected 7's.
Table: CBS Scan Forms
| Scan Type | 8, 9 | 0 Formulation (for subgroup 1) |
|---|---|---|
| Separation, binaries | 2 | 3 |
| Separation, continuous | 4 | 5, with 6 |
| Sufficiency, binaries | 7 | Bernoulli form above with 8 |
| Sufficiency, binaries | 9 | As above, conditioned on 0 |
The scan maximizes 1 over all possible 2 to identify the most substantively biased subgroup 3.
3. Fairness Definitions Audited by CBS
CBS audits any group-fairness definition reducible to a conditional-independence test:
4
This unifies a range of commonly used criteria, spanning both “separation” (error-rate parity) and “sufficiency” (predictive value parity), and can be instantiated for both probabilistic and thresholded outcomes:
- Separation-based fairness (condition on ground-truth 5):
- Balance for the positive/negative class: 6, 7
- True/false positive and negative rate parity (on 8)
- Sufficiency-based fairness (condition on model output 9 or 0):
- Calibration/predictive parity: 1
- Positive/negative predictive value parity
CBS can also conduct value-conditional scans, e.g., restricting to 2 to detect false positive rate disparities.
4. Algorithmic Implementation
The CBS algorithm integrates statistical estimation, subset scanning, and permutation inference.
- Expected Value Estimation: Train regression or logistic regression models (using 3 data), with inverse-propensity weights 4, to estimate 5 for 6.
- Scan Statistic Construction: Compute 7 as specified for the chosen fairness definition and scan type.
- Iterative Subset Scan: For 8 random restarts:
- Initialize candidate subgroup 9.
- For each unscanned attribute 0, relax 1 on 2, evaluating 3 for possible value sets and constructing intervals over thresholds.
- Select the interval/subgroup with maximal 4 as the current 5.
- Iterate until convergence, recording the maximal 6 and corresponding 7.
- Statistical Significance Assessment: Evaluate the significance of 8 via permutation testing (shuffling 9).
This approach exploits the Additive Linear-Time Subset Scanning (ALTSS) property to achieve near-linear time subset evaluation at each step (Boxer et al., 2023).
5. Theoretical Foundations
CBS’s scan statistics satisfy the ALTSS property, ensuring each attribute scan requires evaluating 0 subsets rather than 1, where 2 is attribute arity. Multiple random restarts and coordinate ascent enable effective search for the global optimum in practice. The use of permutation-based p-values corrects for multiple testing across subgroups, maintaining control of the family-wise error rate while retaining high detection power. Empirical evaluation demonstrates that CBS achieves higher detection accuracy (e.g., Jaccard index for true subgroup recovery) than GerryFair and Multiaccuracy Boost, particularly for small or subtle subgroups.
6. Empirical Results: COMPAS Case Studies
CBS was evaluated using semi-synthetic and real-world analyses of the COMPAS pre-trial risk assessment tool:
- Simulated Bias Experiments: With injected biases over COMPAS covariates,
- Separation scans were most sensitive to artificial shifts in predicted log-odds.
- Sufficiency scans best detected shifts in true log-odds.
- CBS outperformed competing auditors in accuracy across varied subgroup sizes and bias magnitudes.
- Real Data Analysis (COMPAS, 3):
- Significant false positive rate disparity detected for Black males: 4 for Black & male, 5 for non-Black & male (6, 7).
- Separation for predictions flagged the “under-25 felony” subgroup with substantial disparity in average predicted risk (8 vs. 9).
- Sufficiency analysis revealed calibration issues among older males with 0–1 prior offenses: 2 vs. 3 (4, 5).
7. Practical Considerations and Limitations
CBS’s reliability depends on accurate estimation of expected outcomes (6), which in turn is sensitive to the specification of the propensity-score and outcome models. Doubly robust or targeted learning approaches may offer improved robustness. CBS addresses only group-level conditional-independence fairness, not individual fairness or counterfactual analyses. Detection power is reduced for very small subgroups or highly noisy data, especially in sufficiency scans with weak covariate signal. CBS identifies the single most significant subgroup per run; iterative re-scanning is required for disjoint subgroup discovery. Permutation-based significance computation is computationally intensive, though approximate null-distributions may offer future acceleration.
CBS provides a statistically principled and computationally efficient approach for discovering intersectional and contextual model biases under a broad range of group-fairness definitions, with demonstrated effectiveness both in synthetic settings and real-world deployments (Boxer et al., 2023).