Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
153 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Fairness Metrics for Clinical Predictive AI

Updated 1 July 2025
  • Fairness metrics for clinical predictive AI quantify potential discrimination in models against groups based on sensitive attributes like race, sex, or age.
  • The field features a diverse set of metrics, categorized by performance dependency, model output level, and base performance metric, each with conceptual and operational challenges.
  • Choosing appropriate metrics is complex; best practices recommend reporting multiple metrics, quantifying uncertainty, and prioritizing probability-based and utility-focused assessments for better clinical relevance.

Fairness metrics for clinical predictive AI quantify whether predictive models introduce or perpetuate discrimination against individuals or groups defined by sensitive attributes such as race, sex, age, and diagnosis. These metrics serve as core tools for assessing, reporting, and mitigating inequities in clinical AI applications. The literature reveals a diverse and fragmented landscape of fairness metrics, with varying levels of performance dependency, clinical validation, and alignment with real-world utility (2506.17035). A critical appraisal demonstrates the necessity for careful metric selection, uncertainty quantification, intersectional analysis, and clinical contextualization.

1. Classification and Types of Fairness Metrics

Fairness metrics in clinical predictive AI are categorized by three main dimensions: performance dependency, model output level, and base performance metric (2506.17035).

(a) Performance Dependency

  • Performance-independent (unsupervised) metrics: Assess parity in model outputs across groups without using outcome labels. Examples include mean score parity and statistical parity, focusing on whether the proportion of positive predictions is equal across demographic groups.
  • Performance-dependent (supervised) metrics: Evaluate disparities relative to true outcome labels, comparing predictive performance across groups. These metrics typically reflect disparities present in labeled data and are more common in clinical literature.

(b) Model Output Level

  • Probability-based metrics: Utilize estimated risk probabilities pre-threshold, allowing for nuanced assessments across the risk spectrum (e.g., AUROC parity, calibration parity).
  • Threshold-dependent metrics: Operate after applying a decision threshold, evaluating fairness in hard classifications (e.g., equal opportunity difference, predictive parity).

(c) Base Performance Metric

Metrics may reflect differences in:

  • Discrimination: e.g., AUROC, partial AUC
  • Calibration: e.g., calibration-in-the-large, Expected Calibration Error (ECE)
  • Overall performance: e.g., Brier Score, log-loss
  • Partial metrics: e.g., True Positive Rate (TPR), False Positive Rate (FPR), Positive Predictive Value (PPV)
  • Summary metrics: e.g., accuracy gap, balanced accuracy gap, F1 parity
  • Clinical utility: e.g., subgroup net benefit

2. Notable Metrics and Healthcare-Specific Extensions

Eighteen of sixty-two metrics identified were developed explicitly for healthcare (2506.17035). Key examples include:

  • AUROC Parity: Assesses discrimination differences by checking if area under the ROC curve is similar across groups.

ΔAUROC=AUROCA=0AUROCA=1\Delta_{\text{AUROC}} = \left| \text{AUROC}_{A=0} - \text{AUROC}_{A=1} \right|

  • Calibration-in-the-large Parity: Evaluates whether mean predicted risk aligns with observed event rates in each group.

ΔCITL=(Mean(pA=0)Mean(YA=0))(Mean(pA=1)Mean(YA=1))\Delta_{\text{CITL}} = \left| (\text{Mean}(p_{A=0}) - \text{Mean}(Y_{A=0})) - (\text{Mean}(p_{A=1}) - \text{Mean}(Y_{A=1})) \right|

  • Equal Opportunity Difference: Measures differences in sensitivity.

ΔTPR=TPRA=0TPRA=1\Delta_\text{TPR} = \left| \text{TPR}_{A=0} - \text{TPR}_{A=1} \right|

  • Statistical Parity Difference:

ΔSP=P(Y^=1A=0)P(Y^=1A=1)\Delta_{\text{SP}} = | P(\hat{Y}=1|A=0) - P(\hat{Y}=1|A=1) |

  • Subgroup Net Benefit: Calculates net benefit at the group level, integrating prevalence difference and weighting of harms/benefits.

NBg=TPgngFPgngw1wNB_g = \frac{\text{TP}_g}{n_g} - \frac{\text{FP}_g}{n_g} \cdot \frac{w}{1-w}

where ww is the odds at threshold probability.

Some metrics focus on intersectional subgroup parity, aiming to capture disparities that may only surface when multiple sensitive dimensions are considered together (e.g., race–sex intersections).

3. Conceptual and Operational Challenges

There is no consensus definition of a fairness metric in clinical predictive AI, leading to multiple, sometimes incompatible, metrics in use (2506.17035). Key challenges include:

  • Threshold dependency: Many metrics depend heavily on selected decision thresholds, which may be arbitrary or lack clinical justification.
  • Metric fragmentation: Lack of standardized, widely accepted metrics for assessing fairness, especially for probability-based or calibration-based assessments.
  • Clinical contextualization: Most group fairness metrics are adopted from other domains and can lack direct relevance to clinical impact or patient benefit.
  • Uncertainty quantification: Confidence intervals or bootstrapped estimates for fairness metrics are rarely reported, which is problematic for small or disadvantaged subgroups.
  • Intersectionality: Most metrics report on single attributes; few support operationalization across multiple intersecting sensitive factors due to computational and sample size hurdles.
  • Utility focus: Most metrics summarize statistical disparity but do not indicate potential patient benefit or harm, except for metrics like subgroup net benefit.

4. Trade-offs, Impossibility Results, and Reporting Practices

Mathematical relationships documented in the literature show that not all fairness metrics are simultaneously achievable, especially when outcome base rates differ across groups (2001.07864). For example:

  • Equalized odds, statistical parity, and predictive parity cannot be met simultaneously unless base rates are equal.
  • A metric reflecting calibration (probability output) may not result in equitable classifications after thresholding.
  • Practically, targeting one fairness metric can worsen others—a phenomenon repeatedly validated empirically (2007.10306, 2501.13219).

Best practice recommendations include:

  • Report multiple metrics: Relying on a single fairness metric can hide disparities not addressed by that particular approach.
  • Context-dependent selection: Choose metrics aligned with stakeholder priorities, clinical workflows, and regulatory requirements.
  • Quantify and report uncertainty: Confidence intervals for subgroup metrics are essential, especially for underrepresented populations.
  • Justify threshold choices: Clearly report and justify thresholds used in fairness assessments.

5. Recommendations and Directions for Future Research

The literature identifies several priorities and gaps for advancing fairness metric development and reporting in clinical AI (2506.17035):

  • Probability-based metrics should be prioritized for model development and validation, to avoid threshold-induced artifacts.
  • Intersectional fairness: Extend measurement and reporting beyond single attributes, incorporating methodologies for robust intersectional subgroup analysis.
  • Uncertainty quantification: Incorporate confidence intervals on all group and intersectional metrics, particularly where subgroups are small.
  • Clinical utility metrics: Develop and report metrics tied directly to patient benefit (e.g., subgroup net benefit) to complement statistical parity measures.
  • Empirical validation: Test and calibrate new fairness metrics on real clinical data, ensuring their applicability for decision-support.
  • Stakeholder engagement and contextualization: Align metric selection with ethical, clinical, and social priorities documented through participatory design and governance frameworks.

6. Summary Table: Example Fairness Metrics

Metric Formula Intended Role
AUROC Parity $\Delta_{\text{AUROC} = | \text{AUROC}_{A=0} - \text{AUROC}_{A=1} |$ Discrimination parity
Calibration-in-the-Large $\Delta_{\text{CITL} = | (\text{Mean}(p_{A=0}) - \text{Mean}(Y_{A=0})) - (\text{Mean}(p_{A=1}) - \text{Mean}(Y_{A=1})) |$ Calibration parity
Equal Opportunity Gap ΔTPR=TPRA=0TPRA=1\Delta_\text{TPR} = | \text{TPR}_{A=0} - \text{TPR}_{A=1} | Sensitivity parity across groups
Statistical Parity $\Delta_{\text{SP} = | P(\hat{Y}=1|A=0) - P(\hat{Y}=1|A=1) |$ Output rate parity
Subgroup Net Benefit NBg=TPgngFPgngw1wNB_g = \frac{TP_g}{n_g} - \frac{FP_g}{n_g} \cdot \frac{w}{1-w} Clinical utility parity

7. Conclusions and Outlook

The proliferation of fairness metrics for clinical predictive AI reflects the complexity of quantifying and achieving equity in model-driven decision-making. The current landscape is fragmented, over-reliant on threshold-dependent and group-based statistical parities, and underdeveloped regarding calibration, uncertainty, intersectionality, and clinical utility (2506.17035). A move towards probability-based, uncertainty-aware, and utility-grounded fairness metrics—implemented and reported transparently in alignment with clinical and ethical realities—remains a critical priority for future research and deployment in healthcare AI.