Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

153 tokens/sec

GPT-4o

7 tokens/sec

Gemini 2.5 Pro Pro

45 tokens/sec

o3 Pro

4 tokens/sec

GPT-4.1 Pro

38 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

Fairness Metrics for Clinical Predictive AI

Updated 1 July 2025

Fairness metrics for clinical predictive AI quantify potential discrimination in models against groups based on sensitive attributes like race, sex, or age.
The field features a diverse set of metrics, categorized by performance dependency, model output level, and base performance metric, each with conceptual and operational challenges.
Choosing appropriate metrics is complex; best practices recommend reporting multiple metrics, quantifying uncertainty, and prioritizing probability-based and utility-focused assessments for better clinical relevance.

Fairness metrics for clinical predictive AI quantify whether predictive models introduce or perpetuate discrimination against individuals or groups defined by sensitive attributes such as race, sex, age, and diagnosis. These metrics serve as core tools for assessing, reporting, and mitigating inequities in clinical AI applications. The literature reveals a diverse and fragmented landscape of fairness metrics, with varying levels of performance dependency, clinical validation, and alignment with real-world utility (2506.17035). A critical appraisal demonstrates the necessity for careful metric selection, uncertainty quantification, intersectional analysis, and clinical contextualization.

1. Classification and Types of Fairness Metrics

Fairness metrics in clinical predictive AI are categorized by three main dimensions: performance dependency, model output level, and base performance metric (2506.17035).

(a) Performance Dependency

Performance-independent (unsupervised) metrics: Assess parity in model outputs across groups without using outcome labels. Examples include mean score parity and statistical parity, focusing on whether the proportion of positive predictions is equal across demographic groups.
Performance-dependent (supervised) metrics: Evaluate disparities relative to true outcome labels, comparing predictive performance across groups. These metrics typically reflect disparities present in labeled data and are more common in clinical literature.

(b) Model Output Level

Probability-based metrics: Utilize estimated risk probabilities pre-threshold, allowing for nuanced assessments across the risk spectrum (e.g., AUROC parity, calibration parity).
Threshold-dependent metrics: Operate after applying a decision threshold, evaluating fairness in hard classifications (e.g., equal opportunity difference, predictive parity).

(c) Base Performance Metric

Metrics may reflect differences in:

Discrimination: e.g., AUROC, partial AUC
Calibration: e.g., calibration-in-the-large, Expected Calibration Error (ECE)
Overall performance: e.g., Brier Score, log-loss
Partial metrics: e.g., True Positive Rate (TPR), False Positive Rate (FPR), Positive Predictive Value (PPV)
Summary metrics: e.g., accuracy gap, balanced accuracy gap, F1 parity
Clinical utility: e.g., subgroup net benefit

2. Notable Metrics and Healthcare-Specific Extensions

Eighteen of sixty-two metrics identified were developed explicitly for healthcare (2506.17035). Key examples include:

AUROC Parity: Assesses discrimination differences by checking if area under the ROC curve is similar across groups.

$\Delta_{\text{AUROC}} = \left| \text{AUROC}_{A=0} - \text{AUROC}_{A=1} \right|$

Calibration-in-the-large Parity: Evaluates whether mean predicted risk aligns with observed event rates in each group.

$\Delta_{\text{CITL}} = \left| (\text{Mean}(p_{A=0}) - \text{Mean}(Y_{A=0})) - (\text{Mean}(p_{A=1}) - \text{Mean}(Y_{A=1})) \right|$

Equal Opportunity Difference: Measures differences in sensitivity.

$\Delta_\text{TPR} = \left| \text{TPR}_{A=0} - \text{TPR}_{A=1} \right|$

Statistical Parity Difference:

$\Delta_{\text{SP}} = | P(\hat{Y}=1|A=0) - P(\hat{Y}=1|A=1) |$

Subgroup Net Benefit: Calculates net benefit at the group level, integrating prevalence difference and weighting of harms/benefits.

$NB_g = \frac{\text{TP}_g}{n_g} - \frac{\text{FP}_g}{n_g} \cdot \frac{w}{1-w}$

where $w$ is the odds at threshold probability.

Some metrics focus on intersectional subgroup parity, aiming to capture disparities that may only surface when multiple sensitive dimensions are considered together (e.g., race–sex intersections).

3. Conceptual and Operational Challenges

There is no consensus definition of a fairness metric in clinical predictive AI, leading to multiple, sometimes incompatible, metrics in use (2506.17035). Key challenges include:

Threshold dependency: Many metrics depend heavily on selected decision thresholds, which may be arbitrary or lack clinical justification.
Metric fragmentation: Lack of standardized, widely accepted metrics for assessing fairness, especially for probability-based or calibration-based assessments.
Clinical contextualization: Most group fairness metrics are adopted from other domains and can lack direct relevance to clinical impact or patient benefit.
Uncertainty quantification: Confidence intervals or bootstrapped estimates for fairness metrics are rarely reported, which is problematic for small or disadvantaged subgroups.
Intersectionality: Most metrics report on single attributes; few support operationalization across multiple intersecting sensitive factors due to computational and sample size hurdles.
Utility focus: Most metrics summarize statistical disparity but do not indicate potential patient benefit or harm, except for metrics like subgroup net benefit.

4. Trade-offs, Impossibility Results, and Reporting Practices

Mathematical relationships documented in the literature show that not all fairness metrics are simultaneously achievable, especially when outcome base rates differ across groups (2001.07864). For example:

Equalized odds, statistical parity, and predictive parity cannot be met simultaneously unless base rates are equal.
A metric reflecting calibration (probability output) may not result in equitable classifications after thresholding.
Practically, targeting one fairness metric can worsen others—a phenomenon repeatedly validated empirically (2007.10306, 2501.13219).

Best practice recommendations include:

Report multiple metrics: Relying on a single fairness metric can hide disparities not addressed by that particular approach.
Context-dependent selection: Choose metrics aligned with stakeholder priorities, clinical workflows, and regulatory requirements.
Quantify and report uncertainty: Confidence intervals for subgroup metrics are essential, especially for underrepresented populations.
Justify threshold choices: Clearly report and justify thresholds used in fairness assessments.

5. Recommendations and Directions for Future Research

The literature identifies several priorities and gaps for advancing fairness metric development and reporting in clinical AI (2506.17035):

Probability-based metrics should be prioritized for model development and validation, to avoid threshold-induced artifacts.
Intersectional fairness: Extend measurement and reporting beyond single attributes, incorporating methodologies for robust intersectional subgroup analysis.
Uncertainty quantification: Incorporate confidence intervals on all group and intersectional metrics, particularly where subgroups are small.
Clinical utility metrics: Develop and report metrics tied directly to patient benefit (e.g., subgroup net benefit) to complement statistical parity measures.
Empirical validation: Test and calibrate new fairness metrics on real clinical data, ensuring their applicability for decision-support.
Stakeholder engagement and contextualization: Align metric selection with ethical, clinical, and social priorities documented through participatory design and governance frameworks.

6. Summary Table: Example Fairness Metrics

Metric	Formula	Intended Role
AUROC Parity	$\Delta_{\text{AUROC} = \| \text{AUROC}_{A=0} - \text{AUROC}_{A=1} \|$	Discrimination parity
Calibration-in-the-Large	$\Delta_{\text{CITL} = \| (\text{Mean}(p_{A=0}) - \text{Mean}(Y_{A=0})) - (\text{Mean}(p_{A=1}) - \text{Mean}(Y_{A=1})) \|$	Calibration parity
Equal Opportunity Gap	$\Delta_\text{TPR} = \| \text{TPR}_{A=0} - \text{TPR}_{A=1} \|$	Sensitivity parity across groups
Statistical Parity	$\Delta_{\text{SP} = \| P(\hat{Y}=1\|A=0) - P(\hat{Y}=1\|A=1) \|$	Output rate parity
Subgroup Net Benefit	$NB_g = \frac{TP_g}{n_g} - \frac{FP_g}{n_g} \cdot \frac{w}{1-w}$	Clinical utility parity

7. Conclusions and Outlook

The proliferation of fairness metrics for clinical predictive AI reflects the complexity of quantifying and achieving equity in model-driven decision-making. The current landscape is fragmented, over-reliant on threshold-dependent and group-based statistical parities, and underdeveloped regarding calibration, uncertainty, intersectionality, and clinical utility (2506.17035). A move towards probability-based, uncertainty-aware, and utility-grounded fairness metrics—implemented and reported transparently in alignment with clinical and ethical realities—remains a critical priority for future research and deployment in healthcare AI.

PDF Markdown Chat (Upgrade)

References (4)

Critical Appraisal of Fairness Metrics in Clinical Predictive AI (2025)

Fairness Metrics: A Comparative Analysis (2020)

An Empirical Characterization of Fair Machine Learning For Clinical Risk Prediction (2020)

Enhancing Multi-Attribute Fairness in Healthcare Predictive Modeling (2025)