Papers
Topics
Authors
Recent
Search
2000 character limit reached

Explaining medical AI performance disparities across sites with confounder Shapley value analysis

Published 12 Nov 2021 in cs.LG and cs.AI | (2111.08168v1)

Abstract: Medical AI algorithms can often experience degraded performance when evaluated on previously unseen sites. Addressing cross-site performance disparities is key to ensuring that AI is equitable and effective when deployed on diverse patient populations. Multi-site evaluations are key to diagnosing such disparities as they can test algorithms across a broader range of potential biases such as patient demographics, equipment types, and technical parameters. However, such tests do not explain why the model performs worse. Our framework provides a method for quantifying the marginal and cumulative effect of each type of bias on the overall performance difference when a model is evaluated on external data. We demonstrate its usefulness in a case study of a deep learning model trained to detect the presence of pneumothorax, where our framework can help explain up to 60% of the discrepancy in performance across different sites with known biases like disease comorbidities and imaging parameters.

Authors (3)
Citations (7)

Summary

  • The paper demonstrates that a confounder Shapley value framework can quantify and explain AI performance disparities across different medical sites.
  • The methodology decomposes AI performance discrepancies by measuring contributions from key factors such as imaging view and comorbidities, achieving a 27% explanation rate.
  • The findings stress the need for high-quality metadata and improved training practices to reduce bias and guide regulatory measures in medical AI deployment.

Explaining Medical AI Performance Disparities with Confounder Shapley Value Analysis

Introduction

The proliferation of AI in healthcare has brought both advancements and challenges. A prominent issue afflicting medical AI is the degradation of algorithmic performance when models are applied to sites with data not encountered during training. Such cross-site performance disparities demand effective diagnostic tools to ensure equitable and effective AI deployment across diverse patient populations. The paper "Explaining medical AI performance disparities across sites with confounder Shapley value analysis" addresses this by introducing a framework to quantify and explain biases in model performance across different sites.

Methodology

The proposed framework utilizes a confounder Shapley value analysis to decompose the performance disparities of AI models when applied across different healthcare sites. Shapley values provide a systematic approach to assessing the individual and cumulative contributions of varying site-specific factors, such as demographic differences and imaging parameters, which might bias the performance of AI algorithms.

The authors formalize the notion of site performance and disparity in AI models and model the confounding factors as measurable random variables. The novel contribution of the paper is the detailed method with which Shapley values are employed to quantify the contributions of each confounder towards the observed performance discrepancies. The framework accommodates multiple confounders, applying the efficiency, null, and symmetry properties of Shapley values to delineate the impact of each factor.

Case Study: Pneumothorax Detection

The efficacy of the proposed framework is illustrated via a case study on detecting pneumothorax using X-ray images. Three publicly available datasets from NIH, SHC, and BID served as the basis of this study. DenseNet-121 models were trained and evaluated across these datasets to detect pneumothorax, a condition significant for its clinical urgency and varied presentation.

The findings demonstrated an average of 27% of performance discrepancies could be explained using the framework. Notably, image view and comorbidities emerged as the primary contributors, highlighting the importance of these factors in algorithmic training. However, unexplained disparities persisted, particularly between SHC and BID, attributed to missing metadata.

Implications and Future Directions

This research holds significant implications for both AI developers and regulators. By diagnosing the causes of performance disparities, stakeholders can prioritize data that mitigate bias, refine training practices, and guide the regulatory scrutiny of AI models. The paper also emphasizes the necessity for high-quality, comprehensive metadata collection, supporting a broader understanding of bias origins and magnitude.

The implications for future AI deployment in medicine are substantial. As AI systems become more pervasive, the intricacies of cross-site performance disparities necessitate a concerted effort towards transparency and robustness in model evaluation. Furthermore, the methodology outlined could be adapted, with modifications, to various medical imaging tasks beyond pneumothorax detection.

Conclusion

The integration of Shapley value analysis to explain medical AI performance disparities represents a notable advancement in bias diagnosis. While challenges remain in metadata collection and inherent noise, the framework provides a rigorous and principled approach to identifying and addressing bias in AI models deployed across diverse clinical settings. This work serves as a foundational step towards achieving more equitable and generalizable medical AI solutions.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

Explain it Like I'm 14

Explaining the paper: “Explaining medical AI performance disparities across sites with confounder Shapley value analysis”

Overview: What is this paper about?

This paper looks at why medical AI systems often work well at one hospital but worse at another. The authors create a way to “break down” the performance drop and explain how much each difference between hospitals—like patient ages, scan types, or other diseases—contributes to that drop. They test their idea on AI that detects pneumothorax (a collapsed lung) in chest X-rays.

What were the researchers trying to find out?

The main questions are:

  • Why do AI models for healthcare perform differently at different hospitals?
  • Which specific differences between hospitals cause the performance drop?
  • How much does each factor (like image view, patient age, or other illnesses) contribute to the difference?
  • Can we use this information to make AI fairer and more reliable across different places?

How did they study it? (Simple explanation of the method)

Think of an AI model like a student who learned from one school (Hospital A) and then takes a test at another school (Hospital B). The student might do worse because the tests are different in various ways. The researchers wanted to know which differences matter most.

They used a fairness idea called “Shapley values” from game theory. Here’s an everyday analogy:

  • Imagine a team wins a game. You want to fairly split the credit among players. You look at how much the team scores with and without each player in different lineups. A Shapley value is a fair way to assign each player their share of the credit.
  • In this paper, the “players” are hospital differences (called “site factors”), like:
    • Patient age
    • Sex
    • X-ray view (front, side, etc.)
    • Image size/shape
    • Comorbidities (other illnesses seen on the X-ray, like enlarged heart or atelectasis)
  • The “score” is how well the model performs (measured by AUC, a number from 0 to 1 where higher is better).

Here’s what they did, step by step:

  • Train an AI model on chest X-rays from one hospital.
  • Test it on the same hospital’s data (best-case).
  • Test it on a different hospital’s data (often worse).
  • Measure the drop in performance between “same hospital” and “new hospital.”
  • Then, for one factor at a time, make the new hospital’s data “look like” the original hospital on that factor. For example, if Hospital A uses more front-view X-rays and Hospital B uses more side-view X-rays, they re-balance Hospital B’s test set to match A’s proportion of image views.
  • Test the model again and see how much the performance improves. That improvement is the amount of the drop explained by that factor.
  • Because multiple factors can interact, they repeat this process in many orders and combinations, then use Shapley values to fairly split the total “blame” among all factors.
  • They repeat the sampling many times to make the estimate stable.

In short: they “match” the data factor-by-factor, observe how the model’s score changes, and use a fair-sharing method (Shapley values) to assign how much each factor contributes to the performance gap.

What did they find, and why is it important?

They tested their method on three large US chest X-ray datasets from different hospitals: NIH (Bethesda, MD), SHC/Stanford (Palo Alto, CA), and BIDMC (Boston, MA). They trained a deep learning model (DenseNet-121) to detect pneumothorax and evaluated it across hospitals.

Key findings:

  • Cross-hospital performance drops are real and can be significant.
  • Their framework could explain, on average, about 27% of the performance difference. In some cases, it explained much more—up to about 60%.
  • In two comparisons (SHC on BID and BID on SHC), their method explained very little. The likely reason: important metadata (like race, socioeconomic status, detailed scanner settings) wasn’t available in both datasets, so the method couldn’t analyze those factors.
  • The two biggest contributors to performance gaps were:
    • X-ray image view (AP/PA/lateral). Different views show the chest in different ways, and many AI models don’t train separately by view.
    • Comorbidities (other conditions in the image). Some illnesses can confuse the model. For example:
    • Atelectasis (part of the lung collapses) and
    • Cardiomegaly (enlarged heart)
    • were particularly important. These can make images look different and lead to false positives or confusion.

Why this matters:

  • Instead of just saying “the model works worse elsewhere,” this method says “here’s how much worse, and here’s which differences are causing it.” That’s actionable.
  • It helps developers focus on the most important fixes—for example, training separate models by image view or collecting more images with certain comorbidities.
  • It can guide hospitals and researchers on what metadata to collect and share, making future models more reliable and fair.

What’s the big picture impact?

  • For clinicians and users: You can understand when and why an AI might make more mistakes at your hospital.
  • For developers: You can target data collection and model design to the factors that cause the biggest problems, instead of guessing.
  • For dataset creators and regulators: You can encourage or require collection of key metadata that makes fair evaluation possible and highlight when multi-site testing is truly diverse.

Limitations to keep in mind:

  • Missing or sparse metadata limits what can be explained. In some datasets, important details like race or detailed imaging parameters weren’t available.
  • Labels can be noisy (for example, different hospitals use different methods to label images), which can affect results.
  • The method matches factors one by one (not all at once), which is practical but may miss complex interactions.
  • More studies are needed across other diseases and image types.

Takeaway

Medical AI often performs differently across hospitals because hospitals differ in patients, imaging styles, and other conditions. This paper offers a fair and practical way to measure how much each difference contributes to the performance gap. The results show that factors like X-ray view and comorbidities are major drivers. By identifying and quantifying these factors, the approach helps make medical AI safer, fairer, and more reliable in the real world.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.