Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
175 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Multi-Source Conformal Inference Under Distribution Shift (2405.09331v1)

Published 15 May 2024 in stat.ME and stat.ML

Abstract: Recent years have experienced increasing utilization of complex machine learning models across multiple sources of data to inform more generalizable decision-making. However, distribution shifts across data sources and privacy concerns related to sharing individual-level data, coupled with a lack of uncertainty quantification from machine learning predictions, make it challenging to achieve valid inferences in multi-source environments. In this paper, we consider the problem of obtaining distribution-free prediction intervals for a target population, leveraging multiple potentially biased data sources. We derive the efficient influence functions for the quantiles of unobserved outcomes in the target and source populations, and show that one can incorporate machine learning prediction algorithms in the estimation of nuisance functions while still achieving parametric rates of convergence to nominal coverage probabilities. Moreover, when conditional outcome invariance is violated, we propose a data-adaptive strategy to upweight informative data sources for efficiency gain and downweight non-informative data sources for bias reduction. We highlight the robustness and efficiency of our proposals for a variety of conformal scores and data-generating mechanisms via extensive synthetic experiments. Hospital length of stay prediction intervals for pediatric patients undergoing a high-risk cardiac surgical procedure between 2016-2022 in the U.S. illustrate the utility of our methodology.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (41)
  1. Conformal prediction beyond exchangeability. The Annals of Statistics, 51(2):816–845, 2023.
  2. Efficient and adaptive estimation for semiparametric models. Johns Hopkins University Press Baltimore, 1993.
  3. Diagnosing model performance under distribution shift. arXiv preprint arXiv:2303.02011, 2023.
  4. Learning from electronic health records across multiple sites: A communication-efficient and privacy-preserving distributed algorithm. Journal of the American Medical Informatics Association, 27(3):376–385, 2020a.
  5. A fast score test for generalized mixture models. Biometrics, 76(3):811–820, 2020b.
  6. Distribution-free prediction sets for two-layer hierarchical models. Journal of the American Statistical Association, 118(544):2491–2502, 2023.
  7. A fast trans-lasso algorithm with penalized weighted score function. Computational Statistics & Data Analysis, 192:107899, 2024.
  8. Robust inference for federated meta-learning. arXiv preprint arXiv:2301.00718, 2023.
  9. Federated adaptive causal estimation (face) of target treatment effects. arXiv preprint arXiv:2112.09313, 2021.
  10. Multiply robust federated estimation of targeted average treatment effects. Advances in Neural Information Processing Systems, 36:70453–70482, 2023.
  11. Privacy-preserving, communication-efficient, and target-flexible hospital quality measurement. The Annals of Applied Statistics, 18(2):1337–1359, 2024.
  12. One-shot federated conformal prediction. In International Conference on Machine Learning, pp. 14153–14177. PMLR, 2023.
  13. Diagnosing the role of observable distribution shift in scientific replications. arXiv preprint 2309.01056, 2023a.
  14. Sensitivity analysis of individual treatment effects: A robust conformal inference approach. Proceedings of the National Academy of Sciences, 120(6):e2214889120, 2023b.
  15. Semiparametric counterfactual density estimation. Biometrika, pp.  asad017, 2023.
  16. Distribution-free inference with hierarchical data. arXiv preprint arXiv:2306.06342, 2023.
  17. Distribution-free predictive inference for regression. Journal of the American Statistical Association, 113(523):1094–1111, 2018.
  18. Conformal inference of counterfactuals and individual treatment effects. Journal of the Royal Statistical Society Series B: Statistical Methodology, 83(5):911–938, 2021.
  19. Targeting underrepresented populations in precision medicine: A federated transfer learning approach. The Annals of Applied Statistics, 17(4):2970–2992, 2023.
  20. Federated conformal predictors for distributed uncertainty quantification. In International Conference on Machine Learning, pp. 22942–22964. PMLR, 2023.
  21. Ten years of data verification: the society of thoracic surgeons congenital heart surgery database audits. World Journal for Pediatric and Congenital Heart Surgery, 10(4):454–463, 2019.
  22. Congenital heart surgery case mix across north american centers and impact on performance assessment. The Annals of thoracic surgery, 102(5):1580–1587, 2016.
  23. Causal inference by using invariant prediction: identification and confidence intervals. Journal of the Royal Statistical Society Series B: Statistical Methodology, 78(5):947–1012, 2016.
  24. Conformal prediction for federated uncertainty quantification under label shift. In International Conference on Machine Learning, pp. 27907–27947. PMLR, 2023.
  25. Distribution-free uncertainty quantification for classification under label shift. In Uncertainty in Artificial Intelligence, pp.  844–853. PMLR, 2021.
  26. Qin, J. Inferences for case-control and semiparametric two-sample density ratio models. Biometrika, 85(3):619–630, 1998.
  27. Sensitivity analysis for selection bias and unmeasured confounding in missing data and causal inference models. In Statistical models in epidemiology, the environment, and clinical trials, pp.  1–94. Springer, 2000.
  28. Efficient adjustment sets for population average causal treatment effect estimation in graphical models. Journal of Machine Learning Research, 21:1–86, 2020.
  29. Risk factors for hospital morbidity and mortality after the norwood procedure: a report from the pediatric heart network single ventricle reconstruction trial. The Journal of thoracic and cardiovascular surgery, 144(4):882–895, 2012.
  30. Conformal prediction under covariate shift. Advances in neural information processing systems, 32, 2019.
  31. United States Census Bureau, G. D. Census regions and divisions of the united states, 2020. URL https://www2.census.gov/geo/pdfs/maps-data/maps/reference/us_regdiv.pdf.
  32. Super learner. Statistical applications in genetics and molecular biology, 6(1), 2007.
  33. van der Vaart, A. Semiparametric statistics. In Lectures on probability theory and statistics (Saint-Flour, 1999), pp.  331–457. Springer, 2002.
  34. An adaptive kernel approach to federated learning of heterogeneous causal effects. Advances in Neural Information Processing Systems, 35:24459–24473, 2022a.
  35. Bayesian federated estimation of causal effects from observational data. In Uncertainty in Artificial Intelligence, pp.  2024–2034. PMLR, 2022b.
  36. Algorithmic learning in a random world, volume 29. Springer, 2005.
  37. On-line predictive linear regression. The Annals of Statistics, pp.  1566–1590, 2009.
  38. Federated causal inference in heterogeneous observational data. Statistics in Medicine, 42(24):4418–4439, 2023.
  39. Doubly robust calibration of prediction sets under covariate shift. Journal of the Royal Statistical Society Series B: Statistical Methodology, pp.  qkae009, 2024.
  40. Conformal sensitivity analysis for individual treatment effects. Journal of the American Statistical Association, 119(545):122–135, 2024.
  41. Zou, H. The adaptive lasso and its oracle properties. Journal of the American statistical association, 101(476):1418–1429, 2006.
Citations (3)

Summary

  • The paper introduces a conformal inference method that achieves nearly nominal marginal coverage despite distribution shifts by adaptively weighting multiple data sources.
  • The method leverages influence functions to accurately estimate quantiles, yielding shorter and more reliable prediction intervals compared to single-source approaches.
  • It demonstrates strong practical relevance, especially in healthcare, by enabling precise, privacy-aware predictions for outcomes such as hospital length of stay.

Multi-Source Conformal Inference Under Distribution Shift

Introduction

Data scientists often face the challenge of predicting outcomes from datasets collected from multiple sources with differing characteristics, such as diverse hospitals for healthcare data. It’s tricky because each source can have variations in covariates and outcomes, creating what's known as distribution shifts. Moreover, sharing data among sources such as hospitals is often restricted due to privacy concerns, making the problem even tougher. This paper presents methods for achieving valid prediction intervals, smartly leveraging data from multiple sources while maintaining privacy and handling distribution shifts.

Key Concepts and Methods

Conformal Inference

Conformal inference is a statistical tool used to create prediction intervals that provide a guarantee on how often future outcomes will fall within these intervals. These methods are model-agnostic and can be applied to any predictive model. Conformal inference is particularly useful in settings with finite sample sizes because it doesn’t rely on asymptotic approximations.

Influence Functions

The paper explores advanced use of influence functions, which help in estimating the quantiles of unobserved outcomes accurately even when leveraging biased data sources. Influence functions ensure that the estimations are robust and achieve the desired coverage probabilities effectively.

Handling Distribution Shifts

When working with multiple data sources, assuming that the conditional outcome distribution remains the same across all sources can be too strong of an assumption. The paper addresses this by proposing a method that adaptively assigns more weight to informative sources and less to non-informative ones, thereby reducing bias and increasing efficiency in predictions.

Numerical Results

The numerical results presented in the paper show how the proposed methods perform across various scenarios. Here are some key takeaways:

  • Marginal Coverage: The method achieves coverage close to the nominal level (e.g., 90% intervals contain the true outcome 90% of the time), with tighter and less variable intervals compared to methods that use only the target site's data.
  • Efficiency Gain: By leveraging data from multiple sources, the method produces shorter prediction intervals, which is practically significant for applications needing precise and reliable predictions.
  • Robustness: The approach is robust to the choice of conformal scores (methods for evaluating prediction errors), handling scenarios where existing methods might underperform.

Practical Implications and Future Directions

Practical Applications

The methods presented in the paper have important applications, particularly in healthcare, where they can help provide reliable predictive analytics across hospitals without compromising patient privacy. For the healthcare example given in the paper, predicting hospital length of stay (LOS) for pediatric patients after high-risk surgeries can help in better resource allocation and patient care planning.

Theoretical Implications

From a theoretical standpoint, the paper augments the conformal inference framework by integrating it with influence function-based estimation and adaptive weighting. This combination allows for robust, efficient, and privacy-aware predictions in varied multi-source data environments.

Future Developments

The paper opens several avenues for further research:

  • Covariate-Adaptive Weights: Future work could explore dynamically adapting the weights based on covariates instead of global weights, potentially leading to more personalized and precise prediction intervals.
  • Sensitivity Analysis: Another interesting direction would be developing methods to perform sensitivity analysis when the common conditional outcome distribution (CCOD) assumption is violated. This would provide a framework to understand the robustness of the proposed method under different assumption violations.

Conclusion

This paper presents a significant advancement in the field of prediction under distribution shift, delivering robust prediction intervals by smartly leveraging multiple data sources while respecting privacy constraints. It bridges gaps between theoretical concepts and practical needs, setting the stage for more refined and adaptable machine learning applications in data-sensitive fields such as healthcare. The integration of advanced statistical methods like influence functions with conformal inference makes these solutions not only theoretically appealing but also practically powerful.