Multi-Source Conformal Inference Under Distribution Shift (2405.09331v1)
Abstract: Recent years have experienced increasing utilization of complex machine learning models across multiple sources of data to inform more generalizable decision-making. However, distribution shifts across data sources and privacy concerns related to sharing individual-level data, coupled with a lack of uncertainty quantification from machine learning predictions, make it challenging to achieve valid inferences in multi-source environments. In this paper, we consider the problem of obtaining distribution-free prediction intervals for a target population, leveraging multiple potentially biased data sources. We derive the efficient influence functions for the quantiles of unobserved outcomes in the target and source populations, and show that one can incorporate machine learning prediction algorithms in the estimation of nuisance functions while still achieving parametric rates of convergence to nominal coverage probabilities. Moreover, when conditional outcome invariance is violated, we propose a data-adaptive strategy to upweight informative data sources for efficiency gain and downweight non-informative data sources for bias reduction. We highlight the robustness and efficiency of our proposals for a variety of conformal scores and data-generating mechanisms via extensive synthetic experiments. Hospital length of stay prediction intervals for pediatric patients undergoing a high-risk cardiac surgical procedure between 2016-2022 in the U.S. illustrate the utility of our methodology.
- Conformal prediction beyond exchangeability. The Annals of Statistics, 51(2):816–845, 2023.
- Efficient and adaptive estimation for semiparametric models. Johns Hopkins University Press Baltimore, 1993.
- Diagnosing model performance under distribution shift. arXiv preprint arXiv:2303.02011, 2023.
- Learning from electronic health records across multiple sites: A communication-efficient and privacy-preserving distributed algorithm. Journal of the American Medical Informatics Association, 27(3):376–385, 2020a.
- A fast score test for generalized mixture models. Biometrics, 76(3):811–820, 2020b.
- Distribution-free prediction sets for two-layer hierarchical models. Journal of the American Statistical Association, 118(544):2491–2502, 2023.
- A fast trans-lasso algorithm with penalized weighted score function. Computational Statistics & Data Analysis, 192:107899, 2024.
- Robust inference for federated meta-learning. arXiv preprint arXiv:2301.00718, 2023.
- Federated adaptive causal estimation (face) of target treatment effects. arXiv preprint arXiv:2112.09313, 2021.
- Multiply robust federated estimation of targeted average treatment effects. Advances in Neural Information Processing Systems, 36:70453–70482, 2023.
- Privacy-preserving, communication-efficient, and target-flexible hospital quality measurement. The Annals of Applied Statistics, 18(2):1337–1359, 2024.
- One-shot federated conformal prediction. In International Conference on Machine Learning, pp. 14153–14177. PMLR, 2023.
- Diagnosing the role of observable distribution shift in scientific replications. arXiv preprint 2309.01056, 2023a.
- Sensitivity analysis of individual treatment effects: A robust conformal inference approach. Proceedings of the National Academy of Sciences, 120(6):e2214889120, 2023b.
- Semiparametric counterfactual density estimation. Biometrika, pp. asad017, 2023.
- Distribution-free inference with hierarchical data. arXiv preprint arXiv:2306.06342, 2023.
- Distribution-free predictive inference for regression. Journal of the American Statistical Association, 113(523):1094–1111, 2018.
- Conformal inference of counterfactuals and individual treatment effects. Journal of the Royal Statistical Society Series B: Statistical Methodology, 83(5):911–938, 2021.
- Targeting underrepresented populations in precision medicine: A federated transfer learning approach. The Annals of Applied Statistics, 17(4):2970–2992, 2023.
- Federated conformal predictors for distributed uncertainty quantification. In International Conference on Machine Learning, pp. 22942–22964. PMLR, 2023.
- Ten years of data verification: the society of thoracic surgeons congenital heart surgery database audits. World Journal for Pediatric and Congenital Heart Surgery, 10(4):454–463, 2019.
- Congenital heart surgery case mix across north american centers and impact on performance assessment. The Annals of thoracic surgery, 102(5):1580–1587, 2016.
- Causal inference by using invariant prediction: identification and confidence intervals. Journal of the Royal Statistical Society Series B: Statistical Methodology, 78(5):947–1012, 2016.
- Conformal prediction for federated uncertainty quantification under label shift. In International Conference on Machine Learning, pp. 27907–27947. PMLR, 2023.
- Distribution-free uncertainty quantification for classification under label shift. In Uncertainty in Artificial Intelligence, pp. 844–853. PMLR, 2021.
- Qin, J. Inferences for case-control and semiparametric two-sample density ratio models. Biometrika, 85(3):619–630, 1998.
- Sensitivity analysis for selection bias and unmeasured confounding in missing data and causal inference models. In Statistical models in epidemiology, the environment, and clinical trials, pp. 1–94. Springer, 2000.
- Efficient adjustment sets for population average causal treatment effect estimation in graphical models. Journal of Machine Learning Research, 21:1–86, 2020.
- Risk factors for hospital morbidity and mortality after the norwood procedure: a report from the pediatric heart network single ventricle reconstruction trial. The Journal of thoracic and cardiovascular surgery, 144(4):882–895, 2012.
- Conformal prediction under covariate shift. Advances in neural information processing systems, 32, 2019.
- United States Census Bureau, G. D. Census regions and divisions of the united states, 2020. URL https://www2.census.gov/geo/pdfs/maps-data/maps/reference/us_regdiv.pdf.
- Super learner. Statistical applications in genetics and molecular biology, 6(1), 2007.
- van der Vaart, A. Semiparametric statistics. In Lectures on probability theory and statistics (Saint-Flour, 1999), pp. 331–457. Springer, 2002.
- An adaptive kernel approach to federated learning of heterogeneous causal effects. Advances in Neural Information Processing Systems, 35:24459–24473, 2022a.
- Bayesian federated estimation of causal effects from observational data. In Uncertainty in Artificial Intelligence, pp. 2024–2034. PMLR, 2022b.
- Algorithmic learning in a random world, volume 29. Springer, 2005.
- On-line predictive linear regression. The Annals of Statistics, pp. 1566–1590, 2009.
- Federated causal inference in heterogeneous observational data. Statistics in Medicine, 42(24):4418–4439, 2023.
- Doubly robust calibration of prediction sets under covariate shift. Journal of the Royal Statistical Society Series B: Statistical Methodology, pp. qkae009, 2024.
- Conformal sensitivity analysis for individual treatment effects. Journal of the American Statistical Association, 119(545):122–135, 2024.
- Zou, H. The adaptive lasso and its oracle properties. Journal of the American statistical association, 101(476):1418–1429, 2006.