A Semiparametric Approach for Robust and Efficient Learning with Biobank Data (2404.01191v1)
Abstract: With the increasing availability of electronic health records (EHR) linked with biobank data for translational research, a critical step in realizing its potential is to accurately classify phenotypes for patients. Existing approaches to achieve this goal are based on error-prone EHR surrogate outcomes, assisted and validated by a small set of labels obtained via medical chart review, which may also be subject to misclassification. Ignoring the noise in these outcomes can induce severe estimation and validation bias to both EHR phenotyping and risking modeling with biomarkers collected in the biobank. To overcome this challenge, we propose a novel unsupervised and semiparametric approach to jointly model multiple noisy EHR outcomes with their linked biobank features. Our approach primarily aims at disease risk modeling with the baseline biomarkers, and is also able to produce a predictive EHR phenotyping model and validate its performance without observations of the true disease outcome. It consists of composite and nonparametric regression steps free of any parametric model specification, followed by a parametric projection step to reduce the uncertainty and improve the estimation efficiency. We show that our method is robust to violations of the parametric assumptions while attaining the desirable root-$n$ convergence rates on risk modeling. Our developed method outperforms existing methods in extensive simulation studies, as well as a real-world application in phenotyping and genetic risk modeling of type II diabetes.
- The surrogate index: Combining short-term proxies to estimate long-term treatment effects more rapidly and precisely. Technical report, National Bureau of Economic Research.
- Electronic phenotyping with aphrodite and the observational health sciences and informatics (ohdsi) data network. AMIA Summits on Translational Science Proceedings, 2017:48.
- Advances in electronic phenotyping: from rule-based definitions to machine learning models. Annual Review of Biomedical Data Science, 1:53–68.
- Estimating multivariate latent-structure models. The Annals of Statistics, 44(2):540–563.
- The mass general brigham biobank portal: an i2b2-based data repository linking disparate and high-dimensional patient data to support multimodal analytics. Journal of the American Medical Informatics Association, 29(4):643–651.
- Chen, X. (2007). Chapter 76 large sample sieve estimation of semi-nonparametric models. volume 6 of Handbook of Econometrics, pages 5549–5632. Elsevier.
- Systematic comparison of phenome-wide association study of electronic medical record data and genome-wide association study data. Nature biotechnology, 31(12):1102–1111.
- Comparisons of polyexposure, polygenic, and clinical risk scores in risk prediction of type 2 diabetes. Diabetes Care, 44(4):935–943.
- Semi-supervised validation of multiple surrogate outcomes with application to electronic medical records phenotyping. Biometrics, 75(1):78–89.
- Risk prediction with imperfect survival outcome information from electronic health records. Biometrics, 79(1):190–202.
- Surrogate assisted semi-supervised inference for high dimensional risk prediction. Journal of Machine Learning Research, 24(265):1–58.
- Efficient and robust semi-supervised estimation of ate with partially annotated treatment and response. arXiv preprint arXiv:2110.12336.
- Pie: A prior knowledge guided integrated likelihood estimation method for bias reduction in association studies using electronic health records data. Journal of the American Medical Informatics Association, 25(3):345–352.
- From gwas to phewas: the search for causality in big data. The Lancet Digital Health, 1(3):e101–e103.
- On the role of surrogates in the efficient estimation of treatment effects with limited outcome data. arXiv preprint arXiv:2003.12408.
- Kohane, I. S. (2011). Using electronic health records to drive discovery in disease genomics. Nature Reviews Genetics, 12(6):417–428.
- Development of phenotype algorithms using electronic medical records and incorporating natural language processing. bmj, 350:h1885.
- Associations of autoantibodies, autoimmune risk alleles, and clinical diagnoses from the electronic medical records in rheumatoid arthritis cases and non–rheumatoid arthritis controls. Arthritis & Rheumatology, 65(3):571–581.
- High-throughput multimodal automated phenotyping (MAP) with application to PheWAS. Journal of the American Medical Informatics Association, 26(11):1255–1262.
- Fine-mapping type 2 diabetes loci to single-variant resolution using high-density imputation and islet-specific epigenome maps. Nature genetics, 50(11):1505–1513.
- On profile likelihood. Journal of the American Statistical Association, 95(450):449–465.
- A review of approaches to identifying patient phenotype cohorts using electronic health records. Journal of the American Medical Informatics Association, 21(2):221–230.
- Van der Vaart, A. W. (2000). Asymptotic statistics, volume 3. Cambridge university press.
- Diversity and scale: genetic architecture of 2,068 traits in the va million veteran program. medRxiv.
- Accelerating biomarker discovery through electronic health records, automated biobanking, and proteomics. Journal of the American College of Cardiology, 73(17):2195–2205.
- Enabling phenotypic big data with phenorm. Journal of the American Medical Informatics Association, 25(1):54–60.
- Maximum smoothed likelihood component density estimation in mixture models with known mixing proportions. Electronic Journal of Statistics, 13(2):4035–4078.
- Electronic health record phenotyping with internally assessable performance (phiap) using anchor-positive and unlabeled patients. arXiv preprint arXiv:1902.10060.
- High-throughput phenotyping with electronic medical record data using a common semi-supervised approach (phecap). Nature protocols, 14(12):3426–3444.
- Prior adaptive semi-supervised learning with application to ehr phenotyping. The Journal of Machine Learning Research, 23(1):3617–3641.
- Nonparametric estimation of multivariate mixtures. Journal of the American Statistical Association, pages 1–16.