MixEHR-SurG: a joint proportional hazard and guided topic model for inferring mortality-associated topics from electronic health records (2312.13454v3)
Abstract: Survival models can help medical practitioners to evaluate the prognostic importance of clinical variables to patient outcomes such as mortality or hospital readmission and subsequently design personalized treatment regimes. Electronic Health Records (EHRs) hold the promise for large-scale survival analysis based on systematically recorded clinical features for each patient. However, existing survival models either do not scale to high dimensional and multi-modal EHR data or are difficult to interpret. In this study, we present a supervised topic model called MixEHR-SurG to simultaneously integrate heterogeneous EHR data and model survival hazard. Our contributions are three-folds: (1) integrating EHR topic inference with Cox proportional hazards likelihood; (2) integrating patient-specific topic hyperparameters using the PheCode concepts such that each topic can be identified with exactly one PheCode-associated phenotype; (3) multi-modal survival topic inference. This leads to a highly interpretable survival topic model that can infer PheCode-specific phenotype topics associated with patient mortality. We evaluated MixEHR-SurG using a simulated dataset and two real-world EHR datasets: the Quebec Congenital Heart Disease (CHD) data consisting of 8,211 subjects with 75,187 outpatient claim records of 1,767 unique ICD codes; the MIMIC-III consisting of 1,458 subjects with multi-modal EHR records. Compared to the baselines, MixEHR-SurG achieved a superior dynamic AUROC for mortality prediction, with a mean AUROC score of 0.89 in the simulation dataset and a mean AUROC of 0.645 on the CHD dataset. Qualitatively, MixEHR-SurG associates severe cardiac conditions with high mortality risk among the CHD patients after the first heart failure hospitalization and critical brain injuries with increased mortality among the MIMIC-III patients after their ICU discharge.
- “Pre-pandemic assessment: a decade of progress in electronic health record adoption among US hospitals” In Health Affairs Scholar 1.5 Oxford University Press US, 2023, pp. qxad056
- Jordan W Smoller “The use of electronic health records for psychiatric phenotyping and genomics” In American Journal of Medical Genetics Part B: Neuropsychiatric Genetics 177.7 Wiley Online Library, 2018, pp. 601–612
- “A review of automatic phenotyping approaches using electronic health records” In Electronics 8.11 MDPI, 2019, pp. 1235
- “A review of approaches to identifying patient phenotype cohorts using electronic health records” In Journal of the American Medical Informatics Association 21.2 BMJ Publishing Group, 2014, pp. 221–230
- Peter B Jensen, Lars J Jensen and Søren Brunak “Mining electronic health records: towards better research applications and clinical care” In Nature Reviews Genetics 13.6 Nature Publishing Group UK London, 2012, pp. 395–405
- “Analysis of free text in electronic health records for identification of cancer patient trajectories” In Scientific reports 7.1 Nature Publishing Group UK London, 2017, pp. 46226
- “Significance of machine learning in healthcare: Features, pillars and applications” In International Journal of Intelligent Networks 3 Elsevier, 2022, pp. 58–73
- “Deep patient: an unsupervised representation to predict the future of patients from the electronic health records” In Scientific reports 6.1 Nature Publishing Group, 2016, pp. 1–10
- “Deep survival analysis” In Machine Learning for Healthcare Conference, 2016, pp. 101–114 PMLR
- “Deephit: A deep learning approach to survival analysis with competing risks” In Proceedings of the AAAI conference on artificial intelligence 32.1, 2018
- “Machine learning vs. conventional statistical models for predicting heart failure readmission and mortality” In ESC heart failure 8.1 Wiley Online Library, 2021, pp. 106–115
- David R Cox “Regression models and life-tables” In Journal of the Royal Statistical Society: Series B (Methodological) 34.2 Wiley Online Library, 1972, pp. 187–202
- “Kernel Cox regression models for linking gene expression profiles to censored survival data” In Biocomputing 2003 World Scientific, 2002, pp. 65–76
- “Random survival forests”, 2008
- Robert Tibshirani “The lasso method for variable selection in the Cox model” In Statistics in medicine 16.4 Wiley Online Library, 1997, pp. 385–395
- Scott M Lundberg and Su-In Lee “A unified approach to interpreting model predictions” In Advances in neural information processing systems 30, 2017
- Hugh Chen, Scott M Lundberg and Su-In Lee “Explaining a series of models by propagating Shapley values” In Nature communications 13.1 Nature Publishing Group UK London, 2022, pp. 4512
- “From local explanations to global understanding with explainable AI for trees” In Nature machine intelligence 2.1 Nature Publishing Group UK London, 2020, pp. 56–67
- David M Blei, Andrew Y Ng and Michael I Jordan “Latent dirichlet allocation” In Journal of machine Learning research 3.Jan, 2003, pp. 993–1022
- “Inferring multimodal latent topics from electronic health records” In Nature communications 11.1 Nature Publishing Group UK London, 2020, pp. 2536
- “Supervised multi-specialist topic model with applications on large-scale electronic health record data” In Proceedings of the 12th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics, 2021, pp. 1–26
- “Automatic phenotyping by a seed-guided topic model” In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2022, pp. 4713–4723
- “MixEHR-Guided: A guided multi-modal topic modeling approach for large-scale automatic phenotyping using the electronic health record” In Journal of biomedical informatics 134 Elsevier, 2022, pp. 104190
- John A Dawson and Christina Kendziorski “Survival-supervised latent Dirichlet allocation models for genomic analysis of time-to-event outcomes” In arXiv preprint arXiv:1202.5999, 2012
- Yee Teh, David Newman and Max Welling “A collapsed variational Bayesian inference algorithm for latent Dirichlet allocation” In Advances in neural information processing systems 19, 2006
- “Evaluating phecodes, clinical classification software, and ICD-9-CM codes for phenome-wide association studies in the electronic health record” In PloS one 12.7 Public Library of Science San Francisco, CA USA, 2017, pp. e0175508
- “Regularization paths for Cox’s proportional hazards model via coordinate descent” In Journal of statistical software 39.5 NIH Public Access, 2011, pp. 1
- Ralf Bender, Thomas Augustin and Maria Blettner “Generating survival times to simulate Cox proportional hazards models” In Statistics in medicine 24.11 Wiley Online Library, 2005, pp. 1713–1723
- “MIMIC-III, a freely accessible critical care database” In Scientific data 3.1 Nature Publishing Group, 2016, pp. 1–9
- “Evaluating prediction rules for t-year survivors with censored regression models” In Journal of the American Statistical Association 102.478 Taylor & Francis, 2007, pp. 527–537
- “Estimation methods for time-dependent AUC models with survival data” In Canadian Journal of Statistics 38.1 Wiley Online Library, 2010, pp. 8–26
- “Summary measure of discrimination in survival models based on cumulative/dynamic time-dependent ROC curves” In Statistical methods in medical research 25.5 SAGE Publications Sage UK: London, England, 2016, pp. 2088–2102
- Abel Wakai, Ian G Roberts and Gillian Schierhout “Mannitol for acute traumatic brain injury” In Cochrane Database of Systematic Reviews John Wiley & Sons, Ltd, 2005
- “Performance of a machine learning algorithm using electronic health record data to identify and estimate survival in a longitudinal cohort of patients with lung cancer” In JAMA Network Open 4.7 American Medical Association, 2021, pp. e2114723–e2114723
- “Phenotree: Interactive visual analytics for hierarchical phenotyping from large-scale electronic health records” In IEEE Transactions on Multimedia 18.11 IEEE, 2016, pp. 2257–2270
- “Learning probabilistic phenotypes from heterogeneous EHR data” In Journal of biomedical informatics 58 Elsevier, 2015, pp. 156–165
- “Temporal representation of care trajectories of cancer patients using data from a regional information system: an application in breast cancer” In BMC medical informatics and decision making 14.1 BioMed Central, 2014, pp. 1–15
- “Deep LDA: A new way to topic model” In Journal of Information and Optimization Sciences 41.3 Taylor & Francis, 2020, pp. 823–834
- “A novel neural topic model and its supervised extension” In Proceedings of the AAAI Conference on Artificial Intelligence 29.1, 2015
- “Topic modelling meets deep neural networks: A survey” In arXiv preprint arXiv:2103.00498, 2021
- “A graph-embedded topic model enables characterization of diverse pain phenotypes among UK biobank individuals” In Iscience 25.6 Elsevier, 2022
- “Modeling electronic health record data using an end-to-end knowledge-graph-informed topic model” In Scientific Reports 12.1 Nature Publishing Group UK London, 2022, pp. 17868
- Victor Veitch, Dhanya Sridhar and David Blei “Adapting text embeddings for causal inference” In Conference on Uncertainty in Artificial Intelligence, 2020, pp. 919–928 PMLR
- “Mining causal topics in text data: iterative topic modeling with time series feedback” In Proceedings of the 22nd ACM international conference on information & knowledge management, 2013, pp. 885–890
- “Inferring causal phenotype networks using structural equation models” In Genetics Selection Evolution 43.1 BioMed Central, 2011, pp. 1–13
- “Formalising recall by genotype as an efficient approach to detailed phenotyping and causal inference” In Nature Communications 9.1 Nature Publishing Group UK London, 2018, pp. 711
- “Network-medicine framework for studying disease trajectories in US veterans” In Scientific Reports 12.1 Nature Publishing Group UK London, 2022, pp. 12018
- “Disease trajectories and mortality among individuals diagnosed with depression: a community-based cohort study in UK Biobank” In Molecular psychiatry 26.11 Nature Publishing Group UK London, 2021, pp. 6736–6746
- “A computational method for learning disease trajectories from partially observable EHR data” In IEEE journal of biomedical and health informatics 25.7 IEEE, 2021, pp. 2476–2486
- “High-throughput multimodal automated phenotyping (MAP) with application to PheWAS” In Journal of the American Medical Informatics Association 26.11 Oxford University Press, 2019, pp. 1255–1262
- Thomas L Griffiths and Mark Steyvers “Finding scientific topics” In Proceedings of the National academy of Sciences 101.suppl_1 National Acad Sciences, 2004, pp. 5228–5235
- “Rethinking collapsed variational Bayes inference for LDA” In arXiv preprint arXiv:1206.6435, 2012
- Thomas Minka “Estimating a Dirichlet distribution” Technical report, MIT, 2000
- Sebastian Pölsterl “scikit-survival: A Library for Time-to-Event Analysis Built on Top of scikit-learn” In Journal of Machine Learning Research 21.212, 2020, pp. 1–6 URL: http://jmlr.org/papers/v21/20-729.html
- Terry M Therneau “A Package for Survival Analysis in R” R package version 3.5-7, 2023 URL: https://CRAN.R-project.org/package=survival