Scalable and accurate deep learning for electronic health records (1801.07860v3)

Published 24 Jan 2018 in cs.CY and cs.LG

Abstract: Predictive modeling with electronic health record (EHR) data is anticipated to drive personalized medicine and improve healthcare quality. Constructing predictive statistical models typically requires extraction of curated predictor variables from normalized EHR data, a labor-intensive process that discards the vast majority of information in each patient's record. We propose a representation of patients' entire, raw EHR records based on the Fast Healthcare Interoperability Resources (FHIR) format. We demonstrate that deep learning methods using this representation are capable of accurately predicting multiple medical events from multiple centers without site-specific data harmonization. We validated our approach using de-identified EHR data from two U.S. academic medical centers with 216,221 adult patients hospitalized for at least 24 hours. In the sequential format we propose, this volume of EHR data unrolled into a total of 46,864,534,945 data points, including clinical notes. Deep learning models achieved high accuracy for tasks such as predicting in-hospital mortality (AUROC across sites 0.93-0.94), 30-day unplanned readmission (AUROC 0.75-0.76), prolonged length of stay (AUROC 0.85-0.86), and all of a patient's final discharge diagnoses (frequency-weighted AUROC 0.90). These models outperformed state-of-the-art traditional predictive models in all cases. We also present a case-study of a neural-network attribution system, which illustrates how clinicians can gain some transparency into the predictions. We believe that this approach can be used to create accurate and scalable predictions for a variety of clinical scenarios, complete with explanations that directly highlight evidence in the patient's chart.

PDF Abstract

This paper presents a deep learning approach for predictive modeling using raw Electronic Health Record (EHR) data, demonstrating its scalability and accuracy across multiple clinical tasks and hospital sites without requiring manual feature engineering or site-specific data harmonization. The core idea is to represent a patient's entire longitudinal EHR, including structured data, laboratory results, vital signs, medications, procedures, and free-text clinical notes, as a temporal sequence of events based on the Fast Healthcare Interoperability Resources (FHIR) format.

Methodology:

Data Representation: The entire EHR for each patient was converted into a sequence of FHIR-based resources, ordered chronologically. Each piece of information within a resource (e.g., a medication name, a lab value, a word in a note) was treated as a "token." Numeric values were normalized. This process resulted in a massive dataset, totaling over 46 billion tokens for predictions made at discharge across two hospital systems.
Datasets: De-identified EHR data from two US academic medical centers (UCSF and University of Chicago Medicine) were used, encompassing 216,221 adult hospitalizations (114,003 unique patients) lasting at least 24 hours. One dataset included free-text notes, while the other did not.
Prediction Tasks: The models were trained to predict four distinct outcomes:
- In-hospital mortality
- 30-day unplanned readmission
- Prolonged length of stay (>= 7 days)
- All final discharge diagnoses (ICD-9 codes)
Prediction Timing: Predictions were generated at multiple time points during a hospitalization (e.g., at admission, 24 hours after admission, at discharge) using all data available up to that point.
Model Architecture: An ensemble of three different deep learning architectures was used:
- Recurrent Neural Networks (specifically LSTMs)
- Attention-based Time-Aware Neural Networks (TANNs)
- Neural networks with boosted time-based decision stumps These architectures were chosen for their ability to handle sequential, variable-length, and high-dimensional data.
Baselines: The deep learning models were compared against traditional, clinically-used predictive models adapted for the datasets: an augmented Early Warning Score (aEWS) for mortality, a modified HOSPITAL score for readmission, and a modified Liu score for length of stay.
Evaluation: Model performance was primarily evaluated using the Area Under the Receiver Operating Characteristic curve (AUROC). Calibration curves and the work-up-to-detection ratio (number needed to evaluate) were also assessed. For diagnosis prediction, frequency-weighted AUROC and micro-F1 scores were used.
Explainability: Attribution methods were applied to visualize which input data tokens contributed most significantly to a specific prediction, demonstrated via a case paper for mortality prediction.

Key Results:

High Accuracy: The deep learning models significantly outperformed the baseline traditional models across all prediction tasks and time points.
- Mortality (at 24h): AUROC 0.93-0.95 vs 0.85-0.86 (baseline)
- Readmission (at discharge): AUROC 0.75-0.77 vs 0.68-0.70 (baseline)
- Length of Stay (at 24h): AUROC 0.85-0.86 vs 0.74-0.76 (baseline)
- Discharge Diagnoses (at discharge): Weighted AUROC 0.90
Scalability: The single FHIR-based data representation pipeline successfully processed vast amounts of diverse data (structured, unstructured text) from two different hospital systems for multiple prediction tasks without task-specific feature engineering.
Timeliness: The deep learning models often achieved high predictive accuracy earlier in the hospital stay compared to baseline models (e.g., achieving similar mortality prediction accuracy 24-48 hours earlier).
Explainability: The case paper demonstrated that attribution methods could highlight clinically relevant information (e.g., specific notes, lab results, medications) that the model used to make its prediction, potentially increasing clinical trust.

Practical Implications & Implementation Considerations:

Reduced Preprocessing: The FHIR-based sequential representation significantly reduces the manual effort typically required for feature selection and data harmonization when building predictive models, promoting scalability.
Leveraging Unstructured Data: The ability to directly incorporate free-text notes alongside structured data allows models to potentially capture nuances missed by traditional approaches.
Unified Data Pipeline: A single pipeline (Data -> FHIR Sequence -> Model Input) can support various predictive tasks, simplifying development and deployment across different clinical problems.
Computational Resources: Training these deep learning models on billions of data points is computationally intensive and requires significant infrastructure (e.g., distributed computing platforms like TensorFlow) and expertise. However, making predictions (inference) for a new patient is very fast (milliseconds).
Data Format: Implementation requires mapping existing EHR data into the FHIR standard structure, which might be a significant initial step depending on the source system's format. The paper provides a link to a GitHub repository for their FHIR representation: https://github.com/google/fhir.
Interpretability: While attribution methods provide some insight, fully understanding and validating the model's reasoning remains an active research area. Presenting these explanations effectively to clinicians is crucial for adoption.
Generalizability: While the approach worked well at two sites, further research is needed to understand how well models trained at one site transfer to others, especially given the lack of explicit data harmonization between sites in this paper.
Prospective Validation: The paper is retrospective. Prospective clinical trials are needed to confirm if using these predictions actually improves patient outcomes and clinical workflows.

In summary, the paper demonstrates a powerful and scalable method using deep learning on raw, temporally ordered FHIR-formatted EHR data to achieve state-of-the-art predictive performance on various clinical tasks, offering a potential path to more automated and comprehensive clinical prediction systems.

PDF Markdown Bookmark Chat (Pro)

Authors (34)

Alvin Rajkomar (6 papers)
Eyal Oren (1 paper)
Kai Chen (512 papers)
Andrew M. Dai (40 papers)
Nissan Hajaj (1 paper)
Peter J. Liu (30 papers)
Xiaobing Liu (22 papers)
Mimi Sun (7 papers)
Patrik Sundberg (1 paper)
Hector Yee (4 papers)
Kun Zhang (353 papers)
Gavin E. Duggan (1 paper)
Gerardo Flores (22 papers)
Michaela Hardt (8 papers)
Jamie Irvine (1 paper)
Quoc Le (39 papers)
Kurt Litsch (1 paper)
Jake Marcus (3 papers)
Alexander Mossin (1 paper)
Justin Tansuwan (2 papers)

Citations (1,941)

View on Semantic Scholar

Scalable and accurate deep learning for electronic health records (1801.07860v3)

Related Papers