EHRSHOT: An EHR Benchmark for Few-Shot Evaluation of Foundation Models (2307.02028v3)

Published 5 Jul 2023 in cs.LG, cs.AI, and cs.CL

Abstract: While the general ML community has benefited from public datasets, tasks, and models, the progress of ML in healthcare has been hampered by a lack of such shared assets. The success of foundation models creates new challenges for healthcare ML by requiring access to shared pretrained models to validate performance benefits. We help address these challenges through three contributions. First, we publish a new dataset, EHRSHOT, which contains deidentified structured data from the electronic health records (EHRs) of 6,739 patients from Stanford Medicine. Unlike MIMIC-III/IV and other popular EHR datasets, EHRSHOT is longitudinal and not restricted to ICU/ED patients. Second, we publish the weights of CLMBR-T-base, a 141M parameter clinical foundation model pretrained on the structured EHR data of 2.57M patients. We are one of the first to fully release such a model for coded EHR data; in contrast, most prior models released for clinical data (e.g. GatorTron, ClinicalBERT) only work with unstructured text and cannot process the rich, structured data within an EHR. We provide an end-to-end pipeline for the community to validate and build upon its performance. Third, we define 15 few-shot clinical prediction tasks, enabling evaluation of foundation models on benefits such as sample efficiency and task adaptation. Our model and dataset are available via a research data use agreement from our website: https://ehrshot.stanford.edu. Code to reproduce our results are available at our Github repo: https://github.com/som-shahlab/ehrshot-benchmark

PDF HTML Abstract

Evaluating Clinical Foundation Models: An Exploration with EHRSHOT

The paper "EHRSHOT: An EHR Benchmark for Few-Shot Evaluation of Foundation Models" provides a substantial contribution to machine learning in healthcare, addressing critical impediments by introducing a benchmark dataset, a pretrained foundation model, and defining specific clinical tasks. This work represents a significant effort towards improving reproducibility and evaluation standards for clinical prediction models using electronic health records (EHRs).

Contributions

The authors present three key contributions:

Dataset Release: The EHRSHOT dataset stands out for its longitudinal nature, featuring 6,739 patients' complete medical histories from Stanford Medicine, encompassing over 41.6 million clinical events. Unlike prior datasets such as MIMIC-III/IV, which focus predominantly on ICU/ED data, EHRSHOT's broader scope allows for more comprehensive evaluation scenarios that better reflect real-world healthcare systems. This dataset is released under a research data use agreement, ensuring accessibility while maintaining necessary privacy protections.
Foundation Model - CLMBR-T-base: The paper introduces CLMBR-T-base, a 141 million parameter transformer-based model trained on structured EHR data from 2.57 million anonymized patient records. This transformer model is notably published with its full pre-trained weights, addressing a major gap in the availability of such models for EHR data. Many existing models, like GatorTron and ClinicalBERT, only treat unstructured text data and often lack full weight availability for external validation or extension.
Few-Shot Benchmark Tasks: In crafting a set of 15 few-shot clinical prediction tasks, the authors provide a framework for evaluating the efficacy of foundation models with limited training samples. This is crucial in the healthcare domain where acquiring labeled data can be resource-intensive. The few-shot learning tasks test the model's ability to generalize and adapt to diverse prediction tasks effectively, benchmarking sample efficiency and task adaptation.

Results and Analytical Discussion

The authors demonstrate significant AUROC/AUPRC improvements in few-shot settings using CLMBR-T-base as compared to baseline models like count-based GBM. However, they note that while CLMBR-T-base shows improved performance across most task categories, especially at intermediate sample sizes, the model's performance can diminish when trained on more extensive datasets, particularly for certain diagnosis tasks.

The above observations imply that while pretrained models like CLMBR-T-base offer initial advantageous initialization states due to extensive prior exposure across multiple patient records, task-specific models trained from scratch might surpass in settings where sufficient labelled data are available. The performance parity or reversal in higher label settings underscores the complex interplay between task design, data availability, and model architecture in clinical implementations.

Implications and Future Directions

The authors' methodical approach to releasing a dataset alongside a pretrained model addresses several flaws noted in previous literature: lack of structured EHR data beyond ICU/ED settings, absence of open pretrained foundation models, and inadequacies in few-shot learning benchmarks. The research speculates that by leveraging shared, open resources, collaborative advancements in model robustness and scope are achievable, promoting more standardized methods in healthcare ML research.

Future developments could involve investigating larger and multi-institutional datasets to assess the generalizability and reliability of foundation models across varying healthcare systems. Additionally, extending the work to include unstructured data such as clinical notes could provide further insights into the effective integration and utilization of all data types present in EHRs.

Conclusion

"EHRSHOT: An EHR Benchmark for Few-Shot Evaluation of Foundation Models" establishes a framework for enhanced reproducibility and a new avenue for validating clinical foundation models in low-label environments. By opening path breaking avenues for few-shot learning evaluations in healthcare, this paper seeds future exploration in the fifty percent shades of clinical machine learning, navigating towards a more reliable and accessible embodiment of data-driven healthcare.