Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Med-BERT: pre-trained contextualized embeddings on large-scale structured electronic health records for disease prediction (2005.12833v1)

Published 22 May 2020 in cs.CL, cs.LG, and cs.NE

Abstract: Deep learning (DL) based predictive models from electronic health records (EHR) deliver impressive performance in many clinical tasks. Large training cohorts, however, are often required to achieve high accuracy, hindering the adoption of DL-based models in scenarios with limited training data size. Recently, bidirectional encoder representations from transformers (BERT) and related models have achieved tremendous successes in the natural language processing domain. The pre-training of BERT on a very large training corpus generates contextualized embeddings that can boost the performance of models trained on smaller datasets. We propose Med-BERT, which adapts the BERT framework for pre-training contextualized embedding models on structured diagnosis data from 28,490,650 patients EHR dataset. Fine-tuning experiments are conducted on two disease-prediction tasks: (1) prediction of heart failure in patients with diabetes and (2) prediction of pancreatic cancer from two clinical databases. Med-BERT substantially improves prediction accuracy, boosting the area under receiver operating characteristics curve (AUC) by 2.02-7.12%. In particular, pre-trained Med-BERT substantially improves the performance of tasks with very small fine-tuning training sets (300-500 samples) boosting the AUC by more than 20% or equivalent to the AUC of 10 times larger training set. We believe that Med-BERT will benefit disease-prediction studies with small local training datasets, reduce data collection expenses, and accelerate the pace of artificial intelligence aided healthcare.

An Overview of Med-BERT: Contextualized Embeddings for Disease Prediction

Med-BERT represents a methodological advancement in leveraging structured Electronic Health Records (EHRs) for disease prediction. Utilizing adaptations of the BERT model, traditionally successful in NLP, Med-BERT pre-trains contextualized embeddings tailored for EHR data. This paper introduces Med-BERT, evaluating its effectiveness in enhancing disease-prediction tasks, focusing on structured EHRs and its utility for scenarios with limited available training data.

Methodology

Med-BERT adapts the BERT framework to pre-train contextualized embeddings on structured EHRs. The paper introduces enhancements in layer representation and domain-specific pre-training tasks that better capture clinical semantics. The two disease-prediction tasks used for fine-tuning include heart failure prediction in diabetic patients and pancreatic cancer prediction, encompassing three distinct cohorts from two EHR databases.

Results

The results indicate a marked improvement in predictive performance with Med-BERT. The introduction of pre-trained embeddings improved the AUC for three predictive models (GRU, Bi-GRU, RETAIN) by 2.67–7.12%. Notably, Med-BERT’s efficacy is particularly high when training on small datasets of 300–500 samples, achieving performance levels comparable to those with datasets ten times larger without pre-training. Meaningful connections among clinical codes were observed via dependency analysis, affirming the quality of learned embeddings.

Implications

The potential of Med-BERT lies in its ability to provide robust predictions, especially when data for training is scant. The pre-training and fine-tuning paradigm shows promise in various EHR-driven predictive tasks, aiding in reducing data collection costs and potentially accelerating AI-aided diagnostics. Med-BERT shares its pre-trained model publicly, offering significant utility for researchers with limited localized datasets.

Comparisons and Contributions

Med-BERT is compared with other models like BEHRT and G-BERT, highlighting its larger pre-training cohort, extensive vocabulary, and the inclusion of relevant clinical tasks. It is noted that Med-BERT uniquely incorporates a domain-specific task, prediction of prolonged length of stay (LOS), improving context capture.

In fine-tuning phases, Med-BERT models showed superior results compared to traditional word2vec-style embeddings, particularly when combined with GRU, Bi-GRU, and RETAIN models. This suggests that Med-BERT may reduce reliance on complex model architectures, offering simpler, more effective solutions.

Future Directions

The future of AI in healthcare could see broader adoption of models like Med-BERT for varied clinical prediction tasks. Expanding datasets to include additional EHR modalities such as medications, procedures, and laboratory tests could enhance model accuracy. Furthermore, efforts to integrate temporal data more effectively are planned. Continued exploration of novel pre-training tasks and visualization techniques could also improve model interpretability and applicability.

In conclusion, Med-BERT advances the frontier of disease prediction using pre-trained embeddings on structured EHR data. Its potential to streamline predictive tasks, particularly with limited training data, suggests significant implications for clinical informatics and AI-driven healthcare solutions.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Laila Rasmy (5 papers)
  2. Yang Xiang (187 papers)
  3. Ziqian Xie (3 papers)
  4. Cui Tao (24 papers)
  5. Degui Zhi (8 papers)
Citations (556)