Clinical BERT: Pretrained on EHR Notes

Updated 16 November 2025

The paper shows that pretraining BERT on clinical notes significantly improves performance on clinical NLP benchmarks compared to general-domain models.
The methodology integrates rigorous deidentification, customized subword tokenization, and extensive pretraining on diverse EHR corpora to handle domain-specific language.
Key challenges include ensuring out-of-distribution generalization, managing long document contexts, and maintaining privacy while optimizing clinical model performance.

Bidirectional Encoder Representations from Transformers (BERT) pretrained on clinical notes refers to contextualized LLMs initialized using unsupervised objectives (masked language modeling and, for most variants, next-sentence prediction) on large corpora of electronic health record (EHR) narrative text. These models capture domain-specific token distributions, semantic patterns, and task-relevant structural properties unique to clinical documentation, leading to strong gains in downstream clinical NLP benchmarks compared to baseline BERT models pretrained on general-domain corpora.

1. Clinical Note Corpora and Pretraining Procedures

BERT models pretrained on clinical notes employ EHR data such as discharge summaries, progress notes, radiology reports, and specialty note types. Common corpora include MIMIC-III (≈2 million notes, 45,000–60,000 unique patients) and institutional datasets (UCSF’s 75 million-note, 39-billion-token corpus, de-identified social-work notes, or hospital-specific collections for disease-specific models like AKI-BERT or AD-BERT) (Huang et al., 2019, Sushil et al., 2022, Mao et al., 2022, Mao et al., 2022, Sun et al., 2023). Preprocessing typically involves:

De-identification (automated PHI redaction and removal, or synthetic PHI tokens)
Text normalization (lowercasing, punctuation, whitespace cleanup)
Sentence and section segmentation (SpaCy, Punkt, or custom rules)
Subword tokenization (WordPiece or byte-level BPE with a medical-specific or general vocabulary)
Chunking or padding of clinical documents to the model’s maximum sequence length (512, 1,024, 4,096, or 8,192 tokens, depending on model)

The choice of vocabulary is critical; models trained from scratch on clinical note corpora adopt a subword vocabulary tailored to EHR frequencies (e.g., BERT-XML’s 20 K-term vocabulary yielding <1 OOV per note (Zhang et al., 2020), UCSF-BERT’s 64 K-subword WordPiece tokens (Sushil et al., 2022)), while others use the existing BERT-Base or BioBERT vocabulary (30,000–30,522 tokens).

2. Model Architectures and Objective Functions

Standard practice is to adopt the BERT-Base transformer configuration:

12 encoder layers
Hidden size 768
12 attention heads per layer
Intermediate (FFN) size 3072
Approx. 110 M parameters

All major BERT clinical variants utilize the same stack, sometimes with minor modifications for longer input (e.g., Clinical-Longformer: sparse sliding-window attention, max 4,096 tokens (Li et al., 2023); Clinical ModernBERT: RoPE, FlashAttention, 8,192-token context (Lee et al., 4 Apr 2025)), or additional attention mechanisms for multi-label tasks (as in BERT-XML’s per-label attention for ICD coding).

The canonical pretraining objective is:

$\mathcal{L} = \mathcal{L}_{\rm MLM} + \mathcal{L}_{\rm NSP}$

where

$\mathcal{L}_{\rm MLM} = -\sum_{i\in M} \log P(x_i\,|\,x_{[1:n]\setminus M})$ (predict masked tokens)
$\mathcal{L}_{\rm NSP}$ is the binary cross-entropy for sentence continuity

Several recent long-context or efficient models drop NSP entirely, relying on MLM alone (Clinical-Longformer, Clinical-BigBird, ModernBERT) (Li et al., 2023, Lee et al., 4 Apr 2025).

Pretraining is conducted from scratch on clinical note corpora, or as domain-adaptive fine-tuning: general-domain BERT/BioBERT weights are further trained on domain-specific EHR text (as in ClinicalBERT (Huang et al., 2019), Bio+Clinical BERT (Yang et al., 2023), AD-BERT (Mao et al., 2022)), typically for hundreds of thousands to a million update steps with AdamW optimization, linear decay or cosine learning rate schedules, and batch sizes dictated by sequence length and GPU/TPU memory.

3. Privacy, De-identification, and Model Release

Given the sensitive nature of EHR data, strict deidentification is the norm prior to pretraining (Huang et al., 2019, Lehman et al., 2021, Yang et al., 2023). Public corpora such as MIMIC-III replace real PHI with bracketed tokens; custom corpora follow institution-approved deidentification, often with additional review or automated PHI detection tools. Lehman et al. (Lehman et al., 2021) systematically probed ClinicalBERT variants for memorization or leakage of names and associated conditions using a suite of inference and generative attacks. Across MIMIC-III-trained BERTs, no evidence was found for meaningful PHI linkage leakage above what could be inferred from global condition frequencies. However, the risk may increase for larger models (e.g., GPT-3 scale) and data with richer name mention density; rigorous deidentification and, when possible, differential privacy (DP-SGD) and access controls are recommended best practice (Lehman et al., 2021, Yang et al., 2023).

4. Evaluation on Downstream Clinical NLP Tasks

Clinical-BERT variants are systematically evaluated on established EHR NLP benchmarks:

Named Entity Recognition (i2b2 shared tasks: 2006–2014), using F₁ (strict/lenient) (Huang et al., 2019, Li et al., 2023, Abdul-Quddoos et al., 23 Sep 2025)
Automated ICD coding (BERT-XML: AUC, macro/micro, >2,000 codes) (Zhang et al., 2020)
Early event prediction (readmission, mortality, AKI onset, AD conversion: AUROC, F₁) (Huang et al., 2019, Mao et al., 2022, Mao et al., 2022)
Medication/event extraction (n2c2/CMED: micro/macro F₁) (Abdul-Quddoos et al., 23 Sep 2025, Sarker et al., 29 Jun 2025)
Phenotype recognition (binary token-level NER, precision/recall/F₁) (Yang et al., 2023)
Document classification, inference, and question answering (OpenI, MedNLI, emrQA) (Li et al., 2023, Sushil et al., 2022, Lee et al., 4 Apr 2025)

Across all tasks, models pretrained or continually adapted on clinical corpora outperform general-domain BERT and often BioBERT (PubMed- or PMC-only). Clinical-Longformer and Clinical ModernBERT show state-of-the-art results in long-document NER and document-level classification (gains of 1–3 or more percentage points in F₁) (Li et al., 2023, Lee et al., 4 Apr 2025). Domain-specific variants (AKI-BERT, AD-BERT) yield 0.5–2 points gain in AUROC on the corresponding diagnosis/prognosis task relative to non-disease-adapted BERTs (Mao et al., 2022, Mao et al., 2022).

Model	NER F₁ (i2b2 2010)
BERT-Base	0.784–0.784
Clinical-BERT	0.858–0.843
Fed_Clinical-BERT (FedPretrain)	0.820–0.808
Bio+Clinical BERT (Discharge)	0.9355 (strict)
Clinical Longformer	0.9284 (strict)
Clinical ModernBERT	0.886–0.883

5. Specialized Pretraining and Adaptation Strategies

Several advanced paradigms have been demonstrated:

Federated Pretraining: BERT-Base can be both pretrained and fine-tuned across non-cooperative data silos (FedAvg, no raw data exchange), incurring only a modest F₁ decrease (< 5% for pretrain, < 2% for finetune, ~6% for full federated) on NER (Liu et al., 2020).
Disease-Specific Adaptation: Continued pretraining on disease-focused clinical notes (e.g., AKI-BERT, AD-BERT) outperforms both generic clinical and biomedical BERT models on disease-specific prognostic tasks (Mao et al., 2022, Mao et al., 2022).
Multi-label attention architectures (e.g., BERT-XML) are critical for large code classification tasks, especially for rare entity classes (Zhang et al., 2020).
Hierarchical and long-context modeling (UCSF-BERT-MS, Clinical-Longformer, ModernBERT) explicitly address the 512-token limit, achieving superior document-level performance for patients with multiple/long notes (Sun et al., 2023, Li et al., 2023, Lee et al., 4 Apr 2025).
Ensembling multiple domain-specific and general models with soft majority voting yields up to 10–11 pp micro-F₁ and 9 pp macro-F₁ improvement over any single model on medication event labeling (Sarker et al., 29 Jun 2025).

6. Limitations and Open Challenges

Recognized caveats include:

Most clinical BERT research to date relies on single-institution, U.S.-centric data (e.g., MIMIC-III/IV, UCSF Health) (Sushil et al., 2022, Sun et al., 2023). This limits linguistic and systemic diversity and may inflate evaluation results when train and test sets overlap.
Out-of-distribution generalization, abbreviation/acronym expansion, numeric and temporal reasoning, and implicit causal inference remain suboptimal for all current models (Sushil et al., 2022, Lee et al., 4 Apr 2025).
Processing very rare codes, extremely long multi-document patient histories, and memory efficiency at 8 K+ token context remain open engineering hurdles (Zhang et al., 2020, Lee et al., 4 Apr 2025).
Privacy safeguards beyond deidentification (secure aggregation, DP-SGD, censoring) are not universally deployed; larger autoregressive architectures may present elevated leakage risk (Lehman et al., 2021, Yang et al., 2023).
Fine-grained, multi-dimensional contextual event classification (as in CMED Task 3) can suffer relative to general-domain BERTs, possibly owing to overfitting of co-occurrence statistics in highly specialized models (Abdul-Quddoos et al., 23 Sep 2025).

7. Best Practices and Practical Considerations

Successful application of BERT pretrained on clinical notes should incorporate the following:

Prefer subword vocabularies adapted to in-domain EHR text to reduce OOV rates and sequence expansion. In systems with markedly different note formats or markup, retrain tokenizers as appropriate (Zhang et al., 2020, Sushil et al., 2022).
For tasks involving long notes or multi-note aggregation, prefer sparse attention models (Longformer, BigBird, ModernBERT) or hierarchical pooling of contextual embeddings (UCSF-BERT-MS) (Li et al., 2023, Sun et al., 2023).
Employ soft voting ensembling for class-imbalanced entities or relations to harvest complementary strengths of general and clinical-domain BERTs (Sarker et al., 29 Jun 2025).
In disease-specific or extreme sublanguage cases, conduct additional domain-adaptive pretraining on the relevant patient cohort (Mao et al., 2022, Mao et al., 2022).
Apply rigorous deidentification and PHI suppression, and only release model weights after institutional review; consider DP techniques for privacy-critical applications (Lehman et al., 2021, Yang et al., 2023).
For resource-constrained settings, start from BERT-base or quantized open-source models with LoRA-style adapters, rather than training full large models (Yang et al., 2023).
For downstream sequence labeling, tune batch size, learning rate ( $2\times10^{-5}$ is standard), epochs (3–6), and evaluate with strict micro/macro F₁ when annotator agreement or label sparsity is a concern (Huang et al., 2019, Sarker et al., 29 Jun 2025).

In conclusion, BERT models pretrained on large, in-domain, de-identified clinical notes consistently and substantially enhance clinical NLP performance for a broad spectrum of tasks. They encode both the domain sublanguage and contextual entity/event relations fundamental to robust automated EHR interpretation. Open issues remain regarding rare-entity generalization, privacy, and robust cross-site transfer, motivating ongoing pretraining method and architecture development.