Pretrained Language Models in Biomedicine
- Pretrained language models (PLMs) are deep neural architectures based on transformers that leverage self-supervised objectives to learn robust language representations.
- They are extensively applied in biomedical NLP, with models like BioBERT and PubMedBERT capturing domain-specific semantic and syntactic patterns.
- Modern PLMs face challenges such as domain adaptation, vocabulary fragmentation, and efficiency, spurring research on optimization and responsible deployment.
Pretrained LLMs (PLMs) are deep neural architectures, typically based on transformer or related self-attention mechanisms, that are trained on large-scale unlabeled text corpora via self-supervised objectives and then subsequently adapted to a wide range of downstream NLP tasks. In biomedical NLP, PLMs such as BioBERT, PubMedBERT, and BioELECTRA have become the default backbone due to their scalable ability to encode both syntactic and domain-specific semantic knowledge from massive biomedical and clinical corpora (Kalyan et al., 2021, Wang et al., 2021). The evolution of PLMs has redefined information extraction, classification, and reasoning paradigms in the field, but has also surfaced new challenges around effective adaptation, domain knowledge representation, data efficiency, and responsible deployment.
1. Foundational Concepts: Architecture and Self-Supervised Learning
PLMs leverage self-supervised learning (SSL), where supervision is derived directly from properties of the input text, rather than human annotation. Core tasks include masked LLMing (MLM)—in which a proportion of tokens in the input sequence is masked and the model is trained to reconstruct them—and variants such as replaced token detection (RTD) and span-based objectives.
The input tokens are mapped to continuous vector spaces by embedding layers, which are typically constructed as follows:
where is the embedding matrix, , , and correspond to token, positional, and segment embeddings of equivalent dimension, is the sequence length, and is the embedding size.
A stack of transformer encoder blocks processes these inputs, each including multi-head self-attention mechanisms with internals:
- Queries, keys, and values: , ,
- Scaled dot-product self-attention: ,
- Layer normalization and feed-forward sublayers:
This architecture enables modeling of hierarchical and long-range dependencies in language.
2. Pretraining Strategies and Objectives
Pretraining strategies for PLMs in biomedicine are categorized along axes of data source and objective function:
- Mixed-Domain Pretraining (MDPT): Continual pretraining on in-domain biomedical corpora using weights from a general-domain PLM (example: BioBERT continues from BERT on PubMed/PMC text).
- Domain-Specific Pretraining (DSPT): From-scratch pretraining solely on domain data—this enables optimized vocabulary segmentation and tokenization for biomedical terminology (example: PubMedBERT).
- Task Adaptive Pretraining (TAPT): Additional pretraining on a small collection of task-relevant but unlabeled data to impart domain and style characteristics at low cost.
The main objectives include:
- Masked LLMing (MLM):
where is the set of masked tokens in .
- Replaced Token Detection (RTD): Each token is classified as original or replaced, sidestepping issues of [MASK] tokens.
- Span Boundary Objective (SBO): Predicts entire masked spans from boundary representations.
Auxiliary objectives often inject external biomedical knowledge, e.g., predicting relationships extracted from UMLS or multi-label synonym prediction.
3. Adaptation and Fine-Tuning for Downstream Biomedical Tasks
PLMs are adapted to downstream tasks in two principal manners:
- Intermediate Fine-Tuning (IFT): The PLM is first fine-tuned on a larger, related labeled set (e.g., general-domain NLI) before being fine-tuned on the smaller biomedical target set.
- Multi-Task Fine-Tuning: Simultaneous fine-tuning on multiple related tasks using shared representations but separate output heads.
These strategies are particularly crucial in biomedicine because of the scarcity and idiosyncracy of labeled datasets (e.g., in NER for rare diseases, or relation extraction in clinical notes), as well as the need for models robust to complex biomedical language phenomena.
4. Model Taxonomy: Data and Specialization
Biomedical PLMs are categorized by their data source and architecture extensions:
Data Corpus Type | Representative Models | Special Features |
---|---|---|
EHR/Clinical Notes | ClinicalBERT, BEHRT, MedBERT | Tokenization of structured codes; clinical metadata embeddings (age, gender) |
Scientific Articles | BioBERT, PubMedBERT, BlueBERT, BioELECTRA | PubMed/PMC data, custom vocabularies |
Radiology Reports | Specialized models for imaging text | Multi-modal extensions |
Social Media | CT-BERT, BioRedditBERT | Handling informal biomedical text |
Hybrid | Mix of general and biomedical text | Up-sampling, vocabulary extension |
Extensions include language-specific (non-English) PLMs, ontology-enriched models (incorporating structured KBs like UMLS), vocabulary-adaptive "green models," debiased models, and multi-modal models unifying clinical text and images.
5. Challenges, Solutions, and Open Issues in Deployment
Biomedical PLMs face distinct challenges:
- Low-Cost Domain Adaptation: Pretraining is compute-intensive; solutions include TAPT and embedding layer extension with in-domain vocabulary, often via retraining WordPiece or Word2Vec algorithms.
- Ontology Integration: Models may overlook curated KB insights; injection via auxiliary pretraining tasks (triple classification, entity linking) is deployed.
- Scarcity of Target Data: Transfer approaches (IFT, multi-task, semi-supervised learning) and data augmentation (e.g., back translation, EDA) mitigate overfitting.
- Robustness to Noise: Character-level embeddings (e.g., BioCharBERT), adversarial training, and noisy sample augmentation improve performance on non-canonical biomedical language.
- Vocabulary Fragmentation: Biomedical terms split by general-domain tokenizers are inadequately represented; domain-specific tokenization or vocabulary augmentation remedies this.
- Limited In-domain Data for Pretraining: Corpus up-sampling and combined simultaneous pretraining balance representation learning.
Open issues remain regarding:
- Bias mitigation via intrinsic probes and new loss formulations.
- Data privacy and leakage risk from pretraining on sensitive EHR data.
- Efficient, environmentally sustainable pretraining and inference architectures.
- Novel pretraining tasks that offer richer learning signals or overcome the masking bottleneck.
- Development of benchmarks to reflect EHR and social media biomedical language.
- Intrinsic probes analogous to LAMA for evaluating domain-specific factuality and reasoning.
6. Impact and Future Research Directions
PLMs have driven unprecedented advances in biomedical NLP task performance, but the pathway forward requires greater efficiency, better domain adaptation, deeper integration of curated domain knowledge, and systematic interpretability and bias assessment. Research directions include adaptively scalable architectures, improved privacy-preserving pretraining, expanded and more diverse biomedical benchmarks (including EHR and social media), and explicit probing of how factual and procedural knowledge is internally represented by these models.
Intrinsically, as new architectures (e.g., ConvBERT, DeBERTa) and knowledge integration strategies are explored, the community is converging on more efficient, transparent, and knowledge-augmented PLMs that balance coverage, interpretability, and robustness.
References
- AMMU: A Survey of Transformer-based Biomedical Pretrained LLMs (Kalyan et al., 2021)
- Pre-trained LLMs in Biomedical Domain: A Systematic Survey (Wang et al., 2021)