Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
125 tokens/sec
GPT-4o
10 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
3 tokens/sec
DeepSeek R1 via Azure Pro
51 tokens/sec
2000 character limit reached

Pretrained Language Models in Biomedicine

Updated 29 July 2025
  • Pretrained language models (PLMs) are deep neural architectures based on transformers that leverage self-supervised objectives to learn robust language representations.
  • They are extensively applied in biomedical NLP, with models like BioBERT and PubMedBERT capturing domain-specific semantic and syntactic patterns.
  • Modern PLMs face challenges such as domain adaptation, vocabulary fragmentation, and efficiency, spurring research on optimization and responsible deployment.

Pretrained LLMs (PLMs) are deep neural architectures, typically based on transformer or related self-attention mechanisms, that are trained on large-scale unlabeled text corpora via self-supervised objectives and then subsequently adapted to a wide range of downstream NLP tasks. In biomedical NLP, PLMs such as BioBERT, PubMedBERT, and BioELECTRA have become the default backbone due to their scalable ability to encode both syntactic and domain-specific semantic knowledge from massive biomedical and clinical corpora (Kalyan et al., 2021, Wang et al., 2021). The evolution of PLMs has redefined information extraction, classification, and reasoning paradigms in the field, but has also surfaced new challenges around effective adaptation, domain knowledge representation, data efficiency, and responsible deployment.

1. Foundational Concepts: Architecture and Self-Supervised Learning

PLMs leverage self-supervised learning (SSL), where supervision is derived directly from properties of the input text, rather than human annotation. Core tasks include masked LLMing (MLM)—in which a proportion of tokens in the input sequence is masked and the model is trained to reconstruct them—and variants such as replaced token detection (RTD) and span-based objectives.

The input tokens are mapped to continuous vector spaces by embedding layers, which are typically constructed as follows:

X=I+P+SX = I + P + S

where XRn×eX \in \mathbb{R}^{n \times e} is the embedding matrix, II, PP, and SS correspond to token, positional, and segment embeddings of equivalent dimension, nn is the sequence length, and ee is the embedding size.

A stack of transformer encoder blocks processes these inputs, each including multi-head self-attention mechanisms with internals:

  • Queries, keys, and values: Q=XWQQ = XW^Q, K=XWKK = XW^K, V=XWVV = XW^V
  • Scaled dot-product self-attention: P=Softmax(QKTq)P = \text{Softmax}\left(\frac{QK^T}{\sqrt{q}}\right), Z=PVZ = P \cdot V
  • Layer normalization and feed-forward sublayers: PFN(y)=GELU(yW1+b1)W2+b2PFN(y) = \text{GELU}(yW_1 + b_1)W_2 + b_2

This architecture enables modeling of hierarchical and long-range dependencies in language.

2. Pretraining Strategies and Objectives

Pretraining strategies for PLMs in biomedicine are categorized along axes of data source and objective function:

  • Mixed-Domain Pretraining (MDPT): Continual pretraining on in-domain biomedical corpora using weights from a general-domain PLM (example: BioBERT continues from BERT on PubMed/PMC text).
  • Domain-Specific Pretraining (DSPT): From-scratch pretraining solely on domain data—this enables optimized vocabulary segmentation and tokenization for biomedical terminology (example: PubMedBERT).
  • Task Adaptive Pretraining (TAPT): Additional pretraining on a small collection of task-relevant but unlabeled data to impart domain and style characteristics at low cost.

The main objectives include:

  • Masked LLMing (MLM):

LMLM=1m(x)im(x)logP(xix^)L_{MLM} = -\frac{1}{|m(x)|} \sum_{i \in m(x)} \log P(x_i \mid \hat{x})

where m(x)m(x) is the set of masked tokens in xx.

  • Replaced Token Detection (RTD): Each token is classified as original or replaced, sidestepping issues of [MASK] tokens.
  • Span Boundary Objective (SBO): Predicts entire masked spans from boundary representations.

Auxiliary objectives often inject external biomedical knowledge, e.g., predicting relationships extracted from UMLS or multi-label synonym prediction.

3. Adaptation and Fine-Tuning for Downstream Biomedical Tasks

PLMs are adapted to downstream tasks in two principal manners:

  • Intermediate Fine-Tuning (IFT): The PLM is first fine-tuned on a larger, related labeled set (e.g., general-domain NLI) before being fine-tuned on the smaller biomedical target set.
  • Multi-Task Fine-Tuning: Simultaneous fine-tuning on multiple related tasks using shared representations but separate output heads.

These strategies are particularly crucial in biomedicine because of the scarcity and idiosyncracy of labeled datasets (e.g., in NER for rare diseases, or relation extraction in clinical notes), as well as the need for models robust to complex biomedical language phenomena.

4. Model Taxonomy: Data and Specialization

Biomedical PLMs are categorized by their data source and architecture extensions:

Data Corpus Type Representative Models Special Features
EHR/Clinical Notes ClinicalBERT, BEHRT, MedBERT Tokenization of structured codes; clinical metadata embeddings (age, gender)
Scientific Articles BioBERT, PubMedBERT, BlueBERT, BioELECTRA PubMed/PMC data, custom vocabularies
Radiology Reports Specialized models for imaging text Multi-modal extensions
Social Media CT-BERT, BioRedditBERT Handling informal biomedical text
Hybrid Mix of general and biomedical text Up-sampling, vocabulary extension

Extensions include language-specific (non-English) PLMs, ontology-enriched models (incorporating structured KBs like UMLS), vocabulary-adaptive "green models," debiased models, and multi-modal models unifying clinical text and images.

5. Challenges, Solutions, and Open Issues in Deployment

Biomedical PLMs face distinct challenges:

  • Low-Cost Domain Adaptation: Pretraining is compute-intensive; solutions include TAPT and embedding layer extension with in-domain vocabulary, often via retraining WordPiece or Word2Vec algorithms.
  • Ontology Integration: Models may overlook curated KB insights; injection via auxiliary pretraining tasks (triple classification, entity linking) is deployed.
  • Scarcity of Target Data: Transfer approaches (IFT, multi-task, semi-supervised learning) and data augmentation (e.g., back translation, EDA) mitigate overfitting.
  • Robustness to Noise: Character-level embeddings (e.g., BioCharBERT), adversarial training, and noisy sample augmentation improve performance on non-canonical biomedical language.
  • Vocabulary Fragmentation: Biomedical terms split by general-domain tokenizers are inadequately represented; domain-specific tokenization or vocabulary augmentation remedies this.
  • Limited In-domain Data for Pretraining: Corpus up-sampling and combined simultaneous pretraining balance representation learning.

Open issues remain regarding:

  • Bias mitigation via intrinsic probes and new loss formulations.
  • Data privacy and leakage risk from pretraining on sensitive EHR data.
  • Efficient, environmentally sustainable pretraining and inference architectures.
  • Novel pretraining tasks that offer richer learning signals or overcome the masking bottleneck.
  • Development of benchmarks to reflect EHR and social media biomedical language.
  • Intrinsic probes analogous to LAMA for evaluating domain-specific factuality and reasoning.

6. Impact and Future Research Directions

PLMs have driven unprecedented advances in biomedical NLP task performance, but the pathway forward requires greater efficiency, better domain adaptation, deeper integration of curated domain knowledge, and systematic interpretability and bias assessment. Research directions include adaptively scalable architectures, improved privacy-preserving pretraining, expanded and more diverse biomedical benchmarks (including EHR and social media), and explicit probing of how factual and procedural knowledge is internally represented by these models.

Intrinsically, as new architectures (e.g., ConvBERT, DeBERTa) and knowledge integration strategies are explored, the community is converging on more efficient, transparent, and knowledge-augmented PLMs that balance coverage, interpretability, and robustness.

References

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)