Papers
Topics
Authors
Recent
Search
2000 character limit reached

Trial2Vec: Zero-Shot Clinical Trial Retrieval

Updated 30 March 2026
  • Trial2Vec is a zero-shot clinical trial retrieval model that produces medically meaningful embeddings using self-supervised contrastive learning and integrated UMLS knowledge.
  • It employs global and local contrastive objectives to encode key trial attributes and contextual sections via a Transformer-based TrialBERT backbone.
  • Empirical evaluations show Trial2Vec outperforms baselines like BM25, achieving significant gains in precision, recall, and outcome prediction.

Trial2Vec is a zero-shot clinical trial retrieval model that produces medically meaningful embeddings of clinical trial protocols using self-supervised contrastive learning and clinical knowledge integration, without requiring annotated similarity labels. Trial2Vec encodes the meta-structure of trial documents—including key attributes such as title, disease, intervention, and outcome—as well as contextual sections and Unified Medical Language System (UMLS) entities, to enable efficient and accurate document-level similarity search in the absence of human-labeled data. By leveraging both global and local contrastive objectives, the model yields embeddings that improve downstream applications such as retrieval and trial outcome prediction, with demonstrated gains over established baselines (Wang et al., 2022).

1. Motivation and Problem Formulation

The primary challenge addressed by Trial2Vec is similarity search for clinical trials, a task arising when designing new protocols and seeking to borrow key elements from or avoid pitfalls of historical studies. Clinical trial documents, such as those on ClinicalTrials.gov, are lengthy (∼600 words on average) and lack large-scale human-labeled similarity judgments, primarily due to the high annotation cost associated with expert review. This precludes conventional supervised retrieval approaches.

To address the absence of annotated relevance data, Trial2Vec adopts a zero-shot self-supervised paradigm. Instead of manual trial-to-trial similarity judgments, the model constructs training signals by exploiting the internal meta-structure of trial documents (e.g., section headers and key-value pairs) and external clinical ontologies (e.g., UMLS). This approach enables automatic construction of positive and negative pairs for contrastive learning, circumventing the need for expert annotation (Wang et al., 2022).

2. Trial Document Meta-Structure and Clinical Knowledge Integration

Trial2Vec’s document representation exploits explicit meta-structural segmentation:

  • Key attributes: Title, Condition/Disease, Intervention, and Primary Outcome Measure. These sections encode high-level, sparse, and predictive information.
  • Context sections: Description (study design, arms), Eligibility Criteria (inclusion/exclusion), as well as References, Locations, and Phases, which contain dense descriptive details.

For medical knowledge integration, SciSpacy is used to extract UMLS entities from attribute texts. UMLS’s hierarchical and synonymic relations are leveraged to systematically define similar entities (e.g., drugs in the same class) and establish contrastive pairs for learning. This includes parent-child and synonym relationships, enabling the model to sample positive and negative examples at both document and entity levels (Wang et al., 2022).

3. Self-Supervised Contrastive Learning Framework

Trial2Vec jointly optimizes global and local InfoNCE contrastive losses for each batch of BB trial documents. The procedure can be summarized as follows:

(a) Generation of Global Contrastive Pairs

  • Global positives: Constructed by randomly masking one key attribute from a trial, forming xi+x_i^+; encoding the residual sections yields the positive embedding hi+\mathbf{h}_i^+.
  • Global negatives: Derived by meta-swapping, wherein another trial kk (sharing some but not all key attributes) is selected, and a key field (e.g., intervention) is swapped into trial ii, producing xi−x_i^- and its negative embedding hi−\mathbf{h}_i^-.
  • In-batch negatives: Additional negatives are provided by embeddings of other trials within the same batch.

(b) Generation of Local (Attribute-Level) Contrastive Pairs

  • Local positives: For each attribute, its UMLS entity ei,1e_{i,1} is replaced with a synonym or parent concept, forming E(xiatt+)E(x_i^{\text{att}+}) and yielding the positive local embedding zi+\mathbf{z}_i^+.
  • Local negatives: Generated by replacing or removing xi+x_i^+0 with an unrelated entity, giving xi+x_i^+1.

(c) Loss Functions

Both contrastive objectives use cosine similarity with temperature xi+x_i^+2 (set to xi+x_i^+3):

Global contrastive loss:

xi+x_i^+4

Local contrastive loss:

xi+x_i^+5

Total objective: xi+x_i^+6.

This dual-level contrastive learning objective exploits both holistic document and fine-grained attribute relationships within and across trials (Wang et al., 2022).

4. Embedding Architecture

(a) TrialBERT Backbone

Trial2Vec is built upon "TrialBERT," which itself is initialized from BioBERT and further continually pre-trained using masked language modeling (MLM) on diverse corpora:

  • ClinicalTrials.gov trial documents: ∼240M tokens (400k trials)
  • Medical Encyclopedia articles: ∼3M tokens (4k articles)
  • Wikipedia medical articles: ∼11M tokens

(b) Attribute and Context Encoding

For each key attribute xi+x_i^+7, embeddings xi+x_i^+8 are computed. The context section xi+x_i^+9 (all non-key) is encoded as hi+\mathbf{h}_i^+0.

(c) Attention-Based Global Aggregation

A global embedding is constructed by aggregating local (attribute) embeddings via multi-head attention, using the context embedding as the query and key attribute embeddings as keys/values: hi+\mathbf{h}_i^+1 This mechanism enables context-conditioned weighting of attribute embeddings, reflecting the varying clinical relevance of different sections (Wang et al., 2022).

(d) Embedding Usage

  • Local embeddings hi+\mathbf{h}_i^+2: Supplied to the local contrastive loss.
  • Global embedding hi+\mathbf{h}_i^+3: Drives the global contrastive loss and is used for downstream retrieval.

5. Training Procedure

(a) Pre-training

  • TrialBERT is pre-trained for 5 epochs with batch size 100, learning rate hi+\mathbf{h}_i^+4 on the concatenated trial and medical corpora.

(b) Contrastive Fine-tuning

  • Optimizer: AdamW, learning rate hi+\mathbf{h}_i^+5, weight decay hi+\mathbf{h}_i^+6.
  • Batch size: 50, using 6 NVIDIA 2080 Ti GPUs.
  • Training proceeds for 3–4 epochs, early-stopped based on validation hi+\mathbf{h}_i^+7.

(c) Core Hyperparameters

Hyperparameter Value
MLM epochs 5
Contrastive batch size 50
Temperature (hi+\mathbf{h}_i^+8) 0.05
Fine-tuning epochs 3–4

6. Empirical Evaluation

Trial2Vec was evaluated using 1,600 labeled trial pairs (160 query trials × 10 TF-IDF candidate matches each). Expert raters determined relevance. Compared to BM25, Trial2Vec achieved substantial improvements:

Metric Trial2Vec BM25 Δ (pp)
Prec@1 0.88 0.70 +18
Prec@2 0.79 0.56 +23
Prec@5 0.51 0.42 +9
Rec@5 0.89 0.77 +12
nDCG@5 0.88 0.73 +15

This demonstrates the efficacy of self-supervised, meta-structural contrastive learning for clinical trial retrieval, even in the zero-shot setting.

(b) Downstream Task: Trial Outcome Prediction

A one-layer MLP classifier attached to hi+\mathbf{h}_i^+9 was applied to predict trial termination versus completion on 210k completed and 34k terminated trials:

Model Accuracy ROC-AUC PR-AUC
TF-IDF features 0.857 0.719 0.296
TrialBERT (no contrastive) 0.856 0.728 0.311
Trial2Vec 0.862 0.733 0.314

This suggests global trial embeddings improve predictive modeling of trial status compared to both count-based and MLM-based baselines.

7. Interpretability and Representation Analysis

t-SNE visualizations of 2,000 random global embeddings reveal well-separated clusters corresponding to disease categories; for example, breast cancer and depression trials form distinct groups. Analysis of attention weights from the multi-head attention aggregator shows model focus aligns with clinical intuition: for oncology trials, higher weights are assigned to "condition" and "outcome" embeddings, whereas for eligibility-focused protocols, the "eligibility criteria" embedding dominates. Case studies demonstrate that retrieval often surfaces appropriate historical trials with the same intervention, while keyword-based baselines retrieve less clinically relevant matches (Wang et al., 2022).

Taken together, Trial2Vec offers a principled framework for encoding clinical trial protocols into interpretable, multi-aspect vectors, yielding performance and interpretability benefits across search and predictive tasks in real-world clinical settings.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Trial2Vec.