Trial2Vec: Zero-Shot Clinical Trial Retrieval
- Trial2Vec is a zero-shot clinical trial retrieval model that produces medically meaningful embeddings using self-supervised contrastive learning and integrated UMLS knowledge.
- It employs global and local contrastive objectives to encode key trial attributes and contextual sections via a Transformer-based TrialBERT backbone.
- Empirical evaluations show Trial2Vec outperforms baselines like BM25, achieving significant gains in precision, recall, and outcome prediction.
Trial2Vec is a zero-shot clinical trial retrieval model that produces medically meaningful embeddings of clinical trial protocols using self-supervised contrastive learning and clinical knowledge integration, without requiring annotated similarity labels. Trial2Vec encodes the meta-structure of trial documents—including key attributes such as title, disease, intervention, and outcome—as well as contextual sections and Unified Medical Language System (UMLS) entities, to enable efficient and accurate document-level similarity search in the absence of human-labeled data. By leveraging both global and local contrastive objectives, the model yields embeddings that improve downstream applications such as retrieval and trial outcome prediction, with demonstrated gains over established baselines (Wang et al., 2022).
1. Motivation and Problem Formulation
The primary challenge addressed by Trial2Vec is similarity search for clinical trials, a task arising when designing new protocols and seeking to borrow key elements from or avoid pitfalls of historical studies. Clinical trial documents, such as those on ClinicalTrials.gov, are lengthy (∼600 words on average) and lack large-scale human-labeled similarity judgments, primarily due to the high annotation cost associated with expert review. This precludes conventional supervised retrieval approaches.
To address the absence of annotated relevance data, Trial2Vec adopts a zero-shot self-supervised paradigm. Instead of manual trial-to-trial similarity judgments, the model constructs training signals by exploiting the internal meta-structure of trial documents (e.g., section headers and key-value pairs) and external clinical ontologies (e.g., UMLS). This approach enables automatic construction of positive and negative pairs for contrastive learning, circumventing the need for expert annotation (Wang et al., 2022).
2. Trial Document Meta-Structure and Clinical Knowledge Integration
Trial2Vec’s document representation exploits explicit meta-structural segmentation:
- Key attributes: Title, Condition/Disease, Intervention, and Primary Outcome Measure. These sections encode high-level, sparse, and predictive information.
- Context sections: Description (study design, arms), Eligibility Criteria (inclusion/exclusion), as well as References, Locations, and Phases, which contain dense descriptive details.
For medical knowledge integration, SciSpacy is used to extract UMLS entities from attribute texts. UMLS’s hierarchical and synonymic relations are leveraged to systematically define similar entities (e.g., drugs in the same class) and establish contrastive pairs for learning. This includes parent-child and synonym relationships, enabling the model to sample positive and negative examples at both document and entity levels (Wang et al., 2022).
3. Self-Supervised Contrastive Learning Framework
Trial2Vec jointly optimizes global and local InfoNCE contrastive losses for each batch of trial documents. The procedure can be summarized as follows:
(a) Generation of Global Contrastive Pairs
- Global positives: Constructed by randomly masking one key attribute from a trial, forming ; encoding the residual sections yields the positive embedding .
- Global negatives: Derived by meta-swapping, wherein another trial (sharing some but not all key attributes) is selected, and a key field (e.g., intervention) is swapped into trial , producing and its negative embedding .
- In-batch negatives: Additional negatives are provided by embeddings of other trials within the same batch.
(b) Generation of Local (Attribute-Level) Contrastive Pairs
- Local positives: For each attribute, its UMLS entity is replaced with a synonym or parent concept, forming and yielding the positive local embedding .
- Local negatives: Generated by replacing or removing 0 with an unrelated entity, giving 1.
(c) Loss Functions
Both contrastive objectives use cosine similarity with temperature 2 (set to 3):
Global contrastive loss:
4
Local contrastive loss:
5
Total objective: 6.
This dual-level contrastive learning objective exploits both holistic document and fine-grained attribute relationships within and across trials (Wang et al., 2022).
4. Embedding Architecture
(a) TrialBERT Backbone
Trial2Vec is built upon "TrialBERT," which itself is initialized from BioBERT and further continually pre-trained using masked language modeling (MLM) on diverse corpora:
- ClinicalTrials.gov trial documents: ∼240M tokens (400k trials)
- Medical Encyclopedia articles: ∼3M tokens (4k articles)
- Wikipedia medical articles: ∼11M tokens
(b) Attribute and Context Encoding
For each key attribute 7, embeddings 8 are computed. The context section 9 (all non-key) is encoded as 0.
(c) Attention-Based Global Aggregation
A global embedding is constructed by aggregating local (attribute) embeddings via multi-head attention, using the context embedding as the query and key attribute embeddings as keys/values: 1 This mechanism enables context-conditioned weighting of attribute embeddings, reflecting the varying clinical relevance of different sections (Wang et al., 2022).
(d) Embedding Usage
- Local embeddings 2: Supplied to the local contrastive loss.
- Global embedding 3: Drives the global contrastive loss and is used for downstream retrieval.
5. Training Procedure
(a) Pre-training
- TrialBERT is pre-trained for 5 epochs with batch size 100, learning rate 4 on the concatenated trial and medical corpora.
(b) Contrastive Fine-tuning
- Optimizer: AdamW, learning rate 5, weight decay 6.
- Batch size: 50, using 6 NVIDIA 2080 Ti GPUs.
- Training proceeds for 3–4 epochs, early-stopped based on validation 7.
(c) Core Hyperparameters
| Hyperparameter | Value |
|---|---|
| MLM epochs | 5 |
| Contrastive batch size | 50 |
| Temperature (8) | 0.05 |
| Fine-tuning epochs | 3–4 |
6. Empirical Evaluation
(a) Zero-Shot Similarity Search
Trial2Vec was evaluated using 1,600 labeled trial pairs (160 query trials × 10 TF-IDF candidate matches each). Expert raters determined relevance. Compared to BM25, Trial2Vec achieved substantial improvements:
| Metric | Trial2Vec | BM25 | Δ (pp) |
|---|---|---|---|
| Prec@1 | 0.88 | 0.70 | +18 |
| Prec@2 | 0.79 | 0.56 | +23 |
| Prec@5 | 0.51 | 0.42 | +9 |
| Rec@5 | 0.89 | 0.77 | +12 |
| nDCG@5 | 0.88 | 0.73 | +15 |
This demonstrates the efficacy of self-supervised, meta-structural contrastive learning for clinical trial retrieval, even in the zero-shot setting.
(b) Downstream Task: Trial Outcome Prediction
A one-layer MLP classifier attached to 9 was applied to predict trial termination versus completion on 210k completed and 34k terminated trials:
| Model | Accuracy | ROC-AUC | PR-AUC |
|---|---|---|---|
| TF-IDF features | 0.857 | 0.719 | 0.296 |
| TrialBERT (no contrastive) | 0.856 | 0.728 | 0.311 |
| Trial2Vec | 0.862 | 0.733 | 0.314 |
This suggests global trial embeddings improve predictive modeling of trial status compared to both count-based and MLM-based baselines.
7. Interpretability and Representation Analysis
t-SNE visualizations of 2,000 random global embeddings reveal well-separated clusters corresponding to disease categories; for example, breast cancer and depression trials form distinct groups. Analysis of attention weights from the multi-head attention aggregator shows model focus aligns with clinical intuition: for oncology trials, higher weights are assigned to "condition" and "outcome" embeddings, whereas for eligibility-focused protocols, the "eligibility criteria" embedding dominates. Case studies demonstrate that retrieval often surfaces appropriate historical trials with the same intervention, while keyword-based baselines retrieve less clinically relevant matches (Wang et al., 2022).
Taken together, Trial2Vec offers a principled framework for encoding clinical trial protocols into interpretable, multi-aspect vectors, yielding performance and interpretability benefits across search and predictive tasks in real-world clinical settings.