- The paper introduces RAG-based LLMs that significantly improve matching patients to complex trial criteria using clinical narratives.
- It compares encoder and decoder models, demonstrating that domain-specific fine-tuning enhances micro-F1 and AUROC performance.
- The study reveals that dynamic evidence retrieval reduces token input while maintaining high accuracy for patient eligibility decisions.
Improving Clinical Trial Recruitment using Clinical Narratives and LLMs
Introduction
Automating clinical trial recruitment remains a critical challenge in the biomedical informatics community due to the complexity and scale of longitudinal electronic health records (EHRs), which increasingly comprise unstructured clinical narratives. Matching patients to inclusion and exclusion criteria for trials is resource-intensive and error-prone, with manual review processes often missing a significant fraction of eligible candidates. Traditional rule-based and machine learning approaches rely extensively on structured data, limiting their applicability as criteria grow in sophistication, necessitating cross-document and temporal reasoning. LLMs present a promising advance, given their contextual understanding and reasoning capabilities over free text. This paper presents a systematic comparison of encoder and decoder LLMs, adapted with domain-specific and general weights, and evaluates three strategies—direct long-context modeling, NER-based extractive summarization, and retrieval-augmented generation (RAG)—in the clinical trial recruitment task using the 2018 N2C2 dataset (2604.05190).
Methods
The study addresses patient eligibility determination as a multi-label classification problem, mapping a patient’s longitudinal narratives and specific trial criteria to binary eligibility decisions. The N2C2 2018 Track 1 dataset is used, encompassing 288 diabetic patients with annotations for 13 real-world criteria related to cardiovascular and diabetes risk. Each patient typically possesses several lengthy notes, with the average token count (original context) exceeding the native limits of conventional transformer architectures.
Model Architectures and Adaptation
Both encoder-based (BERT, GatorTron-2k) and decoder-based (Llama 3.1 8B, MedGemma-27B, GPT-OSS 20B) LLMs are assessed. Encoder models are fully fine-tuned, capitalizing on manageable parameter counts, while parameter-efficient fine-tuning (PEFT) via Low-Rank Adaptation (LoRA) is employed for larger decoder LLMs to optimize memory and training efficiency without the overhead of updating billions of parameters.
Context Management Strategies
Three strategies are examined:
- Original long-context: Directly feeding the maximum-possible contiguous text windows to LLMs, subject to their architectural limits.
- NER-based extractive summarization: Utilizing a GatorTron-based clinical NER system to extract and concatenate sentences rich in clinically relevant entities (e.g., problems, treatments, tests), prioritizing recent history and truncating to model limits.
- Retrieval-Augmented Generation: For each criterion, embedding clinical note chunks using BAAI/bge encoders, and retrieving the top-k (k=10) criterion-relevant chunks with cosine similarity for feeding to the LLM alongside the criterion prompt.
Instruction-following prompt engineering enforces binary responses, reducing stochasticity in model output.
Results
Token Reduction and Compute Efficiency
The strategies achieve significant input compression relative to the original context size (mean 5,290 tokens): NER-based extracts (mean 2,955), and RAG (mean 1,403). This suggests substantial latent redundancy in raw clinical narratives, and highlights the importance of strategic evidence localization, particularly when computational and latency resources are constrained.
MedGemma-27B LLM, when paired with RAG (BAAI-large embeddings), achieves the highest micro-F1 score (0.8905) and AUROC (0.8922), outperforming all encoder-based models (BERT micro-F1 0.7240, AUROC 0.7260) and general-purpose decoders. MedGemma-27B’s direct long-context strategy also yields competitive results (micro-F1 0.8849), underscoring the advantage of domain adaptation.
Decoder-based LLMs consistently outperform encoder-based architectures across all strategies, particularly for trial criteria that demand multi-document reasoning. RAG boosts performance most for criteria hinging on sparse and temporally distributed evidence, such as ALCOHOL-ABUSE, DRUG-ABUSE, and ENGLISH, with per-criterion F1 improvements of over 40% versus the NER baseline. Criteria derivable from structured or short-context data (e.g., CREATININE, MI-6MOS) only show incremental improvement from advanced context management.
Context Management Trade-offs
The NER-based summarization strategy, though effective at reducing sequence length, incurs notable information loss, especially for criteria with evidence diffused across notes. For Llama 3.1 8B, NER filtering degraded the micro-F1 from 0.8671 (long-context) to 0.8293, indicating that simplistic entity-driven filtering is less robust than targeted RAG.
RAG’s synergy with generative LLMs can be attributed to its dynamic selection of pertinent evidence, maximizing model interpretability and signal relevance, while also significantly reducing the computational affordance required for long-context processing.
Implications
The findings have both pragmatic and theoretical implications. Practically, RAG-augmented generative LLMs such as MedGemma-27B enable near state-of-the-art automation in patient screening, achieving human-level rule-based system performance (previous best micro-F1 ≈0.91) without requiring extensive domain engineering, and with substantially reduced computational cost. The cost-performance trade-off is especially compelling where real-world EHRs involve millions of notes and highly imbalanced eligibility classes.
From a methodological perspective, the results emphasize:
- The critical importance of domain-specific adaptation for both encoder and decoder LLMs in medical settings.
- The limits of simplistic summarization (NER-driven) and heuristic truncation in capturing complex eligibility logic.
- The necessity for advanced retrieval pipelines—beyond naïve chunking or static filtering—to localize temporally and semantically dispersed evidence for criteria requiring long-range inference.
The study further strengthens the theoretical framing of clinical trial recruitment as a test-bed for robust, interpretable clinical NLP systems, with the potential for generalization to allied tasks such as automated chart review, large-scale cohort selection, and clinical decision support.
Future Directions
While this work benchmarks on the N2C2 dataset, extending evaluation to more diverse and complex EHR corpora is crucial for external validity. Future work is needed to explore scalable agentic LLMs and more sophisticated retrieval and reasoning paradigms, possibly integrating knowledge-graph grounding, causal inference, and real-time adaptation to evolving trial criteria. Attention to cost, latency, and interpretability remains paramount for clinical deployment.
Conclusion
This paper demonstrates that RAG-enhanced, domain-adapted generative LLMs substantially advance the automation of clinical trial recruitment from longitudinal clinical narratives, efficiently overcoming limitations of context window size and input redundancy. Decoder-based medical LLMs, such as MedGemma-27B, consistently outperform both encoder architectures and general-purpose LLMs, especially when equipped with dynamic retrieval pipelines. The empirical results provide a compelling case for RAG-based LLM adoption in large-scale, real-world trial screening workflows, and set a foundation for continued development of robust, computationally efficient AI systems for healthcare informatics.