CohortGPT: LLM-Enhanced Clinical Cohort Recruitment
- CohortGPT is a framework that applies transformer models, dynamic chain-of-thought prompting, and clinical knowledge graphs to automate disease classification and patient recruitment.
- It integrates hierarchical knowledge and policy-gradient optimized exemplar selection to enhance few-shot performance in interpreting unstructured clinical texts.
- Experimental results show significant improvements over traditional fine-tuned models, reducing manual effort and optimizing recruitment in clinical cohort studies.
CohortGPT refers to a set of architectures and methodologies aimed at leveraging generative pre-trained transformer (GPT) models and enhancement strategies for cohort-level analysis, knowledge-intensive classification, and participant recruitment—primarily in clinical settings but with implications for other domains. Representative work is the CohortGPT framework for few-shot medical text classification to support automated participant identification in cohort establishment for clinical research (Guan et al., 2023).
1. Clinical Cohort Recruitment via LLMs
CohortGPT is designed to automate the recruitment of participants for clinical research cohorts through the classification of unstructured clinical narratives (e.g., radiology reports, clinical notes) into disease labels. Manual parsing of such records is labor-intensive and fine-tuning LLMs for each cohort or task is typically limited by small labeled datasets in the clinical domain. CohortGPT instead exploits in-context learning in LLMs such as ChatGPT or GPT‑4, augmented with explicit domain knowledge and dynamic prompt selection, to enable accurate disease labeling with minimal supervision.
A core motivation is overcoming the inability of general-purpose LLMs to reason about domain-specific information that is implicit or context-dependent in medical texts, and to function robustly in settings with scarce labeled data.
2. Methodological Components
CohortGPT integrates several key strategies for knowledge-intensive text classification in clinical settings:
A. Clinical Knowledge Graph (KG) Integration:
- A hierarchical KG captures relationships among diseases and their anatomical or tissue associations.
- Three encoding strategies are explored to introduce KG information into the LLM prompt: KG-as-Tree (using markdown-like headers), KG-as-Relation (triplet facts), and KG-as-Rule (natural language rules relating diseases and anatomical units). The latter is empirically preferred.
B. Dynamic Chain-of-Thought (CoT) Prompting:
- Classification prompts are constructed using “chain-of-thought” examples, which show detailed, stepwise reasoning to arrive at correct disease labels from input reports.
- For every target report , a set of CoT exemplars is dynamically selected from a candidate pool via a policy neural network .
- Selection is optimized by policy-gradient reinforcement learning, maximizing
where quantifies labeling performance.
C. Policy-Gradient CoT Selection:
- Since LLM gradients are not directly accessible, policy gradient is estimated as:
where rewards strongly penalize incorrect labels (with ) and modestly reward correct ones ().
These mechanisms collectively facilitate high-quality, context-aware disease classification from text, especially in the few-shot learning regime.
3. Quantitative Performance and Experimental Setup
CohortGPT is evaluated on real-world datasets:
- IU-RR: 3,955 radiology reports annotated with multiple disease labels.
- MIMIC-CXR: A subset of 1,808 chest radiograph reports drawn from the public MIMIC-CXR database.
Baselines include:
- Domain-adapted transformer fine-tuned models (BioBERT, BioGPT)
- Prompted LLMs: ChatGPT, GPT-4, Alpaca, BloomZ
Key metrics reported are Exact Match Ratio (EMR), macro/micro-F1 score, precision/recall, and Hamming loss.
| Model | Few-shot F1 (IU-RR) | EMR |
|---|---|---|
| BioBERT | 0.44 | low |
| BioGPT | 0.25 | low |
| ChatGPT | 0.69 (CoT+KG) | — |
| GPT‑4 | 0.81 (CoT+KG) | — |
CohortGPT substantially outperforms fine-tuned domain models (BioBERT, BioGPT) on all metrics in data-scarce settings. The best results are achieved with the KG-as-Rule knowledge prompt and policy-optimized CoT selection. Sensitivity analysis shows strongest accuracy at –$8$ CoT exemplars per prompt and candidate pools around 25, balancing context diversity and prompt length.
4. Ablations and Knowledge Integration
Ablation studies reveal:
- KG-as-Rule is superior to KG-as-Tree or KG-as-Relation for guiding LLM reasoning, likely due to the interpretability and explicitness of natural language rules.
- Dynamically optimized CoT selection consistently outperforms random, manual, and similarity-based selection, confirming the benefit of reinforcement learning in prompt construction.
- CoT sample pool size, number of shots, and candidate CoT diversity are critical factors for optimal few-shot performance.
5. Practical Implications and Deployment
CohortGPT enables rapid, high-precision labeling of unstructured clinical texts for recruitment or cohort definition with drastically reduced manual effort. This is especially valuable when annotated data is expensive or infeasible to collect at scale. The methodology is inherently compatible with LLM APIs (e.g., ChatGPT, GPT‑4) but is also extensible to open-source LLMs (e.g., Alpaca, Vicuna) for deployment in protected data environments.
The codebase, along with sample data and instructions, is publicly available for reproducibility and extension: https://anonymous.4open.science/r/CohortGPT-4872/
6. Extensions and Future Directions
Possible avenues for further research and application include:
- Extension to other clinical natural language processing tasks such as diagnosis, prognosis, and automated decision support.
- Incorporation of more granular or multi-modal domain knowledge (e.g., integrating biomedical ontologies or imaging features).
- More refined reward functions or feedback strategies in CoT selection, potentially incorporating direct signal from clinical adjudicators.
- Adaptation of the methodology for large-scale, privacy-preserving, or federated settings, akin to cohort-parallel learning strategies (Dhasade et al., 24 May 2024).
A plausible implication is that similar techniques could generalize beyond clinical text, supporting robust cohort identification in domains such as epidemiology, pharmacovigilance, and social sciences.
7. Comparison with Related Methods
CohortGPT differs from conventional fine-tuning approaches by prioritizing few-shot, context-enhanced, and knowledge-augmented prompting. Unlike supervised models that demand extensive labeled datasets per task, it achieves strong performance with minimal labeled examples by dynamically curating and integrating relevant knowledge and reasoning exemplars. This design both reduces annotation burden and enables rapid adaptation to new recruitment or labeling criteria, thus addressing a major bottleneck in evidence-based cohort construction for clinical trials and observational studies.
In summary, CohortGPT represents a robust framework for medical cohort recruitment via LLMs, leveraging structured domain knowledge and reinforcement-optimized reasoning chains to maximize performance under few-shot and knowledge-intensive conditions (Guan et al., 2023).