Effective Prompt Design for LLM Classification of Clinical Trial Abstracts

Identify prompt structures that yield high‑accuracy classification of biomedical abstracts into clinical trial versus non‑trial categories when using proprietary large language models such as GPT‑3.5 and GPT‑4, and derive task‑appropriate best‑practice guidelines for prompt design in this setting.

Background

The authors use GPT‑3.5 and GPT‑4 to extract labels for classifying PubMed abstracts as clinical trials, then distill these labels into fine‑tuned open‑source models. Before prompt iteration, they note a lack of established practices for designing effective prompts for this classification task.

Given the nascent state of prompt engineering research and the variability in model behavior across prompts, determining robust prompt structures and generalizable design principles remains unresolved. The authors iteratively test several prompt formats but indicate the broader best‑practice question is still unsettled.

References

A priori, it is not clear what prompt structure will work well. The relative infancy of this area of research renders it difficult to identify a set of "best practices."

Counting Clinical Trials: New Evidence on Pharmaceutical Sector Productivity (2405.08030 - Durvasula et al., 12 May 2024) in Subsection “Prompt Design” (Section 2, Model Distillation)