Improving Large Language Models for Clinical Named Entity Recognition via Prompt Engineering (2303.16416v3)

Published 29 Mar 2023 in cs.CL

Abstract: Objective: This study quantifies the capabilities of GPT-3.5 and GPT-4 for clinical named entity recognition (NER) tasks and proposes task-specific prompts to improve their performance. Materials and Methods: We evaluated these models on two clinical NER tasks: (1) to extract medical problems, treatments, and tests from clinical notes in the MTSamples corpus, following the 2010 i2b2 concept extraction shared task, and (2) identifying nervous system disorder-related adverse events from safety reports in the vaccine adverse event reporting system (VAERS). To improve the GPT models' performance, we developed a clinical task-specific prompt framework that includes (1) baseline prompts with task description and format specification, (2) annotation guideline-based prompts, (3) error analysis-based instructions, and (4) annotated samples for few-shot learning. We assessed each prompt's effectiveness and compared the models to BioClinicalBERT. Results: Using baseline prompts, GPT-3.5 and GPT-4 achieved relaxed F1 scores of 0.634, 0.804 for MTSamples, and 0.301, 0.593 for VAERS. Additional prompt components consistently improved model performance. When all four components were used, GPT-3.5 and GPT-4 achieved relaxed F1 socres of 0.794, 0.861 for MTSamples and 0.676, 0.736 for VAERS, demonstrating the effectiveness of our prompt framework. Although these results trail BioClinicalBERT (F1 of 0.901 for the MTSamples dataset and 0.802 for the VAERS), it is very promising considering few training samples are needed. Conclusion: While direct application of GPT models to clinical NER tasks falls short of optimal performance, our task-specific prompt framework, incorporating medical knowledge and training samples, significantly enhances GPT models' feasibility for potential clinical applications.

PDF Abstract

Improving LLMs for Clinical Named Entity Recognition via Prompt Engineering

The paper "Improving LLMs for Clinical Named Entity Recognition via Prompt Engineering" presents a substantial evaluation of generative pre-trained transformers, specifically GPT-3.5 and GPT-4, in the context of clinical Named Entity Recognition (NER). Recognizing the importance of precise information extraction from Electronic Health Records (EHRs), the paper proposes a task-specific prompt engineering framework to enhance the performance of these LLMs.

Study Outline and Methods

The research addresses two clinical NER tasks: extracting entities such as medical problems, treatments, and tests from clinical notes in the MTSamples corpus, and identifying nervous system disorder-related adverse events from VAERS safety reports. The baseline comparison utilizes BioClinicalBERT, a domain-specific LLM, illustrating the traditional methodology's capabilities.

To ameliorate GPT's performance, the authors devised a systematic approach to prompt engineering, comprising:

Baseline Prompts: Providing initial task descriptions and output format specifications.
Annotation Guidelines: Incorporating entity definitions and rules from established guidelines to guide the model's recognitions.
Error Analysis-Based Instructions: Refining prompts based on a retrospective analysis of the model's output on training data.
Annotated Samples for Few-Shot Learning: Offering examples of annotated text to educate the model in recognizing context-specific entities.

Results

Empirical testing demonstrates that incorporating all prompt components improves model performance significantly. The relaxed F1 scores for MTSamples increased to 0.794 for GPT-3.5 and 0.861 for GPT-4, while VAERS scores were 0.676 and 0.736, respectively. Although GPT models did not outperform BioClinicalBERT, which achieved an F1 of 0.901 on MTSamples, the results from GPT-4 are competitive, showcasing potential with minimal training samples.

Discussion

The paper suggests that LLMs like GPT-3.5 and GPT-4 can substantially benefit from well-orchestrated prompt engineering strategies, particularly in domains requiring high precision like medicine. By minimizing the dependency on large annotated datasets, these techniques offer a less resource-intensive path towards deploying NLP in clinical settings. The authors advocate for integrating domain knowledge into LLM prompts, utilizing the emergent abilities of LLMs to handle diverse clinical tasks.

The limitations observed include the models' performance under exact-match criteria and challenges with boundary detection of clinical entities. Additionally, the research invites further exploration into more sophisticated few-shot learning strategies, such as the chain-of-thoughts method, which may further elevate the capabilities of LLMs in intricate NER tasks.

Future Implications

This work highlights the importance of prompt engineering in refining LLM applications across specific domains and lays the groundwork for future research to explore cost-effective development methodologies in clinical NLP. As LLMs continue to evolve, integrating prompt engineering with advanced model architectures could pave the way for broader applications, improving adaptability and performance without the cumbersome requirement for expansive, domain-specific datasets. Future research should aim to standardize the evaluation metrics appropriate for LLM-based NER systems, to optimally gauge their capacity and refine prompt design for diverse biomedical applications.

PDF Markdown Bookmark Chat (Pro)

Authors (12)

Yan Hu (75 papers)
Xu Zuo (13 papers)
Xueqing Peng (12 papers)
Yujia Zhou (34 papers)
Zehan Li (26 papers)
Xiaoqian Jiang (59 papers)
Hua Xu (78 papers)
Qingyu Chen (57 papers)
Jingcheng Du (13 papers)
Vipina Kuttichi Keloth (4 papers)
Zhiyong Lu (113 papers)
Kirk Roberts (32 papers)

Citations (123)

View on Semantic Scholar

Related Papers

Find Related Papers

YouTube

Show All Videos