Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

LLM-based Prompt Ensemble for Reliable Medical Entity Recognition from EHRs (2505.08704v2)

Published 13 May 2025 in cs.AI and cs.CL

Abstract: Electronic Health Records (EHRs) are digital records of patient information, often containing unstructured clinical text. Named Entity Recognition (NER) is essential in EHRs for extracting key medical entities like problems, tests, and treatments to support downstream clinical applications. This paper explores prompt-based medical entity recognition using LLMs, specifically GPT-4o and DeepSeek-R1, guided by various prompt engineering techniques, including zero-shot, few-shot, and an ensemble approach. Among all strategies, GPT-4o with prompt ensemble achieved the highest classification performance with an F1-score of 0.95 and recall of 0.98, outperforming DeepSeek-R1 on the task. The ensemble method improved reliability by aggregating outputs through embedding-based similarity and majority voting.

LLM-based Prompt Ensemble for Reliable Medical Entity Recognition from EHRs

The paper presents an exploration of a novel approach utilizing LLMs, specifically GPT-4o and DeepSeek-R1, for medical entity recognition from Electronic Health Records (EHRs). The authors introduce a prompt ensemble methodology aimed at improving the reliability and performance of named entity recognition (NER) tasks within the domain of healthcare. This paper addresses the challenges associated with unstructured clinical text in EHRs by leveraging advanced NLP techniques to extract key entities such as problems, tests, and treatments.

Key Findings and Results

By employing various prompt engineering strategies including zero-shot, few-shot, and an ensemble approach, the research highlights several significant findings:

  • Performance Metrics: The GPT-4o model utilizing a prompt ensemble method achieved substantial performance indicators with an F1-score of 0.95 and recall of 0.98, outperforming DeepSeek-R1. These results demonstrate the capability of GPT-4o to surpass existing benchmarks in the extraction and classification of clinical entities, as evidenced by its higher recall and better consistency across various strategies.
  • Prompt Ensemble: The ensemble method aggregates outputs from multiple prompt formats and leverages embedding-based similarity measurement coupled with majority voting. This multi-faceted approach is instrumental in mitigating label noise and hallucinations, thereby offering a more reliable recognition framework.

Methodology

The research's methodological framework involves the use of structured prompt templates to interact with LLMs, guiding them to identify and classify medical entities without the need for fine-tuning. Various prompting strategies were examined:

  • Zero-shot Prompting: This baseline approach provides the model with entity definitions and task instructions, relying on its pretrained capabilities.
  • Few-shot Prompting: Incorporating document, sentence, and individual entity samples, this strategic variation enables the model to learn from diverse context levels.
  • Prompt Ensemble: By combining outputs from different few-shot configurations, the ensemble further refines entity predictions through similarity-based alignment and voting.

Practical Implications

The robustness of the ensemble approach improves extraction reliability in clinical settings, offering potential enhancements to decision support systems, real-world evidence generation, and disease surveillance tasks. The paper's exploration into the embedding-based ensemble methodology proposes an alternative direction for achieving high precision in medical text annotation using state-of-the-art LLM models.

Speculations on Future Developments in AI

The results in this paper imply a continued evolution in AI's capability to process and extract structured data from vast and complex unstructured EHR systems. Potential future developments may focus on refining prompt engineering strategies, optimizing execution efficiency, and exploring deeper integration of context-aware machine learning models for even more accurate prediction and entity classification.

Conclusion

Overall, this paper provides valuable insights into leveraging LLMs through prompt-based learning for clinical NER. The methodology and findings underscore the evolving role of AI in effectively addressing the inherent complexities within EHRs, further establishing the foundational strategies for future research in medical NLP applications.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. K M Sajjadul Islam (4 papers)
  2. Ayesha Siddika Nipu (5 papers)
  3. Jiawei Wu (43 papers)
  4. Praveen Madiraju (4 papers)
Youtube Logo Streamline Icon: https://streamlinehq.com