MedHallu: A Comprehensive Benchmark for Detecting Medical Hallucinations in Large Language Models (2502.14302v1)

Published 20 Feb 2025 in cs.CL, cs.AI, and cs.LG

Abstract: Advancements in LLMs and their increasing use in medical question-answering necessitate rigorous evaluation of their reliability. A critical challenge lies in hallucination, where models generate plausible yet factually incorrect outputs. In the medical domain, this poses serious risks to patient safety and clinical decision-making. To address this, we introduce MedHallu, the first benchmark specifically designed for medical hallucination detection. MedHallu comprises 10,000 high-quality question-answer pairs derived from PubMedQA, with hallucinated answers systematically generated through a controlled pipeline. Our experiments show that state-of-the-art LLMs, including GPT-4o, Llama-3.1, and the medically fine-tuned UltraMedical, struggle with this binary hallucination detection task, with the best model achieving an F1 score as low as 0.625 for detecting "hard" category hallucinations. Using bidirectional entailment clustering, we show that harder-to-detect hallucinations are semantically closer to ground truth. Through experiments, we also show incorporating domain-specific knowledge and introducing a "not sure" category as one of the answer categories improves the precision and F1 scores by up to 38% relative to baselines.

PDF Abstract

MedHallu: A Benchmark for Detecting Medical Hallucinations in LLMs

The paper introduces MedHallu, an innovative benchmark designed to assess the ability of LLMs in detecting hallucinations within medical question-answering tasks. As LLMs like GPT-4 and Llama gain prominence in providing solutions across various domains, their tendency to produce hallucinations—outputs that appear plausible but are factually incorrect—presents a significant risk, particularly in the medical field where the consequences of inaccuracies can be severe.

Key Contributions

MedHallu is structured to evaluate the hallucination detection capacity of LLMs through a series of intricately designed challenges. It comprises 10,000 high-quality question-answer pairs derived from PubMedQA, each annotated to identify hallucinations. Hallucinated answers are generated through a controlled pipeline, ensuring that both "easy" and "hard" hallucinations are present. The dataset categorizes hallucinations into four types specific to the medical domain: misinterpretation of question, incomplete information, mechanism and pathway misattribution, and methodological and evidence fabrication. This structure allows for nuanced evaluation of LLMs, highlighting their performance variability across different hallucination types.

Methodology

The authors implement a multi-stage pipeline to generate hallucinated answers. Candidates are produced by prompting an LLM with the given question, context, and ground truth answer. Each candidate is evaluated using a blend of qualitative checks and semantic analyses to ensure that the hallucinated answer is convincingly close yet distinct from the verified answers. A bidirectional entailment mechanism is employed to assess semantic proximity, ensuring that hallucinations posing as harder challenges are semantically similar to genuine responses.

Experimental Insights

The paper presents comprehensive experiments using both general-purpose and medical fine-tuned LLMs under scenarios with and without access to additional medical context. Intriguingly, general-purpose models like GPT-4 outperform specialized models in hallucination detection tasks when no extra context is provided. However, providing domain-specific knowledge enhances performance across the board, with some general models seeing up to 32% improvement in F1 scores. The introduction of a "not sure" category further boosts the precision of detection, allowing models to abstain from decisions when uncertainty is high. This approach is critical in high-stakes domains like medicine where a false assertion can lead to adverse outcomes.

Semantic Analysis

An intriguing discovery is the observation that the semantic clusters of hallucinated content closer to the truth are the hardest for models to discern, highlighting an area where current LLMs struggle. These semantically nuanced hallucinations pose significant challenges for detection algorithms, reinforcing the need for continued iteration and training to enhance model robustness.

Implications and Future Directions

MedHallu not only exposes current limitations in LLM hallucination detection but also sets the stage for future enhancements. The findings indicate substantial room for improving LLM training methodologies, particularly concerning incorporating external knowledge systems and probabilistic modeling that better handle the nuanced semantic differences characteristic of medical knowledge. The insights gained could align future AI developments with clinical standards, potentially directing research toward more reliable and contextually aware AI models capable of supporting medical professionals without risking patient safety.

This work is a step forward in benchmarking the limits of current LLMs, inferences that could serve as a guiding post for developers and researchers aiming to minimize hallucinations and increase the safety of AI systems deployed in critical sectors like healthcare. As the field advances, such benchmarks will be invaluable in standardizing evaluation metrics and improving the AI models integral to future technological solutions.

PDF Markdown Bookmark Chat (Pro)

Authors (7)

Shrey Pandit (7 papers)
Jiawei Xu (64 papers)
Junyuan Hong (31 papers)
Zhangyang Wang (374 papers)
Tianlong Chen (202 papers)
Kaidi Xu (85 papers)
Ying Ding (126 papers)

Related Papers

Find Related Papers

Tweets

https://twitter.com/ShreyPandit2001/status/1893368416423071775

https://twitter.com/OpenlifesciAI/status/1892972977156923592

YouTube

Show All Videos