Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

How well do LLMs cite relevant medical references? An evaluation framework and analyses (2402.02008v1)

Published 3 Feb 2024 in cs.CL and cs.AI
How well do LLMs cite relevant medical references? An evaluation framework and analyses

Abstract: LLMs are currently being used to answer medical questions across a variety of clinical domains. Recent top-performing commercial LLMs, in particular, are also capable of citing sources to support their responses. In this paper, we ask: do the sources that LLMs generate actually support the claims that they make? To answer this, we propose three contributions. First, as expert medical annotations are an expensive and time-consuming bottleneck for scalable evaluation, we demonstrate that GPT-4 is highly accurate in validating source relevance, agreeing 88% of the time with a panel of medical doctors. Second, we develop an end-to-end, automated pipeline called \textit{SourceCheckup} and use it to evaluate five top-performing LLMs on a dataset of 1200 generated questions, totaling over 40K pairs of statements and sources. Interestingly, we find that between ~50% to 90% of LLM responses are not fully supported by the sources they provide. We also evaluate GPT-4 with retrieval augmented generation (RAG) and find that, even still, around 30\% of individual statements are unsupported, while nearly half of its responses are not fully supported. Third, we open-source our curated dataset of medical questions and expert annotations for future evaluations. Given the rapid pace of LLM development and the potential harms of incorrect or outdated medical information, it is crucial to also understand and quantify their capability to produce relevant, trustworthy medical references.

Evaluation of LLMs in Citing Medical References

The paper entitled "How well do LLMs cite relevant medical references? An evaluation framework and analyses" presents a comprehensive evaluation of LLMs in their ability to cite appropriate medical references. With the increasing use of LLMs in healthcare, the accuracy and validity of the information they provide have become paramount, especially as these models are positioned to aid in clinical decisions.

Contributions and Methodology

The paper introduces several key contributions to evaluate the efficacy of LLMs in generating relevant citations in medical contexts:

  1. Automated Evaluation Framework: The authors present an automated framework, termed SourceCheckup, that evaluates the ability of LLMs to cite appropriate medical references. This framework is particularly useful because it mitigates the demand for costly and time-consuming expert medical annotations, demonstrating that GPT-4 can achieve an 88% agreement rate with a cohort of medical doctors.
  2. Dataset and Evaluation: The paper evaluated five prominent LLMs (GPT-4 RAG, GPT-4 API, Claude v2.1, Mistral Medium, and Gemini Pro) over a dataset comprising 1,200 questions, leading to over 40,000 statement-source pairs. This extensive dataset was meticulously curated from prominent medical sources such as Mayo Clinic, UpToDate, and Reddit's r/AskDocs. The provision of this open-source dataset and corresponding expert annotations will facilitate further research in this area.
  3. Quantitative Analysis: The paper provides startling insights into the performance of these models. It was found that between 50% and 90% of LLM responses lacked fully supported claims from their cited sources. Even with retrieval-augmented generation (RAG), around 30% of individual statements remained unsupported, with half of the responses not fully validated.

Results

The evaluation framework elucidated several deficiencies in the LLMs' capabilities. Notably, GPT-4 with RAG — despite having superior computational resources and access to real-time data via search engines — failed to consistently provide accurate support, with only 54% of responses fully supported. This points to a critical gap, particularly for high-stakes domains like healthcare, where misinformation can lead to serious consequences.

Other models, like the GPT-4 API and Claude v2.1, showed even less promising results. Specifically, the valid URL generation rates varied, with non-RAG models providing valid links only 40–70% of the time, indicating potential hallucination issues. Furthermore, results demonstrated that the type of question significantly influenced the models' citation performance, with Reddit-sourced questions generally being more challenging to support than those from curated medical sites.

Theoretical and Practical Implications

The findings have profound implications, both theoretically and practically, for the future utilization and regulation of LLMs in medicine. The necessity for reliable citation mechanisms is evident to ensure that LLMs can be trusted in clinical settings. The results underscore the urgent need for retrieval and verification enhancements in the models to align with regulatory standards for use in medical decision-making contexts.

Given the current limitations observed, future directions could focus on improving the factuality and source-attribution mechanisms inherent in LLMs. Additionally, developing robust frameworks, potentially integrating LLMs with databases officially approved by authority bodies like the FDA, could mitigate the potential risks associated with unsupervised LLM deployments in healthcare.

Conclusion

This comprehensive evaluation underscores the existing gaps in the ability of LLMs to cite valid medical references effectively. As the integration of AI in healthcare continues to accelerate, ensuring the factual accuracy and reliability of information output by such models will become increasingly critical. The research presents a foundational step in addressing these challenges and sets a concrete path for future enhancements in AI-assisted healthcare delivery. The provision of the open-source dataset further invites expansion and exploration of the research to achieve models that can consistently perform in line with the stringent requirements demanded by healthcare domains.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (10)
  1. Kevin Wu (20 papers)
  2. Eric Wu (17 papers)
  3. Ally Cassasola (1 paper)
  4. Angela Zhang (10 papers)
  5. Kevin Wei (11 papers)
  6. Teresa Nguyen (1 paper)
  7. Sith Riantawan (1 paper)
  8. Patricia Shi Riantawan (1 paper)
  9. Daniel E. Ho (45 papers)
  10. James Zou (232 papers)
Citations (14)