Evaluation of LLMs in Citing Medical References
The paper entitled "How well do LLMs cite relevant medical references? An evaluation framework and analyses" presents a comprehensive evaluation of LLMs in their ability to cite appropriate medical references. With the increasing use of LLMs in healthcare, the accuracy and validity of the information they provide have become paramount, especially as these models are positioned to aid in clinical decisions.
Contributions and Methodology
The paper introduces several key contributions to evaluate the efficacy of LLMs in generating relevant citations in medical contexts:
- Automated Evaluation Framework: The authors present an automated framework, termed SourceCheckup, that evaluates the ability of LLMs to cite appropriate medical references. This framework is particularly useful because it mitigates the demand for costly and time-consuming expert medical annotations, demonstrating that GPT-4 can achieve an 88% agreement rate with a cohort of medical doctors.
- Dataset and Evaluation: The paper evaluated five prominent LLMs (GPT-4 RAG, GPT-4 API, Claude v2.1, Mistral Medium, and Gemini Pro) over a dataset comprising 1,200 questions, leading to over 40,000 statement-source pairs. This extensive dataset was meticulously curated from prominent medical sources such as Mayo Clinic, UpToDate, and Reddit's r/AskDocs. The provision of this open-source dataset and corresponding expert annotations will facilitate further research in this area.
- Quantitative Analysis: The paper provides startling insights into the performance of these models. It was found that between 50% and 90% of LLM responses lacked fully supported claims from their cited sources. Even with retrieval-augmented generation (RAG), around 30% of individual statements remained unsupported, with half of the responses not fully validated.
Results
The evaluation framework elucidated several deficiencies in the LLMs' capabilities. Notably, GPT-4 with RAG — despite having superior computational resources and access to real-time data via search engines — failed to consistently provide accurate support, with only 54% of responses fully supported. This points to a critical gap, particularly for high-stakes domains like healthcare, where misinformation can lead to serious consequences.
Other models, like the GPT-4 API and Claude v2.1, showed even less promising results. Specifically, the valid URL generation rates varied, with non-RAG models providing valid links only 40–70% of the time, indicating potential hallucination issues. Furthermore, results demonstrated that the type of question significantly influenced the models' citation performance, with Reddit-sourced questions generally being more challenging to support than those from curated medical sites.
Theoretical and Practical Implications
The findings have profound implications, both theoretically and practically, for the future utilization and regulation of LLMs in medicine. The necessity for reliable citation mechanisms is evident to ensure that LLMs can be trusted in clinical settings. The results underscore the urgent need for retrieval and verification enhancements in the models to align with regulatory standards for use in medical decision-making contexts.
Given the current limitations observed, future directions could focus on improving the factuality and source-attribution mechanisms inherent in LLMs. Additionally, developing robust frameworks, potentially integrating LLMs with databases officially approved by authority bodies like the FDA, could mitigate the potential risks associated with unsupervised LLM deployments in healthcare.
Conclusion
This comprehensive evaluation underscores the existing gaps in the ability of LLMs to cite valid medical references effectively. As the integration of AI in healthcare continues to accelerate, ensuring the factual accuracy and reliability of information output by such models will become increasingly critical. The research presents a foundational step in addressing these challenges and sets a concrete path for future enhancements in AI-assisted healthcare delivery. The provision of the open-source dataset further invites expansion and exploration of the research to achieve models that can consistently perform in line with the stringent requirements demanded by healthcare domains.