Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
51 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Benchmarking Large Language Models on Answering and Explaining Challenging Medical Questions (2402.18060v4)

Published 28 Feb 2024 in cs.CL
Benchmarking Large Language Models on Answering and Explaining Challenging Medical Questions

Abstract: LLMs have demonstrated impressive performance in answering medical questions, such as achieving passing scores on medical licensing examinations. However, medical board exam or general clinical questions do not capture the complexity of realistic clinical cases. Moreover, the lack of reference explanations means we cannot easily evaluate the reasoning of model decisions, a crucial component of supporting doctors in making complex medical decisions. To address these challenges, we construct two new datasets: JAMA Clinical Challenge and Medbullets. JAMA Clinical Challenge consists of questions based on challenging clinical cases, while Medbullets comprises simulated clinical questions. Both datasets are structured as multiple-choice question-answering tasks, accompanied by expert-written explanations. We evaluate seven LLMs on the two datasets using various prompts. Experiments demonstrate that our datasets are harder than previous benchmarks. Human and automatic evaluations of model-generated explanations provide insights into the promise and deficiency of LLMs for explainable medical QA.

Benchmarking LLMs on Challenging Medical Question Answering

The paper "Benchmarking LLMs on Answering and Explaining Challenging Medical Questions" addresses the capabilities of LLMs in the domain of medical question answering, specifically focusing on their ability to handle complex clinical cases and provide coherent explanations. The authors introduce two novel datasets, JAMA Clinical Challenge and Medbullets, which aim to evaluate the proficiency of LLMs in more realistic and demanding medical scenarios than those posed by traditional benchmarks like medical licensing exams.

Overview and Motivation

Medical question answering is a critical area where LLMs have shown promise by achieving impressive scores on standard medical examinations, such as the United States Medical Licensing Examination (USMLE). However, these exams often rely on textbook knowledge and do not adequately simulate the intricacies of real-world clinical cases where nuanced reasoning and the interpretation of complex scenarios are required. The paper posits that merely achieving high accuracy on board exams is insufficient for these models to support clinical decision-making in practice.

To further the field, the authors focus on two key improvements: increasing the challenge level of testing datasets to better reflect realistic medical situations and incorporating expert-written explanations to assess the reasoning capabilities of LLMs. The lack of reliable reference explanations in existing datasets impedes evaluating the explainability of model predictions, a crucial aspect of their utility in clinical applications.

Dataset Construction and Description

The paper introduces two datasets:

  1. JAMA Clinical Challenge: Comprising 1,524 clinical cases curated from the JAMA Network Clinical Challenge archive, this dataset presents challenging real-world cases requiring detailed reasoning and diagnostic skills. Each case includes a comprehensive clinical vignette, a question, multiple-choice answers, and detailed expert-written explanations.
  2. Medbullets: Harvested from publicly available USMLE Step 2/3 style questions, this dataset consists of 308 questions, each accompanied by a clinical scenario, multiple answer options, and explanations. The questions are designed to mirror common clinical situations, testing the ability of LLMs to apply clinical reasoning effectively.

These datasets are not only larger than previous ones but also come with high-quality explanations, making them invaluable resources for training and evaluating the next generation of medical LLMs.

Evaluation of LLMs

The authors evaluated four LLMs: GPT-3.5, GPT-4, PaLM 2, and Llama 2, using the newly constructed datasets. The evaluation involved testing the models' ability to predict answers and generate explanations using different prompting strategies, including zero-shot and chain-of-thought (CoT) prompting.

  • Findings on Prediction Accuracy: The results highlight a significant challenge posed by the new datasets, with performance drops observed across all models compared to traditional benchmarks. GPT-4 demonstrated superior performance overall, indicating its robustness in handling complex clinical questions.
  • Chain-of-Thought and In-Context Learning: The experiments suggest that CoT prompting enhances model reasoning capabilities by encouraging step-by-step analysis. However, this improvement was marginal for the most challenging questions from the JAMA dataset. In-context learning showed benefits mainly for GPT-4, with other models displaying limited adaptability to new tasks through this mechanism.

Explanation Evaluation and Human Alignment

One of the paper's pivotal contributions is the assessment of model-generated explanations. The authors utilized automatic metrics like ROUGE-L and BARTScore and conducted human evaluations to gauge the quality of explanations.

  • Automatic vs. Human Evaluation: The paper found notable discrepancies between automatic metrics and human judgments, with human evaluators often preferring explanations generated by models that did not score highest on automated measures. This misalignment underscores the need for developing more reliable evaluation metrics that better capture qualitative aspects valued in medical reasoning.

Implications and Future Directions

The introduction of these datasets sets a new standard for evaluating LLMs in the medical domain, pushing beyond mere knowledge recall to assess practical reasoning and explanation generation. The research suggests several avenues for future work, including refining evaluation metrics for explanations, exploring more sophisticated prompting strategies, and integrating multimodal capabilities to address cases involving visual data, such as X-rays.

By establishing a robust benchmark for challenging medical QA, the paper paves the way for developing LLMs that are not only accurate but also capable of providing insightful and trustworthy support in clinical decision-making.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (56)
  1. Overview of the medical question answering task at trec 2017 liveqa. In TREC, pages 1–12.
  2. Bridging the gap between consumers’ medication questions and trusted answers. In MedInfo, pages 25–29.
  3. Falcon-40b: an open large language model with state-of-the-art performance. Findings of the Association for Computational Linguistics: ACL, 2023:10755–10773.
  4. Palm 2 technical report. arXiv preprint arXiv:2305.10403.
  5. Diagnostic accuracy of a large language model in pediatric case studies. JAMA pediatrics.
  6. Asma Ben Abacha and Dina Demner-Fushman. 2019. A question-entailment approach to question answering. BMC bioinformatics, 20(1):1–23.
  7. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
  8. Meta-learning via language model in-context tuning. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 719–730, Dublin, Ireland. Association for Computational Linguistics.
  9. Meditron-70b: Scaling medical pretraining for large language models. arXiv preprint arXiv:2311.16079.
  10. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311.
  11. Chatgpt in medicine: an overview of its applications, advantages, limitations, future prospects, and ethical considerations. Frontiers in Artificial Intelligence, 6:1169595.
  12. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
  13. Medalign: A clinician-generated dataset for instruction following with electronic medical records. arXiv preprint arXiv:2308.14089.
  14. Medalpaca–an open-source collection of medical conversational ai models and training data. arXiv preprint arXiv:2304.08247.
  15. Emily Harris. 2023. Large language models answer medical questions accurately, but can’t match clinicians’ knowledge. JAMA.
  16. The elements of statistical learning: data mining, inference, and prediction, volume 2. Springer.
  17. Medeval: A multi-level, multi-task, and multi-domain medical benchmark for language model evaluation. arXiv preprint arXiv:2310.14088.
  18. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300.
  19. Teaching machines to read and comprehend. Advances in neural information processing systems, 28.
  20. Large-scale, diverse, paraphrastic bitexts via sampling and clustering. In Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL), pages 44–54.
  21. What disease does this patient have? a large-scale open domain question answering dataset from medical exams. Applied Sciences, 11(14):6421.
  22. Hidden flaws behind expert-level accuracy of gpt-4 vision in medicine. arXiv preprint arXiv:2401.08396.
  23. Accuracy of a generative artificial intelligence model in a complex diagnostic challenge. JAMA.
  24. Large language models are zero-shot reasoners. Advances in neural information processing systems, 35:22199–22213.
  25. BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7871–7880, Online. Association for Computational Linguistics.
  26. How chatbots and large language model artificial intelligence systems will reshape modern medicine: Fountain of creativity or pandora’s box? JAMA Internal Medicine.
  27. Can large language models reason about medical questions? arXiv preprint arXiv:2207.08143.
  28. Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pages 74–81.
  29. Inbal Magar and Roy Schwartz. 2022. Data contamination: From memorization to exploitation. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 157–165, Dublin, Ireland. Association for Computational Linguistics.
  30. Rethinking the role of demonstrations: What makes in-context learning work? In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 11048–11064, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  31. Capabilities of gpt-4 on medical challenge problems. arXiv preprint arXiv:2303.13375.
  32. Can generalist foundation models outcompete special-purpose tuning? case study in medicine. arXiv preprint arXiv:2311.16452.
  33. OpenAI. 2023. Gpt-4 technical report.
  34. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744.
  35. Medmcqa: A large-scale multi-subject multi-choice dataset for medical domain question answering. In Conference on Health, Inference, and Learning, pages 248–260. PMLR.
  36. emrqa: A large corpus for question answering on electronic medical records. arXiv preprint arXiv:1809.00732.
  37. Understanding the impact of explanations on advice-taking: a user study for ai-based clinical decision support systems. In Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems, pages 1–9.
  38. Longbox: Evaluating transformers on long-sequence clinical tasks. arXiv preprint arXiv:2311.09564.
  39. A study of generative large language model for medical research and healthcare. arXiv preprint arXiv:2305.13523.
  40. Use of gpt-4 to analyze medical records of patients with extensive investigations and delayed diagnosis. JAMA Network Open, 6(8):e2325000–e2325000.
  41. Large language models encode clinical knowledge. arXiv preprint arXiv:2212.13138.
  42. Towards expert-level medical question answering with large language models. arXiv preprint arXiv:2305.09617.
  43. Radqa: A question answering dataset to improve comprehension of radiology reports. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 6250–6259.
  44. Chatbot vs medical student performance on free-response clinical reasoning examinations. JAMA internal medicine, 183(9):1028–1030.
  45. Clinical camel: An open-source expert-level medical language model with dialogue-based knowledge encoding. arXiv preprint arXiv:2305.12031.
  46. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  47. Clinicalgpt: Large language models finetuned with diverse medical data and comprehensive evaluation. arXiv preprint arXiv:2306.09968.
  48. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171.
  49. Augmenting black-box llms with medical textbooks for clinical question answering. arXiv preprint arXiv:2309.02233.
  50. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837.
  51. Pmc-llama: Further finetuning llama on medical papers. arXiv preprint arXiv:2304.14454.
  52. Bartscore: Evaluating generated text as text generation. Advances in Neural Information Processing Systems, 34:27263–27277.
  53. Cliniqg4qa: Generating diverse questions for domain adaptation of clinical question answering. In 2021 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pages 580–587. IEEE.
  54. Huatuogpt, towards taming language model to be a doctor. arXiv preprint arXiv:2305.15075.
  55. Bertscore: Evaluating text generation with bert. arXiv preprint arXiv:1904.09675.
  56. Alpacare: Instruction-tuned large language models for medical application. arXiv preprint arXiv:2310.14558.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Hanjie Chen (28 papers)
  2. Zhouxiang Fang (3 papers)
  3. Yash Singla (1 paper)
  4. Mark Dredze (66 papers)
Citations (13)
X Twitter Logo Streamline Icon: https://streamlinehq.com