Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
72 tokens/sec
GPT-4o
61 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Few shot chain-of-thought driven reasoning to prompt LLMs for open ended medical question answering (2403.04890v3)

Published 7 Mar 2024 in cs.CL
Few shot chain-of-thought driven reasoning to prompt LLMs for open ended medical question answering

Abstract: In this paper, we propose a modified version of the MedQA-USMLE dataset, named MEDQA-OPEN, which contains open-ended medical questions without options to mimic clinical scenarios, along with clinician-approved reasoned answers. Additionally, we implement a prompt driven by Chain of Thought (CoT) reasoning, CLINICR, to mirror the prospective process of incremental reasoning, reaching a correct response to medical questions. We empirically demonstrate how CLINICR outperforms the state-of-the-art 5-shot CoT-based prompt (Li\'evin et al., 2022). We also present an approach that mirrors real-life clinical practice by first exploring multiple differential diagnoses through MCQ-CLINICR and subsequently narrowing down to a final diagnosis using MCQ-ELIMINATIVE. Finally, emphasizing the importance of response verification in medical settings, we utilize a reward model mechanism, replacing the elimination process performed by MCQ-ELIMINATIVE.

Few-shot Chain-of-thought Driven Reasoning for Open-ended Medical Question Answering

Introduction

In the arena of healthcare, leveraging LLMs for medical question answering is emerging as a promising approach to aid medical professionals and students. The paper presents a methodical investigation into enhancing the efficacy of LLMs in answering open-ended medical questions. Distinctively, this paper shifts focus towards subjective response generation by developing a modified MedQA-USMLE dataset to mirror real-life clinical scenarios more accurately.

Methodology

A pivotal contribution of this work is the introduction of an advanced prompting strategy designed specifically for the medical domain, described as incremental reasoning prompts. Unlike traditional few-shot Codex prompts that often resort to eliminative reasoning, this strategy advocates for a forward-looking chain of thought (CoT) process, which aligns more closely with the clinical diagnostic process.

Key Differentiations and Dataset Modifications

  • The conventional Codex few-shot prompts and the newly proposed MedCodex few-shot prompts were employed and assessed against both the traditional MedQA dataset and a novel variant tailored to encourage descriptive responses.
  • The MedQA-USMLE dataset underwent substantial modifications to produce two distinct versions: one retaining its original multiple-choice question (MCQ) format (referred to as MedQA-Original) and another adapted for descriptive, open-ended questioning (MedQA-No-Opt). This adaptation was essential for simulating a more genuine clinical inquiry environment.

Results and Observations

The evaluation of the incremental reasoning prompt's effectiveness revealed nuanced performances across different scenarios:

  • When applied to the original MCQ-format dataset, the standard Codex prompting approach outperformed the incremental reasoning prompts. This disparity underscores the Codex pattern's proficiency in navigating the constrained choice space inherent in MCQs.
  • Conversely, the incremental reasoning prompts demonstrated a significant advantage over Codex prompts within the descriptive version of the dataset. The observed superiority highlights the importance of a more dynamic and holistic reasoning approach when confronting open-ended medical questions.

Furthermore, a novel experiment on differential diagnosis generation capitalizes on generating plausible options before employing either the Codex or a specialized verifier model for final answer selection. This innovative approach not only resonates with the clinical decision-making process but also showcased an enhanced performance, especially when integrated with the trained verifier model.

Implications and Future Directions

The paper's implications extend beyond enhancing LLMs' performance in medical question answering. By introducing and validating the incremental reasoning prompt strategy, the research opens pathways for developing more nuanced and context-aware LLM applications in healthcare. This approach could potentially refine LLMs’ utility in clinical decision support, patient education, and medical training.

Looking ahead, the paper suggests several avenues for continued exploration. Among them is the prospect of applying the verified rewarding mechanism on other LLMs beyond the Llama2 model tested. Additionally, expanding the application of the developed methods to a broader range of medical datasets could further validate the proposed approach's effectiveness and adaptability.

Conclusion

The paper's exploration into using few-shot, chain-of-thought driven reasoning to prompt LLMs for open-ended medical question answering contributes valuable insights into the potential for AI-driven tools in healthcare. The development of the modified MedQA dataset, alongside the introduction of a novel prompting strategy, lays foundational work for future research aimed at enhancing the precision and relevance of LLM outputs in medical contexts.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (23)
  1. Katherine A Batterton and Kimberly N Hale. 2017. The likert scale what it is and how to use it. Phalanx, 50(2):32–39.
  2. E Bolton et al. 2022. Pubmedgpt 2.7 b. Technical report, Technical report. Stanford University Center for Research on Foundation ….
  3. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374.
  4. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416.
  5. The future landscape of large language models in medicine. Communications Medicine, 3(1):141.
  6. Compositional semantic parsing with large language models. arXiv preprint arXiv:2209.15003.
  7. A survey on automated fact-checking. Transactions of the Association for Computational Linguistics, 10:178–206.
  8. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300.
  9. What disease does this patient have? a large-scale open domain question answering dataset from medical exams. Applied Sciences, 11(14):6421.
  10. Pubmedqa: A dataset for biomedical research question answering. arXiv preprint arXiv:1909.06146.
  11. Large language models are zero-shot reasoners. Advances in neural information processing systems, 35:22199–22213.
  12. Can large language models reason about medical questions? arXiv preprint arXiv:2207.08143.
  13. Explainable ai for clinical risk prediction: a survey of concepts, methods, and modalities. arXiv preprint arXiv:2308.08407.
  14. Show your work: Scratchpads for intermediate computation with language models. arXiv preprint arXiv:2112.00114.
  15. Medmcqa: A large-scale multi-subject multi-choice dataset for medical domain question answering. In Conference on Health, Inference, and Learning, pages 248–260. PMLR.
  16. Large language models encode clinical knowledge. arXiv preprint arXiv:2212.13138.
  17. Large language models in medicine. Nature medicine, 29(8):1930–1940.
  18. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  19. Biomedlm: a domain-specific large language model for biomedical text. MosaicML. Accessed: Dec, 23(3):2.
  20. Secrets of rlhf in large language models part ii: Reward modeling.
  21. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837.
  22. Medical exam question answering with large-scale reading comprehension. In Proceedings of the AAAI conference on artificial intelligence, volume 32.
  23. Learning to prompt for vision-language models. International Journal of Computer Vision, 130(9):2337–2348.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Ojas Gramopadhye (1 paper)
  2. Saeel Sandeep Nachane (1 paper)
  3. Prateek Chanda (6 papers)
  4. Ganesh Ramakrishnan (88 papers)
  5. Kshitij Sharad Jadhav (4 papers)
  6. Yatin Nandwani (12 papers)
  7. Dinesh Raghu (19 papers)
  8. Sachindra Joshi (32 papers)
Citations (18)
X Twitter Logo Streamline Icon: https://streamlinehq.com