Can LLMs Correct Physicians, Yet? Investigating Effective Interaction Methods in the Medical Domain (2403.20288v2)
Abstract: We explore the potential of LLMs to assist and potentially correct physicians in medical decision-making tasks. We evaluate several LLMs, including Meditron, Llama2, and Mistral, to analyze the ability of these models to interact effectively with physicians across different scenarios. We consider questions from PubMedQA and several tasks, ranging from binary (yes/no) responses to long answer generation, where the answer of the model is produced after an interaction with a physician. Our findings suggest that prompt design significantly influences the downstream accuracy of LLMs and that LLMs can provide valuable feedback to physicians, challenging incorrect diagnoses and contributing to more accurate decision-making. For example, when the physician is accurate 38% of the time, Mistral can produce the correct answer, improving accuracy up to 74% depending on the prompt being used, while Llama2 and Meditron models exhibit greater sensitivity to prompt choice. Our analysis also uncovers the challenges of ensuring that LLM-generated suggestions are pertinent and useful, emphasizing the need for further research in this area.
- Artificial Hallucinations in ChatGPT: Implications in Scientific Writing. Cureus, 15(2):e35179.
- Large language models and the perils of their hallucinations. Critical Care, 27(1):1–2.
- Analogy generation by prompting large language models: A case study of InstructGPT. In Proceedings of the 15th International Conference on Natural Language Generation, pages 298–312, Waterville, Maine, USA and virtual meeting. Association for Computational Linguistics.
- Meditron-70b: Scaling medical pretraining for large language models. arXiv, abs/2311.16079.
- Meditron-70b: Scaling medical pretraining for large language models.
- Empowering psychotherapy with large language models: Cognitive distortion detection through diagnosis of thought prompting. ArXiv, abs/2310.07146.
- Chataug: Leveraging chatgpt for text data augmentation. ArXiv, abs/2302.13007.
- A framework for few-shot language model evaluation.
- Few shot chain-of-thought driven reasoning to prompt llms for open ended medical question answering. arXiv, abs/2403.04890.
- Survey of hallucination in natural language generation. ACM Comput. Surv., 55(12):248:1–248:38.
- Mistral 7b. arXiv, abs/2310.06825.
- What disease does this patient have? a large-scale open domain question answering dataset from medical exams. Applied Sciences, 11(14):6421.
- Pubmedqa: A dataset for biomedical research question answering. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2567–2577.
- The magic of IF: Investigating causal reasoning abilities in large language models of code. In Findings of the Association for Computational Linguistics: ACL 2023, pages 9009–9022, Toronto, Canada. Association for Computational Linguistics.
- Can large language models reason about medical questions? ArXiv, abs/2207.08143.
- Chatgpt-healthprompt. harnessing the power of xai in prompt-based healthcare decision support using chatgpt. arXiv, abs/2308.09731.
- Capabilities of gpt-4 on medical challenge problems. arXiv, abs/2303.13375.
- Can generalist foundation models outcompete special-purpose tuning? case study in medicine. ArXiv, abs/2311.16452.
- Gpt-4 technical report. arXiv, abs/2303.08774.
- Can artificial intelligence help for scientific writing? Critical Care, 27(1). Publisher: BioMed Central Ltd.
- An overview of clinical decision support systems: benefits, risks, and strategies for success. NPJ Digital Medicine, 3.
- Medagents: Large language models as collaborators for zero-shot medical reasoning. arXiv, abs/2311.10537.
- Llama 2: Open foundation and fine-tuned chat models. arXiv, abs/2307.09288.
- Prompt engineering for healthcare: Methodologies and applications. ArXiv, abs/2304.14670.