Automatic Interactive Evaluation for Large Language Models with State Aware Patient Simulator (2403.08495v4)
Abstract: LLMs have demonstrated remarkable proficiency in human interactions, yet their application within the medical field remains insufficiently explored. Previous works mainly focus on the performance of medical knowledge with examinations, which is far from the realistic scenarios, falling short in assessing the abilities of LLMs on clinical tasks. In the quest to enhance the application of LLMs in healthcare, this paper introduces the Automated Interactive Evaluation (AIE) framework and the State-Aware Patient Simulator (SAPS), targeting the gap between traditional LLM evaluations and the nuanced demands of clinical practice. Unlike prior methods that rely on static medical knowledge assessments, AIE and SAPS provide a dynamic, realistic platform for assessing LLMs through multi-turn doctor-patient simulations. This approach offers a closer approximation to real clinical scenarios and allows for a detailed analysis of LLM behaviors in response to complex patient interactions. Our extensive experimental validation demonstrates the effectiveness of the AIE framework, with outcomes that align well with human evaluations, underscoring its potential to revolutionize medical LLM testing for improved healthcare delivery.
- Qwen technical report. arXiv preprint arXiv:2309.16609, 2023.
- Llm-empowered chatbots for psychiatrist and patient simulation: Application and evaluation. arXiv preprint arXiv:2305.13614, 2023.
- A benchmark for automatic medical consultation system: frameworks, tasks and datasets. Bioinformatics, 39(1):btac817, 12 2022. ISSN 1367-4811. doi: 10.1093/bioinformatics/btac817. URL https://doi.org/10.1093/bioinformatics/btac817.
- Glm: General language model pretraining with autoregressive blank infilling. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 320–335, 2022.
- A survey on large language models: Applications, challenges, limitations, and practical usage. Authorea Preprints, 2023.
- Medalpaca–an open-source collection of medical conversational ai models and training data. arXiv preprint arXiv:2304.08247, 2023.
- Measuring massive multitask language understanding. Proceedings of the International Conference on Learning Representations (ICLR), 2021.
- What disease does this patient have? A large-scale open domain question answering dataset from medical exams. CoRR, abs/2009.13081, 2020a. URL https://arxiv.org/abs/2009.13081.
- What disease does this patient have? a large-scale open domain question answering dataset from medical exams. arXiv preprint arXiv:2009.13081, 2020b.
- Pubmedqa: A dataset for biomedical research question answering. In Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan (eds.), Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3-7, 2019, pp. 2567–2577. Association for Computational Linguistics, 2019. doi: 10.18653/V1/D19-1259. URL https://doi.org/10.18653/v1/D19-1259.
- Testing the limits of language models: A conversational framework for medical ai assessment. medRxiv, 2023.
- Challenges and applications of large language models. arXiv preprint arXiv:2307.10169, 2023.
- “best practice” for patient-centered communication: a narrative review. Journal of graduate medical education, 5(3):385–393, 2013.
- Performance of chatgpt on usmle: Potential for ai-assisted medical education using large language models. PLOS Digital Health, 2(2):1–12, 02 2023. doi: 10.1371/journal.pdig.0000198. URL https://doi.org/10.1371/journal.pdig.0000198.
- Vladimir I Levenshtein et al. Binary codes capable of correcting deletions, insertions, and reversals. In Soviet physics doklady, volume 10, pp. 707–710. Soviet Union, 1966.
- A diversity-promoting objective function for neural conversation models. In Kevin Knight, Ani Nenkova, and Owen Rambow (eds.), Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 110–119, San Diego, California, June 2016. Association for Computational Linguistics. doi: 10.18653/v1/N16-1014. URL https://aclanthology.org/N16-1014.
- Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pp. 74–81, 2004.
- Meddg: an entity-centric medical consultation dataset for entity-aware medical dialogue generation. In CCF International Conference on Natural Language Processing and Chinese Computing, pp. 447–459. Springer, 2022.
- G-eval: NLG evaluation using gpt-4 with better human alignment. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 2511–2522, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.153. URL https://aclanthology.org/2023.emnlp-main.153.
- The challenges for regulating medical use of chatgpt and other large language models. Jama, 2023.
- Capabilities of GPT-4 on medical challenge problems. CoRR, abs/2303.13375, 2023. doi: 10.48550/ARXIV.2303.13375. URL https://doi.org/10.48550/arXiv.2303.13375.
- OpenAI. Chatgpt: Optimizing language models for dialogue. Website, 2022. https://openai.com/blog/chatgpt.
- OpenAI. GPT-4 technical report. CoRR, abs/2303.08774, 2023. doi: 10.48550/arXiv.2303.08774.
- Medmcqa: A large-scale multi-subject multi-choice dataset for medical domain question answering. In Gerardo Flores, George H. Chen, Tom J. Pollard, Joyce C. Ho, and Tristan Naumann (eds.), Conference on Health, Inference, and Learning, CHIL 2022, 7-8 April 2022, Virtual Event, volume 174 of Proceedings of Machine Learning Research, pp. 248–260. PMLR, 2022a. URL https://proceedings.mlr.press/v174/pal22a.html.
- Medmcqa: A large-scale multi-subject multi-choice dataset for medical domain question answering. In Gerardo Flores, George H Chen, Tom Pollard, Joyce C Ho, and Tristan Naumann (eds.), Proceedings of the Conference on Health, Inference, and Learning, volume 174 of Proceedings of Machine Learning Research, pp. 248–260. PMLR, 07–08 Apr 2022b. URL https://proceedings.mlr.press/v174/pal22a.html.
- Role play with large language models. Nature, 623(7987):493–498, 2023.
- Large language models encode clinical knowledge. Nature, 620(7972):172–180, 2023.
- InternLM Team. Internlm: A multilingual language model with progressively enhanced capabilities. https://github.com/InternLM/InternLM, 2023.
- Large language models in medicine. Nature medicine, 29(8):1930–1940, 2023.
- Opportunities and challenges for chatgpt and large language models in biomedicine and health. Briefings in Bioinformatics, 25(1):bbad493, 2024.
- Towards conversational diagnostic ai. arXiv preprint arXiv:2401.05654, 2024.
- Huatuo: Tuning llama model with chinese medical knowledge. arXiv preprint arXiv:2304.06975, 2023.
- Task-oriented dialogue system for automatic diagnosis. In Iryna Gurevych and Yusuke Miyao (eds.), Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp. 201–207, Melbourne, Australia, July 2018. Association for Computational Linguistics. doi: 10.18653/v1/P18-2033. URL https://aclanthology.org/P18-2033.
- A network-based end-to-end trainable task-oriented dialogue system. arXiv preprint arXiv:1604.04562, 2016.
- Doctorglm: Fine-tuning your chinese doctor is not a herculean task. arXiv preprint arXiv:2304.01097, 2023.
- Baichuan 2: Open large-scale language models. arXiv preprint arXiv:2309.10305, 2023a.
- Large language models in health care: Development, applications, and challenges. Health Care Science, 2(4):255–263, 2023b.
- Chatdoctor: A medical chat model fine-tuned on llama model using medical domain knowledge. arXiv preprint arXiv:2303.14070, 2023.
- Medical dialogue response generation with pivotal information recalling. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp. 4763–4771, 2022.
- A survey of large language models in medicine: Progress, application, and challenge. arXiv preprint arXiv:2311.05112, 2023.
- Yusheng Liao (16 papers)
- Yutong Meng (2 papers)
- Yuhao Wang (144 papers)
- Hongcheng Liu (23 papers)
- Yanfeng Wang (211 papers)
- Yu Wang (939 papers)