Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

MedGPTEval: A Dataset and Benchmark to Evaluate Responses of Large Language Models in Medicine (2305.07340v1)

Published 12 May 2023 in cs.CL

Abstract: METHODS: First, a set of evaluation criteria is designed based on a comprehensive literature review. Second, existing candidate criteria are optimized for using a Delphi method by five experts in medicine and engineering. Third, three clinical experts design a set of medical datasets to interact with LLMs. Finally, benchmarking experiments are conducted on the datasets. The responses generated by chatbots based on LLMs are recorded for blind evaluations by five licensed medical experts. RESULTS: The obtained evaluation criteria cover medical professional capabilities, social comprehensive capabilities, contextual capabilities, and computational robustness, with sixteen detailed indicators. The medical datasets include twenty-seven medical dialogues and seven case reports in Chinese. Three chatbots are evaluated, ChatGPT by OpenAI, ERNIE Bot by Baidu Inc., and Doctor PuJiang (Dr. PJ) by Shanghai Artificial Intelligence Laboratory. Experimental results show that Dr. PJ outperforms ChatGPT and ERNIE Bot in both multiple-turn medical dialogue and case report scenarios.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (13)
  1. Jie Xu (467 papers)
  2. Lu Lu (189 papers)
  3. Sen Yang (191 papers)
  4. Bilin Liang (1 paper)
  5. Xinwei Peng (3 papers)
  6. Jiali Pang (2 papers)
  7. Jinru Ding (5 papers)
  8. Xiaoming Shi (40 papers)
  9. Lingrui Yang (1 paper)
  10. Huan Song (13 papers)
  11. Kang Li (207 papers)
  12. Xin Sun (151 papers)
  13. Shaoting Zhang (133 papers)
Citations (6)