Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

A Framework for Human Evaluation of Large Language Models in Healthcare Derived from Literature Review (2405.02559v2)

Published 4 May 2024 in cs.CL and cs.AI

Abstract: With generative AI, particularly LLMs, continuing to make inroads in healthcare, it is critical to supplement traditional automated evaluations with human evaluations. Understanding and evaluating the output of LLMs is essential to assuring safety, reliability, and effectiveness. However, human evaluation's cumbersome, time-consuming, and non-standardized nature presents significant obstacles to comprehensive evaluation and widespread adoption of LLMs in practice. This study reviews existing literature on human evaluation methodologies for LLMs in healthcare. We highlight a notable need for a standardized and consistent human evaluation approach. Our extensive literature search, adhering to the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines, includes publications from January 2018 to February 2024. The review examines the human evaluation of LLMs across various medical specialties, addressing factors such as evaluation dimensions, sample types and sizes, selection, and recruitment of evaluators, frameworks and metrics, evaluation process, and statistical analysis type. Drawing on the diverse evaluation strategies employed in these studies, we propose a comprehensive and practical framework for human evaluation of LLMs: QUEST: Quality of Information, Understanding and Reasoning, Expression Style and Persona, Safety and Harm, and Trust and Confidence. This framework aims to improve the reliability, generalizability, and applicability of human evaluation of LLMs in different healthcare applications by defining clear evaluation dimensions and offering detailed guidelines.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (15)
  1. Thomas Yu Chow Tam (3 papers)
  2. Sonish Sivarajkumar (12 papers)
  3. Sumit Kapoor (3 papers)
  4. Alisa V Stolyar (1 paper)
  5. Katelyn Polanska (2 papers)
  6. Karleigh R McCarthy (1 paper)
  7. Hunter Osterhoudt (2 papers)
  8. Xizhi Wu (5 papers)
  9. Shyam Visweswaran (21 papers)
  10. Sunyang Fu (9 papers)
  11. Piyush Mathur (3 papers)
  12. Giovanni E. Cacciamani (3 papers)
  13. Cong Sun (25 papers)
  14. Yifan Peng (147 papers)
  15. Yanshan Wang (50 papers)
Citations (6)
X Twitter Logo Streamline Icon: https://streamlinehq.com