Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

EHRNoteQA: An LLM Benchmark for Real-World Clinical Practice Using Discharge Summaries (2402.16040v5)

Published 25 Feb 2024 in cs.CL

Abstract: Discharge summaries in Electronic Health Records (EHRs) are crucial for clinical decision-making, but their length and complexity make information extraction challenging, especially when dealing with accumulated summaries across multiple patient admissions. LLMs show promise in addressing this challenge by efficiently analyzing vast and complex data. Existing benchmarks, however, fall short in properly evaluating LLMs' capabilities in this context, as they typically focus on single-note information or limited topics, failing to reflect the real-world inquiries required by clinicians. To bridge this gap, we introduce EHRNoteQA, a novel benchmark built on the MIMIC-IV EHR, comprising 962 different QA pairs each linked to distinct patients' discharge summaries. Every QA pair is initially generated using GPT-4 and then manually reviewed and refined by three clinicians to ensure clinical relevance. EHRNoteQA includes questions that require information across multiple discharge summaries and covers eight diverse topics, mirroring the complexity and diversity of real clinical inquiries. We offer EHRNoteQA in two formats: open-ended and multi-choice question answering, and propose a reliable evaluation method for each. We evaluate 27 LLMs using EHRNoteQA and examine various factors affecting the model performance (e.g., the length and number of discharge summaries). Furthermore, to validate EHRNoteQA as a reliable proxy for expert evaluations in clinical practice, we measure the correlation between the LLM performance on EHRNoteQA, and the LLM performance manually evaluated by clinicians. Results show that LLM performance on EHRNoteQA have higher correlation with clinician-evaluated performance (Spearman: 0.78, Kendall: 0.62) compared to other benchmarks, demonstrating its practical relevance in evaluating LLMs in clinical settings.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (51)
  1. Open llm leaderboard. https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard.
  2. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
  3. Evaluating the feasibility of chatgpt in healthcare: an analysis of multiple clinical and research scenarios. Journal of Medical Systems, 47(1):33.
  4. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.
  5. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457.
  6. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168.
  7. Cognitive. 2023. Dolphin-2.0-mistral-7b. https://huggingface.co/cognitivecomputations/dolphin-2.0-mistral-7b,.
  8. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
  9. Gunther Eysenbach et al. 2023. The role of chatgpt, generative language models, and artificial intelligence in medical education: a conversation with chatgpt and a call for papers. JMIR Medical Education, 9(1):e46885.
  10. Jungwei Fan. 2019. Annotating and characterizing clinical sentences with explicit why-qa cues. In Proceedings of the 2nd Clinical Natural Language Processing Workshop, pages 101–106.
  11. Medalign: A clinician-generated dataset for instruction following with electronic medical records. arXiv preprint arXiv:2308.14089.
  12. A framework for few-shot language model evaluation.
  13. Measuring massive multitask language understanding. In International Conference on Learning Representations.
  14. Chatgpt for healthcare services: An emerging stage for an innovative perspective. BenchCouncil Transactions on Benchmarks, Standards and Evaluations, 3(1):100105.
  15. Mistral 7b. arXiv preprint arXiv:2310.06825.
  16. What disease does this patient have? a large-scale open domain question answering dataset from medical exams. Applied Sciences, 11(14):6421.
  17. Pubmedqa: A dataset for biomedical research question answering. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2567–2577.
  18. Mimic-iv, a freely accessible electronic health record dataset. Scientific data, 10(1):1.
  19. Mimic-iii, a freely accessible critical care database. Scientific data, 3(1):1–9.
  20. Evaluating open-domain question answering in the era of large language models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5591–5606, Toronto, Canada. Association for Computational Linguistics.
  21. Platypus: Quick, cheap, and powerful refinement of llms. arXiv preprint arXiv:2308.07317.
  22. Openorcaplatypus: Llama2-13b model instruct-tuned on filtered openorcav1 gpt-4 dataset and merged with divergent stem and logic dataset model. https://huggingface.co/Open-Orca/OpenOrca-Platypus2-13B,.
  23. Learning to ask like a physician. In Proceedings of the 4th Clinical Natural Language Processing Workshop, pages 74–86.
  24. Mistralorca: Mistral-7b model instruct-tuned on filtered openorcav1 gpt-4 dataset. https://huggingface.co/Open-Orca/Mistral-7B-OpenOrca.
  25. Holistic evaluation of language models. arXiv preprint arXiv:2211.09110.
  26. Truthfulqa: Measuring how models mimic human falsehoods. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3214–3252.
  27. Extractive clinical question-answering with multianswer and multifocus questions: Data set development and evaluation study. JMIR AI, 2(1):e41818.
  28. MosaicML. 2023. Introducing mpt-7b: A new standard for open-source, commercially usable llms. Accessed: 2023-05-05.
  29. OpenAI. 2023. Gpt-4 technical report.
  30. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744.
  31. Medmcqa: A large-scale multi-subject multi-choice dataset for medical domain question answering. In Conference on Health, Inference, and Learning, pages 248–260. PMLR.
  32. emrqa: A large corpus for question answering on electronic medical records. arXiv preprint arXiv:1809.00732.
  33. Sajan B Patel and Kyle Lam. 2023. Chatgpt: the future of discharge summaries? The Lancet Digital Health, 5(3):e107–e108.
  34. Annotating electronic medical records for question answering. arXiv preprint arXiv:1805.06816.
  35. Winogrande: An adversarial winograd schema challenge at scale. Communications of the ACM, 64(9):99–106.
  36. Radqa: A question answering dataset to improve comprehension of radiology reports. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 6250–6259.
  37. Evaluation metrics in the era of gpt-4: reliably evaluating large language models on sequence to sequence tasks. arXiv preprint arXiv:2310.13800.
  38. Migel Tissera. 2023a. Synthia-13b-v1.2b: Synthetic intelligent agent. https://huggingface.co/migtissera/Synthia-13B.
  39. Migel Tissera. 2023b. Synthia-7b-v1.3: Synthetic intelligent agent. https://huggingface.co/migtissera/Synthia-13B.
  40. Clinical camel: An open-source expert-level medical language model with dialogue-based knowledge encoding. arXiv preprint arXiv:2305.12031.
  41. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
  42. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  43. Özlem Uzuner. 2009. Recognizing obesity and comorbidities in sparse data. Journal of the American Medical Informatics Association, 16(4):561–570.
  44. Identifying patient smoking status from medical discharge records. Journal of the American Medical Informatics Association, 15(1):14–24.
  45. Extracting medication information from clinical text. Journal of the American Medical Informatics Association, 17(5):514–518.
  46. 2010 i2b2/va challenge on concepts, assertions, and relations in clinical text. Journal of the American Medical Informatics Association, 18(5):552–556.
  47. Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244.
  48. Clinical reading comprehension: A thorough analysis of the emrQA dataset. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4474–4486, Online. Association for Computational Linguistics.
  49. Cliniqg4qa: Generating diverse questions for domain adaptation of clinical question answering. In 2021 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pages 580–587. IEEE.
  50. Hellaswag: Can a machine really finish your sentence? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4791–4800.
  51. Lima: Less is more for alignment. arXiv preprint arXiv:2305.11206.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (9)
  1. Sunjun Kweon (7 papers)
  2. Jiyoun Kim (7 papers)
  3. Heeyoung Kwak (3 papers)
  4. Dongchul Cha (1 paper)
  5. Hangyul Yoon (5 papers)
  6. Kwanghyun Kim (1 paper)
  7. Seunghyun Won (4 papers)
  8. Edward Choi (90 papers)
  9. Jeewon Yang (1 paper)