Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
51 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

LongHealth: A Question Answering Benchmark with Long Clinical Documents (2401.14490v1)

Published 25 Jan 2024 in cs.CL

Abstract: Background: Recent advancements in LLMs offer potential benefits in healthcare, particularly in processing extensive patient records. However, existing benchmarks do not fully assess LLMs' capability in handling real-world, lengthy clinical data. Methods: We present the LongHealth benchmark, comprising 20 detailed fictional patient cases across various diseases, with each case containing 5,090 to 6,754 words. The benchmark challenges LLMs with 400 multiple-choice questions in three categories: information extraction, negation, and sorting, challenging LLMs to extract and interpret information from large clinical documents. Results: We evaluated nine open-source LLMs with a minimum of 16,000 tokens and also included OpenAI's proprietary and cost-efficient GPT-3.5 Turbo for comparison. The highest accuracy was observed for Mixtral-8x7B-Instruct-v0.1, particularly in tasks focused on information retrieval from single and multiple patient documents. However, all models struggled significantly in tasks requiring the identification of missing information, highlighting a critical area for improvement in clinical data interpretation. Conclusion: While LLMs show considerable potential for processing long clinical documents, their current accuracy levels are insufficient for reliable clinical use, especially in scenarios requiring the identification of missing information. The LongHealth benchmark provides a more realistic assessment of LLMs in a healthcare setting and highlights the need for further model refinement for safe and effective clinical application. We make the benchmark and evaluation code publicly available.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (32)
  1. The future landscape of large language models in medicine. Communications Medicine, 3(1):1–8, October 2023.
  2. It’s about time: Physicians’ perceptions of time constraints in primary care medical practice in three national healthcare systems. Med. Care, 48(2):95, February 2010.
  3. Length and redundancy of outpatient progress notes across a decade at an academic medical center. JAMA Netw Open, 4(7):e2115334–e2115334, July 2021.
  4. A frame semantic overview of NLP-based information extraction for cancer-related EHR notes. J. Biomed. Inform., 100:103301, December 2019.
  5. Large language models are few-shot clinical information extractors. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 1998–2022, December 2022.
  6. What disease does this patient have? a large-scale open domain question answering dataset from medical exams. Applied Sciences, 11(14):6421, 2021.
  7. Pubmedqa: A dataset for biomedical research question answering. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2567–2577, 2019.
  8. Medmcqa: A large-scale multi-subject multi-choice dataset for medical domain question answering. In Gerardo Flores, George H Chen, Tom Pollard, Joyce C Ho, and Tristan Naumann, editors, Proceedings of the Conference on Health, Inference, and Learning, volume 174 of Proceedings of Machine Learning Research, pages 248–260. PMLR, 07–08 Apr 2022.
  9. Aligning ai with shared human values. Proceedings of the International Conference on Learning Representations (ICLR), 2021.
  10. Measuring massive multitask language understanding. Proceedings of the International Conference on Learning Representations (ICLR), 2021.
  11. Gpqa: A graduate-level google-proof q&a benchmark, 2023.
  12. Long range arena: A benchmark for efficient transformers, 2020.
  13. Why is the electronic health record so challenging for research and clinical care? Methods Inf. Med., 60(01/02):032–048, May 2021.
  14. Assertion detection in clinical notes: Medical language models to the rescue? In Chaitanya Shivade, Rashmi Gangadharaiah, Spandana Gella, Sandeep Konam, Shaoqing Yuan, Yi Zhang, Parminder Bhatia, and Byron Wallace, editors, Proceedings of the Second Workshop on Natural Language Processing for Medical Conversations, pages 35–40, Online, June 2021. Association for Computational Linguistics.
  15. 01-ai. Yi. GitHub repository, 2023. Accessed: 2023-12-26.
  16. Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685, 2023.
  17. How long can open-source llms truly promise on context length?, June 2023.
  18. Mistral 7b, 2023.
  19. Mixtral of experts, 2024.
  20. Zephyr: Direct distillation of lm alignment. arXiv preprint arXiv:2310.16944, 2023.
  21. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615, 2022.
  22. Spec: A soft prompt-based calibration on performance variability of large language model in clinical notes summarization, 2023.
  23. Mimic-cxr-jpg-chest radiographs with structured labels. PhysioNet, 2019.
  24. Large language models are few-shot clinical information extractors. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang, editors, Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 1998–2022, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics.
  25. A sense inventory for clinical abbreviations and acronyms created using clinical notes and medical dictionary resources. Journal of the American Medical Informatics Association, 21(2):299–307, 2014.
  26. Mimic-iii, a freely accessible critical care database. Scientific data, 3(1):1–9, 2016.
  27. Llms accelerate annotation for medical information extraction. In Stefan Hegselmann, Antonio Parziale, Divya Shanmugam, Shengpu Tang, Mercy Nyamewaa Asiedu, Serina Chang, Tom Hartvigsen, and Harvineet Singh, editors, Proceedings of the 3rd Machine Learning for Health Symposium, volume 225 of Proceedings of Machine Learning Research, pages 82–100. PMLR, 10 Dec 2023.
  28. Identifying and extracting rare disease phenotypes with large language models, 2023.
  29. Distilling large language models for biomedical knowledge extraction: A case study on adverse drug events, 2023.
  30. Large language model is not a good few-shot information extractor, but a good reranker for hard samples!, 2023.
  31. Anthropic. Long context prompting for claude 2.1, Dec 2023. Accessed: 2023-12-26.
  32. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837, 2022.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (10)
  1. Lisa Adams (3 papers)
  2. Felix Busch (8 papers)
  3. Tianyu Han (20 papers)
  4. Jean-Baptiste Excoffier (3 papers)
  5. Matthieu Ortala (2 papers)
  6. Alexander Löser (21 papers)
  7. Hugo JWL. Aerts (13 papers)
  8. Jakob Nikolas Kather (34 papers)
  9. Daniel Truhn (51 papers)
  10. Keno Bressem (7 papers)
Citations (3)
X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets