WisPerMed at "Discharge Me!": Advancing Text Generation in Healthcare with Large Language Models, Dynamic Expert Selection, and Priming Techniques on MIMIC-IV (2405.11255v1)
Abstract: This study aims to leverage state of the art LLMs to automate generating the "Brief Hospital Course" and "Discharge Instructions" sections of Discharge Summaries from the MIMIC-IV dataset, reducing clinicians' administrative workload. We investigate how automation can improve documentation accuracy, alleviate clinician burnout, and enhance operational efficacy in healthcare facilities. This research was conducted within our participation in the Shared Task Discharge Me! at BioNLP @ ACL 2024. Various strategies were employed, including few-shot learning, instruction tuning, and Dynamic Expert Selection (DES), to develop models capable of generating the required text sections. Notably, utilizing an additional clinical domain-specific dataset demonstrated substantial potential to enhance clinical language processing. The DES method, which optimizes the selection of text outputs from multiple predictions, proved to be especially effective. It achieved the highest overall score of 0.332 in the competition, surpassing single-model outputs. This finding suggests that advanced deep learning methods in combination with DES can effectively automate parts of electronic health record documentation. These advancements could enhance patient care by freeing clinician time for patient interactions. The integration of text selection strategies represents a promising avenue for further research.
- Phi-3 technical report: A highly capable language model locally on your phone. Preprint, arXiv:2404.14219.
- AI@Meta. 2024. Llama 3. https://github.com/meta-llama/llama3/. Accessed: 2024-05-15.
- Malaikannan Sankarasubbu Ankit Pal. 2024. OpenBioLLMs: Advancing open-source large language models for healthcare and life sciences. https://huggingface.co/aaditya/OpenBioLLM-Llama3-70B. Accessed: 2024-05-15.
- Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An automatic metric for mt evaluation with improved correlation with human judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pages 65–72, Ann Arbor, Michigan. Association for Computational Linguistics.
- Readability Revisited: The New Dale-Chall Readability Formula, volume 1. Brookline Books.
- Chatbot arena: An open platform for evaluating llms by human preference. Preprint, arXiv:2403.04132.
- Meri Coleman and Ta Lin Liau. 1975. A computer readability formula designed for machine scoring. Journal of Applied Psychology, 60(2):283–284.
- BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
- What’s up with llama 3? arena data analysis. Accessed: 2024-05-15.
- Physiobank, physiotoolkit, and physionet: Components of a new research resource for complex physiologic signals. Circulation, 101(23):e215–e220.
- LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations.
- Mistral 7B. Preprint, arXiv:2310.06825.
- MIMIC-IV.
- Mimic-iv, a freely accessible electronic health record dataset. Scientific Data, 10:1.
- Derivation of new readability formulas (automated readability index, fog count and flesch reading ease formula) for navy enlisted personnel.
- Publicly shareable clinical large language model built on synthetic clinical notes. Preprint, arXiv:2309.00237.
- SummaC: Re-visiting nli-based models for inconsistency detection in summarization. Transactions of the Association for Computational Linguistics, 10:163–177.
- Quantifying the carbon emissions of machine learning. CoRR, abs/1910.09700.
- DeepZensols: A deep learning natural language processing framework for experimentation and reproducibility. In Proceedings of the 3rd Workshop for Natural Language Processing Open Source Software (NLP-OSS 2023), pages 141–146, Singapore, Singapore. Empirical Methods in Natural Language Processing.
- A new public corpus for clinical section identification: MedSecId. In Proceedings of the 29th International Conference on Computational Linguistics, pages 3709–3721, Gyeongju, Republic of Korea. International Committee on Computational Linguistics.
- Chin-Yew Lin. 2004. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pages 74–81, Barcelona, Spain. Association for Computational Linguistics.
- Ilya Loshchilov and Frank Hutter. 2019. Decoupled weight decay regularization. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net.
- Large language models: A survey. arXiv preprint arXiv:2402.06196.
- OpenAI. 2023. ChatGPT. https://www.openai.com. Accessed: 2024-05-15.
- Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 311–318, Philadelphia, Pennsylvania, USA. Association for Computational Linguistics.
- SB Patel and K Lam. 2023. Chatgpt: the future of discharge summaries? Lancet Digit Health, 5(3):e107–e108. Epub 2023 Feb 6.
- Instruction tuning with gpt-4. Preprint, arXiv:2304.03277.
- Distilbert, a distilled version of BERT: smaller, faster, cheaper and lighter. CoRR, abs/1910.01108.
- Discharge summary hospital course summarisation of in patient electronic health record text with clinical concept guided deep pre-trained transformer models. Journal of Biomedical Informatics, 141:104358.
- Allocation of physician time in ambulatory practice: A time and motion study in 4 specialties. Annals of Internal Medicine, 165(11):753.
- Experimental standards for deep learning in natural language processing research. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 2673–2692, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- Generalizing from a few examples: A survey on few-shot learning. ACM Computing Surveys, 53(3).
- WizardLM: Empowering large pre-trained language models to follow complex instructions. In The Twelfth International Conference on Learning Representations.
- J. Xu. 2024. Discharge me: Bionlp acl’24 shared task on streamlining discharge documentation.
- Overview of the first shared task on clinical text generation: Rrg24 and “discharge me!”. In The 23rd Workshop on Biomedical Natural Language Processing and BioNLP Shared Tasks, Bangkok, Thailand. Association for Computational Linguistics.
- Aci-bench: A novel ambient clinical intelligence dataset for benchmarking automatic visit note generation. Scientific Data, 10(1):586.
- AlignScore: Evaluating factual consistency with a unified alignment function. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 11328–11348, Toronto, Canada. Association for Computational Linguistics.
- BERTScore: Evaluating text generation with bert. In International Conference on Learning Representations.
- Hendrik Damm (2 papers)
- Tabea M. G. Pakull (2 papers)
- Bahadır Eryılmaz (2 papers)
- Helmut Becker (3 papers)
- Ahmad Idrissi-Yaghir (6 papers)
- Henning Schäfer (6 papers)
- Sergej Schultenkämper (1 paper)
- Christoph M. Friedrich (17 papers)