Can GPT-3.5 Generate and Code Discharge Summaries? (2401.13512v2)
Abstract: Objective: To investigate GPT-3.5 in generating and coding medical documents with ICD-10 codes for data augmentation on low-resources labels. Materials and Methods: Employing GPT-3.5 we generated and coded 9,606 discharge summaries based on lists of ICD-10 code descriptions of patients with infrequent (generation) codes within the MIMIC-IV dataset. Combined with the baseline training set, this formed an augmented training set. Neural coding models were trained on baseline and augmented data and evaluated on a MIMIC-IV test set. We report micro- and macro-F1 scores on the full codeset, generation codes, and their families. Weak Hierarchical Confusion Matrices were employed to determine within-family and outside-of-family coding errors in the latter codesets. The coding performance of GPT-3.5 was evaluated both on prompt-guided self-generated data and real MIMIC-IV data. Clinical professionals evaluated the clinical acceptability of the generated documents. Results: Augmentation slightly hinders the overall performance of the models but improves performance for the generation candidate codes and their families, including one unseen in the baseline training data. Augmented models display lower out-of-family error rates. GPT-3.5 can identify ICD-10 codes by the prompted descriptions, but performs poorly on real data. Evaluators note the correctness of generated concepts while suffering in variety, supporting information, and narrative. Discussion and Conclusion: GPT-3.5 alone is unsuitable for ICD-10 coding. Augmentation positively affects generation code families but mainly benefits codes with existing examples. Augmentation reduces out-of-family errors. Discharge summaries generated by GPT-3.5 state prompted concepts correctly but lack variety, and authenticity in narratives. They are unsuitable for clinical practice.
- Automated clinical coding: what, why, and where we are? NPJ digital medicine, 5(1):159, 2022.
- MIMIC-IV, a freely accessible electronic health record dataset. Scientific data, 10(1):1, 2023.
- Explainable prediction of medical codes from clinical text. In Proceedings of NAACL-HLT, pages 1101–1111, 2018.
- Explainable automated coding of clinical notes using hierarchical label-wise attention networks and label embedding initialisation. Journal of biomedical informatics, 116:103728, 2021.
- Read, attend, and code: pushing the limits of medical codes prediction from clinical notes by machines. In Machine Learning for Healthcare Conference, pages 196–208. PMLR, 2021.
- Few-shot and zero-shot multi-label learning for structured label spaces. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Conference on Empirical Methods in Natural Language Processing, volume 2018, page 3132. NIH Public Access, 2018.
- Generalized zero-shot text classification for icd coding. In Proceedings of the Twenty-Ninth International Conference on International Joint Conferences on Artificial Intelligence, pages 4018–4024, 2021.
- Hicu: Leveraging hierarchy for curriculum learning in automated icd coding. arXiv preprint arXiv:2208.02301, 2022.
- Hienet: Bidirectional hierarchy framework for automated icd coding. In International Conference on Database Systems for Advanced Applications, pages 523–539. Springer, 2022.
- Horses to zebras: ontology-guided data augmentation and synthesis for icd-9 coding. In Proceedings of the 21st Workshop on Biomedical Language Processing. Association for Computational Linguistics, 2022.
- An automatic icd coding network using partition-based label attention. arXiv preprint arXiv:2211.08429, 2022.
- Divide and conquer: An extreme multi-label classification approach for coding diseases and procedures in spanish. In Proceedings of the 13th International Workshop on Health Text Mining and Information Analysis (LOUHI), pages 138–147, 2022.
- Bert for long documents: A case study of automated icd coding. arXiv preprint arXiv:2211.02519, 2022.
- Training language models to follow instructions with human feedback. In Koyejo, S., Mohamed, S., Agarwal, A. et al, editors, Advances in Neural Information Processing Systems, volume 35, pages 27730–27744. Curran Associates, Inc., 2022. URL https://proceedings.neurips.cc/paper_files/paper/2022/file/b1efde53be364a73914f58805a001731-Paper-Conference.pdf.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
- A survey of large language models. arXiv preprint arXiv:2303.18223, 2023.
- Large language models encode clinical knowledge. arXiv e-prints, pages arXiv–2212, 2022.
- Survey of hallucination in natural language generation. ACM Comput. Surv., 55(12), mar 2023. ISSN 0360-0300. 10.1145/3571730. URL https://doi.org/10.1145/3571730.
- Benefits, limits, and risks of gpt-4 as an ai chatbot for medicine. New England Journal of Medicine, 388(13):1233–1239, 2023.
- Revolutionizing radiology with gpt-based models: Current applications, future possibilities and limitations of chatgpt. Diagnostic and Interventional Imaging, 2023.
- Ai chatbots not yet ready for clinical use. medRxiv, pages 2023–03, 2023.
- Foresight-generative pretrained transformer (gpt) for modelling of patient timelines using ehrs.
- Dale: Generative data augmentation for low-resource legal nlp. arXiv preprint arXiv:2310.15799, 2023.
- Automated medical coding on mimic-iii and mimic-iv: A critical review and replicability study. arXiv preprint arXiv:2304.10909, 2023.
- Mimic-iv-icd: A new benchmark for extreme multilabel classification. arXiv preprint arXiv:2304.13998, 2023.
- A label attention model for icd coding from clinical text. arXiv preprint arXiv:2007.06351, 2020.
- Icd coding from clinical text using multi-filter residual convolutional neural network. In proceedings of the AAAI conference on artificial intelligence, volume 34, pages 8180–8187, 2020.
- Distributed representations of words and phrases and their compositionality. Advances in neural information processing systems, 26, 2013.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
- Plm-icd: automatic icd coding with pretrained language models. arXiv preprint arXiv:2207.05289, 2022.
- Evaluation measures for hierarchical classification: a unified view and novel approaches. Data Mining and Knowledge Discovery, 29:820–865, 2015.
- Cophe: A count-preserving hierarchical evaluation metric in large-scale multi-label text classification. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 907–912, 2021.
- Fleiss, J.L. Measuring nominal scale agreement among many raters. Psychological bulletin, 76(5):378, 1971.
- Retrieval-augmented generation for knowledge-intensive NLP tasks. In Larochelle, H., Ranzato, M., Hadsell, R. et al, editors, Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020. URL https://proceedings.neurips.cc/paper/2020/hash/6b493230205f780e1bc26945df7481e5-Abstract.html.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.