Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 167 tok/s
Gemini 2.5 Pro 48 tok/s Pro
GPT-5 Medium 33 tok/s Pro
GPT-5 High 40 tok/s Pro
GPT-4o 92 tok/s Pro
Kimi K2 193 tok/s Pro
GPT OSS 120B 425 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

Can GPT-3.5 Generate and Code Discharge Summaries? (2401.13512v2)

Published 24 Jan 2024 in cs.CL

Abstract: Objective: To investigate GPT-3.5 in generating and coding medical documents with ICD-10 codes for data augmentation on low-resources labels. Materials and Methods: Employing GPT-3.5 we generated and coded 9,606 discharge summaries based on lists of ICD-10 code descriptions of patients with infrequent (generation) codes within the MIMIC-IV dataset. Combined with the baseline training set, this formed an augmented training set. Neural coding models were trained on baseline and augmented data and evaluated on a MIMIC-IV test set. We report micro- and macro-F1 scores on the full codeset, generation codes, and their families. Weak Hierarchical Confusion Matrices were employed to determine within-family and outside-of-family coding errors in the latter codesets. The coding performance of GPT-3.5 was evaluated both on prompt-guided self-generated data and real MIMIC-IV data. Clinical professionals evaluated the clinical acceptability of the generated documents. Results: Augmentation slightly hinders the overall performance of the models but improves performance for the generation candidate codes and their families, including one unseen in the baseline training data. Augmented models display lower out-of-family error rates. GPT-3.5 can identify ICD-10 codes by the prompted descriptions, but performs poorly on real data. Evaluators note the correctness of generated concepts while suffering in variety, supporting information, and narrative. Discussion and Conclusion: GPT-3.5 alone is unsuitable for ICD-10 coding. Augmentation positively affects generation code families but mainly benefits codes with existing examples. Augmentation reduces out-of-family errors. Discharge summaries generated by GPT-3.5 state prompted concepts correctly but lack variety, and authenticity in narratives. They are unsuitable for clinical practice.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (34)
  1. Automated clinical coding: what, why, and where we are? NPJ digital medicine, 5(1):159, 2022.
  2. MIMIC-IV, a freely accessible electronic health record dataset. Scientific data, 10(1):1, 2023.
  3. Explainable prediction of medical codes from clinical text. In Proceedings of NAACL-HLT, pages 1101–1111, 2018.
  4. Explainable automated coding of clinical notes using hierarchical label-wise attention networks and label embedding initialisation. Journal of biomedical informatics, 116:103728, 2021.
  5. Read, attend, and code: pushing the limits of medical codes prediction from clinical notes by machines. In Machine Learning for Healthcare Conference, pages 196–208. PMLR, 2021.
  6. Few-shot and zero-shot multi-label learning for structured label spaces. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Conference on Empirical Methods in Natural Language Processing, volume 2018, page 3132. NIH Public Access, 2018.
  7. Generalized zero-shot text classification for icd coding. In Proceedings of the Twenty-Ninth International Conference on International Joint Conferences on Artificial Intelligence, pages 4018–4024, 2021.
  8. Hicu: Leveraging hierarchy for curriculum learning in automated icd coding. arXiv preprint arXiv:2208.02301, 2022.
  9. Hienet: Bidirectional hierarchy framework for automated icd coding. In International Conference on Database Systems for Advanced Applications, pages 523–539. Springer, 2022.
  10. Horses to zebras: ontology-guided data augmentation and synthesis for icd-9 coding. In Proceedings of the 21st Workshop on Biomedical Language Processing. Association for Computational Linguistics, 2022.
  11. An automatic icd coding network using partition-based label attention. arXiv preprint arXiv:2211.08429, 2022.
  12. Divide and conquer: An extreme multi-label classification approach for coding diseases and procedures in spanish. In Proceedings of the 13th International Workshop on Health Text Mining and Information Analysis (LOUHI), pages 138–147, 2022.
  13. Bert for long documents: A case study of automated icd coding. arXiv preprint arXiv:2211.02519, 2022.
  14. Training language models to follow instructions with human feedback. In Koyejo, S., Mohamed, S., Agarwal, A. et al, editors, Advances in Neural Information Processing Systems, volume 35, pages 27730–27744. Curran Associates, Inc., 2022. URL https://proceedings.neurips.cc/paper_files/paper/2022/file/b1efde53be364a73914f58805a001731-Paper-Conference.pdf.
  15. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  16. A survey of large language models. arXiv preprint arXiv:2303.18223, 2023.
  17. Large language models encode clinical knowledge. arXiv e-prints, pages arXiv–2212, 2022.
  18. Survey of hallucination in natural language generation. ACM Comput. Surv., 55(12), mar 2023. ISSN 0360-0300. 10.1145/3571730. URL https://doi.org/10.1145/3571730.
  19. Benefits, limits, and risks of gpt-4 as an ai chatbot for medicine. New England Journal of Medicine, 388(13):1233–1239, 2023.
  20. Revolutionizing radiology with gpt-based models: Current applications, future possibilities and limitations of chatgpt. Diagnostic and Interventional Imaging, 2023.
  21. Ai chatbots not yet ready for clinical use. medRxiv, pages 2023–03, 2023.
  22. Foresight-generative pretrained transformer (gpt) for modelling of patient timelines using ehrs.
  23. Dale: Generative data augmentation for low-resource legal nlp. arXiv preprint arXiv:2310.15799, 2023.
  24. Automated medical coding on mimic-iii and mimic-iv: A critical review and replicability study. arXiv preprint arXiv:2304.10909, 2023.
  25. Mimic-iv-icd: A new benchmark for extreme multilabel classification. arXiv preprint arXiv:2304.13998, 2023.
  26. A label attention model for icd coding from clinical text. arXiv preprint arXiv:2007.06351, 2020.
  27. Icd coding from clinical text using multi-filter residual convolutional neural network. In proceedings of the AAAI conference on artificial intelligence, volume 34, pages 8180–8187, 2020.
  28. Distributed representations of words and phrases and their compositionality. Advances in neural information processing systems, 26, 2013.
  29. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  30. Plm-icd: automatic icd coding with pretrained language models. arXiv preprint arXiv:2207.05289, 2022.
  31. Evaluation measures for hierarchical classification: a unified view and novel approaches. Data Mining and Knowledge Discovery, 29:820–865, 2015.
  32. Cophe: A count-preserving hierarchical evaluation metric in large-scale multi-label text classification. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 907–912, 2021.
  33. Fleiss, J.L. Measuring nominal scale agreement among many raters. Psychological bulletin, 76(5):378, 1971.
  34. Retrieval-augmented generation for knowledge-intensive NLP tasks. In Larochelle, H., Ranzato, M., Hadsell, R. et al, editors, Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020. URL https://proceedings.neurips.cc/paper/2020/hash/6b493230205f780e1bc26945df7481e5-Abstract.html.
Citations (11)

Summary

  • The paper evaluates GPT-3.5’s role in generating synthetic discharge summaries to augment ICD-10 coding models using metrics like micro- and macro-F1 scores.
  • The paper demonstrates that GPT-3.5 improves candidate code generation and reduces out-of-family errors, yet its clinical narratives lack sufficient depth and realism.
  • The study suggests that while GPT-3.5 shows promise for data augmentation, its direct clinical deployment remains unfeasible without improved narrative coherence.

Introduction

The task of generating and coding medical documents, notably discharge summaries with ICD-10 codes, is a critical but resource-intensive process in healthcare. Automation through NLP has the potential to alleviate this, and recent progress in LLMs like GPT-3.5 raise intriguing possibilities for addressing challenges in medical document coding. This paper rigorously evaluates the ability of GPT-3.5 to generate and code discharge summaries within the specific context of data augmentation for rare labels in the medical coding process.

Methodology and Experimentation

The researchers approached the experiment with meticulous care, selecting candidate documents from the MIMIC-IV dataset for their synthesis evaluation. GPT-3.5 was harnessed through a detailed prompting strategy to create discharge summaries based on lists of ICD-10 code descriptions. These synthetic summaries were then used to augment an existing training set for neural coding models, allowing for a comparative analysis of models trained solely on authentic summaries versus those supplemented with synthetic data.

To validate the coding performance, the paper utilized a multi-faceted evaluation method consisting of micro- and macro-F1 scores, with the aid of Weak Hierarchical Confusion Matrices to discern coding errors both within and outside the family of generated codes. GPT-3.5's capability was tested against prompt-guided synthetic data and the authenticity of real MIMIC-IV data. Additionally, the generated documents underwent a clinical acceptability review by professionals.

Results and Findings

Augmentation with GPT-3.5-generated summaries appeared to slightly impact overall performance negatively; however, for the generation of candidate codes and their families, performance improved. Notably, the models displayed reduced out-of-family error rates when augmented with GPT-3.5's output. Nevertheless, when coding real data, GPT-3.5's proficiency was limited, indicating that it is not yet viable for practical clinical deployment as an autonomous ICD-10 coder.

The clinical review revealed that while GPT-3.5 could correctly generate individual concepts, the narratives lacked variety, depth of supporting information, and did not portray realistic clinical scenarios, making them insufficient for clinical use. This underscores the nuanced requirements of clinical documentation that go beyond mere factual correctness to include coherence, context, and prioritization of medical information.

Implications and Future Directions

This detailed investigation into the role of GPT-3.5 in medical document generation and coding provides valuable insights into the capabilities and limitations of LLMs in healthcare. Although GPT-3.5 showed promise in coding within a controlled synthetic environment, its real-world application is constrained by its inability to authentically replicate the clinical narrative structure and context.

Future directions might include exploring different prompting strategies, utilizing real clinical notes as contextual learning examples, or integrating chronological ordering in discharge summaries to guide LLMs towards more coherent narratives. Additionally, while GPT-3.5 alone may not suffice for clinical coding tasks, its role in data augmentation for training machine learning models points to a synergistic approach where human expertise is complemented by AI-generated insights for better model performance on rare codes.

Such studies are critical stepping stones in harnessing the power of AI to support healthcare systems, potentially leading to more efficient, accurate medical documentation processes in the future.

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 1 tweet and received 0 likes.

Upgrade to Pro to view all of the tweets about this paper: