Can GPT-3.5 Generate and Code Discharge Summaries? (2401.13512v2)

Published 24 Jan 2024 in cs.CL

Abstract: Objective: To investigate GPT-3.5 in generating and coding medical documents with ICD-10 codes for data augmentation on low-resources labels. Materials and Methods: Employing GPT-3.5 we generated and coded 9,606 discharge summaries based on lists of ICD-10 code descriptions of patients with infrequent (generation) codes within the MIMIC-IV dataset. Combined with the baseline training set, this formed an augmented training set. Neural coding models were trained on baseline and augmented data and evaluated on a MIMIC-IV test set. We report micro- and macro-F1 scores on the full codeset, generation codes, and their families. Weak Hierarchical Confusion Matrices were employed to determine within-family and outside-of-family coding errors in the latter codesets. The coding performance of GPT-3.5 was evaluated both on prompt-guided self-generated data and real MIMIC-IV data. Clinical professionals evaluated the clinical acceptability of the generated documents. Results: Augmentation slightly hinders the overall performance of the models but improves performance for the generation candidate codes and their families, including one unseen in the baseline training data. Augmented models display lower out-of-family error rates. GPT-3.5 can identify ICD-10 codes by the prompted descriptions, but performs poorly on real data. Evaluators note the correctness of generated concepts while suffering in variety, supporting information, and narrative. Discussion and Conclusion: GPT-3.5 alone is unsuitable for ICD-10 coding. Augmentation positively affects generation code families but mainly benefits codes with existing examples. Augmentation reduces out-of-family errors. Discharge summaries generated by GPT-3.5 state prompted concepts correctly but lack variety, and authenticity in narratives. They are unsuitable for clinical practice.

References (34)

Citations (11)

View on Semantic Scholar

Summary

The paper evaluates GPT-3.5’s role in generating synthetic discharge summaries to augment ICD-10 coding models using metrics like micro- and macro-F1 scores.
The paper demonstrates that GPT-3.5 improves candidate code generation and reduces out-of-family errors, yet its clinical narratives lack sufficient depth and realism.
The study suggests that while GPT-3.5 shows promise for data augmentation, its direct clinical deployment remains unfeasible without improved narrative coherence.

Introduction

The task of generating and coding medical documents, notably discharge summaries with ICD-10 codes, is a critical but resource-intensive process in healthcare. Automation through NLP has the potential to alleviate this, and recent progress in LLMs like GPT-3.5 raise intriguing possibilities for addressing challenges in medical document coding. This paper rigorously evaluates the ability of GPT-3.5 to generate and code discharge summaries within the specific context of data augmentation for rare labels in the medical coding process.

Methodology and Experimentation

The researchers approached the experiment with meticulous care, selecting candidate documents from the MIMIC-IV dataset for their synthesis evaluation. GPT-3.5 was harnessed through a detailed prompting strategy to create discharge summaries based on lists of ICD-10 code descriptions. These synthetic summaries were then used to augment an existing training set for neural coding models, allowing for a comparative analysis of models trained solely on authentic summaries versus those supplemented with synthetic data.

To validate the coding performance, the paper utilized a multi-faceted evaluation method consisting of micro- and macro-F1 scores, with the aid of Weak Hierarchical Confusion Matrices to discern coding errors both within and outside the family of generated codes. GPT-3.5's capability was tested against prompt-guided synthetic data and the authenticity of real MIMIC-IV data. Additionally, the generated documents underwent a clinical acceptability review by professionals.

Results and Findings

Augmentation with GPT-3.5-generated summaries appeared to slightly impact overall performance negatively; however, for the generation of candidate codes and their families, performance improved. Notably, the models displayed reduced out-of-family error rates when augmented with GPT-3.5's output. Nevertheless, when coding real data, GPT-3.5's proficiency was limited, indicating that it is not yet viable for practical clinical deployment as an autonomous ICD-10 coder.

The clinical review revealed that while GPT-3.5 could correctly generate individual concepts, the narratives lacked variety, depth of supporting information, and did not portray realistic clinical scenarios, making them insufficient for clinical use. This underscores the nuanced requirements of clinical documentation that go beyond mere factual correctness to include coherence, context, and prioritization of medical information.

Implications and Future Directions

This detailed investigation into the role of GPT-3.5 in medical document generation and coding provides valuable insights into the capabilities and limitations of LLMs in healthcare. Although GPT-3.5 showed promise in coding within a controlled synthetic environment, its real-world application is constrained by its inability to authentically replicate the clinical narrative structure and context.

Future directions might include exploring different prompting strategies, utilizing real clinical notes as contextual learning examples, or integrating chronological ordering in discharge summaries to guide LLMs towards more coherent narratives. Additionally, while GPT-3.5 alone may not suffice for clinical coding tasks, its role in data augmentation for training machine learning models points to a synergistic approach where human expertise is complemented by AI-generated insights for better model performance on rare codes.

Such studies are critical stepping stones in harnessing the power of AI to support healthcare systems, potentially leading to more efficient, accurate medical documentation processes in the future.