Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Medical Data Augmentation via ChatGPT: A Case Study on Medication Identification and Medication Event Classification (2306.07297v1)

Published 10 Jun 2023 in cs.CL, cs.AI, and cs.LG

Abstract: The identification of key factors such as medications, diseases, and relationships within electronic health records and clinical notes has a wide range of applications in the clinical field. In the N2C2 2022 competitions, various tasks were presented to promote the identification of key factors in electronic health records (EHRs) using the Contextualized Medication Event Dataset (CMED). Pretrained LLMs demonstrated exceptional performance in these tasks. This study aims to explore the utilization of LLMs, specifically ChatGPT, for data augmentation to overcome the limited availability of annotated data for identifying the key factors in EHRs. Additionally, different pre-trained BERT models, initially trained on extensive datasets like Wikipedia and MIMIC, were employed to develop models for identifying these key variables in EHRs through fine-tuning on augmented datasets. The experimental results of two EHR analysis tasks, namely medication identification and medication event classification, indicate that data augmentation based on ChatGPT proves beneficial in improving performance for both medication identification and medication event classification.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (18)
  1. Publicly Available Clinical BERT Embeddings. In Proceedings of the 2nd Clinical Natural Language Processing Workshop. Association for Computational Linguistics, Minneapolis, Minnesota, USA, 72–78. https://doi.org/10.18653/v1/W19-1909
  2. COMETA: A Corpus for Medical Entity Linking in the Social Media. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics.
  3. Language Models are Few-Shot Learners. arXiv:cs.CL/2005.14165
  4. AugGPT: Leveraging ChatGPT for Text Data Augmentation. arXiv:cs.CL/2302.13007
  5. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics, Minneapolis, Minnesota, 4171–4186. https://doi.org/10.18653/v1/N19-1423
  6. Deep Learning. MIT Press, Cambridge, MA, USA. urlhttp://www.deeplearningbook.org.
  7. Bidirectional LSTM-CRF Models for Sequence Tagging. ArXiv abs/1508.01991 (2015).
  8. BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics (sep 2019). https://doi.org/10.1093/bioinformatics/btz682
  9. RoBERTa: A Robustly Optimized BERT Pretraining Approach. https://doi.org/10.48550/ARXIV.1907.11692
  10. Toward Understanding Clinical Context of Medication Change Events in Clinical Narratives. AMIA … Annual Symposium proceedings. AMIA Symposium 2021 (2021), 833–842.
  11. Deep Contextualized Word Representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). Association for Computational Linguistics, New Orleans, Louisiana, 2227–2237. https://doi.org/10.18653/v1/N18-1202
  12. Alec Radford and Karthik Narasimhan. 2018. Improving Language Understanding by Generative Pre-Training.
  13. An Overview of Named Entity Recognition. In 2018 International Conference on Asian Language Processing (IALP). 273–278. https://doi.org/10.1109/IALP.2018.8629225
  14. Vincent Van Asch. 2013. Macro-and micro-averaged evaluation measures [[basic draft]]. Belgium: CLiPS 49 (2013).
  15. A Prompt Pattern Catalog to Enhance Prompt Engineering with ChatGPT. arXiv:cs.SE/2302.11382
  16. Yiming Yang. 2001. A study of thresholding strategies for text categorization. In Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval. 137–145.
  17. GPT3Mix: Leveraging Large-scale Language Models for Text Augmentation. In Findings of the Association for Computational Linguistics: EMNLP 2021. Association for Computational Linguistics, Punta Cana, Dominican Republic, 2225–2239. https://doi.org/10.18653/v1/2021.findings-emnlp.192
  18. A survey of large language models. arXiv preprint arXiv:2303.18223 (2023).
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Shouvon Sarker (3 papers)
  2. Lijun Qian (34 papers)
  3. Xishuang Dong (17 papers)
Citations (9)