DeID-GPT: Zero-shot Medical Text De-Identification by GPT-4 (2303.11032v2)

Published 20 Mar 2023 in cs.CL and cs.CY

Abstract: The digitization of healthcare has facilitated the sharing and re-using of medical data but has also raised concerns about confidentiality and privacy. HIPAA (Health Insurance Portability and Accountability Act) mandates removing re-identifying information before the dissemination of medical records. Thus, effective and efficient solutions for de-identifying medical data, especially those in free-text forms, are highly needed. While various computer-assisted de-identification methods, including both rule-based and learning-based, have been developed and used in prior practice, such solutions still lack generalizability or need to be fine-tuned according to different scenarios, significantly imposing restrictions in wider use. The advancement of LLMs (LLM), such as ChatGPT and GPT-4, have shown great potential in processing text data in the medical domain with zero-shot in-context learning, especially in the task of privacy protection, as these models can identify confidential information by their powerful named entity recognition (NER) capability. In this work, we developed a novel GPT4-enabled de-identification framework (``DeID-GPT") to automatically identify and remove the identifying information. Compared to existing commonly used medical text data de-identification methods, our developed DeID-GPT showed the highest accuracy and remarkable reliability in masking private information from the unstructured medical text while preserving the original structure and meaning of the text. This study is one of the earliest to utilize ChatGPT and GPT-4 for medical text data processing and de-identification, which provides insights for further research and solution development on the use of LLMs such as ChatGPT/GPT-4 in healthcare. Codes and benchmarking data information are available at https://github.com/yhydhx/ChatGPT-API.

PDF HTML Abstract

DeID-GPT: Zero-shot Medical Text De-Identification by GPT-4

The paper "DeID-GPT: Zero-shot Medical Text De-Identification by GPT-4" addresses the pressing need for effective de-identification techniques in the digitized healthcare domain, especially pertaining to free-form clinical text. In light of HIPAA requirements, which stipulate the removal of identifiable patient information from medical records, this research explores leveraging LLMs, specifically GPT-4, for zero-shot de-identification tasks.

Context and Innovation

The advent of electronic health records (EHR) has facilitated significant advancements in the medical field through data sharing and application of data-driven solutions. However, this digital transformation also brings heightened concerns over patient privacy and confidentiality. Previous efforts have primarily focused on rule-based and learning-based methods for de-identification, but these approaches often lack generalizability across different datasets and require extensive fine-tuning.

Recent developments in LLMs, notably GPT-4, present a novel opportunity for medical text processing through zero-shot and in-context learning capabilities. These models, powered by advanced named entity recognition (NER), can potentially identify and redact sensitive information efficiently without requiring large-scale labeled data or manual interventions.

Methodology

The paper introduces the DeID-GPT framework, which is built upon GPT-4's capabilities to automatically identify and remove identifying information in clinical text. The approach centers around prompt engineering—designing tailored prompts embedded with HIPAA identifiers to guide the model in recognizing and redacting protected health information (PHI).

The methodology comprises two key steps:

Prompt Design: Tailoring prompts that incorporate HIPAA guidelines, enabling the LLM to understand the specific information requiring redaction.
Processing through GPT-4: Utilizing GPT-4 for the actual de-identification process, wherein both the prompt and original clinical text are input to generate anonymized outputs.

Experimental Results

The authors conducted an extensive evaluation of DeID-GPT using the i2b2/UTHealth de-identification challenge dataset. The findings demonstrate that GPT-4 outperforms existing de-identification methods, including BERT, RoBERTa, and ClinicalBERT, achieving superior accuracy in redacting sensitive information. Specifically, GPT-4 achieved an accuracy rate exceeding 99% in zero-shot prompts, showcasing its robust capability in handling de-identification tasks without requiring explicit training or fine-tuning.

Implications and Future Directions

The implications of this research are substantial for both theoretical exploration and practical applications in AI and healthcare. The introduction of GPT-4 for de-identification tasks has potential advantages, such as:

Scale and Efficiency: Rapid processing of large datasets, reducing time and resources compared to manual and rule-based methods.
Adaptability: Seamless application across varied medical text datasets without necessitating changes in the workflow.
Reduction of Annotation Efforts: Minimizing the need for large-scale annotated data, which is often a bottleneck in clinical NLP tasks.

Moving forward, several avenues present opportunities for further exploration:

Locally-Deployed Models: Developing open-source LLMs suitable for local deployment within healthcare institutions to ensure data privacy and compliance with regulations.
Domain-Specific Enhancement: Refining LLMs with domain-specific data (such as clinical notes) could augment model performance and adaptability.
Fine-Tuning and Integration: Investigating fine-tuning techniques for LLMs, especially GPT-4, to optimize their capabilities for specific healthcare domains.

In summary, the DeID-GPT framework signifies promising progress in leveraging LLMs for automated medical text de-identification, contributing valuable insights into the broader application of AI models in safeguarding patient privacy within the healthcare sector.

PDF Markdown Bookmark Chat (Pro)

References (110)

Authors (18)

Zhengliang Liu (91 papers)
Yue Huang (171 papers)
Xiaowei Yu (36 papers)
Lu Zhang (373 papers)
Zihao Wu (100 papers)
Chao Cao (104 papers)
Haixing Dai (39 papers)
Lin Zhao (228 papers)
Yiwei Li (107 papers)
Peng Shu (34 papers)
Fang Zeng (10 papers)
Lichao Sun (186 papers)
Wei Liu (1135 papers)
Dinggang Shen (153 papers)
Quanzheng Li (122 papers)
Tianming Liu (161 papers)
Dajiang Zhu (68 papers)
Xiang Li (1003 papers)

Citations (147)

View on Semantic Scholar

GitHub

GitHub - yhydhx/ChatGPT-API: Implements ChatGPT API via request package. (36 stars)

DeID-GPT: Zero-shot Medical Text De-Identification by GPT-4 (2303.11032v2)