DeID-GPT: Zero-Shot Medical Text De-Identification

Updated 4 February 2026

DeID-GPT is a zero-shot LLM-based framework that automates redaction of protected health information from clinical notes.
It uses explicit prompt engineering with structured rules to map HIPAA identifiers, achieving state-of-the-art accuracy without fine-tuning.
The framework generalizes across clinical formats and domains, offering a scalable, low-overhead solution for privacy-preserving data processing.

DeID-GPT is a zero-shot, LLM-based framework for medical text de-identification, leveraging the capabilities of GPT-4 and related LLMs to automate the identification and masking of protected health information (PHI) from unstructured clinical notes under HIPAA constraints. By relying on explicit prompt engineering aligned with regulatory identifiers, DeID-GPT achieves state-of-the-art accuracy on benchmark datasets without any model fine-tuning or domain-specific retraining, providing a generalizable, low-overhead solution for privacy-sensitive data processing in healthcare and related domains (Liu et al., 2023).

1. System Architecture and Workflow

DeID-GPT employs a stepwise pipeline for the de-identification of clinical notes:

Input Processing: Raw clinical notes, such as XML-encoded i2b2/UTHealth 2014 files, undergo preprocessing where free-text sections are extracted, XML tags removed, and whitespace normalized. Gold-standard PHI annotations are retained for evaluation.
Prompt Construction: The framework incorporates all 18 HIPAA-specified identifiers as de-identification rules. These are mapped to dataset-specific PHI classes by a semantic similarity voting procedure performed by the same LLM. An explicit, structured prompt is then synthesized, comprising a task statement, command, and enumerated rules for each PHI category.
PHI Detection and Masking: Clinical notes and the constructed prompt are input to GPT-4 (or ChatGPT for GPT-3.5) via API or manual web interface. The LLM outputs the clinical note with all detected PHI spans replaced by “[redacted]”.
Output: The resulting de-identified notes are suitable for downstream sharing, privacy-preserving analytics, or surrogate data substitution.

Pseudocode outlining the main loop of DeID-GPT:

$C_j$ 2

2. Prompt Engineering in Zero-Shot Settings

Prompt design is central to DeID-GPT’s efficacy. The optimal prompt template consists of three distinct segments:

Task Statement: “Please anonymize the following clinical note.” [TASK]
Command: “Replace all the following information with the term ‘[redacted]’:” [COMMAND]
Specific Rules: Enumerated lines, each guiding the redaction of a PHI class.

Example rules:

Names: “Redact any strings that might be a name or acronym or initials, patients’ names, doctors’ names, the names of the M.D. or Dr.” [NAME]
Locations: “Redact any strings that might be a location or address, such as ‘3970 Longview Drive’.” [LOCATION]
Age: “Redact any strings that look like ‘something years old’ or ‘age 37’.” [AGE]
IDs: “Redact any dates and IDs and numbers and record dates.” [ID-like strings]
Professions: “Redact professions such as ‘manager’.” [PROFESSION]
Contact Info: “Redact any contact information.” [CONTACT]

No in-context demonstrations were included; all results are strictly zero-shot. Explicit, well-structured prompts are critical for model reliability, as underspecified or ambiguous prompts reduce coverage or induce errors.

Problematic prompt formats identified include:

Task-only prompts (insufficient coverage)
Extra punctuation after command (prompt truncation)
Mixing tasks (de-identification and summarization) in one prompt
Omission of output format expectations

Explicit prompts confer a substantial gain in accuracy: for ChatGPT (0.686 to 0.929) and for GPT-4 (0.908 to 0.990).

3. Algorithmic Details and Evaluation Metrics

PHI Mapping Function

HIPAA identifiers ( $H_i$ ) are mapped to dataset PHI categories ( $C_j$ ) by computing semantic similarity scores:

For each $H_i$ :

Compute $S_{ij} = \text{SemanticSim}(H_i, C_j)$ via GPT model
If $\max_j S_{ij} \geq \tau$ , then $H_i \rightarrow C_{argmax_j S_{ij}}$
Otherwise, $H_i \rightarrow$ “Other”

Threshold $\tau$ is dataset-specific but not explicitly reported.

Entity-wise accuracy:

$\text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN}$

Standard NLP metrics:

$\text{Precision} = \frac{TP}{TP + FP}$
$C_j$ 0
$C_j$ 1

4. Experimental Setup and Comparative Performance

Dataset

i2b2/UTHealth 2014 De-Identification Challenge: 1,304 clinical notes from 296 diabetic patients, PHI manually annotated and replaced by surrogates, text split into train/dev/test. The DeID-GPT study uses the test set for all reported results.

Baselines and LLM Comparisons

Model	Implicit Prompt	Explicit Prompt	Fine-tuned
ChatGPT (GPT-3.5)	0.686	0.929	–
GPT-4	0.908	0.990	–
BERT (cased)	–	–	0.798
RoBERTa	–	–	0.947
ClinicalBERT	–	–	0.974
mT0	0.824	–	–
Falcon-7B	0.605	–	–
Flan-T5-base	0.737	–	–
LLaMA-1/2 (7B)	~0.61	–	–

GPT-4 with explicit prompting achieves 0.990 accuracy, surpassing fine-tuned clinical NER models. At the time of study, GPT-4 was accessed manually via web interface, limiting the evaluation set to 50 randomly-sampled test notes.

5. Error Modes and Reliability

GPT-4 and ChatGPT

Missed PHI: Occurs when entity boundaries are ambiguous (e.g., geographical locations embedded in clinical terms).
Over-redaction: Non-PHI terms may be masked if they resemble a regulated entity (e.g., clinical codes similar to names).

Non-ChatGPT/GPT-4 LLMs

Task Misunderstanding: Models may repeat the prompt or inadvertently summarize the input instead of de-identifying.
Prompt Sensitivity: Longer, rule-heavy prompts are more likely to be truncated or yield incomplete redactions.

Ablation studies show explicit rule inclusion and clear output specification account for significant accuracy improvements.

6. Implications, Generalizability, and Future Prospects

Generalizability without Fine-Tuning: DeID-GPT requires no domain retraining or annotation yet delivers or exceeds the accuracy of models fine-tuned on clinical data. Prompt adaptation alone suffices for new institutions, languages, or clinical contexts.
Local Deployment Considerations: Due to data privacy requirements, on-premise deployment with open-weight LLMs (e.g., LLaMA, BLOOM, OPT) is a priority area. This requires advances in quantization and parallelism to manage computational cost.
Domain-Specific LLMs & Model Tuning: Custom LLMs trained on medical corpora (e.g., BioGPT, domain-specialized GPT-4) may further improve de-identification accuracy, particularly as GPT-4 API fine-tuning becomes available.
Extension to Multi-Modal Data: With GPT-4V, de-identification may extend to imaging data, enabling joint masking of PHI in both text and medical images.
Broad Applicability: Beyond healthcare, the DeID-GPT protocol applies to privacy protection in financial, legal, and social media documents.

7. Contextual Significance

DeID-GPT represents an early and effective application of general-purpose LLMs to the statutory task of medical text de-identification (Liu et al., 2023). Its ability to generalize across formats, clinical settings, and without training sets a new standard for rapid, compliant privacy protection pipelines. The principal advances lie in prompt engineering rigor and semantic mapping procedures, which position DeID-GPT as a versatile solution for privacy-preserving data sharing required by healthcare legislation and similar regulatory domains.

Markdown Report Issue Upgrade to Chat

References (1)

DeID-GPT: Zero-shot Medical Text De-Identification by GPT-4 (2023)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to DeID-GPT Framework.

DeID-GPT: Zero-Shot Medical Text De-Identification

1. System Architecture and Workflow

2. Prompt Engineering in Zero-Shot Settings

3. Algorithmic Details and Evaluation Metrics

PHI Mapping Function

4. Experimental Setup and Comparative Performance

Dataset

Baselines and LLM Comparisons

5. Error Modes and Reliability

GPT-4 and ChatGPT

Non-ChatGPT/GPT-4 LLMs

6. Implications, Generalizability, and Future Prospects

7. Contextual Significance

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

DeID-GPT: Zero-Shot Medical Text De-Identification

1. System Architecture and Workflow

2. Prompt Engineering in Zero-Shot Settings

3. Algorithmic Details and Evaluation Metrics

PHI Mapping Function

Accuracy and Related Metrics

4. Experimental Setup and Comparative Performance

Dataset

Baselines and LLM Comparisons

5. Error Modes and Reliability

GPT-4 and ChatGPT

Non-ChatGPT/GPT-4 LLMs

6. Implications, Generalizability, and Future Prospects

7. Contextual Significance

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research