MEDEC: Medical Error Detection Dataset

Updated 2 December 2025

MEDEC is a clinical dataset comprising 500 annotated passages that support rigorous evaluation of LLMs for error detection and correction.
It enables multi-task benchmarking across error flag detection, error sentence localization, and correction generation using balanced sampling of five error types.
The dataset facilitates assessment of prompting methods like zero-shot, SPR, and RDP, highlighting trade-offs in recall, precision, and correction accuracy.

The MEDEC dataset is a specialized resource for benchmarking medical error detection and correction in clinical documentation using LLMs. Comprising 500 annotated clinical passages drawn from authentic healthcare records, MEDEC supports rigorous, multi-faceted evaluation of both human and automated systems on tasks critical for improving the factual safety of electronic medical records. Its design enables systematic assessment of prompt-based learning methods—including zero-shot, static prompting with random exemplars (SPR), and retrieval-augmented dynamic prompting (RDP)—across core error-processing subtasks in the clinical domain (Ahmed et al., 25 Nov 2025).

1. Dataset Scope and Structure

MEDEC consists of 500 clinical note passages, each annotated at multiple granularities: presence/absence of an error (“error flag”), explicit error span (“error sentence”), and ground-truth correction. The exemplar pool is explicitly balanced across five clinically salient error types: Diagnosis, Management, Treatment, Pharmacotherapy, and Causal Organism. The dataset structure supports three aligned subtask labels per example:

Task 1 – Error Flag Detection: Binary indicator for note-level error presence.
Task 2 – Error Sentence Detection: Identification of the erroneous sentence within the passage.
Task 3 – Correction: Surface-form rewrite providing the corrected sentence or assertion.

Annotations are curated to enable both fine-grained and aggregate evaluation of model performance on error recognition and repair.

2. Benchmark Tasks Supported

MEDEC operationalizes three core subtasks in automatic clinical error processing, each reflecting a distinct aspect of deployed model utility:

Error-Flag Detection: Determines if a note contains any factual, diagnostic, or management error (binary classification).
Error-Sentence Detection: Localizes the erroneous span at the sentence level (sequence labeling).
Error Correction: Generates the corrected version of the error-containing sentence (natural language generation).

Evaluation metrics include recall, false-positive rate (FPR), F₁-score for detection tasks, and BLEU-1, ROUGE-L, and BERTScore F₁ for correction, all averaged over repeated random sampling configurations (Ahmed et al., 25 Nov 2025).

3. Exemplar Pool and Balanced Sampling

The dataset’s architecture mandates balanced representation of all five targeted error types, mitigating class imbalance and ensuring error-type diversity for rigorous analysis. For each model evaluation, prompt construction involves uniform random sampling without replacement from the N = 500 exemplar pool. This configuration supports reproducibility and error-type coverage necessary for in-context learning studies (e.g., SPR/RDP).

A summary of the exemplar pool structure:

Error Type	Approx. Proportion	Annotation Level
Diagnosis	~20%	Sentence/Correction
Management	~20%	Sentence/Correction
Treatment	~20%	Sentence/Correction
Pharmacotherapy	~20%	Sentence/Correction
Causal Organism	~20%	Sentence/Correction

Exemplars are formatted using fixed prompt blocks, and both train/test splits and random seeds are documented for replicability.

4. Application to Prompting Methodologies

MEDEC’s design is central to empirical investigations of prompting schemes for LLM-based error detection and correction. Notably, it enables detailed comparison across:

Zero-Shot Prompting: Single instruction, no in-context examples, resulting in high precision but lower recall.
Static Prompting with Random Exemplars (SPR): Prompts constructed using k = 5 randomly-sampled, balanced exemplars per test instance. SPR yields substantial recall gains (~13 point improvement over zero-shot; 83.5% vs. 70.2%), with a moderate rise in FPR (12.3% vs. 5.1%). The prompt template uses canonical formatting and is size-constrained (<1024 tokens).
Retrieval-Augmented Dynamic Prompting (RDP): Exemplar selection is dynamically tailored per query, improving recall and correction accuracy for domain-specific phenomena at the cost of higher FPR (19.8%).

Experimental results highlight MEDEC’s sensitivity to error-type distribution and in-context guidance, providing a controlled environment for isolating prompt and model effects (Ahmed et al., 25 Nov 2025).

5. Analysis of Failure Modes

Quantitative and qualitative analyses using MEDEC reveal systematic model failures tied to dataset characteristics:

Rare Entity Substitutions: Missed corrections when ground-truth involves entities (e.g., drug/organism names) absent from the sampled exemplars.
Numerical Dosage Confusions: Poor detection of subtle numeric errors (e.g., “mg” vs. “mcg”) when such cases are underrepresented.
Negation Artifacts: False positives in logically consistent double-negative statements, reflecting model limitations in deeper semantic parsing.

Manual audits on MEDEC outputs highlight the impact of limited exemplar diversity on rare-case detection and point to the advantages of dynamic retrieval for boosting coverage on specialized clinical content.

6. Impact on Model Evaluation and Development

By providing balanced, annotation-rich clinical passages linked to realistic EMR errors, MEDEC enables:

Standardized benchmarking of LLMs across plug-and-play (zero-shot), static in-context (SPR), and retrieval-augmented (RDP) settings.
Controlled quantitative analysis of error detection, localization, and correction capability, making it possible to surface method-specific trade-offs in recall, precision, and domain-knowledge transfer.
Detailed error-type breakdowns, supporting interpretability analyses and highlighting operational risks for clinical deployment.

MEDEC’s structure, balance, and task formulation are foundational for advancing LLM robustness and reliability in clinical safety-critical contexts (Ahmed et al., 25 Nov 2025).

Markdown Report Issue Upgrade to Chat

References (1)

A Systematic Analysis of Large Language Models with RAG-enabled Dynamic Prompting for Medical Error Detection and Correction (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MEDEC Dataset.

MEDEC: Medical Error Detection Dataset

1. Dataset Scope and Structure

2. Benchmark Tasks Supported

3. Exemplar Pool and Balanced Sampling

4. Application to Prompting Methodologies

5. Analysis of Failure Modes

6. Impact on Model Evaluation and Development

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

MEDEC: Medical Error Detection Dataset

1. Dataset Scope and Structure

2. Benchmark Tasks Supported

3. Exemplar Pool and Balanced Sampling

4. Application to Prompting Methodologies

5. Analysis of Failure Modes

6. Impact on Model Evaluation and Development

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research