MEDEC: Medical Error Detection Dataset
- MEDEC is a clinical dataset comprising 500 annotated passages that support rigorous evaluation of LLMs for error detection and correction.
- It enables multi-task benchmarking across error flag detection, error sentence localization, and correction generation using balanced sampling of five error types.
- The dataset facilitates assessment of prompting methods like zero-shot, SPR, and RDP, highlighting trade-offs in recall, precision, and correction accuracy.
The MEDEC dataset is a specialized resource for benchmarking medical error detection and correction in clinical documentation using LLMs. Comprising 500 annotated clinical passages drawn from authentic healthcare records, MEDEC supports rigorous, multi-faceted evaluation of both human and automated systems on tasks critical for improving the factual safety of electronic medical records. Its design enables systematic assessment of prompt-based learning methods—including zero-shot, static prompting with random exemplars (SPR), and retrieval-augmented dynamic prompting (RDP)—across core error-processing subtasks in the clinical domain (Ahmed et al., 25 Nov 2025).
1. Dataset Scope and Structure
MEDEC consists of 500 clinical note passages, each annotated at multiple granularities: presence/absence of an error (“error flag”), explicit error span (“error sentence”), and ground-truth correction. The exemplar pool is explicitly balanced across five clinically salient error types: Diagnosis, Management, Treatment, Pharmacotherapy, and Causal Organism. The dataset structure supports three aligned subtask labels per example:
- Task 1 – Error Flag Detection: Binary indicator for note-level error presence.
- Task 2 – Error Sentence Detection: Identification of the erroneous sentence within the passage.
- Task 3 – Correction: Surface-form rewrite providing the corrected sentence or assertion.
Annotations are curated to enable both fine-grained and aggregate evaluation of model performance on error recognition and repair.
2. Benchmark Tasks Supported
MEDEC operationalizes three core subtasks in automatic clinical error processing, each reflecting a distinct aspect of deployed model utility:
- Error-Flag Detection: Determines if a note contains any factual, diagnostic, or management error (binary classification).
- Error-Sentence Detection: Localizes the erroneous span at the sentence level (sequence labeling).
- Error Correction: Generates the corrected version of the error-containing sentence (natural language generation).
Evaluation metrics include recall, false-positive rate (FPR), F₁-score for detection tasks, and BLEU-1, ROUGE-L, and BERTScore F₁ for correction, all averaged over repeated random sampling configurations (Ahmed et al., 25 Nov 2025).
3. Exemplar Pool and Balanced Sampling
The dataset’s architecture mandates balanced representation of all five targeted error types, mitigating class imbalance and ensuring error-type diversity for rigorous analysis. For each model evaluation, prompt construction involves uniform random sampling without replacement from the N = 500 exemplar pool. This configuration supports reproducibility and error-type coverage necessary for in-context learning studies (e.g., SPR/RDP).
A summary of the exemplar pool structure:
| Error Type | Approx. Proportion | Annotation Level |
|---|---|---|
| Diagnosis | ~20% | Sentence/Correction |
| Management | ~20% | Sentence/Correction |
| Treatment | ~20% | Sentence/Correction |
| Pharmacotherapy | ~20% | Sentence/Correction |
| Causal Organism | ~20% | Sentence/Correction |
Exemplars are formatted using fixed prompt blocks, and both train/test splits and random seeds are documented for replicability.
4. Application to Prompting Methodologies
MEDEC’s design is central to empirical investigations of prompting schemes for LLM-based error detection and correction. Notably, it enables detailed comparison across:
- Zero-Shot Prompting: Single instruction, no in-context examples, resulting in high precision but lower recall.
- Static Prompting with Random Exemplars (SPR): Prompts constructed using k = 5 randomly-sampled, balanced exemplars per test instance. SPR yields substantial recall gains (~13 point improvement over zero-shot; 83.5% vs. 70.2%), with a moderate rise in FPR (12.3% vs. 5.1%). The prompt template uses canonical formatting and is size-constrained (<1024 tokens).
- Retrieval-Augmented Dynamic Prompting (RDP): Exemplar selection is dynamically tailored per query, improving recall and correction accuracy for domain-specific phenomena at the cost of higher FPR (19.8%).
Experimental results highlight MEDEC’s sensitivity to error-type distribution and in-context guidance, providing a controlled environment for isolating prompt and model effects (Ahmed et al., 25 Nov 2025).
5. Analysis of Failure Modes
Quantitative and qualitative analyses using MEDEC reveal systematic model failures tied to dataset characteristics:
- Rare Entity Substitutions: Missed corrections when ground-truth involves entities (e.g., drug/organism names) absent from the sampled exemplars.
- Numerical Dosage Confusions: Poor detection of subtle numeric errors (e.g., “mg” vs. “mcg”) when such cases are underrepresented.
- Negation Artifacts: False positives in logically consistent double-negative statements, reflecting model limitations in deeper semantic parsing.
Manual audits on MEDEC outputs highlight the impact of limited exemplar diversity on rare-case detection and point to the advantages of dynamic retrieval for boosting coverage on specialized clinical content.
6. Impact on Model Evaluation and Development
By providing balanced, annotation-rich clinical passages linked to realistic EMR errors, MEDEC enables:
- Standardized benchmarking of LLMs across plug-and-play (zero-shot), static in-context (SPR), and retrieval-augmented (RDP) settings.
- Controlled quantitative analysis of error detection, localization, and correction capability, making it possible to surface method-specific trade-offs in recall, precision, and domain-knowledge transfer.
- Detailed error-type breakdowns, supporting interpretability analyses and highlighting operational risks for clinical deployment.
MEDEC’s structure, balance, and task formulation are foundational for advancing LLM robustness and reliability in clinical safety-critical contexts (Ahmed et al., 25 Nov 2025).