FactEHR: Clinical Factuality Dataset
- FactEHR is a large-scale clinical dataset featuring 2,168 de-identified notes and nearly 1 million entailment pairs derived from diverse EHR sources.
- It uses document-level fact decomposition and natural language inference to quantitatively and qualitatively assess factuality in clinical text.
- The dataset supports precision and recall evaluations of LLMs, offering actionable insights for safe AI deployment in healthcare settings.
FactEHR is the first large-scale dataset constructed for evaluating fine-grained factuality in clinical text using document-level fact decomposition and natural language inference (NLI) techniques. It enables quantitative and qualitative assessment of LLMs on their capacity to parse and verify atomic clinical claims within electronic health record (EHR) documentation spanning multiple hospital systems and note types. The dataset consists of full-document fact decompositions for 2,168 de-identified clinical notes, yielding 987,266 entailment pairs across four note types and three EHR sources. The goal is to support both precision- and recall-style factuality verification, standardizing a challenging workflow for safely deploying generative AI in healthcare settings (Munnangi et al., 17 Dec 2024).
1. Dataset Composition and Structure
FactEHR unifies clinical notes sampled from three distinct EHR repositories: MIMIC (Beth Israel Deaconess Medical Center), CORAL (UCSF, oncology-centric), and MedAlign (Stanford Health Care, general hospital corpus). Sampling was randomized and type-constrained, with token limits set from 64 to 3,840 per note. The dataset encompasses the following breakdown:
| Note Type | MIMIC | CORAL | MedAlign | Total |
|---|---|---|---|---|
| Progress Note | 250 | 172 | 250 | 672 |
| Nursing Note | 250 | — | 129 | 379 |
| Discharge Summary | 250 | — | 117 | 367 |
| Procedure Note | 250 | — | 250 | 500 (MIMIC-CXR radiology) + 250 (MedAlign) = 750 |
A total of 2,168 notes () were processed through four LLMs (GPT-4o, o1-mini, Gemini-1.5-Flash, Llama3-8B) for document-level fact decomposition, producing 8,665 decompositions and entailment pairs, where and is the number of atomic facts for note . This construction supports both precision (note-to-fact) and recall (facts-to-sentence) evaluations.
Fact-density, or the average atomic facts per sentence, varies by LLM and note type: For example, GPT-4o yields 2.37 facts/sentence in Nursing Notes and 1.45 in Procedure Notes, while Llama3-8B produces fewer facts than there are sentences in multiple settings—suggesting omission of relevant information.
2. Fact Decomposition Pipeline
Fact decomposition serves to rewrite dense clinical statements into concise, independent atomic facts. The pipeline comprises three stages:
A. LLM Fact Generation: Each note is processed by four LLMs using a prompt adapted from Min et al. (2023): “Rewrite each sentence as a list of independent, atomic facts. Output as a delimited string.”
B. Manual Entailment Annotation: From the total pool, 1,036 premise–hypothesis pairs are randomly selected for gold annotation by a team of clinicians (physicians, residents, medical students, researcher). Each atomic fact is judged for full entailment against its premise. A subset (100 pairs) undergoes double annotation for inter-rater reliability.
C. LLM-Based Judgement: Remaining entailment pairs (986,230) are automatically labeled using GPT-4o, tuned on a 2,468-pair development set.
The overall process may be represented by:
1 2 3 4 5 6 |
for each note d in D: C_d = LLM_fact_decompose(d) for each fact c in C_d: emit entailment_pair(premise=d, hypothesis=c) # fact-precision for each sentence s in tokenize_sentences(d): emit entailment_pair(premise=C_d, hypothesis=s) # fact-recall |
3. Evaluation Metrics and Annotation Quality
Precision () and recall () metrics are defined as follows:
- Fact Precision:
- Unweighted Fact Recall:
- Weighted Fact Recall: (accounts for sentence atomicity )
Traditional classification metrics (precision = TP/(TP+FP), recall = TP/(TP+FN)) are also reported using entailment judgments.
Inter-annotator agreement for human annotation is substantial, with Fleiss’ κ = 0.67 (classically, ).
Tokenization error modes are quantified: junk sentences (e.g., formatting lines, orphaned dates) occur with rates between 0.2–9.0%, and partial sentences (fragmented splits) at 3.6–10.4%. Procedure notes exhibit highest junk rates (9.0%) and progress notes highest partial rates (10.4%).
4. Comparative Performance of LLMs
LLM performance on fact decomposition and entailment recognition is model- and note-type dependent. Table 2 of the source summarizes key metrics (procedure and discharge shown, averaged across documents):
| LLM | Procedure P* (%) | Procedure R* (%) | Procedure F1* (%) | Discharge P* (%) | Discharge R* (%) | Discharge F1* (%) |
|---|---|---|---|---|---|---|
| GPT-4o | 98.5 | 78.7 | 86.5 | 97.0 | 79.0 | 86.8 |
| o1-mini | 97.8 | 78.4 | 86.2 | 94.8 | 81.1 | 86.7 |
| Gemini-1.5 | 95.9 | 64.2 | 77.0 | 96.5 | 65.4 | 76.9 |
| Llama3-8B | 84.2 | 49.4 | 62.0 | 84.6 | 38.9 | 50.4 |
GPT-4o and o1-mini achieve the highest F1 scores (86–92%) on structured notes but lower recall for narrative discharge summaries (≈79–81%). Coefficient of variation in fact counts per document for discharge summaries reaches 0.45, indicating that some models produce 2.6 times more facts per sentence than others. Llama3-8B generates the highest rate of incorrect facts (~55% in a 20-note sample), whereas GPT-4o and Gemini-1.5 produce >95% correct facts, with >80% judged atomic and independent.
5. Example Decompositions and Use Cases
A representative example from a radiology Procedure Note illustrates the atomic fact extraction:
Source excerpt:
“Left pleural effusion increased from prior exam. No consolidation or pneumothorax. Heart size is normal. Lines and tubes in place.”
Decomposed atomic facts:
- “There is a pleural effusion on the left side.”
- “The left pleural effusion has increased compared to the prior exam.”
- “No consolidation is present.”
- “No pneumothorax is present.”
- “The cardiac size is normal.”
- “Lines and tubes are properly positioned.”
Each fact is paired with the full note for precision, and with source sentences for recall, enabling calculation of metric scores post-annotation. FactEHR supports:
- Benchmarking LLM-generated clinical documentation (e.g., automated discharge summaries)
- Training supervised decomposition-oriented finetuning or prompt-tuning of open-source LLMs
- Stress-testing medical NLI models in highly compositional and jargon-rich domains
A plausible implication is that robust factual decomposition is essential for trustworthy clinical language modeling and downstream healthcare automation.
6. Dataset Limitations and Scope
FactEHR is constrained by several factors:
- Absence of gold-standard (“ground-truth”) fact decompositions
- Reliance on proprietary LLM APIs within a HIPAA-compliant environment
- Coverage limited to four note types and English language only
- Imperfections in sentence tokenization, particularly for de-identified notes
Despite these constraints, FactEHR provides 8,665 decompositions, 987,266 entailment pairs, and 1,036 expert annotations, establishing a foundation for future research in factuality evaluation of clinical NLP systems. The dataset addresses a previously understudied niche and highlights variability in LLM performance, the risks of omission, and the need for continued methodology refinement (Munnangi et al., 17 Dec 2024).