RadGraph Annotation

Updated 14 February 2026

RadGraph annotation is a structured methodology that maps free-text radiology reports into normalized clinical facts using a graph-based formalism.
It employs expert annotators with strict calibration and high inter-annotator agreement (Cohen’s κ up to 0.99) to ensure data quality.
The framework enables automated labeling, style-aware report generation, and disease progression tracking, advancing medical NLP applications.

RadGraph annotation is a methodology for systematically structuring and extracting clinical entities and their semantic relations from free-text radiology reports. Designed initially for chest X-ray reports, the framework underpins datasets, benchmark models, and information extraction pipelines widely adopted in medical NLP research. At its core, RadGraph provides a graph-based formalism mapping unstructured language to normalized, relational clinical facts, facilitating downstream applications ranging from automated report understanding to style-sensitive text generation and disease progression tracking (Jain et al., 2021, Khanna et al., 2023, Yan et al., 2023).

1. RadGraph Annotation Schema

The RadGraph schema formalizes the essential entities and directed relations comprising the clinical content of radiology reports. Entities and relations are annotated in a way that captures not only observed findings but also their clinical assertions and context.

Entity Types

Anatomy: Any reference to an anatomic region or substructure (e.g., "lung," "pleura," "mediastinum").
Observation:Definitely Present: Findings or pathological states explicitly marked as present (e.g., "consolidation," "effusion").
Observation:Uncertain: Descriptors reflecting diagnostic uncertainty or differentials (e.g., "possible pneumothorax," "cannot exclude lymphadenopathy").
Observation:Definitely Absent: Findings explicitly negated by the radiologist (e.g., "no pneumothorax," "without focal consolidation").

Relation Types

located_at (Observation → Anatomy): Links a clinical observation to its anatomical site (e.g., "pleural effusion" located_at "right lung base").
modify: Associates modifying entities to their targets, restricted to same-type pairs (Anatomy→Anatomy or Observation→Observation); includes degree/descriptor modifiers and spatial/anatomical qualifiers (e.g., "moderate" modify "pleural effusion").
suggestive_of (Observation → Observation): Encodes inferential relationships in which one observation implies or suggests another (e.g., "ground-glass opacity" suggestive_of "early pneumonia") (Jain et al., 2021, Yan et al., 2023).

Span and Nesting

Entities are annotated over continuous spans using an IOB tagging scheme internally, with non-overlapping spans except where full-type containment is necessary and cross-type nesting is disallowed (e.g., Anatomy and Observation remain strictly disjoint).

2. Annotation Process and Quality Assurance

Annotation is performed by expert annotators—typically board-certified radiologists—on standardized platforms such as Datasaur.ai. The process includes multiple calibration and adjudication steps to ensure clinical and linguistic consistency.

Pilot Phase: Iterative pilots (~15 reports each), refining schema based on annotator disagreements and edge-case notes.
Span Guidelines: Rigid non-overlapping constraints, with granularity bias: in ambiguous segmentations, annotate smaller spans and resolve semantic relations via explicit edges.
Ambiguity Resolution: Written notes and group calls clarify cases such as device labeling and rare negations.
Adjudication: Dual-annotation for test reports, recording all disagreements to serve as a human agreement benchmark.
Inter-Annotator Agreement: Expressed as Cohen’s κ, reporting values of 0.974 (MIMIC-CXR, entities) and 0.841 (MIMIC-CXR, relations) (Jain et al., 2021); for extended schemas, pairwise Cohen’s κ values reach 0.9943–0.9963 (Khanna et al., 2023).
Edge Labeling: Relations are annotated only when unambiguous; prominence is given to direct, syntactically evident edges.

3. Dataset Composition and Schema Evolution

The RadGraph dataset includes manually annotated and model-inferred corpora at scale, with subsequent releases (e.g., RadGraph2) introducing hierarchical labeling and new clinical event types.

Dataset	Reports	Entities	Relations	Notes
RadGraph-dev	500	14,579	10,889	Manually labeled, MIMIC-CXR
RadGraph-test	100	~1,300–1,500/annot.	~900–1,100/annot.	Doubly labeled, MIMIC & CheXpert
Inference dataset	220,763	~6 million	~4 million	Auto-labeled, MIMIC-CXR
RadGraph2	800	23,457	17,373	Explicit "Change" entities

The entity taxonomy in RadGraph2 expands the schema to annotate disease and device progression: hierarchical entity types (e.g., "CHAN-CON-WOR" for condition worsening, "CHAN-DEV-AP" for new device), with explicit multi-level probabilities and a tree-structured loss during model training (Khanna et al., 2023).

4. Benchmark Models and Evaluation Metrics

RadGraph Benchmark employs the DYGIE++ framework for joint entity and relation extraction. Architectural details include:

Encoder: PubMedBERT or other domain-pretrained BERT variants.
Span Representation: Candidate spans up to length 3 tokens.
Joint Modeling: Simultaneous prediction of entity types and directed graph edges.
Hierarchical Loss (RadGraph2): Conditional probability chains over entity taxonomy, optimizing both leaf and internal nodes (Khanna et al., 2023).

Evaluation

All metrics use strict span and type matching for entities, and strict endpoint and type matching for relations.

Task	Dataset	Micro-F1 (Entities)	Micro-F1 (Relations)
RadGraph Benchmark	MIMIC-CXR	0.94	0.82
	CheXpert	0.91	0.73
Human benchmark (κ)	MIMIC-CXR	0.99	0.95
	CheXpert	0.93	0.75
RadGraph2 - HGIE (rel.)	MIMIC-CXR	—	~0.879

Performance is summarized as micro- and macro-F1 across classes, with error analysis referenced to the doubly labeled test set (Jain et al., 2021, Khanna et al., 2023, Yan et al., 2023).

5. Practical Annotation Examples

Annotation is concretized as token-span selection followed by relation arc drawing. For instance:

Report: “Moderate right pleural effusion with adjacent consolidation; no pneumothorax.”

Entities:
1. "Moderate" → Observation:Definitely Present
2. "right" → Anatomy
3. "pleural effusion" → Observation:Definitely Present
4. "adjacent" → Observation:Definitely Present
5. "consolidation" → Observation:Definitely Present
6. "no pneumothorax" → Observation:Definitely Absent
Relations:
- modify("Moderate"→"pleural effusion")
- located_at("pleural effusion"→"right")
- modify("adjacent"→"consolidation")
- "no pneumothorax" stands as an isolated negative finding

Extended RadGraph2 schema additionally labels temporal and comparative changes, e.g.,

"new" → CHAN-CON-AP
"persists" → CHAN-NC
"increased" → CHAN-CON-WOR, with modify edges to the changed entity (Jain et al., 2021, Khanna et al., 2023).

6. Applications and Extensions

RadGraph annotation underlies various downstream applications:

Large-scale automatic labeling: Auto-annotation of hundreds of thousands of chest radiograph reports via trained RadGraph Benchmark or HGIE models facilitates secondary research in medical NLP and imaging (Jain et al., 2021, Khanna et al., 2023).
Style-aware report generation: RadGraph graphs can be extracted from images by a vision-language encoder-decoder, then verbalized into free text by prompting LLMs, enabling disentanglement of report content from radiologist style (Yan et al., 2023).
Progression tracking and device monitoring: RadGraph2 enables automated detection and linkage of disease or device evolution across serial studies, via a hierarchical change taxonomy (Khanna et al., 2023).
Evaluation metrics: Graph F1 scores (node and edge matching) serve as objective measures of model content fidelity in both extraction and generation settings, with human annotation agreement (Cohen’s κ > 0.80) validating reproducibility (Jain et al., 2021, Yan et al., 2023).

7. Significance and Future Directions

RadGraph annotation provides a clinically precise, machine-actionable representation of radiological findings, supporting reproducible research standards and interoperability. Its iterative, radiologist-in-the-loop protocol, strict consensus metrics, and extensible schema allow adaptation to new modalities (e.g., CT, MRI), multi-institutional data, and integration with vision-language learning frameworks.

The introduction of hierarchical schemas in RadGraph2 and the coupling with LLMs for style-aware text generation represent substantive advances. Ongoing challenges include maintaining high inter-annotator agreement for subtle relation types, adapting schemas to additional imaging contexts, and optimizing for both extraction fidelity and generation fluency (Khanna et al., 2023, Yan et al., 2023).

Markdown Report Issue Upgrade to Chat

References (3)

RadGraph: Extracting Clinical Entities and Relations from Radiology Reports (2021)

RadGraph2: Modeling Disease Progression in Radiology Reports via Hierarchical Information Extraction (2023)

Style-Aware Radiology Report Generation with RadGraph and Few-Shot Prompting (2023)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to RadGraph Annotation.