RadGraph Annotation
- RadGraph annotation is a structured methodology that maps free-text radiology reports into normalized clinical facts using a graph-based formalism.
- It employs expert annotators with strict calibration and high inter-annotator agreement (Cohen’s κ up to 0.99) to ensure data quality.
- The framework enables automated labeling, style-aware report generation, and disease progression tracking, advancing medical NLP applications.
RadGraph annotation is a methodology for systematically structuring and extracting clinical entities and their semantic relations from free-text radiology reports. Designed initially for chest X-ray reports, the framework underpins datasets, benchmark models, and information extraction pipelines widely adopted in medical NLP research. At its core, RadGraph provides a graph-based formalism mapping unstructured language to normalized, relational clinical facts, facilitating downstream applications ranging from automated report understanding to style-sensitive text generation and disease progression tracking (Jain et al., 2021, Khanna et al., 2023, Yan et al., 2023).
1. RadGraph Annotation Schema
The RadGraph schema formalizes the essential entities and directed relations comprising the clinical content of radiology reports. Entities and relations are annotated in a way that captures not only observed findings but also their clinical assertions and context.
Entity Types
- Anatomy: Any reference to an anatomic region or substructure (e.g., "lung," "pleura," "mediastinum").
- Observation:Definitely Present: Findings or pathological states explicitly marked as present (e.g., "consolidation," "effusion").
- Observation:Uncertain: Descriptors reflecting diagnostic uncertainty or differentials (e.g., "possible pneumothorax," "cannot exclude lymphadenopathy").
- Observation:Definitely Absent: Findings explicitly negated by the radiologist (e.g., "no pneumothorax," "without focal consolidation").
Relation Types
- located_at (Observation → Anatomy): Links a clinical observation to its anatomical site (e.g., "pleural effusion" located_at "right lung base").
- modify: Associates modifying entities to their targets, restricted to same-type pairs (Anatomy→Anatomy or Observation→Observation); includes degree/descriptor modifiers and spatial/anatomical qualifiers (e.g., "moderate" modify "pleural effusion").
- suggestive_of (Observation → Observation): Encodes inferential relationships in which one observation implies or suggests another (e.g., "ground-glass opacity" suggestive_of "early pneumonia") (Jain et al., 2021, Yan et al., 2023).
Span and Nesting
Entities are annotated over continuous spans using an IOB tagging scheme internally, with non-overlapping spans except where full-type containment is necessary and cross-type nesting is disallowed (e.g., Anatomy and Observation remain strictly disjoint).
2. Annotation Process and Quality Assurance
Annotation is performed by expert annotators—typically board-certified radiologists—on standardized platforms such as Datasaur.ai. The process includes multiple calibration and adjudication steps to ensure clinical and linguistic consistency.
- Pilot Phase: Iterative pilots (~15 reports each), refining schema based on annotator disagreements and edge-case notes.
- Span Guidelines: Rigid non-overlapping constraints, with granularity bias: in ambiguous segmentations, annotate smaller spans and resolve semantic relations via explicit edges.
- Ambiguity Resolution: Written notes and group calls clarify cases such as device labeling and rare negations.
- Adjudication: Dual-annotation for test reports, recording all disagreements to serve as a human agreement benchmark.
- Inter-Annotator Agreement: Expressed as Cohen’s κ, reporting values of 0.974 (MIMIC-CXR, entities) and 0.841 (MIMIC-CXR, relations) (Jain et al., 2021); for extended schemas, pairwise Cohen’s κ values reach 0.9943–0.9963 (Khanna et al., 2023).
- Edge Labeling: Relations are annotated only when unambiguous; prominence is given to direct, syntactically evident edges.
3. Dataset Composition and Schema Evolution
The RadGraph dataset includes manually annotated and model-inferred corpora at scale, with subsequent releases (e.g., RadGraph2) introducing hierarchical labeling and new clinical event types.
| Dataset | Reports | Entities | Relations | Notes |
|---|---|---|---|---|
| RadGraph-dev | 500 | 14,579 | 10,889 | Manually labeled, MIMIC-CXR |
| RadGraph-test | 100 | ~1,300–1,500/annot. | ~900–1,100/annot. | Doubly labeled, MIMIC & CheXpert |
| Inference dataset | 220,763 | ~6 million | ~4 million | Auto-labeled, MIMIC-CXR |
| RadGraph2 | 800 | 23,457 | 17,373 | Explicit "Change" entities |
The entity taxonomy in RadGraph2 expands the schema to annotate disease and device progression: hierarchical entity types (e.g., "CHAN-CON-WOR" for condition worsening, "CHAN-DEV-AP" for new device), with explicit multi-level probabilities and a tree-structured loss during model training (Khanna et al., 2023).
4. Benchmark Models and Evaluation Metrics
RadGraph Benchmark employs the DYGIE++ framework for joint entity and relation extraction. Architectural details include:
- Encoder: PubMedBERT or other domain-pretrained BERT variants.
- Span Representation: Candidate spans up to length 3 tokens.
- Joint Modeling: Simultaneous prediction of entity types and directed graph edges.
- Hierarchical Loss (RadGraph2): Conditional probability chains over entity taxonomy, optimizing both leaf and internal nodes (Khanna et al., 2023).
Evaluation
All metrics use strict span and type matching for entities, and strict endpoint and type matching for relations.
| Task | Dataset | Micro-F1 (Entities) | Micro-F1 (Relations) |
|---|---|---|---|
| RadGraph Benchmark | MIMIC-CXR | 0.94 | 0.82 |
| CheXpert | 0.91 | 0.73 | |
| Human benchmark (κ) | MIMIC-CXR | 0.99 | 0.95 |
| CheXpert | 0.93 | 0.75 | |
| RadGraph2 - HGIE (rel.) | MIMIC-CXR | — | ~0.879 |
Performance is summarized as micro- and macro-F1 across classes, with error analysis referenced to the doubly labeled test set (Jain et al., 2021, Khanna et al., 2023, Yan et al., 2023).
5. Practical Annotation Examples
Annotation is concretized as token-span selection followed by relation arc drawing. For instance:
Report: “Moderate right pleural effusion with adjacent consolidation; no pneumothorax.”
- Entities:
- "Moderate" → Observation:Definitely Present
- "right" → Anatomy
- "pleural effusion" → Observation:Definitely Present
- "adjacent" → Observation:Definitely Present
- "consolidation" → Observation:Definitely Present
- "no pneumothorax" → Observation:Definitely Absent
Relations:
- modify("Moderate"→"pleural effusion")
- located_at("pleural effusion"→"right")
- modify("adjacent"→"consolidation")
- "no pneumothorax" stands as an isolated negative finding
Extended RadGraph2 schema additionally labels temporal and comparative changes, e.g.,
- "new" → CHAN-CON-AP
- "persists" → CHAN-NC
- "increased" → CHAN-CON-WOR, with modify edges to the changed entity (Jain et al., 2021, Khanna et al., 2023).
6. Applications and Extensions
RadGraph annotation underlies various downstream applications:
- Large-scale automatic labeling: Auto-annotation of hundreds of thousands of chest radiograph reports via trained RadGraph Benchmark or HGIE models facilitates secondary research in medical NLP and imaging (Jain et al., 2021, Khanna et al., 2023).
- Style-aware report generation: RadGraph graphs can be extracted from images by a vision-language encoder-decoder, then verbalized into free text by prompting LLMs, enabling disentanglement of report content from radiologist style (Yan et al., 2023).
- Progression tracking and device monitoring: RadGraph2 enables automated detection and linkage of disease or device evolution across serial studies, via a hierarchical change taxonomy (Khanna et al., 2023).
- Evaluation metrics: Graph F1 scores (node and edge matching) serve as objective measures of model content fidelity in both extraction and generation settings, with human annotation agreement (Cohen’s κ > 0.80) validating reproducibility (Jain et al., 2021, Yan et al., 2023).
7. Significance and Future Directions
RadGraph annotation provides a clinically precise, machine-actionable representation of radiological findings, supporting reproducible research standards and interoperability. Its iterative, radiologist-in-the-loop protocol, strict consensus metrics, and extensible schema allow adaptation to new modalities (e.g., CT, MRI), multi-institutional data, and integration with vision-language learning frameworks.
The introduction of hierarchical schemas in RadGraph2 and the coupling with LLMs for style-aware text generation represent substantive advances. Ongoing challenges include maintaining high inter-annotator agreement for subtle relation types, adapting schemas to additional imaging contexts, and optimizing for both extraction fidelity and generation fluency (Khanna et al., 2023, Yan et al., 2023).