CT-RATE Dataset: Chest CT & Reports
- CT-RATE is a comprehensive dataset of non-contrast chest CT volumes and radiology reports, enabling robust multimodal AI research.
- It comprises data from over 21,000 patients with 25,692 scans reconstructed into 50,188 volumes, all standardized through resampling and intensity normalization.
- The dataset underpins derivative benchmarks like RadGenome-Chest CT and CTRATE-IR, driving advancements in segmentation-guided VQA and multi-granular image retrieval.
The CT-RATE dataset is a large-scale, open-source paired collection of non-contrast 3D chest computed tomography (CT) volumes and corresponding radiology reports, designed to enable the training and evaluation of multimodal models for medical image understanding, retrieval, and vision-language tasks. As a foundational resource, it underlies multiple derivative datasets and benchmarks, such as RadGenome-Chest CT (for grounded multimodal reasoning) and CTRATE-IR (for multi-granular retrieval), and serves as a gold standard for the development of generalist medical AI systems using chest CT imaging (Hamamci et al., 2024, Zhang et al., 2024, Zhang et al., 6 Mar 2025).
1. Dataset Composition and Structure
CT-RATE aggregates a cohort of 21,304 unique patients, encompassing 25,692 non-contrast chest CT studies, each paired with a textual radiology report. Each CT study was reconstructed using multiple kernels, yielding a total of 50,188 distinct volumes (two reconstructions per scan: typically lung and mediastinal kernels). The dataset covers a broad range of clinical routine, with volumes acquired from Philips (61.5%), Siemens (30.1%), and PNMS (8.4%) scanners and includes in-plane resolutions of 512², 768², or 1024² pixels and original axial slice thickness between 0.5–2.5 mm. Ages span from 18 to 102 years, with a balanced representation across sex (41.6% female, 58.4% male) (Hamamci et al., 2024).
Preprocessing standardizes all images by: (1) uniform resampling to 0.75 × 0.75 × 1.5 mm voxel spacing, (2) cropping/padding to 480 × 480 × 240 (coronal × sagittal × axial), (3) conversion to Hounsfield Units and subsequent intensity normalization to [−1, 1] for network compatibility, and (4) anonymization (removal of protected health information from DICOM headers and textual reports, which were translated from Turkish to English). Reports are organized into plain text; images are distributed as NIfTI files (Hamamci et al., 2024).
| Metric | Value |
|---|---|
| Unique patients | 21,304 |
| CT experiments (scans) | 25,692 |
| Reconstructed volumes | 50,188 (2 per scan) |
| Slices per volume | 100–600 (mean 304.7, mode 255) |
| Total slices (approx.) | ~15.7 million |
| Scanner manufacturers | Philips 61.5%, Siemens 30.1%, PNMS 8.4% |
| In-plane resolutions | 512² px 65.4%, 768² px 4.2%, 1024² px 30.4% |
2. Report Annotation and Label Extraction
Each radiology report in CT-RATE is divided into canonical sections: clinical information, technique, findings, and impression. The core annotation is based on the “Findings” and “Impression” sections. 1,000 reports were manually annotated to train a RadBERT-RoBERTa-4m model for extracting 18 high-level abnormality types (e.g., consolidation, pleural effusion, nodules, ground-glass opacity, and lymphadenopathy). These labels are used both for downstream multi-abnormality detection tasks and, via further processing, for anatomy-specific retrieval and regional evaluation benchmarks (Hamamci et al., 2024, Zhang et al., 6 Mar 2025).
Extracted metadata includes patient/study identifiers, manufacturer/model details, protocol name, slice thickness, slice counts, pixel spacing, sex, age (anonymized), and temporally shifted study dates. Each volume’s report is available in plain text for flexible vision-language modeling (Hamamci et al., 2024).
3. Derivative Datasets and Extensions
Multiple derivative datasets extend CT-RATE for specialized tasks:
RadGenome-Chest CT (Zhang et al., 2024) introduces region-grounded, segmentation-linked vision-language supervision:
- 197 organ-level binary segmentation masks per volume, generated using SAT (“One Model to Rule Them All”; text-prompted universal 3D segmentation).
- 665,000 multi-granularity grounded report sentences, each linked to anatomical segmentation masks and labeled using fine-tuned GPT-2 (sentence-to-region accuracy: 94.56%).
- Over 1.3 million grounded visual question-answer (VQA) pairs created via rule-based template instantiation, with answers linked to specific mask regions; question types include abnormality identification, presence (yes/no), anatomical location, size measurement, and case-level disorder summary.
- All region groundings and validation set VQA pairs undergo manual expert verification.
CTRATE-IR (Zhang et al., 6 Mar 2025) leverages CT-RATE for anatomy-conditioned image/retrieval benchmarks (see Section 5).
4. Data Organization, Split, and Access
- Splitting: The official division comprises 20,000 patients (24,128 volumes) for training and 1,304 patients (1,564 volumes) for validation, with all corresponding reports. A held-out external validation set (RAD-ChestCT) contains 3,630 volumes without paired reports (Hamamci et al., 2024).
- Files and Hierarchy: Data is hierarchically organized by split, patient, and scan; volumes are NIfTI files, reports are text files, and supporting metadata are in CSV and JSON.
- Access and Licensing: CT-RATE is licensed under CC-BY 4.0 and is distributed via the HuggingFace Datasets Hub (pip install datasets; load_dataset(“ibrahimhamamci/CT-RATE”)); pre-trained models, code, and derivative annotations are released on public repositories (Hamamci et al., 2024, Zhang et al., 6 Mar 2025).
5. Benchmarks and Evaluation Protocols
CT-RATE serves as a backbone for several benchmarks:
- Multi-abnormality Detection: CT-CLIP (contrastive language-image pretraining) achieves a mean AUROC of 0.900 ± 0.005 (internal) and 0.874 ± 0.006 (external RAD-ChestCT) in zero-shot detection; fine-tuned variants (CT-VocabFine/CT-LiPro) reach AUROC up to 0.947 (Hamamci et al., 2024).
- Retrieval Tasks:
- Volume-to-volume retrieval: MAP@1 ≈ 0.887 (internal), ≈ 0.993 (external)
- Report-to-volume retrieval: Recall@5 ≈ 0.15 (internal)
- CTRATE-IR Retrieval (RadIR): Three retrieval tasks are defined—unconditional image-to-image, image-to-report, and anatomy-conditioned retrieval. Similarity uses RaTEScore applied to regional text fragments, with metrics including Recall@K and NDCG@K. CTRATE-IR defines ~132 billion region-conditioned image–image pairs (Zhang et al., 6 Mar 2025).
6. Methodological Components and Annotation Pipeline
Image Preprocessing: Raw DICOMs are anonymized, resampled for isotropy, and normalized in Hounsfield Units. Reports are de-identified, translated, and formatted for natural language processing.
Segmented Supervision: In RadGenome-Chest CT, 3D anatomical segmentation is performed using the SAT model, prompted with 197 chest-region terms. Grounded report sentences leverage a sentence-to-region labeling workflow: GPT-4 annotates a subset of sentences for regions, GPT-2 is fine-tuned, and then applied at scale for entire dataset coverage (94.56% validation accuracy).
VQA Generation: Rule-based and template methods instantiate region- and case-level question–answer pairs, each with ground-truth anatomical linkage. Five VQA types cover open-format and yes/no abnormalities, localization, measurement, and disorder summarization (Zhang et al., 2024).
Retrieval Annotation (CTRATE-IR): RadGraph-XL extracts 90 core anatomical entities from reports. Regional findings are linked through synonym unification and shallow hierarchy propagation, yielding “(study, anatomy, finding-sentence)” triplets for 2,582,477 instances, enabling multi-granular ranking annotations (Zhang et al., 6 Mar 2025).
7. Impact, Applications, and Significance
CT-RATE addresses the longstanding bottleneck of comprehensive, patient-scale, paired 3D imaging and report resources for vision-language medical AI. Its structure supports supervised model training, zero-shot learning, and grounded explainable reasoning through segmentation and region-specific annotation (Hamamci et al., 2024, Zhang et al., 2024). Rich supervision signals—ranging from global impressions to organ-level segmentation and VQA—enable the construction of explainable, generalist foundation models for 3D medical imaging. The dataset’s standardized preprocessing, high granularity, and open access permit robust benchmarking, domain transfer analysis, and direct comparisons between competing architectures.
A plausible implication is that the scale, granularity, and hierarchical anatomical organization of CT-RATE and its extensions will drive further advances in explainable, clinically robust multimodal medical AI, setting a precedent for future medical vision-language benchmarks and retrieval datasets (Hamamci et al., 2024, Zhang et al., 2024, Zhang et al., 6 Mar 2025).