RSNA-MICCAI Radiogenomic MRI Dataset
- RSNA-MICCAI Radiogenomic Classification Dataset is a multi-institutional mpMRI resource used to predict MGMT promoter status in glioblastoma.
- It comprises 672 subjects with four MRI modalities (T1w, T1wCE, T2w, FLAIR), supporting diverse machine learning and deep learning pipelines.
- The dataset underpins reproducible radiogenomic benchmarking, highlighting modest prediction metrics and challenges in clinical translation.
The RSNA-MICCAI Radiogenomic Classification Dataset is a multi-institutional, publicly released multi-parametric MRI (mpMRI) data resource developed as part of the RSNA-ASNR-MICCAI BraTS 2021 challenge. It establishes a benchmark for the non-invasive prediction of O6-methylguanine-DNA methyltransferase (MGMT) promoter methylation status in glioblastoma and related brain tumors. This dataset underpins radiogenomic biomarker research by providing standardized, high-quality imaging and molecular annotation, supporting the development and evaluation of machine learning and deep learning models targeting clinically actionable molecular characterization in neuro-oncology (Mohamed et al., 2024, Baid et al., 2021, Kollias et al., 2023, Jamil, 11 Jan 2026, Pálsson et al., 2021).
1. Dataset Composition and Cohort Structure
The primary RSNA-MICCAI radiogenomic classification releases consist of subsets from the broader BraTS 2021 repository, in which pre-operative mpMRI scans are paired with binary MGMT status as determined by tissue-based assays (e.g., methylation-specific PCR). The 2021 challenge dataset is comprised of:
- 672 glioblastoma subjects, split into:
- Training set: 468 subjects
- Validation set: 117 subjects
- Testing set: 87 subjects
Each subject folder contains four mpMRI series: T1-weighted pre-contrast (T1w), T1-weighted post-contrast (T1wCE/T1Gd), T2-weighted (T2w), and fluid-attenuated inversion recovery (FLAIR). All are typically 3D series, provided as DICOM files, with alternative PNG 2D slice mirrors available for models expecting 2D image inputs. No cases of missing or incomplete modalities are reported in the main split (Mohamed et al., 2024).
Larger versions of the dataset, derived from BraTS 2021, comprise up to 2,040 subjects, sourced from both public initiatives (TCGA-GBM, TCGA-LGG, IvyGAP, CPTAC-GBM, ACRIN-FMISO-Brain) and 80+ academic/clinical institutions (Baid et al., 2021, Kollias et al., 2023). For MGMT status, actual usable annotated samples for model development are reduced due to preprocessing and quality constraints, with papers citing labeled subsets ranging from 585 to 672 subjects (Kollias et al., 2023, Mohamed et al., 2024, Jamil, 11 Jan 2026).
A summary of the core RSNA-MICCAI radiogenomic challenge cohort structure:
| Cohort | Subjects | Modalities per Subject |
|---|---|---|
| Training | 468 | T1w, T1wCE, T2w, FLAIR |
| Validation | 117 | T1w, T1wCE, T2w, FLAIR |
| Testing | 87 | T1w, T1wCE, T2w, FLAIR |
2. Imaging Modalities and Data Organization
All cases in the radiogenomic classification challenge include four co-registered 3D MRI volumes: T1w, T1Gd/T1wCE, T2w, and FLAIR. Image resolution, voxel spacing, and slice thickness vary by institution and scanner, reflecting the diversity of clinical practice. DICOM is the canonical format; for some analysis pipelines, NIfTI conversions are performed, accompanied by rigid or affine registrations to anatomical templates, brain extraction (skull-stripping), and voxel spacing harmonization.
Data directory layouts are consistent with challenge splits:
1 |
/<split>/<subjectID>/<modality>/<sliceIndex>.dcm |
For segmentation tasks, additional ground-truth label files are included, but these are distinct from the radiogenomic (MGMT) challenge subset, which does not require voxelwise labels (Baid et al., 2021).
3. Preprocessing, Annotation, and Labeling
Preprocessing steps vary across studies utilizing the dataset. Baseline protocols reported include:
- DICOM to 3D NumPy/NIfTI conversion
- Window-level contrast adjustment (VOI LUT)
- Spatial resizing to fixed arrays (e.g., 256×256×64), slice-cropping, and brain extraction in some workflows
- Intensity normalization, via percentile clipping or min-max scaling within regions-of-interest, is frequently applied but not universal
Advanced pipelines (e.g., (Jamil, 11 Jan 2026)) add N4 bias-field correction, ROI cropping around segmented tumor, grid harmonization to isotropic voxels (e.g. 2.0 mm), and slice-wise contrast enhancement.
MGMT ground truth labels are determined by standard tissue assays (methylation-specific PCR or pyrosequencing), binarized as “methylated” (1) or “unmethylated” (0). The definition of methylation may depend on the particular molecular assay used and threshold convention (e.g., >10% CpG methylation for pyrosequencing, ≥2% for bisulfite sequencing); however, the public challenge data present only the binary outcome. No multi-reader QC or consensus annotation for MGMT status is described.
Segmentation masks—when available for auxiliary tasks—are generated via ensemble fusion of leading deep segmentation models, refined and validated by expert neuroradiologists (Baid et al., 2021, Jamil, 11 Jan 2026).
4. Use in Machine Learning Pipelines
The RSNA-MICCAI dataset underlies both radiomics-based and deep learning MGMT classification algorithms. Distinct methodologies include:
- Radiomics approaches: Tumor segmentation (auto or manual), extraction of intensity, texture, shape (PyRadiomics), and engineered features (e.g., 3D HOG, FFT statistics), followed by selection (e.g., Fisher’s exact test binarization) and classification (e.g., random forest, MLP). Latent features from unsupervised models—such as 3D variational autoencoders—have been explored but can overfit on small or non-diverse datasets (Pálsson et al., 2021, Jamil, 11 Jan 2026).
- Deep learning approaches: 2D and 3D CNN architectures (ResNet, EfficientNet, Xception, ViT3D), sometimes augmented by RNNs or attention modules to aggregate volumetric context. Methods address input variability in volume depth via padding, cropping, or dynamic routing (as in BTDNet (Kollias et al., 2023)).
- Hybrid fusion models: Combine handcrafted radiomic descriptors and CNN-learned deep features at the fully connected classifier level, using fusion mechanisms such as early concatenation, attention, or ensemble learning (e.g., CatBoost, XGBoost) (Jamil, 11 Jan 2026).
Label supervision is strictly at the whole-volume level; slice-level ground truth for MGMT status is not present, and models employ aggregation methods accordingly.
5. Evaluation Metrics and Benchmark Results
The primary evaluation metric for the challenge is area under the receiver operating characteristic curve (AUC):
where and .
Other metrics reported for some tasks include accuracy, F-score (macro and class-specific), and Matthew’s Correlation Coefficient (MCC). Binary cross-entropy, focal loss, and composite ensemble losses are variously reported across studies as optimization objectives.
Key performance results (MGMT classification, within respective challenge protocols):
- 3D Vision Transformer (ViT3D) and Xception achieved AUCs of 0.6015 and 0.61745, respectively, on the RSNA-MICCAI held-out test set (Mohamed et al., 2024).
- Radiomics-only ML pipelines reported validation set AUCs up to 0.632 (Pálsson et al., 2021).
- BTDNet (CNN–RNN modulation) achieved 66.2% macro F₁ (cross-validation) (Kollias et al., 2023).
- Hybrid radiomics–deep fusion models attained cross-validated AUC = 0.871, ACC = 0.866 on the intersection cohort of 663 subjects; external test AUC = 0.82 (Jamil, 11 Jan 2026).
These results document the difficulty of MGMT radiogenomic classification, highlight the modest discriminative power of mpMRI alone, and illustrate performance gains from model ensembling and attention-based, multi-modal fusion.
6. Data Access, Licensing, and Restrictions
The challenge data are openly downloadable for non-commercial research:
- Primary DICOM dataset: Kaggle RSNA-MICCAI challenge page
- PNG mirror for slice-based models: Kaggle rsna-miccai-png dataset
- Use is governed by the RSNA-MICCAI challenge terms, stipulating non-commercial research only, required data acknowledgments, and compliance with institutional ethical (IRB) approval as necessary. Public test performance assessments are restricted: no external models or datasets may be blended for leaderboard submissions (Mohamed et al., 2024, Baid et al., 2021).
Demographic and other metadata—including exact counts of methylated/unmethylated samples, age distributions, and site identifiers—are not detailed in the primary publications. For such information, users are directed to the challenge’s Kaggle site and associated forums or to the underlying TCIA/BraTS data documentation.
7. Significance, Limitations, and Research Trajectory
The RSNA-MICCAI Radiogenomic Classification Dataset constitutes a foundational benchmark for non-invasive, AI-driven molecular imaging in glioblastoma. Its key features—multi-institutional coverage, pre-operative 3D mpMRI, tissue-derived MGMT labels, and public challenge results—directly enable the development, comparison, and reproducibility of radiogenomic models.
Notable limitations include the moderate cohort size for MGMT annotation, lack of slice- or region-level molecular labeling, potential site- and protocol-induced confounds, and the challenging nature of the molecular prediction task (modest AUCs for unimodal/deep networks). The dataset’s fixed splits limit cross-validation, and class balance may fluctuate at the per-split or institution level. Feature harmonization and intensity standardization remain open challenges (Pálsson et al., 2021, Jamil, 11 Jan 2026).
Suggested future directions include the integration of additional clinical or omics data, advanced intensity harmonization, semi-supervised annotation, and improved multi-modal fusion architectures. Enhanced explainability tools, such as Grad-CAM and SHAP, are being adopted to interpret model predictions, highlight clinically relevant imaging correlates, and facilitate clinical translation (Jamil, 11 Jan 2026).
The RSNA-MICCAI dataset and challenge structure thus provide a rigorous empirical substrate for advancing reproducible, interpretable, and generalizable radiogenomic biomarker discovery in brain tumor research (Mohamed et al., 2024, Baid et al., 2021, Kollias et al., 2023, Jamil, 11 Jan 2026, Pálsson et al., 2021).