Burdenko GBM Progression Cohort
- The Burdenko GBM Progression Cohort is a rigorously curated collection of clinical, imaging, and molecular data from 180 glioblastoma patients to differentiate true tumor progression from treatment-related pseudoprogression.
- It leverages stage-specific imaging at early and late post-radiotherapy timepoints along with detailed molecular and dosimetric information to benchmark deep learning models, including CNNs and transformers.
- The cohort supports multimodal fusion studies, particularly through a 59-patient subset, enabling enhanced performance evaluation with metrics like accuracy, macro-F1, and AUC.
The Burdenko GBM Progression Cohort is a rigorously curated collection of clinical, imaging, and molecular data from adult patients with newly diagnosed glioblastoma, assembled at the Burdenko National Medical Research Center. It is specifically designed for research into the differentiation of true tumor progression (TP) and treatment-related pseudoprogression (PsP) following standard chemoradiotherapy, with particular emphasis on benchmarking deep learning algorithms and developing multimodal prediction tools for post-radiotherapy MRI. The cohort underpins recent advances in stage-specific benchmarking and multimodal transformer-based classification for glioblastoma progression, making it an emerging benchmark for method development in this domain (Guo et al., 23 Nov 2025, Gomaa et al., 6 Feb 2025).
1. Cohort Composition and Eligibility Criteria
The core Burdenko GBM Progression Cohort comprises 180 adult patients who received primary diagnosis with WHO grade IV glioblastoma and underwent standard radiotherapy between 2014 and 2020 (TCIA DOI:10.7937/E1QP-D183). Each patient is represented by at least one post-RT contrast-enhanced T1-weighted MRI (T1C), with analyses focusing on two critical post-therapy follow-up stages: (1) an early scan at 3–4 weeks after radiotherapy, prior to adjuvant chemotherapy, and (2) a later scan at approximately 2–3 months post-RT, following combined adjuvant chemoradiotherapy.
Inclusion criteria are: (i) histopathological confirmation of glioblastoma, (ii) completion of standard radiotherapy (with or without concurrent temozolomide), (iii) availability of at least one high-quality post-RT T1C MRI. Imaging-level exclusion criteria include non-physical T1C geometry (voxel spacing outside [0.3, 6.0] mm, aspect ratio >4.0, slice thickness <0.3×mean in-plane spacing) and insufficient visual clarity (determined by per-series scoring, retaining the top-ranked series per patient and timepoint). While the aggregate cohort demographics (age, sex, MGMT-methylation status, and performance status) are available in the public release, this breakdown is not reported in stage-specific benchmarking (Guo et al., 23 Nov 2025).
A 59-patient, comprehensively phenotyped subset was used for multimodal transformer-based studies, with detailed documentation of age (median 57, range 18–82), gender (47.5% female, 52.5% male), IDH and MGMT molecular status, progression labels (57.6% TP, 42.4% PsP), and associated clinical/RT planning variables (Gomaa et al., 6 Feb 2025).
2. Imaging and Multi-Modal Data Acquisition
The full Burdenko dataset includes multiparametric MRI (T1, T1-Gd, T2, T2-FLAIR), diffusion-weighted and perfusion MRI, planning and diagnostic CT, and associated molecular data. Stage-specific benchmarking focused exclusively on T1C volumes, acquired at both early and late post-RT stages.
For multimodal transformer models, subsets require both T1-weighted post-contrast (T1-CE) and T2-FLAIR MR sequences, planning CT, simulation MRI, and DICOM RT objects encoding the 3D dose distribution. Lesion segmentation involved nnU-Net models trained on BraTS2021, with rigid registration to planning CT using ANTs. Dose mapping was performed by projecting tumor masks into the RT dose domain to extract mean, min, median, and D98 dose statistics.
Key MRI preprocessing steps included DICOM-to-NIfTI conversion, geometric and clarity-based quality control, resampling to isotropic 1 mm³ voxels, skull-stripping, spatial normalization, bias-field correction, and z-score normalization within the brain mask. Channel concatenation (for T1-CE and FLAIR) and histogram standardization were utilized for transformer models. RT planning and clinical covariates were preprocessed via z-normalization (continuous) and one-hot encoding (categorical), with SHAP analysis selecting the top predictive features (Gomaa et al., 6 Feb 2025).
3. Ground Truth Labeling and Class Assignment
Progression labels are assigned based on clinical and radiologic criteria, with stringent algorithms for cases without histopathological confirmation:
- True Progression (TP): Defined by unequivocal radiologic or clinical evidence of tumor recurrence.
- Pseudoprogression (PsP): Temporal enhancement attributed to treatment effect, stabilizing or improving without radiological or clinical evidence of new tumor growth.
- Additional rules: Any timepoint labeled as “progression” overrides earlier diagnoses; stable disease requires at least three responding or stable scans with no prior progression; patients with limited follow-up (≤2 scans) are assigned based on their last available label.
In multimodal studies, only patients with consensus TP or PsP status at the time of follow-up are included, further requiring the presence of both T1-CE and FLAIR sequences and complete RT planning data (Gomaa et al., 6 Feb 2025). Final class balance was skewed toward Progression, with PsP as a minority cohort (exact class counts per benchmarking stage not reported for the full 180-patient set).
4. Preprocessing and Data Augmentation
Image preprocessing pipelines entail seven main steps: (1) DICOM conversion and indexing, (2) geometric and clarity filtering, (3) resampling to 128³ or 160³ voxels, skull-stripping, and rigid registration to MNI space, (4) intensity z-score normalization, (5) deterministic label-series pairing, (6) training-only oversampling and augmentation (latent-space SMOTE via 3D autoencoder, up to 100 synthetic volumes/class; mild affine and Gaussian perturbations), and (7) non-augmented validation/testing (Guo et al., 23 Nov 2025). For multimodal transformer models, rigorous spatial normalization, bias correction, and histogram standardization are applied, with all volumes cropped to uniform dimensions prior to training.
Dataset splits for benchmarking and predictive modeling are based on patient-level five-fold cross-validation (stratified by final label). Augmentations and oversampling are confined to training data; strict fold separation, geometry checks, clarity scoring, and audit logs minimize data leakage and enhance reproducibility.
5. Benchmarking Protocols and Performance Metrics
Multiple deep learning architectures—including CNNs, LSTMs, Mamba hybrids, vision transformers, and state-space models—were systematically benchmarked under a unified, quality-controlled protocol (Guo et al., 23 Nov 2025). The primary protocol employs five-fold cross-validation at the patient level, with 80% of patients per fold allocated to the training set (inclusive of all series-associated preprocessing and augmentation) and 20% reserved for evaluation (no augmentation applied).
Performance is measured via accuracy, macro-averaged F1-score, and macro-averaged area under the ROC curve (AUC), defined as follows for classes (Progression, PsP, Stable):
- Accuracy:
- Precision (per class):
- Recall (per class):
- F1-score (macro):
- Macro-AUC:
Classifier outputs are softmax probabilities per class, with a nominal threshold of 0.5 for one-vs-rest assignment. ROC curves and AUC are computed per class, sweeping thresholds on each score. Reported results are averaged over five cross-validation folds and three random seeds.
6. Empirical Findings and Comparative Modeling
Benchmarking across both post-RT follow-up timepoints revealed comparable overall accuracy (Stage 1: ~0.70 ± 0.02, Stage 2: ~0.72 ± 0.03), with richer class separability—reflected by improvements in macro-F1 and macro-AUC—emerging at the later follow-up (Stage 2 macro-F1: ~0.36 ± 0.06 vs. Stage 1: ~0.28 ± 0.04; macro-AUC: ~0.56 ± 0.06 vs. ~0.52 ± 0.04). The Mamba+CNN hybrid yielded optimal accuracy-efficiency trade-offs, while transformer models offered strong AUC at higher computational cost and lightweight CNNs delivered lesser reliability.
For the more deeply phenotyped 59-patient subset, a self-supervised multimodal vision transformer, integrating MRI, clinical, molecular, and dosimetric data via guided cross-modal attention, achieved mean cross-validated AUC of 0.883 ± 0.044, accuracy of 0.770 ± 0.062, and balanced sensitivity/specificity (~0.82/0.72). This outperformed prior CNN-LSTM and SVM baselines by ΔAUC ≈ 0.13–0.35 (p < 0.05), attributable to (i) self-supervised pretraining on 2,317 unlabeled glioma MRIs and (ii) explicit multimodal fusion. SHAP analysis identified time from RT to progression, minimum radiation dose, D98 dose, and MGMT methylation as pivotal predictors (Gomaa et al., 6 Feb 2025).
7. Limitations and Future Directions
The Burdenko GBM Progression Cohort provides a high-quality, stage-aware platform for benchmarking and method development. Nonetheless, several limitations are reported:
- Cohort imbalance: Prevalence of TP substantially exceeds PsP, constraining absolute discriminatory power. Latent-space SMOTE and mild augmentation techniques partially mitigate this, but modest macro-F1 and macro-AUC scores across models reflect the intrinsic class imbalance and clinical ambiguity.
- Data heterogeneity: Imaging protocol variation and limited coverage of later follow-up restrict model generalizability. The 59-patient subset for transformer studies is of modest size and represents a fraction of total available cases.
- Limited sequence scope: T1C is the sole imaging modality for cross-sectional benchmarking; transformer models leverage only T1-CE and FLAIR, potentially omitting diagnostically relevant features from diffusion, perfusion, or advanced molecular imaging.
- Absence of external benchmarking: The main benchmarking paper does not leverage multi-institutional external validation; external performance is only reported for the transformer-based model (tested on a GlioCMV/UKER cohort, n = 20) (Gomaa et al., 6 Feb 2025).
This suggests that future research should incorporate longitudinal sequence modeling, exploit the full spectrum of multiparametric imaging, and pursue larger, multi-center datasets for robust cross-institutional validation.
| Subset | Patients | Imaging Sequences | Modalities Used | Class Balance | Main Use |
|---|---|---|---|---|---|
| Full Cohort | 180 | T1C (2 timepoints/patient) | MRI (T1C), clinical, some molecular | Progression > PsP | Stage-specific benchmarking |
| Multimodal Subset | 59 | T1-CE, FLAIR, RT planning | MRI, RT dose, clinical, molecular | TP: 57.6%, PsP: 42.4% | Multimodal transformer |