Identifying the Best Machine Learning Algorithms for Brain Tumor Segmentation, Progression Assessment, and Overall Survival Prediction in the BRATS Challenge (1811.02629v3)

Published 5 Nov 2018 in cs.CV, cs.AI, cs.LG, and stat.ML

Abstract: Gliomas are the most common primary brain malignancies, with different degrees of aggressiveness, variable prognosis and various heterogeneous histologic sub-regions, i.e., peritumoral edematous/invaded tissue, necrotic core, active and non-enhancing core. This intrinsic heterogeneity is also portrayed in their radio-phenotype, as their sub-regions are depicted by varying intensity profiles disseminated across multi-parametric magnetic resonance imaging (mpMRI) scans, reflecting varying biological properties. Their heterogeneous shape, extent, and location are some of the factors that make these tumors difficult to resect, and in some cases inoperable. The amount of resected tumor is a factor also considered in longitudinal scans, when evaluating the apparent tumor for potential diagnosis of progression. Furthermore, there is mounting evidence that accurate segmentation of the various tumor sub-regions can offer the basis for quantitative image analysis towards prediction of patient overall survival. This study assesses the state-of-the-art ML methods used for brain tumor image analysis in mpMRI scans, during the last seven instances of the International Brain Tumor Segmentation (BraTS) challenge, i.e., 2012-2018. Specifically, we focus on i) evaluating segmentations of the various glioma sub-regions in pre-operative mpMRI scans, ii) assessing potential tumor progression by virtue of longitudinal growth of tumor sub-regions, beyond use of the RECIST/RANO criteria, and iii) predicting the overall survival from pre-operative mpMRI scans of patients that underwent gross total resection. Finally, we investigate the challenge of identifying the best ML algorithms for each of these tasks, considering that apart from being diverse on each instance of the challenge, the multi-institutional mpMRI BraTS dataset has also been a continuously evolving/growing dataset.

Citations (1,523)

View on Semantic Scholar

Summary

The paper demonstrates that ensemble-based ML models, especially deep learning architectures like U-Net, significantly enhance segmentation robustness across heterogeneous multi-parametric MRI data.
The paper shows traditional ML methods with radiomic feature engineering can outperform deep learning in overall survival prediction given limited training data and complex imaging-clinical correlations.
The paper underscores the importance of standardized data preprocessing and robust evaluation metrics to reliably benchmark models for clinical translation.

This paper, "Identifying the Best Machine Learning Algorithms for Brain Tumor Segmentation, Progression Assessment, and Overall Survival Prediction in the BRATS Challenge" (1811.02629), provides a comprehensive overview and analysis of the International Brain Tumor Segmentation (BraTS) challenge from 2012 to 2018. The primary goal of BraTS is to evaluate state-of-the-art methods for segmenting brain tumors and their sub-regions in multi-parametric Magnetic Resonance Imaging (mpMRI) scans and, in later years, to assess methods for predicting patient overall survival.

The clinical relevance of this challenge lies in the heterogeneous nature of gliomas, which are difficult to delineate and track using manual methods. Accurate segmentation is crucial for treatment planning, monitoring disease progression (beyond standard RECIST/RANO criteria [recist1, recist2, recist3, recist4, rano]), and predicting patient prognosis, potentially using quantitative image analysis (radiomics) derived from segmentation masks [IbsiPaper].

The BraTS dataset is a key contribution of the initiative. It comprises pre-operative mpMRI scans (T1, T1-Gd, T2, T2-FLAIR) collected from multiple institutions with varying protocols and scanner types, reflecting real-world clinical variability. The data undergoes pre-processing, including co-registration to an anatomical template and resampling to 1mm³ isotropic resolution. Manual segmentations, reviewed by expert neuroradiologists, serve as ground truth. The definition of tumor sub-regions evolved over the years:

BraTS 2012-2016: Defined four regions: Necrotic Core (NCR), Edema (ED), Non-Enhancing Tumor (NET), and Active/Enhancing Tumor (AT).
BraTS 2017-Present: Simplified to three regions: Active Tumor (AT), Tumor Core (TC - union of NCR and NET from previous years), and Whole Tumor (WT - union of AT, TC, and ED). This change aimed to address ambiguity and inconsistency in defining the NET region in earlier years. A standardized annotation protocol guides annotators to delineate from the outside-in (WT $\rightarrow$ TC $\rightarrow$ AT).

The dataset has grown significantly since 2012, increasing in size and incorporating a validation set from 2017 onwards to better facilitate algorithm development following a machine learning paradigm (training, validation, testing). Longitudinal data was included in 2014-2016 to support progression assessment, and clinical data (age, OS, resection status) was added in 2017-2018 for the survival prediction task. The data is made publicly available through repositories like SMIR and the CBICA IPP.

The challenge evaluates methods on two main tasks:

Brain Tumor Segmentation: Participants develop automated methods to segment the AT, TC, and WT regions from pre-operative mpMRI scans. Evaluation metrics include Dice Score (accuracy), Hausdorff Distance 95% (boundary robustness), Sensitivity (recall), and Specificity (precision). A case-wise ranking scheme is employed, averaging ranks across subjects, regions, and metrics. Permutation testing is used to assess the statistical significance of rank differences between teams.
Overall Survival Prediction: Introduced in 2017-2018, this task requires participants to predict the overall survival (in days) for patients who underwent Gross Total Resection (GTR). The evaluation focuses on classifying patients into short- (<10 months), mid- (10-15 months), and long-survivors (>15 months), using classification accuracy as the primary ranking metric. Additional metrics like Mean Squared Error (MSE) and Spearman Correlation are used for error analysis. Participants typically extract radiomic features from the segmented tumor sub-regions and surrounding tissues, often combining them with clinical features (age, resection status), and use machine learning models for prediction.

Key findings from the challenge instances presented in the paper include:

Segmentation:
- Automated methods have improved over the years, largely due to the increasing size and diversity of the dataset and advances in Deep Learning (DL) architectures, particularly variations of U-Net and cascaded CNNs [brats17:lncs:biomedia1, brats17:lncs:UCL-TIG, brats18:rank1:seg:nvidia, brats18:rank2:seg:fabian].
- Segmentation of the Whole Tumor (WT) is generally the most robust, followed by the Tumor Core (TC), and the Active Tumor (AT), which is the most challenging due to its smaller size and varying appearance.
- While individual automated methods perform well, ensembles or fusions of top-performing algorithms often achieve superior and more robust results, sometimes exceeding expert inter-rater agreement [bratsTmiPaper]. This suggests that combining predictions from diverse models is a practical strategy to improve robustness in clinical translation.
- The variability across different methods highlights the challenge of handling the data heterogeneity inherent in multi-institutional clinical data.
Survival Prediction:
- This task proved more challenging than segmentation, especially for DL methods, potentially due to the relatively smaller size of the training set for this specific task compared to segmentation, and the complex, potentially non-linear relationship between imaging features, clinical data, and survival.
- Traditional Machine Learning methods, often combined with radiomic feature extraction, demonstrated competitive or superior performance compared to DL approaches in the survival prediction task during BraTS 2017-2018 [brats18:rank1:surv:feng, brats18:rank2:surv:elodie, brats18:rank2:surv:lisun, brats18:rank3:surv:tata, brats18:rank3:surv:leon]. This suggests that feature engineering and classical ML models remain relevant, especially when dealing with limited clinical datasets or when integrating multimodal data.
- The top survival prediction accuracy was around 0.6 in both 2017 and 2018, indicating the difficulty of predicting OS from pre-operative scans alone and the potential need to integrate additional data types like radiogenomics [rgRutmanEjrLink] or clinical reports.

Implementation considerations for developers and practitioners tackling these tasks based on the paper's insights include:

Data Pre-processing: Standardizing input data (co-registration, resampling, skull-stripping) is crucial for training consistent models across diverse datasets.
Segmentation Model Architecture: 3D convolutional neural networks (CNNs), often based on the U-Net architecture, are dominant. Cascaded or hierarchical approaches (e.g., segmenting the whole tumor first, then sub-regions) can improve performance. Handling class imbalance, especially for smaller sub-regions like AT, is important. Ensembling multiple models is a strong strategy for improving robustness.
Survival Prediction Model: Feature extraction (radiomics, potentially from segmented regions) combined with traditional ML methods (e.g., Random Forests, Gradient Boosting) or integrating clinical features (age, resection status) yielded good results. For DL-based survival prediction, strategies to handle small datasets or integrate with other data types are needed.
Evaluation: Implementing robust evaluation pipelines using metrics like Dice, Hausdorff Distance, and classification accuracy is essential for benchmarking models. Understanding the nuances of metrics (e.g., Dice sensitivity to small volumes) is important.
Computational Requirements: Training 3D CNNs for segmentation requires significant computational resources (GPUs, memory). Inference can also be computationally intensive, although optimizations are possible. Radiomic feature extraction followed by traditional ML for survival prediction is typically less demanding.
Limitations and Future: The field needs more robust individual segmentation models, better handling of clinical confounders (blood products, post-treatment effects), and improved performance on diffuse tumors like low-grade gliomas. Integrating radiomic, clinical, and potentially molecular data is a key direction for more accurate survival prediction. Translating these methods to clinical practice requires addressing issues like deployment (e.g., containerization, as encouraged by BraTS 2018), interpretability, and regulatory approval.

Overall, the paper highlights the progress made in automated brain tumor image analysis through the BraTS challenge, underscores the continued challenges posed by data heterogeneity and task complexity, and points towards ensemble methods, integrated multi-modal data analysis (including clinical and radiogenomic features), and more robust individual models as key areas for future research and clinical translation.

PDF Markdown

Identifying the Best Machine Learning Algorithms for Brain Tumor Segmentation, Progression Assessment, and Overall Survival Prediction in the BRATS Challenge (1811.02629v3)

Summary

Related Papers