Radiomics AutoML Framework Overview
- Radiomics-specific AutoML frameworks are integrated systems that automate the extraction, selection, and modeling of high-dimensional imaging biomarkers.
- They employ modular architectures and rich algorithm libraries—including deep learning and advanced preprocessing—to create reproducible radiomics pipelines.
- Advanced strategies like hyperparameter tuning, ensembling, and interpretability modules ensure robust performance and clinical relevance.
Radiomics-specific Automated Machine Learning (AutoML) frameworks provide end-to-end automation for the extraction, selection, modeling, and interpretation of high-dimensional quantitative imaging biomarkers, enabling non-programming users to construct predictive models from medical images. These systems address the unique challenges of radiomics—heterogeneous imaging modalities, feature reproducibility, complex preprocessing, and workflow diversity—through specialized architectures, rich algorithm libraries, and integrated evaluation protocols. Contemporary frameworks increasingly couple radiomics with deep learning components, harmonization modules, and large-scale optimization strategies to streamline clinically impactful analysis.
1. System Architectures and Agentic Design
Radiomics-tailored AutoML platforms implement modular, multi-component, or agentic architectures to automate the radiomics pipeline:
- mAIstro (Tzanis et al., 30 Apr 2025) utilizes a master LLM-powered "agent" following a ReAct-style (Reason–Act–Observe) loop. The master agent parses natural language user prompts, identifies required tasks, and orchestrates specialized downstream agents:
- Exploratory Data Analysis (EDA) Agent: Automates profiling, summary statistics, missing-value analysis, and standard visualizations.
- Feature Importance & Selection Agent: Supports ANOVA F-test, mutual information, tree-based importances, and recursive feature elimination.
- Radiomics Feature Extraction Agent: Wraps PyRadiomics to compute shape, first-order, and texture (GLCM, GLRLM, GLSZM, GLDM, NGTDM) features, supports multi-filter pipelines and normalization.
- Segmentation Agents: Automate nnU-Net and TotalSegmentator pipelines for mask generation across multiple modalities/anatomies.
- Classifier/Regressor Agents: Run end-to-end modeling with PyCaret across >20 classifier types, integrated preprocessing, and hyperparameter tuning.
- Simplatab (Lozano-Montoya et al., 13 Jan 2026) offers a no-code graphical interface, internally removing highly correlated features, using SULOV and RFE for stability selection, and training ensembles over seven classifiers. It incorporates bias detection and SHAP interpretability modules.
- WORC (Starmans et al., 2021) and DARWIN (Chang et al., 2020) employ modular/graph-based designs, integrating drag-and-drop or script-driven workflow assembly. Both support varied feature extraction, model libraries, and automated algorithm selection.
A defining property of advanced frameworks is their encapsulation of pipeline steps as independently addressable modules or agents, allowing flexible orchestration, reproducible experimentation, and dynamic optimization.
2. Radiomics Feature Extraction: Categories and Mathematical Formulation
Frameworks implement comprehensive feature sets encompassing shape, first-order statistics, and texture, often leveraging PyRadiomics core definitions.
Shape Descriptors:
- Volume:
- Surface Area: mesh-based
- Sphericity:
- Compactness:
First-Order Statistics:
For intensity histogram :
- Mean:
- Variance:
- Skewness:
- Kurtosis:
- Energy:
- Entropy:
Texture Features (GLCM, GLRLM, GLSZM, GLDM, NGTDM):
- Contrast (GLCM):
- Correlation (GLCM):
- Short-Run Emphasis (GLRLM):
DARWIN (Chang et al., 2020) additionally includes higher-order transforms (wavelets, Laplacian of Gaussian, exponential, gradient, LBP2D/3D) and supports robust extraction via region-of-interest perturbation for repeatability.
3. Feature Selection, Hyperparameter Optimization, and Ensembling Strategies
Radiomics-specific AutoML frameworks implement feature selection and optimization modules specialized for high-dimensional, low-sample tabular data:
- Feature Selection: Common algorithms include univariate F-tests ( formulas), mutual information, RELIEF, SelectFromModel (LASSO, random forest-based), variance thresholding, PCA, and nonparametric tests (Mann–Whitney ).
- Dimensionality Reduction: Selection via top-, cumulative importance thresholds, or stability selection (SULOV).
- Hyperparameter Optimization:
- Grid search (exhaustive over discrete grid)
- Random search (uniform/log-uniform sampling)
- Bayesian optimization (Gaussian process or random forest surrogate; Expected Improvement acquisition)
- Internal cross-validation and early stopping
- Ensembling: Top- averaging, FitNumber, and ForwardSelection to combine the best-performing pipelines (Starmans et al., 2021).
Each framework logs search spaces, best configurations, and evaluation metrics for full traceability. Runtime efficiency and overfitting are controlled by budgeted optimization (e.g., limiting random search to iterations for WORC).
4. Integration of Deep Learning and Unified End-to-End Pipelines
Contemporary frameworks unify radiomics and deep learning pipelines, enabling joint segmentation, feature extraction, and direct image classification:
- Combined Pipelines:
- Segmentation radiomics extraction tabular modeling with classifiers/regressors.
- Direct CNN-based modeling: image classifier agents train networks (ResNet, VGG16, InceptionV3) on raw or processed image slices, utilizing transfer learning, augmentations, dynamic learning rates, and device control (Tzanis et al., 30 Apr 2025).
- Preprocessing standardizes images (intensity clipping, discretization, resizing, per-channel normalization) and tabular features (imputation, encoding, scaling).
- Model Evaluation: Metrics include AUC, accuracy, F1, Cohen’s , MCC for classification; MAE, RMSE, for regression; Dice Similarity Coefficient and IoU for segmentation; ROC and macro-averaged F1 for CNNs.
Such dual-mode integration allows frameworks to address image- and feature-level modeling, streamline inference on multimodal inputs, and flexibly deploy models for both clinical and research needs.
5. Usability, Accessibility, and Interface Modalities
Radiomics-specific AutoML platforms differentiate themselves on usability and accessibility axes:
| Framework | Interface Type | Coding Required | Interpretability |
|---|---|---|---|
| Simplatab | GUI (no-code) | None | SHAP, bias modules |
| WORC | Script/config | Advanced | SHAP, LIME |
| DARWIN | Drag-and-drop GUI | None | Full reporting |
| mAIstro | Natural language | None | Interpretable status |
- No-code/Low-code: Simplatab and DARWIN are designed for radiologists or physicians without programming experience. mAIstro abstracts all tasks via prompt-driven English interfaces, requiring only path and parameter specification.
- Installation and Onboarding: Simplatab deploys via Docker/pip with moderate install complexity. Code-driven frameworks (WORC, AutoPrognosis) require advanced build skills and may suffer from dependency obsolescence.
- Interpretability: SHAP, LIME, and built-in bias/vulnerability analysis are available in Simplatab and WORC; DARWIN provides metric visualization and statistical testing with exportable models.
The usability spectrum remains a critical delineator between radiomics-specific and general-purpose AutoML solutions.
6. Quantitative Performance and Benchmarking Results
Frameworks are rigorously evaluated on public and private radiomics cohorts, reporting cross-validated discriminative metrics and runtime:
- Simplatab (Lozano-Montoya et al., 13 Jan 2026): Highest mean test AUC (81.81%) across 10 datasets (e.g., Desmoid 95.0%, Lipo 87.7%, Liver 96.4%), with ~1h runtime on web interface.
- LightAutoML (general-purpose baseline): Fastest training (6 min/dataset) with competitive AUC (78.20%).
- mAIstro (Tzanis et al., 30 Apr 2025): On MedMNIST classification, ResNet variants achieved accuracy 79.5–98.9% and macro-F1 0.598–0.991. Segmentation (nnU-Net): BraTS DSC up to 0.957, KiTS kidney DSC 0.951.
- WORC (Starmans et al., 2021): AUC 0.80–0.87 on liver, lipo, Alzheimer’s, head-neck T-stage; consistently outperformed radiomics baselines and radiologist experts. Default run: 18 h on 24-core Xeon for 500,000 train/val fits.
- DARWIN (Chang et al., 2020): Achieved AUC 0.97 on LIDC-IDRI (lung ROI classification), with sub-minute runtimes for moderate-sized datasets.
Performance metrics are routinely reported with cross-validated mean ± SD and confidence intervals; statistical significance is addressed via bootstrap or corrected resampled t-tests. Frameworks occupying the Pareto frontier demonstrate optimal trade-offs between discriminative power and computational efficiency.
7. Limitations, Gaps, and Future Directions
Despite significant advances, current radiomics-specific AutoML frameworks face several limitations (Lozano-Montoya et al., 13 Jan 2026):
- Survival Analysis: No accessible APIs—except for computationally infeasible AutoPrognosis—offer survival/time-to-event modeling with built-in C-index optimization and censoring.
- Reproducibility Controls: No frameworks directly enforce feature stability assessment (e.g., test–retest ICC, phantom-based filtering) within the radiomics pipeline.
- Harmonization: Intensity standardization and statistical harmonization methods (such as ComBat) are absent.
- Obsolescence and Sustainability: Many domain-specific packages (AutoRadiomics, AutoML for Radiomics) have lapsed due to maintenance challenges and dependency drift.
Recommended adaptations include direct integration of survival modeling, reproducibility modules, harmonization procedures, and true end-to-end automation from image ingestion to reporting. Expanding cohort sizes and diversity, as well as signal-to-noise benchmarking, are necessary for robust pipeline validation.
Radiomics-specific AutoML frameworks have evolved toward fully autonomous, modular systems combining advanced feature extraction, algorithm selection, and deep learning with highly accessible, interpretable interfaces. Efficiency and discriminative performance increasingly rival or surpass expert-driven benchmarks, yet critical gaps in reproducibility, harmonization, and survival modeling remain persistent challenges for future research and clinical translation.