AutoML Benchmark (AMLB) Framework

Updated 12 November 2025

The AutoML Benchmark (AMLB) framework is a standardized protocol enabling consistent evaluation of AutoML systems by enforcing identical datasets, budgets, hardware, and metrics.
It uses a modular architecture with orchestration scripts, API wrappers, and statistical analysis tools to achieve methodological rigor and full reproducibility.
AMLB supports diverse AutoML tools and integrates baseline comparisons and early-stopping mechanisms, facilitating fair, extensible, and energy-efficient model benchmarking.

The AutoML Benchmark (AMLB) framework is a standardized, open-source protocol and software suite for the rigorous, reproducible, and extensible empirical evaluation of automated machine learning (AutoML) frameworks, especially on tabular learning tasks. It defines the canonical testbed, resource controls, experiment workflows, statistical methodology, and analysis toolchain now recognized as the de facto standard for assessing and comparing AutoML systems, both in the research literature and in industrial settings. AMLB's design enforces strict comparability (identical datasets, budgets, hardware, and metrics), supports a large and diversified suite of open datasets, and offers extensibility and automation to accommodate new AutoML tools and emerging benchmarking desiderata.

1. Design Goals and Evaluation Principles

The AMLB framework is built on three foundational principles: open extensibility, strict scientific reproducibility, and methodological rigor in fair comparison. Key design elements include:

Open-source infrastructure: All code, including wrappers for AutoML tools, orchestration scripts, and visualization dashboards, is publicly available and version-controlled (e.g., GitHub; (Gijsbers et al., 2019, Gijsbers et al., 2022)).
Extensible modularity: Adding new frameworks or datasets is a low-effort, incremental process, enabled by wrapper abstractions and configuration-driven experiment setup.
Standardized evaluation regime: All frameworks are run under precisely identical computational constraints (time, memory, CPU) set per dataset and fold, with fixed budgets (e.g., 1 h, 4 h, or shorter, see Section 3), and no manual harmonization of hyperparameter spaces except for resource limits.
Baseline integration: Performance is contextualized against strong baselines such as constant predictors and tuned random forests, ensuring that gains are attributed to AutoML system merit rather than experimental idiosyncrasies (Gijsbers et al., 2019, Jurado et al., 1 Apr 2025).
Reproducibility: All runs are containerized (Docker), code snapshots and configuration files are archived, and full log provenance (including library versions and random seeds) accompanies published results.

2. Benchmark Architecture and Workflow

The AMLB ecosystem follows a modular architecture comprising an experiment orchestrator, API-based framework wrappers, centralized data/specification management, and result aggregation and visualization subsystems (Gijsbers et al., 2019, Gijsbers et al., 2022).

Orchestrator: Reads YAML/JSON configurations defining experiment matrix (frameworks, tasks, split/fold schemes, time/memory/CPU budgets), launches isolated containerized jobs, and manages resource enforcement and failure recovery.
Wrappers: Small, extensible Python classes or scripts standardize the call signatures for each AutoML system, enforce timeouts (via subprocess, Docker cgroups, or signals), and harmonize input/output format (X_train, y_train, X_test, predicting y_pred).
Data Layer: OpenML APIs provide dataset ingestion, with metadata-driven preprocessing including missing value imputation, feature encoding (one-hot or ordinal), and scaling, strictly applied to ensure parity.
Evaluation Engine: Uses scikit-learn (versions specified for consistency) to compute metrics such as AUROC, accuracy, RMSE, and log-loss. Results and runtime/resource traces are stored per (framework, task, fold/seed).
Aggregation and Analysis: Aggregators compile per-run outputs into summary files (CSV/JSON/SQLite), feeding into dashboards (interactive HTML/Plotly/Vega-Lite) and statistical comparison routines, including significance testing and global ranking (Gijsbers et al., 2019, Gijsbers et al., 2022).

The experiment flow can be summarized as:

for each dataset D:
    for each framework F:
        for each fold or seed:
            enforce resource budget (T, RAM, CPU)
            run framework wrapper: fit/predict (X_train, y_train, X_test, time_budget)
            log performance, timing, memory, failures
aggregate results: per framework, task, metric
apply statistical analysis and visualize

3. Task Suite, Resource Regimes, and Preprocessing

AMLB's canonical benchmark suite is tabular, focused on public OpenML datasets. The most widely adopted setting comprises:

104 tasks: 71 classification (binary and multiclass), 33 regression; 10-fold cross-validation (Jurado et al., 1 Apr 2025).
Dataset curation: Excludes time series, image, text, multi-label tasks, and enforces practical constraints (e.g., 9/104 datasets >500K rows, 14/104 >1K features). Datasets span domains such as business, medicine, physics, and finance (Gijsbers et al., 2022, Gijsbers et al., 2019).
Preprocessing uniformity: All frameworks receive identically preprocessed data per metadata prescription (e.g., median-impute numerics, most-frequent categorical imputation, one-hot/ordinal encoding, z-score scaling).
Resource configuration: Time budget per fold is fixed (originally T ∈ {1h, 4h}, with recent studies advocating for T ∈ {5, 10, 30, 60} minutes to accommodate rapid retraining and ecological considerations; see Table 1).
Hardware limits: Enforced via Docker or cgroups, e.g., 8 vCPUs and 28 GB RAM per job (Gijsbers et al., 2019).

Description	Symbol	Value (seconds)
5 minutes	T₅	300
10 minutes	T₁₀	600
30 minutes	T₃₀	1800
60 minutes	T₆₀	3600
4 hours (original)	T₂₄₀(orig)	14400

Shorter time budgets are shown to preserve discriminative power while reducing computational and energy cost by an order of magnitude (Jurado et al., 1 Apr 2025).

4. Evaluation Metrics, Statistical Analysis, and Framework Ranking

Performance is measured per task as follows:

Primary metrics:
- Classification: AUROC (binary), accuracy (multiclass)
- Regression: Root-mean-square error (RMSE)
Failures and Imputation: Any run that fails (crashes, exceeds budget, returns NaN) is replaced by a constant-predictor (CP) baseline score (Jurado et al., 1 Apr 2025, Gijsbers et al., 2019).
Baseline normalization: For comparison, "0" is the constant predictor, "1" is a tuned random forest.

Statistical methodology includes:

Ranking: Frameworks are ranked by mean rank across tasks and folds.
Significance testing: Friedman test for global rank differences, Nemenyi post-hoc for pairwise contrasts; results are visualized by critical difference (CD) diagrams at α=0.05 (Jurado et al., 1 Apr 2025).
Advanced analysis: Bradley-Terry models and BT-trees investigate global and subgroup skill parameters, capturing heterogeneity in relative framework performance (Gijsbers et al., 2022).

Framework	ΔRank at 30′	ΔRank at 10′	ΔRank at 5′
AutoGluon(B)	+0.22	+0.08	+0.26
AutoGluon(HQ)	+0.44	+0.57	+0.96
AutoGluon(HQIL)	−0.01	−0.46	+0.40

Relative ranking of frameworks is highly stable across time constraints (Pearson r > 0.96 between 5-min and 60-min ranks; r > 0.99 for 30-min vs. 60-min) (Jurado et al., 1 Apr 2025).

5. Framework Portfolio, Early-Stopping, and Practicality

AMLB covers a comprehensive set of mature AutoML frameworks—recently 11, including variants of AutoGluon, auto-sklearn/autosklearn2, FLAML, GAMA(B), H2OAutoML, lightAutoML, MLJAR(B), TPOT, and FEDOT (Jurado et al., 1 Apr 2025).

Early-Stopping: Five frameworks offer internal early stopping rules, typically "no improvement for p iterations":
- AutoGluon: EarlyStoppingEnsembleCallback, p=5 models
- H2OAutoML: stopping_rounds=3
- FLAML: early_stop parameter
- TPOT: early_stop=3 (generations)
- FEDOT: early_stopping_iterations=3
- Abstract logic: If $M(\text{best}_{i-p..i}) \leq M(\text{best}_{i-p-1})$ , stop search.
Quantitative effects:
- Moving from 60 min to 5 min yields ~92% reduction in per-fold CPU time; practical benefit is nearly–order-of-magnitude reduced compute for large campaigns.
- Early stopping saves 20–50% wall-time and energy, with minimal or sometimes negative regret (r ≤ 0) in most settings; exceptions occur especially on complex tasks in FLAML/TPOT (r > 0.3; (Jurado et al., 1 Apr 2025)).
- Frameworks with meta-learned warm-starts (e.g., AutoGluon) are especially robust at short budgets; evolutionary search–based frameworks (GAMA, TPOT) are more sensitive to reduced optimization horizon.

6. Extensibility, Automation, and Open Benchmarking

AMLB is designed for direct extensibility and automation (Gijsbers et al., 2019, Gijsbers et al., 2022):

Adding a framework: A YAML or JSON configuration and a Python/Bash wrapper script suffices; minimal boilerplate is necessary if the wrapper exposes standard fit/predict/time_budget APIs.
Adding a dataset: A structured metadata entry (OpenML ID, target, type, metric, budget) and minimal preprocessing logic added per dataset.
Automation: Orchestrator can run full campaign matrix jobs, aggregate, publish, and trigger CI-based regeneration of live dashboards.
All outputs, logs, and metrics are archived, with published dashboards facilitating transparency and reproducibility (e.g., [https://openml.github.io/automlbenchmark/]).

7. Experimental Findings, Recommendations, and Outlook

Empirical studies, notably those involving shorter time budgets and early stopping (Jurado et al., 1 Apr 2025), yield several robust findings:

Relative framework rankings are stable across a wide range of time budgets; 30-min per fold is sufficient for nearly all frameworks/datasets to reach within 5% of their 60-min performance.
Early stopping decreases energy and wall-time without significant performance loss, though best parameters for patience must be tuned for framework and dataset.
No single AutoML system dominates; performance varies across domains, dataset size, and problem type (Gijsbers et al., 2019, Zöller et al., 2019, Gijsbers et al., 2022, Balaji et al., 2018).
Increasing compute beyond 1h per fold has diminishing returns; variance across folds mandates multiple CV splits.
Recommendations include embracing a range of budgets to reflect "anytime" learning behavior and enable broader accessibility, and adopting combined strategies (short budgets, early stop, warm-starting) for green, sustainable AutoML benchmarking.

A plausible implication is that as AutoML systems increasingly support advanced meta-learning, hybrid optimization, and early-convergence detection, benchmark protocols like AMLB will need to evolve to accurately capture the full Pareto frontier of accuracy, inference speed, and ecological impact.