OpenML-CC18 Benchmark Suite

Updated 23 April 2026

OpenML-CC18 is a curated suite of 72 supervised classification tasks featuring real-world datasets with diverse domains, sample sizes, and feature types.
It enforces strict curation criteria on dataset size, class structure, and feature complexity, using standardized 10-fold cross-validation for consistency.
The benchmark underpins algorithm comparisons, meta-learning, AutoML, and active learning by providing rich meta-features and reproducible evaluation protocols.

OpenML-CC18 is a curated benchmarking suite of 72 real-world supervised classification tasks, introduced to promote reproducibility, comparability, and methodological rigor in empirical machine learning. Developed within the OpenML ecosystem, the suite targets the evaluation of algorithms under standardized settings, spanning diverse domains, sample sizes, feature types, and class cardinalities. It has become a critical reference point for large-scale algorithm comparison, meta-learning, active learning, AutoML, and tabular foundation model evaluation, supporting systematic advancement in supervised learning on heterogeneous tabular data (Bischl et al., 2017).

1. Construction, Curation Criteria, and Statistical Diversity

OpenML-CC18 was assembled via strict, multi-stage screening of OpenML’s public datasets to ensure variation in data properties and maintain experimental feasibility. The selection pipeline (Bischl et al., 2017) enforced:

Size constraints: $500 \leq n \leq 100{,}000$ (number of examples), $d_{\text{one-hot}} \leq 5{,}000$ (post-encoding feature count).
Class structure: $k\geq 2$ ; each class must have $\geq 20$ examples, and minority/majority ratio $\geq 0.05$ .
Task realism: exclusion of artificial/simulated datasets, tasks solvable by a single feature or Decision Tree, and data lacking public provenance.
No regression, clustering, time-series, or grouped designs.

The final benchmark includes both binary and multiclass tasks (classes up to $k=50$ ), with feature sets ranging from $d=5$ to $3,000+$. Domains comprise healthcare, finance, biomedicine, sensor data, vision (e.g., MNIST, CIFAR-10), and engineered datasets. Class imbalances vary, as do proportions of numeric/categorical attributes and rates of missingness (Bischl et al., 2017, Xu et al., 2021).

# Tasks	$n$ Range	# Features ( $d$ )	Classes ( $d_{\text{one-hot}} \leq 5{,}000$ 0)	Domains
72	500– $d_{\text{one-hot}} \leq 5{,}000$ 1	5–3,000+	2–50	Tabular, vision, sensor

A subset of tasks may be filtered for specialized studies (e.g., only binary, only ≤100 features) (Jagadish et al., 2024).

2. Metadata, Access, and Evaluation Protocols

Each OpenML-CC18 task is uniquely defined by a dataset, target column, evaluation protocol (standard is 10-fold cross-validation with predefined splits), and meta-information fields (≈70 per set) such as class entropy, imbalance ratios, instance counts, feature modalities, and missing value prevalence (Bischl et al., 2017, Xu et al., 2021).

OpenML exposes CC18 via Python, Java, and R APIs with standardized ARFF (and CSV) formats. Code snippets permit batch downloading, training, prediction, evaluation, and result upload with a few lines (Bischl et al., 2017). Meta-feature fields enable fine-grained programmatic filtering to assemble custom benchmark subsets.

By construction, all tasks provide identical preprocessing: categorical features are one-hot or ordinal-encoded, missing values are imputed (mode or mean as appropriate), and evaluation splits are consistent across all clients (Bischl et al., 2017, Xu et al., 2021, Tanna et al., 14 Jan 2026).

3. Benchmark Use Cases and Influence on Research Methodologies

OpenML-CC18 is the de facto standard for:

Algorithm comparisons: Quantitative head-to-head studies of classical classifiers (RF, SVM, MLP), ensemble models, decision forests, and deep learning (Cardoso et al., 2020, Xu et al., 2021, Bischl et al., 2017).
Meta-learning and AutoML: Systematic meta-feature extraction for algorithm selection, automated hyperparameter optimization, and transfer/meta-learning (Brinkmeyer et al., 2019, Jagadish et al., 2024).
Active learning: Large-scale analysis of acquisition functions and pre-training approaches on tabular classification (Bahri et al., 2022).
Robust evaluation: Item Response Theory (IRT) and Glicko-2 ratings quantify dataset difficulty and classifier ability, allowing for tailored subset selection to either stress classifier limits or enable tight pairwise algorithm differentiation (Cardoso et al., 2020, Cardoso et al., 2021).
Tabular foundation models and fine-tuning: Pretrained TFMs are systematically compared in zero-shot, meta-learned, SFT, and parameter-efficient adaptation regimes (Tanna et al., 14 Jan 2026).

Researchers frequently construct focused CC18 subsets to control for domain, dimensionality, class balance, or label cardinality for targeted benchmarking (Jagadish et al., 2024).

4. Empirical Insights from Systematic Benchmarking

Aggregate analyses over CC18 reveal:

Dataset stratification: Majority of CC18 sets are “easy”—over 80% of instances in 49/60 sets have IRT difficulty $d_{\text{one-hot}} \leq 5{,}000$ 2, and only ~10% of CC18 tasks are predominantly difficult. High-discrimination tasks are more common than genuinely hard ones (Cardoso et al., 2020, Cardoso et al., 2021).
Classifier ranking: Glicko-2 meta-ratings computed over CC18 favor tree-based ensembles (Random Forest, MLPClassifier) on average; random/perfect/pessimistic baselines define ability extremes (Cardoso et al., 2020).
Acquisition function benchmarking: Across 69 tabular CC18 tasks, margin sampling matches or outperforms AL baselines—CoreSet, BALD, cluster/diversity-based—across all data regimes and pre-training variants. Relative accuracy gains over random selection are consistently ~1–4 percentage points (Bahri et al., 2022).
Streaming vs. batch learning: Stream Decision Forest (SDF) and XForest achieve accuracy within ±5% of batch Random Forest on most of CC18, sometimes surpassing it in low-sample regimes ( $d_{\text{one-hot}} \leq 5{,}000$ 3) and/or high-dimensional settings. Streaming models complete all tasks within commodity hardware constraints (Xu et al., 2021).
TFM adaptation: Zero-shot TFM inference achieves $d_{\text{one-hot}} \leq 5{,}000$ 4– $d_{\text{one-hot}} \leq 5{,}000$ 5 accuracy on CC18. Meta-learning yields modest gains (up to $d_{\text{one-hot}} \leq 5{,}000$ 6 pts for TabPFN). Full SFT often degrades performance except in specific “medium & wide” settings. PEFT recovers a large fraction of SFT gains without overfitting (Tanna et al., 14 Jan 2026).

5. Subset Selection, Difficulty Profiling, and Practical Recommendations

Analyses utilizing IRT and Glicko-2 illustrate that not all CC18 datasets are equally informative for algorithm benchmarking (Cardoso et al., 2020, Cardoso et al., 2021). Key findings:

Only a minor subset (<12%) of task instances are truly “hard” ( $d_{\text{one-hot}} \leq 5{,}000$ 7).
80% of instances in half the suite are highly discriminative ( $d_{\text{one-hot}} \leq 5{,}000$ 8).
For robust head-to-head comparisons, select high-discrimination datasets.
For stress-testing classifier ability, restrict to sets with >50% difficult instances.
A carefully chosen 50% subset of CC18 can be as discriminative as the full suite for most algorithmic comparisons.
For active learning, margin-based selection is nearly always optimal and requires no hyperparameter tuning (Bahri et al., 2022).

Use Case	Recommended CC18 Subset
Stress-test ability	Datasets with >50% difficult items ( $d_{\text{one-hot}} \leq 5{,}000$ 9)
Fine discrimination	Datasets with >80% high-discrimination items ( $k\geq 2$ 0)
Efficient protocol	50% highest-difficulty or discrimination subset

6. Impact on Advanced Modeling and Future Directions

OpenML-CC18’s structure has enabled rapid development, robust testing, and reproducibility for advanced learning paradigms:

Cross-dataset meta-learning: Schema-alignment models (e.g., Chameleon) leverage CC18 for the first cross-task few-shot learning experiments on tabular domains, supporting both feature and instance subsampling regimes (Brinkmeyer et al., 2019).
Ecological priors and cognitive-aligned learning: Transformer models meta-trained on LLM-generated, ecologically plausible tasks (ERMI) match or surpass XGBoost, SVM, and TabPFN on CC18, exemplifying the suite’s utility for real-world generalization (Jagadish et al., 2024).
Foundational model evaluation: CC18 provides a uniform platform to benchmark zero-shot, meta-learned, and fine-tuned TFMs, clarifying the regimes where adaptation is beneficial (rarely for small $k\geq 2$ 1, often for imbalanced or wide-feature medium-sized tasks) (Tanna et al., 14 Jan 2026).

Future directions include extension to more challenging regimes (higher-dimensional, richer missingness patterns, variable label cardinality), and more granular difficulty profiling to enable adaptive benchmarking (Cardoso et al., 2020, Bischl et al., 2017).

7. Concluding Position in the Benchmarking Landscape

OpenML-CC18 has established itself as the canonical tabular classification benchmark suite, cited across empirical ML, AutoML, and theoretical algorithmic research. Its curation and rich meta-information facilitate reproducibility, empirical rigor, and community-driven extension. Subset selection methodologies and systematic evaluation protocols developed around CC18 now inform next-generation benchmarking standards within and beyond the OpenML platform (Bischl et al., 2017).