TabRepo: A Large Scale Repository of Tabular Model Evaluations and its AutoML Applications
Abstract: We introduce TabRepo, a new dataset of tabular model evaluations and predictions. TabRepo contains the predictions and metrics of 1310 models evaluated on 200 classification and regression datasets. We illustrate the benefit of our dataset in multiple ways. First, we show that it allows to perform analysis such as comparing Hyperparameter Optimization against current AutoML systems while also considering ensembling at marginal cost by using precomputed model predictions. Second, we show that our dataset can be readily leveraged to perform transfer-learning. In particular, we show that applying standard transfer-learning techniques allows to outperform current state-of-the-art tabular systems in accuracy, runtime and latency.
- Quick-tune: Quickly learning which pretrained model to finetune and how. arXiv preprint arXiv:2306.03828.
- Multimodal datasets: misogyny, pornography, and malignant stereotypes. arXiv preprint arXiv:2110.01963.
- Multi-objective model selection for time series forecasting.
- Breiman, L. (1996). Bagging predictors. Machine learning, 24:123–140.
- Breiman, L. (2001). Random forests. Machine learning, 45:5–32.
- Ensemble selection from libraries of models. In Proceedings of the twenty-first international conference on Machine learning, page 18.
- Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, pages 785–794.
- Demšar, J. (2006). Statistical comparisons of classifiers over multiple data sets. The Journal of Machine learning research, 7:1–30.
- NAS-Bench-201: Extending the scope of reproducible neural architecture search. Technical Report arXiv:2001.00326 [cs.CV].
- Hpobench: A collection of reproducible multi-fidelity benchmark problems for hpo. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2).
- Autogluon-tabular: Robust and accurate automl for structured data.
- Auto-sklearn 2.0: Hands-free automl via meta-learning. Journal of Machine Learning Research, 23(261):1–61.
- Efficient and robust automated machine learning. Advances in neural information processing systems, 28.
- Friedman, J. H. (2001). Greedy function approximation: a gradient boosting machine. Annals of statistics, pages 1189–1232.
- Extremely randomized trees. Machine learning, 63:3–42.
- Amlb: an automl benchmark. arXiv preprint arXiv:2207.12560.
- GAMA: A General Automated Machine Learning Assistant, page 560–564. Springer International Publishing.
- Tabr: Tabular deep learning meets nearest neighbors in 2023.
- Revisiting deep learning models for tabular data. Advances in Neural Information Processing Systems, 34:18932–18943.
- Why do tree-based models still outperform deep learning on tabular data?
- Support vector machines. IEEE Intelligent Systems and their applications, 13(4):18–28.
- Herbold, S. (2020). Autorank: A python package for automated ranking of classifiers. Journal of Open Source Software, 5(48):2173.
- Tabpfn: A transformer that solves small tabular classification problems in a second. arXiv preprint arXiv:2207.01848.
- Dataset2vec: Learning dataset meta-features. Data Mining and Knowledge Discovery, 35:964–985.
- Lightgbm: A highly efficient gradient boosting decision tree. Advances in neural information processing systems, 30.
- Tabular benchmarks for joint architecture and hyperparameter optimization. Technical Report arXiv:1905.04970 [cs.LG].
- H2o automl: Scalable automatic machine learning. In Proceedings of the AutoML Workshop at ICML, volume 2020. ICML.
- When do neural nets outperform boosted trees on tabular data? In Advances in Neural Information Processing Systems.
- TPOT: A Tree-Based Pipeline Optimization Tool for Automating Machine Learning, pages 151–160. Springer International Publishing, Cham.
- Mljar: State-of-the-art automated machine learning framework for tabular data. version 0.10.3.
- Catboost: unbiased boosting with categorical features. Advances in neural information processing systems, 31.
- Assembled-openML: Creating efficient benchmarks for ensembles in autoML with openML. In First International Conference on Automated Machine Learning (Late-Breaking Workshop).
- CMA-ES for post hoc ensembling in autoML: A great success and salvageable failure. In AutoML Conference 2023.
- Lightautoml: Automl solution for a large financial services ecosystem. arXiv preprint arXiv:2109.01528.
- Openml: Networked science in machine learning. SIGKDD Explor. Newsl., 15(2):49–60.
- Flaml: A fast and lightweight automl library. Proceedings of Machine Learning and Systems, 3:434–447.
- Learning hyperparameter optimization initializations. In 2015 IEEE International Conference on Data Science and Advanced Analytics (DSAA), pages 1–10.
- Wolpert, D. H. (1992). Stacked generalization. Neural networks, 5(2):241–259.
- Hydra: Automatically configuring algorithms for portfolio-based selection. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 24, pages 210–216.
- Nas evaluation is frustratingly hard.
- NAS-bench-101: Towards reproducible neural architecture search. In Chaudhuri, K. and Salakhutdinov, R., editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 7105–7114. PMLR.
- Xtab: Cross-table pretraining for tabular transformers.
- Auto-pytorch tabular: Multi-fidelity metalearning for efficient and robust autodl.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.