Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
173 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

TabArena: A Living Benchmark for Machine Learning on Tabular Data (2506.16791v2)

Published 20 Jun 2025 in cs.LG and cs.AI

Abstract: With the growing popularity of deep learning and foundation models for tabular data, the need for standardized and reliable benchmarks is higher than ever. However, current benchmarks are static. Their design is not updated even if flaws are discovered, model versions are updated, or new models are released. To address this, we introduce TabArena, the first continuously maintained living tabular benchmarking system. To launch TabArena, we manually curate a representative collection of datasets and well-implemented models, conduct a large-scale benchmarking study to initialize a public leaderboard, and assemble a team of experienced maintainers. Our results highlight the influence of validation method and ensembling of hyperparameter configurations to benchmark models at their full potential. While gradient-boosted trees are still strong contenders on practical tabular datasets, we observe that deep learning methods have caught up under larger time budgets with ensembling. At the same time, foundation models excel on smaller datasets. Finally, we show that ensembles across models advance the state-of-the-art in tabular machine learning and investigate the contributions of individual models. We launch TabArena with a public leaderboard, reproducible code, and maintenance protocols to create a living benchmark available at https://tabarena.ai.

Summary

TabArena: A Living Benchmark for Machine Learning on Tabular Data

The paper "TabArena: A Living Benchmark for Machine Learning on Tabular Data" presents TabArena, a dynamic and continuously updated benchmarking system designed to evaluate machine learning models on tabular data. The authors address the prevalent limitations in existing benchmarking practices, which often involve static evaluations that do not adapt to new developments or address existing flaws. The introduction of TabArena signifies a shift towards sustainability and community-driven improvements in benchmarking.

Benchmarking Issues and New Approach

Current benchmarks for tabular data face several issues, including outdated datasets, problematic licensing, data leaks, and biased representation of tasks. Furthermore, many benchmarks lack maintenance post-publication, resulting in baselines for the state-of-the-art being replicated with shortcomings. To remedy these issues, TabArena is designed as a living benchmark, subject to community contributions and professional maintenance akin to software development projects.

TabArena introduces substantial protocol changes across model and hyperparameter optimization, dataset curation, and evaluation design. It investigates 1053 datasets and curates 51 suitable datasets, thus filtering out those failing to meet established criteria, such as non-IID data or data from non-tabular modalities. This rigorous curation step ensures that TabArena better represents real-world tabular data tasks.

Model Benchmarks and Results

TabArena curates 16 tabular machine learning models, alongside three tabular foundation models, evaluating them through extensive experiments. In this setup, approximately 25 million model instances are trained, avoiding common issues in previous benchmarks like poor hyperparameter optimization and validation procedures.

A key finding from the paper is that while gradient-boosted trees remain strong contenders, deep learning methods achieve competitive performance under larger time budgets with ensembling. Specifically, models such as TabM under post-hoc ensembling protocols demonstrate superior predictive capabilities. Foundation models like TabPFNv2 exhibit dominance on smaller datasets with effective in-context learning performance.

Innovative Ensembling and Evaluation Strategies

The authors emphasize the importance of ensembling in achieving peak model performance, evidenced in practices widely seen in data competitions. The use of post-hoc ensembling on hyperparameter configurations significantly impacts model rankings in benchmarks, which has been underemphasized in previous studies. The paper demonstrates that the leaderboard dynamics change drastically when this strategy is leveraged.

Evaluation within TabArena uses pairwise comparisons through the Elo rating system, providing clear, calibrated insights into model competitiveness. This system aligns with methodologies in other fields, such as LLM evaluations, where robust comparative metrics are required.

Implications and Future Directions

The implications of TabArena are substantial. Practically, it provides a robust benchmarking system that evolves in line with new methodologies and model improvements. Theoretically, it offers a foundation for broader discussions on benchmarking strategies across machine learning domains, potentially influencing practices in non-tabular domains.

As TabArena matures, it is expected to incorporate datasets and tasks beyond IID tabular data, including non-IID datasets representing temporal dependencies or very small sample sizes. This longitudinal aspect could deeply impact future AI studies, highlighting TabArena as a holistic tool for understanding machine learning models in practice.

TabArena promises to redefine best practices by offering a transparent, actively maintained platform with results reproducibility at its core. Its community-centric approach paves the way for continuous improvements, ensuring that benchmarks stay relevant amid evolving technologies and methodological trends.

In conclusion, TabArena serves as a comprehensive and continuously updated infrastructure for evaluating models on tabular data, addressing limitations of static benchmarks, and encouraging community participation to maintain and expand its scope. Its methodologies, like post-hoc ensembling and thorough dataset curation, provide a foundation for future studies in model evaluation and benchmarking. As an open-source project, TabArena invites widespread collaboration, aiming to foster innovation and establish reliable benchmarks in machine learning practice.