- The paper demonstrates that using proper scoring rules reveals significant differences in model ranking and uncertainty quantification compared to traditional point metrics.
- It introduces a comprehensive benchmark utilizing CRPS, Interval Score, and other metrics to assess calibration, sharpness, and sensitivity to tail events.
- Empirical findings underscore that scoring rule selection critically impacts performance evaluation in high-stakes applications like finance and clinical decision-making.
ScoringBench: Comprehensive Evaluation of Tabular Foundation Models via Proper Scoring Rules
Motivation and Problem Statement
The evaluation of tabular regression models, particularly recent transformer-based tabular foundation models like TabPFN and TabICLv2, has not kept pace with their inherent capabilities. While these models yield full predictive distributions, legacy evaluation protocols focus almost exclusively on point estimation metrics such as RMSE or R2, omitting any assessment of distributional accuracy, tail behavior, or uncertainty quantification. This discrepancy is critical for application domains where risk is asymmetric or tail events are consequential—such as finance or clinical decision-making.
The practice of using point metrics implicitly assumes the mean is always of primary interest, an assumption invalid in high-stakes environments. Although strictly proper scoring rules have been advocated for decades as the methodological cornerstone for probabilistic forecast evaluation, their adoption in tabular regression benchmarking is almost nonexistent. The absence of distributional diagnostics leads to suboptimal deployment choices, where models may excel on mean-squared error but fail to capture essential features of the conditional predictive distributions.
Benchmark Design and Methodology
ScoringBench addresses this deficiency by systematizing the evaluation of tabular regression models using a spectrum of proper scoring rules, in addition to conventional point metrics. The platform implements and reports on CRPS, CRLS (Exceedance Probability Score), Interval Score, Brier Score, β-Energy Scores (with several β values), and weighted CRPS (with domain-adaptable emphasis), yielding granular insights into how models trade off calibration, sharpness, and sensitivity to tail events.
Figure 1: ScoringBench: Evaluating tabular regression models with proper scoring rules.
The evaluation harness encompasses:
- Analytical definitions and implementations for each scoring rule, grounded in the foundational works of Gneiting and others.
- Diagnostics—sharpness and dispersion—for comparing models' uncertainty quantification, enabling discrimination between heteroscedastic and homoscedastic estimators.
- Empirical coverage at various nominal levels (90%, 95%) to quantify calibration of confidence intervals.
- Rank-based permutation tests over cross-validated, independent dataset-level performance metrics, explicitly controlling for pseudoreplication and the incommensurability of scores across heterogeneous tasks, following best practices established in statistical machine learning.
The models are benchmarked and compared on standardized OpenML datasets (OpenML suites 269, 297, 299, subset to 46 unique datasets after necessary filtering), with 5-fold cross-validation. The models include recent TabPFN variants with various fine-tuning objectives, TabICLv2, and adapted XGBoost baselines.
Empirical Findings
A central finding is that model rankings are highly dependent on the choice of scoring rule. This corroborates prior work but is unambiguously demonstrated at scale for tabular foundation models. For instance, the Spearman rank correlation between Brier score and log-score-based rankings is as low as 0.15, confirming the lack of concordance between different proper rules for these tasks.
Fine-tuning TabPFNv2.5 on loss functions aligned with a given scoring rule shifts inductive bias, yielding significantly better performance according to that rule but potentially at the expense of others. For example, the CRLS-finetuned variant wins on CRLS and R2, while TabICLv2 dominates the leaderboard on CRPS. This effect is robust across multiple datasets and confirmed with nonparametric permutation tests, mitigating risks of spurious statistical significance.
The choice of metric is therefore not merely technical, but implicitly encodes risk preferences and domain-specific utility. For example, in financial forecasting, a weighted CRPS emphasizing the left tail may be essential, whereas in clinical risk estimation, interval scores may be of paramount concern. ScoringBench enables practitioners to select, fine-tune, and evaluate models according to metrics that faithfully reflect domain loss functions.
Model diagnostics on sharpness and dispersion further illuminate differences that would be invisible under point metrics, enabling a more nuanced comparison of how models quantify and allocate uncertainty across heterogeneous test instances.
Practical and Theoretical Implications
The implications for applied machine learning and statistical modeling are profound:
- Model evaluation must be aligned with application utility: The scoring rule (or loss function) chosen for evaluation, training, and selection must reflect end-user requirements for risk and reward, especially for use cases with asymmetric or complex error tolerance.
- Reporting only RMSE or R2 is insufficient: Multiple models may attain similar means or variances but differ drastically in outlier or tail behavior, with potentially catastrophic real-world consequences.
- Benchmark standardization: By operationalizing a reproducible, extensible, and transparent benchmark (tracked by git pull requests and a live leaderboard), ScoringBench sets a higher methodological baseline for the field and makes statistical significance assessment tractable and defensible.
- Fine-tuning on domain-appropriate scoring functions: Empirical evidence confirms that the inductive bias of TabPFN-style models can be steered toward performance on a target scoring rule, but no universal pretraining objective exists. For high-risk or regulatory domains, this opens the door to customized training regimes tailored to domain requirements—using custom proper scoring rules when off-the-shelf ones are unsatisfactory.
Future Directions
The platform is positioned for immediate extension, including support for multivariate (joint) regression targets, exploration of alternative normalization strategies for multi-dataset comparison, and broader community involvement in metric selection, dataset curation, and baseline submission. This builds toward a vision of statistically credible, risk-aware model evaluation pipelines for the next generation of tabular foundation models.
Conclusion
ScoringBench establishes a comprehensive, reproducible standard for the evaluation of tabular foundation models with proper scoring rules, making explicit the domain-dependent nature of "best" models and elevating the benchmark for empirical rigor in tabular regression research. By supporting calibrated comparison across a full suite of distributional metrics, it enables both practitioners and theorists to align model selection with the nuances of their operational context, rather than fixating on point-error proxies. The public availability of the tool and leaderboard catalyzes further adoption and method development in application-critical AI systems.