Real-TabPFN: Improving Tabular Foundation Models via Continued Pre-training With Real-World Data (2507.03971v1)

Published 5 Jul 2025 in cs.LG, cs.AI, stat.ME, and stat.ML

Abstract: Foundation models for tabular data, like TabPFN, achieve strong performance on small datasets when pre-trained solely on synthetic data. We show that this performance can be significantly boosted by a targeted continued pre-training phase. Specifically, we demonstrate that leveraging a small, curated collection of large, real-world datasets for continued pre-training yields superior downstream predictive accuracy compared to using broader, potentially noisier corpora like CommonCrawl or GitTables. Our resulting model, Real-TabPFN, achieves substantial performance gains on 29 datasets from the OpenML AutoML Benchmark.

Summary

We haven't generated a summary for this paper yet.

Summarize Now

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Generate Now

Real-TabPFN: Improving Tabular Foundation Models via Continued Pre-training With Real-World Data (2507.03971v1)

Summary

Follow-up Questions

Related Papers

Authors (6)