Papers
Topics
Authors
Recent
2000 character limit reached

TabularFM: Framework for Tabular FMs

Updated 20 November 2025
  • TabularFM is an open framework and research methodology for training foundational models on structured tabular data to address limitations in unstructured modalities.
  • It integrates state-of-the-art generative architectures such as CTGAN, TVAE, and Transformer-based models with standardized data curation and reproducible evaluation protocols.
  • Empirical results show that pretraining consistently improves model performance across synthetic data generation and realistic transfer tasks in diverse application domains.

TabularFM is an open framework and research methodology for developing, evaluating, and benchmarking foundational models (FMs) specifically tailored to structured tabular data. Responding to the limitations of prevailing FM research—largely focused on unstructured modalities such as text and vision—TabularFM consolidates state-of-the-art generative models, standardized data curation pipelines, massive open-access tabular corpora, and reproducible evaluation protocols to enable systematic progress in tabular machine learning (Tran et al., 14 Jun 2024).

1. Motivation for Tabular Foundation Models

Tabular data underpins analytics and decision-making in sectors such as healthcare, finance, and scientific measurement, yet historically lags behind unstructured domains in FM research. Two primary issues constrain advances: (i) publicly available tabular datasets are typically small and noisy, often containing heterogeneous mixtures of numeric, categorical, and sometimes textual or time-series fields, and (ii) there is no clear analog to domain-specialized neural inductive biases (e.g., CNNs for images, Transformers for text) in the tabular setting. The absence of standard large-scale corpora and pretraining paradigms impedes the development of FMs that learn transferable inductive biases for structured tables (Tran et al., 14 Jun 2024).

2. Curated Dataset Collection and Preprocessing

TabularFM introduces and cleans two major foundational corpora for model pretraining:

  • GitTables: Filtered from ∼1 million GitHub CSVs to 1 258 fully structured numeric/categorical tables (avg. 9.5 columns, 1 113 rows each).
  • Kaggle tabular datasets: From 43 514 datasets, stringent cleaning yields 1 435 tables (avg. 8.4 columns, 225 rows).

Cleaning involves excluding datasets with text, URLs, timestamps, or excessive nulls, removing columns with >50% nulls or >90% unique categorical values, and restricting to tables that retain ≥10% of original columns after cleaning. Datasets are split randomly (80/10/10%) or via clustered “domain splits” using BERT embeddings for broader robustness assessment (Tran et al., 14 Jun 2024).

3. Model Architectures and Pretraining Paradigms

TabularFM provides implementations of four primary generative FM families:

  • CTGAN: A conditional GAN tailored for tabular data mixing numeric and categorical fields, employing PacGAN and WGAN-GP loss formulations and conditional column targets.
  • Tabular VAE (TVAE), Shared TVAE (STVAE), STVAE+Metadata (STVAEM): TVAE models every column with mixtures or softmax, optimizing ELBO; STVAE restricts variance to enable transferability by using MSE on numeric columns; STVAEM incorporates BERT-derived column signatures as row-level metadata.
  • Transformer-based (GReaT): Serializes rows into text, tokenized by BPE, training an autoregressive decoder (GPT-2-distil backbone) to maximize next-token prediction.

All models are pretrained across the entire corpus with self-supervised objectives (e.g., adversarial or variational for CTGAN/TVAE, next-token for GReaT). Pretrained models are released for transfer learning and benchmarking (Tran et al., 14 Jun 2024).

4. Evaluation Protocols and Benchmarking Metrics

TabularFM establishes quantitative, dataset-scale leaderboards evaluating both synthetic generative capacity and statistical transferability:

  • Column Shapes: Fidelity of univariate feature distributions (Kolmogorov-Smirnov for numeric, Total Variation Distance for categorical).
  • Column Trends: Pairwise correlation preservation (Pearson’s ρ for numeric-numeric, contingency TVD for categorical-categorical, binned cross-type).
  • Overall Score: Averaged (S_shape + S_trend)/2 for each test table, with random and domain-based test splits.

Benchmarks are hosted with public leaderboards and code on https://tabularfm.github.io. Leaderboards include both synthetic data generation and “realistic data transfer” scenarios—pretrained FM fine-tuning vs. training from scratch (Tran et al., 14 Jun 2024).

5. Key Empirical Results and Insights

Pretraining consistently improves performance for all model types. For example, CTGAN and TVAE-family models pretrained on GitTables or Kaggle show ∼0.10 absolute gain in Overall Score over models trained from scratch (p < 1e–4). Metadata (column signature) augmentations failed to yield consistent additional benefit. Cross-domain splits slightly degrade absolute scores, but pretraining gains persist. GReaT initialized from a language-model (GPT-2) baseline exhibits superior absolute performance compared to versions further pretrained on tabular data, suggesting that generalized world knowledge may transfer strongly to tabular statistics—possibly more than tabular-specialized pretraining at limited scale. Qualitatively, broad-concept columns (e.g., “Age,” “ChronicDisease”) and general correlations are more transferable than application-specific columns (Tran et al., 14 Jun 2024).

6. Model Transferability, Robustness, and Extended Applications

Recent work explores the robustness and transfer of TabularFM models under adversarial input perturbations and novel domains. Transformer-based architectures such as TabPFNv2 (“TabPFN”) and TabICL are notable for their sample-efficient in-context learning; they predict for a test set given a context of example-label pairs, without any weight updates at inference (Djilani et al., 3 Jun 2025). Robustness studies reveal that both models are highly vulnerable to carefully tailored adversarial attacks, but “adversarial in-context training”—iteratively adversarially perturbing the support context (rather than model weights)—confers significant gains in adversarial accuracy across finance, cybersecurity, and healthcare benchmarks. Transfer attacks generated by these FM models are effective at attacking conventional models such as random forests and XGBoost (Djilani et al., 3 Jun 2025).

TabularFMs have also been successfully adapted for node-level prediction on graphs by recasting node classification as a table-inpainting problem. Both “TabGFM” and “G2T-FM” frameworks convert graph nodes to tabular rows enriched with neighborhood statistics and structural features, then apply state-of-the-art tabular FMs (notably TabPFNv2) for zero-shot node classification, outperforming specialist graph neural networks (GNNs) and graph foundation models (GFMs) in both in-context and fine-tuned settings across standard graph benchmarks (Hayler et al., 8 Sep 2025, Eremeev et al., 28 Aug 2025).

7. Limitations and Open Directions

TabularFM is currently constrained to fully structured (numeric + categorical) tables; extensions to real-world tabular scenarios—including mixed text, time-series, image, or spatial columns—remain unexplored. Only selected neural architectures (GAN, VAE, Transformer) are implemented; evaluation of, for example, graph neural architectures directly on relational tables is an open direction. The framework’s evaluation is primarily focused on generative modeling (synthetic data, transferability); comprehensive benchmarking on classical supervised tasks (classification, regression, imputation) is not yet finalized. High robustness against adversarial and out-of-distribution shifts requires further research, particularly in the context of regulatory or safety-critical domains (Tran et al., 14 Jun 2024, Djilani et al., 3 Jun 2025).


References:

Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to TabularFM.