Papers
Topics
Authors
Recent
Search
2000 character limit reached

TabSTAR: Foundation Tabular Model

Updated 20 January 2026
  • TabSTAR is a foundation tabular model that integrates semantically target-aware representations by encoding both free-text and numeric features.
  • It employs a dual-stream architecture combining a pretrained text encoder with a numeric MLP, fused via a transformer encoder for joint modeling.
  • Leveraging LoRA for efficient fine-tuning, TabSTAR delivers state-of-the-art classification and competitive regression performance across diverse datasets.

TabSTAR is a Foundation Tabular Model that introduces Semantically Target-Aware Representations for the joint modeling of tabular features and textual semantics in supervised learning tasks. Designed to enable transfer learning across diverse tabular datasets, particularly those with free-text features, TabSTAR incorporates a pretrained text encoder with a unified, dataset-agnostic transformer backbone, achieving state-of-the-art performance for classification tasks and demonstrating favorable scaling laws in its pretraining regime (Arazi et al., 23 May 2025).

1. Model Architecture

TabSTAR processes tabular examples x=(x1,,xm)x = (x_1, \dots, x_m) with an associated target variable yy. Each example is transformed into e=m+Ce = m + C “elements,” comprising the mm feature tokens and CC target tokens (for a CC-class classification). Each feature token is verbalized into a string sjs_j (e.g., “Report: Mild chest discomfort.”) and a standardized numeric value njn_j (z-score, clipped to [3,3][-3, 3]). Target tokens represent possible labels with strings such as “Target. Decision: ...”.

TabSTAR employs two parallel embedding streams for every element: a pretrained encoder-only transformer (e.g., e5-small-v2) for text strings, yielding Etext(sj)Rd\mathbf{E}_{\text{text}}(s_j) \in \mathbb{R}^d, and a two-layer MLP for scalar numeric values, yielding Enum(nj)Rd\mathbf{E}_{\text{num}}(n_j) \in \mathbb{R}^d. These embeddings are concatenated, passed through a single order-invariant transformer encoder layer (2 heads, hidden size $4d$), and averaged to produce fused dd-dimensional vectors hjfuse\mathbf{h}_j^{\mathrm{fuse}}.

All fused vectors are stacked into H(0)Re×d\mathbf{H}^{(0)} \in \mathbb{R}^{e \times d} and sequentially processed by L=6L = 6 standard transformer encoder layers (pre-norm, no positional embeddings), yielding contextualized embeddings for the target and feature tokens.

The shared prediction head applies:

  • Classification: fcls:RdRf_{\mathrm{cls}}:\mathbb{R}^d \to \mathbb{R} on each target token, with class probabilities pi=exp(i)k=1Cexp(k)p_i = \frac{\exp(\ell_i)}{\sum_{k=1}^C \exp(\ell_k)} for logits i=fcls(hi)\ell_i = f_{\mathrm{cls}}(\mathbf{h}_i).
  • Regression: freg:RdRf_{\mathrm{reg}}:\mathbb{R}^d \to \mathbb{R} on the first token.

For d=384d = 384, parameterization is as follows:

Component Parameter Count (Millions)
Text encoder 33.36
Numeric MLP 0.30
Fusion layer 1.77
Interaction (6-layer) 10.65
Prediction heads 1.19
Total 47.26

2. Pretraining and Fine-Tuning Procedures

TabSTAR is pretrained via supervised multitask objectives on a corpus of 350 tabular datasets (253 classification, 97 regression). The joint loss function is:

L=i:classificationCE(yi,pi)+j:regression(yjy^j)2\mathcal{L} = \sum_{i:\mathrm{classification}}\mathrm{CE}(y_i, p_i) + \sum_{j:\mathrm{regression}}(y_j - \hat{y}_j)^2

where CE(y,p)=cyclogpc\mathrm{CE}(y, p) = -\sum_c y_c \log p_c. Training employs AdamW with OneCycleLR (peak LR 5×1055 \times 10^{-5}), weight decay $0.001$, mixed precision, batch size 32 per dataset (global batch size 128 via accumulation), early stopping on a held-out 5% per-dataset validation set.

Fine-tuning is performed via Low-Rank Adaptation (LoRA), inserting rank-r=32r = 32 decompositions in all transformer layers (LoRA dropout = 0.1, α=64\alpha = 64). Only LoRA weights and final heads are updated; all original parameters remain frozen except—optimally—for the top 6 layers of the text encoder. Fine-tuning uses AdamW with a peak LR of 0.001, early stopping after 5 epochs (10% validation split).

3. Transfer Learning and Cross-Dataset Generalization

TabSTAR’s backbone is invariant to dataset specifics: all model weights are shared across datasets, features, and task structures (classification or regression). This enables a model pretrained on hundreds of tabular datasets to be adapted efficiently to any new tabular dataset (regardless of the number or semantics of input features and classes) using LoRA, which updates only 3.4%\approx 3.4\% of parameters. The approach leverages both tabular reasoning and real-world textual knowledge acquired during pretraining, aligning with the intent of foundation models.

4. Empirical Performance and Scaling Laws

Evaluation spans 50 real-world datasets (14 classification, 36 regression), each with 20 random splits (90% train, 10% test). Metrics include AUROC for classification and R2R^2 for regression, with scores normalized to [0,1][0, 1] per dataset and then averaged.

Performance (up to 10K examples):

Method Classification Regression
TabSTAR 0.809 ± 0.019 0.649 ± 0.039
TabPFN-v2 0.783 ± 0.023 -
CatBoost-Tuned 0.756 ± 0.023 0.784 ± 0.029
XGBoost-Tuned 0.744 ± 0.022 0.772 ± 0.031

TabSTAR establishes the state of the art for classification (up to 10K examples), closing the gap for regression compared to GBDT methods (CatBoost, XGBoost).

Pretraining Scaling Laws

Normalized scores for varying pretraining dataset counts:

Pretrain Datasets Classification Regression
0 0.352 ± 0.086 0.338 ± 0.073
16 0.450 ± 0.084 0.395 ± 0.068
64 0.558 ± 0.086 0.642 ± 0.066
256 0.786 ± 0.076 0.811 ± 0.055

Empirically, performance improves roughly logarithmically with pretraining data size: Perf(D)alog(D)+b\mathrm{Perf}(D) \approx a \log(D) + b. A plausible implication is that further scaling may yield continued gains on broader benchmarks.

5. Comparative Analysis with Contemporary Methods

TabSTAR is compared to existing tabular modeling techniques:

  • TabPFN-v2 (ICL-based): Limited in scalability to \leq10K examples; TabSTAR consistently outperforms by \sim3 points in normalized AUROC.
  • CARTE (Graph plus textual encoding): TabSTAR exceeds CARTE on 93% of classification splits and 36% of regression splits (up to 10K examples).
  • GBDT Methods (CatBoost, XGBoost): GBDTs retain advantages in regression benchmarks, but TabSTAR achieves competitive classification performance and demonstrates scalability beyond 10K examples with its unlimited variant.

These comparisons demonstrate TabSTAR’s effectiveness as a foundation tabular model leveraging textual and tabular signals, extending the applicability of transformer-based techniques for tabular data analysis.

6. Architectural Innovations and Significance

TabSTAR introduces several architectural advances:

  • Semantic Target-Aware Verbalization: Explicitly encodes both feature and target information as textual and numeric tokens.
  • Unified Dataset-Agnostic Parameterization: All model parameters are shared, negating dataset-specific adaptation requirements.
  • Fusion of Text and Numeric Embeddings: Fused via a transformer encoder layer to capture interactions between modalities.
  • Efficient Transfer via LoRA: LoRA enables rapid, resource-efficient adaptation to new tasks, with only a minor fraction of weights updated.

This design supports robust transfer learning and prepares TabSTAR as a foundation model for tabular tasks. This suggests potential for deployment in large-scale settings and as a basis for further research into tabular foundation models.

7. Limitations and Prospects for Future Work

While TabSTAR achieves strong performance for classification and favorable scaling laws, regression results indicate that tree-based methods (CatBoost, XGBoost) remain advantageous in certain cases. The logarithmic scaling in the pretraining phase implies that further increases in dataset diversity and corpus size may drive additional improvements. A plausible implication is that scaling both training data and the backbone’s capacity could extend TabSTAR's state-of-the-art performance to broader domains and larger regime benchmarks.

The integration of Semantically Target-Aware Representations with a unified transformer backbone and transfer-efficient fine-tuning establishes TabSTAR as a leading paradigm for tabular machine learning, especially where free-text features and cross-dataset generalization are prominent.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to TabSTAR.