Papers
Topics
Authors
Recent
Search
2000 character limit reached

TabDPT: Discriminative Pre-trained Transformer

Updated 6 February 2026
  • The paper presents a novel TabDPT architecture that leverages self-supervised masked column modeling and in-context retrieval for enhanced tabular data performance.
  • It demonstrates efficient generalization and predictable scaling laws by pre-training on large-scale, real-world datasets with structural invariances.
  • Empirical benchmarks show TabDPT achieves state-of-the-art accuracy, AUC, and regression metrics compared to existing tabular foundation models.

Tabular Discriminative Pre-trained Transformer (TabDPT) is a specialized foundation model architecture and training paradigm for tabular data, integrating in-context retrieval, self-supervised learning via masked column modeling, and architectural modifications for tabular invariances. Unlike LLMs adapted to text-formatted tables, TabDPT leverages the structural properties of tabular data and directly pre-trains on large-scale real-world datasets, enabling efficient generalization to unseen data and tasks without task-specific tuning (Ma et al., 2024). TabDPT establishes new state-of-the-art results among open-source tabular foundation models (TFMs) in both classification and regression domains, while exhibiting predictable scaling laws reminiscent of LLMs.

1. Model Architecture and Tokenization

TabDPT utilizes a row-based transformer encoder, structurally similar to TabPFN v1 but fundamentally distinct in tokenization and context composition. Each table row, up to Fmax=100F_{\max}=100 features, is embedded into a dd-dimensional vector using a single linear layer followed by layer normalization:

  • Feature Embedding: ϕx:RFmaxRd\phi_x: \mathbb{R}^{F_{\max}}\rightarrow \mathbb{R}^d
  • Label/Target Embedding: ϕy:{1,,Cmax}RRd\phi_y: \{1, \dots, C_{\max}\} \cup \mathbb{R} \rightarrow \mathbb{R}^d; for regression, targets are normalized and projected via a linear map; for classification, a learnable Cmax×dC_{\max} \times d embedding table is used.

Context rows are represented by elementwise summation of feature and label embeddings, while query rows use feature embeddings alone.

Key architectural features include:

  • Row-based tokens restore permutation invariance and reduce transformer context length from O(NF)O(NF) to O(N)O(N).
  • Absence of positional embeddings further enforces row-order invariance.
  • Regular random column sub-sampling/shuffling as data augmentation and to prevent column order overfitting.
  • Heads for classification and regression share the transformer backbone and use separate MLPs for output.

A typical large-scale TabDPT backbone includes L=16L=16 layers, d=768d=768 hidden size, h=4h=4 attention heads, FFN inner dimension up to $4d$, and no dropout (normalization-first transformer blocks) (Ma et al., 2024).

2. Self-supervised Pre-training and In-context Retrieval

The core of TabDPT pre-training is a combined paradigm of self-supervised masked column modeling and in-context learning (ICL)-aligned retrieval.

Masked Column Modeling

For each dataset D={XRN×F,yRN}\mathcal{D}=\{X \in \mathbb{R}^{N \times F}, y \in \mathbb{R}^N\}, a random column cc is selected as a pseudo-target. The model receives all rows with column cc removed (XcX \setminus c) and predicts its values (either as real-valued regression or multiclass classification to CmaxC_{\max} classes):

  • Loss functions:
    • Classification: LCE=1IiIk=1C1[yi=k]logp(y^i=k)\mathcal{L}_{\mathrm{CE}} = -\frac{1}{|\mathcal{I}|} \sum_{i \in \mathcal{I}} \sum_{k=1}^C \mathbf{1}[y_i = k] \log p(\hat y_i = k)
    • Regression: LMSE=1IiI(y^iyi)2\mathcal{L}_{\mathrm{MSE}} = \frac{1}{|\mathcal{I}|} \sum_{i \in \mathcal{I}} (\hat y_i - y_i)^2
    • Overall SSL: Lmask=Etasks[1clsLCE+1regLMSE]\mathcal{L}_{\mathrm{mask}} = \mathbb{E}_{\rm tasks}[1_{\rm cls} \mathcal{L}_{\rm CE} + 1_{\rm reg} \mathcal{L}_{\rm MSE}]

In-context Retrieval

For each row, KK nearest neighbors are selected via a FAISS index built on the pre-normalized feature space (excluding the masked column). These KK neighbors are randomly partitioned into context and query rows for training, constructing the input as

y^qy=Transformer[ϕx(Xctx)ϕy(yctx)context tokens, ϕx(Xqy)]\hat y_{\mathrm{qy}} = \mathrm{Transformer}\left[ \underbrace{\phi_x(X_{ctx}) \oplus \phi_y(y_{ctx})}_{\text{context tokens}}, ~\phi_x(X_{qy}) \right]

where context tokens attend to each other and to the query, while queries attend to all context tokens.

The pre-training optimization is solely the combined mask loss: Ltotal=Lmask=λclsLCE+λregLMSE\mathcal{L}_{\text{total}} = \mathcal{L}_{\text{mask}} = \lambda_{\rm cls}\mathcal{L}_{\rm CE} + \lambda_{\rm reg}\mathcal{L}_{\rm MSE}, with λcls=λreg=1\lambda_{\rm cls} = \lambda_{\rm reg} = 1 (Ma et al., 2024).

3. Data Sources and Preprocessing

TabDPT’s pre-training corpus consists of 123 diverse, CC-BY-licensed OpenML datasets, spanning domains such as biology, finance, healthcare, industrial logs, and text-derived features. The corpus comprises 32 million rows and 2 billion cells.

Pre-processing steps include:

  • Label-encoding categorical features to integer indices.
  • Standardizing numerical columns (mean 0, variance 1), with values clipped to ±\pm10.
  • Missing values are filled with zero post-standardization.

A central empirical finding is that real tabular data provides substantial pre-training benefit over synthetic data. Specifically, models pre-trained with real data converge faster, achieve lower validation loss, and yield superior downstream performance compared to those trained exclusively on synthetic generators (such as TabPFN’s mixture-based data) (Ma et al., 2024). This suggests intrinsic, transferable signal present in heterogeneous real tables beyond what handcrafted synthetic priors deliver.

4. Benchmark Evaluation and Performance

TabDPT’s performance was benchmarked on two large-scale suites:

  • CC18: 72 unseen OpenML classification datasets (500–100,000 rows, up to 5,000 features), with metrics of AUC and accuracy.
  • CTR23: 35 unseen OpenML regression datasets, analogous scale, evaluated by Pearson correlation ρ\rho and R2R^2.
Metric TabDPT TabPFN v2 XGBoost LightGBM CatBoost
CC18 AUC 0.933 0.932 0.926 0.924 0.926
CC18 Accuracy 0.884 0.872 0.869 0.862 0.864
CTR23 Corr 0.837 0.835 0.827 0.825 0.822
CTR23 R2R^2 0.742 0.740 0.711 0.713 0.703

TabDPT achieves top results on all key metrics as of publication. Pairwise dataset win-rates against these baselines exceed 60%. In Elo and Glicko2 rating tournaments, TabDPT is rated highest among open-source TFMs. For a context size of 2048, TabDPT processes 1,000 rows in approximately 0.1 seconds on an A100, which is over 10×\times faster than per-dataset hyperparameter search plus inference for tree-based or deep-learning alternatives (Ma et al., 2024).

5. Scaling Laws, Ablations, and Comparative Insights

Scaling experiments demonstrate that both model size (PP) and pre-training data volume (DD) drive predictable loss improvements following a joint power-law:

^(P,D)=APα+BDβ+E\hat \ell(P, D) = A P^{-\alpha} + B D^{-\beta} + E

with fitted exponents α=0.42\alpha = 0.42, β=0.39\beta = 0.39, indicating that doubling either model size or data yields measurable excess loss reduction. These scaling curves are consistent over $33$k to $78$M parameters and $52$M to $2$B data cells. This observation closely parallels scaling behaviors established in LLMs and vision foundation models, suggesting that sufficiently large and data-rich TFMs are achievable for tabular domains (Ma et al., 2024).

Ablation analyses underscore the importance of row-based tokenization, invariances, and real-data pre-training. The absence of positional encoding and persistent random column shuffling/sub-sampling were both necessary for generalization. A plausible implication is that permutation and invariance constraints are critical to transferrable tabular representations.

6. Limitations and Future Research Directions

TabDPT assumes strictly rectangular, i.i.d. tables and does not explicitly accommodate temporal sequences, hierarchical dependencies, or multimodal (text or image) columns. Feature dimensionality (FmaxF_{\max}) and class count (CmaxC_{\max}) are fixed, handled via principal component analysis for features and base-CmaxC_{\max} encoding for classes. Free-form feature names and textual information are not utilized, as empirical studies found that feature-name embeddings bias toward overfitting under current dataset diversity (Ma et al., 2024).

Future directions identified include:

  • Incorporation of richer textual/contextual annotations (e.g., column names, physical units) through joint text+tabular embeddings.
  • Direct extension to structured (temporal or graph) tabular data, enabling modeling of event logs or hierarchical records.
  • Integration with generative models (e.g., TabPFGen, TabLatent) to serve tasks such as anomaly detection or missing value imputation.
  • Investigation of self-supervised objectives beyond column masking, including contrastive row-pair pretext tasks.

TabDPT is part of a broader trend toward tabular-specific foundation models and discriminative pre-trained transformers. Prior approaches including TP-BERTa (Yan et al., 2024) leverage LLM backbones augmented with techniques such as relative magnitude tokenization and intra-feature attention. However, LLM-based models for tabular prediction have shown limited success in ICL or cross-table transfer compared to models like TabDPT that are pre-trained on large real tabular corpora with tabular-structured inductive biases. Additionally, advances such as TabToken’s supervised contrastive token regularization further inform the design of transferable tabular transformers, emphasizing joint order-invariant tokenization and simultaneous embedding–transformer optimization (Zhou et al., 2023).

TabDPT distinguishes itself by unifying retrieval-augmented in-context learning and masked column modeling atop real heterogeneous data, confirming that real tabular datasets enable faster convergence and superior transfer while obeying power-law scaling. Its open-source pipeline and empirical record position it as a central architecture in the modern TFM landscape (Ma et al., 2024).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Tabular Discriminative Pre-trained Transformer (TabDPT).