Papers
Topics
Authors
Recent
Search
2000 character limit reached

TabICLv2: Open-Source Tabular Foundation Model

Updated 13 February 2026
  • TabICLv2 is an open-source tabular foundation model that leverages fully synthetic pretraining to establish new SOTA accuracy for regression and classification tasks.
  • It introduces novel architectural innovations such as repeated feature grouping, target-aware embeddings, and Query-Aware Scalable Softmax to enhance efficiency and scalability.
  • The model employs the Muon optimizer with a staged pretraining strategy, outperforming predecessors like RealTabPFN-2.5 on various benchmark datasets.

TabICLv2 is an open-source tabular foundation model designed for regression and classification, establishing new state-of-the-art accuracy across standard benchmarks without direct exposure to real-world tabular data during pretraining. Building on recent advances in tabular in-context learning (ICL), TabICLv2 introduces a synthetic prior for pretraining, novel architectural modifications—including a scalable attention mechanism—and optimized pretraining protocols. It achieves superior performance relative to prior state-of-the-art models, such as RealTabPFN-2.5 and TabPFNv2, on both middle- and large-scale tabular data benchmarks, while exhibiting marked gains in efficiency and scalability (Qu et al., 11 Feb 2026).

1. Synthetic Pretraining Data Engine

TabICLv2’s core innovation lies in its exclusive use of a fully synthetic, stochastic data-generation engine during pretraining, avoiding exposure to real tabular datasets and thus imparting diverse inductive biases.

Key components:

  • Correlated hyperparameter sampling: For all repeated scalar hyperparameters (e.g., number of categories, tree depths), the generator samples tUniform(0,1)t \sim \mathrm{Uniform}(0,1) and sLogUniform(0.1,10000)s \sim \mathrm{LogUniform}(0.1, 10\,000), then computes αst\alpha \leftarrow s t, βs(1t)\beta \leftarrow s(1-t) and draws uBeta(α,β)u \sim \mathrm{Beta}(\alpha, \beta), followed by deterministic normalization to the required range. This ensures correlations across all occurrences of a given hyperparameter name.
  • Random Cauchy DAGs: A random directed acyclic graph (DAG) with KK nodes is formed, where an edge iji \to j is included with probability pij=sigmoid(A+Bi+Cj)p_{ij} = \mathrm{sigmoid}(A + B_i + C_j) with A,Bi,CjA, B_i, C_j drawn i.i.d. from the standard Cauchy distribution. AA modulates density globally, BiB_i and CjC_j define out/in-degree biases, respectively.
  • Random node functions: Each non-root node computes matrices via probabilistic application of one of eight function classes:

    1. RandomNN: Small MLPs (random layers, widths, activations).
    2. RandomTree: Ensembles of oblivious decision trees (random TT, random depths).
    3. RandomDiscretization: Nearest-center assignment in Rd\mathbb{R}^d under random LpL^p distance, followed by a small linear map.
    4. RandomGP: Approximate Gaussian processes via random Fourier features, with exponentially random spectral decay, and random orientation.
    5. RandomLinear: Dense multiplication by random matrices from various structured distributions.
    6. RandomQuadratic: Quadratic forms on random dimension subsamples.
    7. RandomEMAssignment: Expectation–Maximization–style soft cluster assignment with subsequent linear mapping.
    8. RandomProduct: Products of two random functions.
  • Feature extraction (converters): Output columns are produced by coordinate extraction from vectors with optional Kumaraswamy warping (numerical), or by discretizing a subvector into cc categories via softmax/neighborhood assignment:

    v=argmaxj{1..c}softmax(ax~+b)jv = \arg\max_{j \in \{1..c\}} \mathrm{softmax}(a \tilde{x} + b)_j

  • Filtering: Any generated dataset for which a shallow ExtraTreesRegressor (25 trees, max depth 6, out-of-bag) cannot outperform a trivial baseline under 95% bootstrap, or where the targets are independent of xx, is rejected (25–35% rejection rate).

This approach yields datasets encompassing multivariate Gaussian processes, tree ensembles, MLPs, quadratics, and clustering dynamics, boosting pretraining diversity and inductive generalization (Qu et al., 11 Feb 2026).

2. Model Architecture and Attention Scaling Innovations

TabICLv2 advances the "compress-then-ICL" architecture underlying TabICL, incorporating several substantial extensions:

  • Repeated feature grouping: Features are embedded in overlapping groups of size three, breaking symmetry and enabling richer interactions. For table XRn×mX \in \mathbb{R}^{n \times m}:

    E1[i,j]=Lin(xi,j,xi,(j+1)modm,xi,(j+3)modm)RdE_1[i,j] = \mathrm{Lin}(x_{i,j}, x_{i,(j+1) \bmod m}, x_{i,(j+3) \bmod m}) \in \mathbb{R}^d

  • Target-aware embedding: The embedding of the training target yiy_i is added to every feature token, not merely appended as a separate input:

    E2[i,j]=E1[i,j]+EmbedTAE(yi)E_2[i,j] = E_1[i,j] + \mathrm{Embed}_\mathrm{TAE}(y_i)

  • Column-wise compression: Each column’s tokens are processed by Set-Transformer blocks with inducing points; the first attention layer employs Query-Aware Scalable Softmax (QASSMax):

    Attn(Q,K,V)=softmax(QKd)Vsoftmax(Q~Kd)V\mathrm{Attn}(Q,K,V) = \mathrm{softmax}\left(\frac{QK^\top}{\sqrt{d}}\right)V\quad \rightarrow\quad \mathrm{softmax}\left(\frac{\tilde{Q}K^\top}{\sqrt{d}} \right)V

    with

    q~h,i=qh,i[MLPbase(logn)h]×(1+tanh(MLPgate(qh)i))\tilde{q}_{h, i} = q_{h, i} \left[ \mathrm{MLP}_{\mathrm{base}}(\log n)_h\right] \times \left(1 + \tanh(\mathrm{MLP}_{\mathrm{gate}}(q_h)_i)\right)

  • Row-wise interaction: Four learnable [CLS] tokens are prepended to each row; outputs are concatenated into a $4d$ vector, processed via a small Transformer, again with QASSMax.
  • Dataset-wide in-context learning (ICL): Embeddings of training targets are added to each row embedding, followed by a deep Transformer (12 layers) where test rows attend to train rows. Final prediction is via a two-layer MLP yielding a 10-way softmax (classification) or 999 quantiles (regression, pinball loss).

Query-Aware Scalable Softmax (QASSMax): Traditional attention mechanisms experience "attention-fading" as sequence length increases (O(n)O(n) scaling in denominator). QASSMax introduces a multiplicative logn\log n scaling factor plus query-specific gating:

q~h,i=qh,ishlognqh,iMLPbase(logn)h  ×  (1+tanh(MLPgate(qh)i))\tilde{q}_{h,i} = q_{h,i} \cdot s_h \log n \quad \rightarrow \quad q_{h,i}\,\mathrm{MLP}_{\mathrm{base}}(\log n)_h\;\times\;(1+\tanh(\mathrm{MLP}_{\rm gate}(q_h)_i))

This adjustment maintains sharply concentrated attention, especially at test-time scales exceeding those seen in training, improving "needle-in-haystack" retrieval (Qu et al., 11 Feb 2026).

3. Muon Optimizer and Pretraining Strategy

In contrast to common AdamW optimization, TabICLv2 adopts the Muon optimizer (Jordan et al. 2024; Schaipp 2025):

  • Parameter-wise learning rate scaling: For WRn×mW \in \mathbb{R}^{n \times m},

    ηW=0.2ηmax(n,m)\eta_W = 0.2\,\eta\,\sqrt{\max(n, m)}

  • Update rule:

    wt+1=wtηWPtgtληWwt1[sign(gt)=sign(wt)]w_{t+1} = w_t - \eta_W P_t g_t - \lambda\,\eta_W\,w_t\,\mathbf{1}[\mathrm{sign}(g_t) = \mathrm{sign}(w_t)]

    where PtP_t is an adaptive matrix preconditioner and the final term implements cautious weight decay, acting only when gtg_t and wtw_t share sign (Qu et al., 11 Feb 2026). This procedure encourages stability and proper scaling in large models.

  • Scheduling and batch size: Pretraining employs a batch size of 64 throughout. Training proceeds via:
    • Stage 1: 500K steps (n=1,024n=1,024), LR up to 8×1048\times10^{-4}, cosine decay
    • Stage 2: 40K steps (nLogUniform(400,10,240)n \sim \mathrm{LogUniform}(400, 10,240)), LR 1×1041\times10^{-4}
    • Stage 3: 10K steps (nLogUniform(400,60,000)n \sim \mathrm{LogUniform}(400, 60,000)), LR 2×1052\times10^{-5}
    • Gradient clipping (10), cautious weight decay (0.01).

Total pretraining cost is approximately 24.5 H100 GPU-days, less than half of TabICL’s 60 A100 days (Qu et al., 11 Feb 2026).

4. Empirical Performance and Scalability

TabICLv2 demonstrates robust generalization and speed advantages:

  • TabArena (51 datasets): Sits at the Pareto front of improvability versus compute, outperforming RealTabPFN-2.5 (tuned and ensembled) by 5–10% in relative error while being 5–10× faster for forward-pass ICL. It wins 59% of direct model comparisons.
  • TALENT (300 datasets): Matches or exceeds RealTabPFN-2.5 in accuracy (classification, RMSE, regression) and consistently improves over TabPFNv2, particularly for datasets with n>10000n > 10\,000.
  • Million-scale tables: Can process 1000000×5001\,000\,000 \times 500 tables in under 450 s on an H100 GPU with <50GB<50\,\mathrm{GB} GPU and <24GB<24\,\mathrm{GB} CPU RAM, using disk-offloading with QASSMax; no need for distillation or external context retrieval.
  • Metrics: Relative to RealTabPFN-2.5:
    • Classification (AUC): 0.80 → 0.85; log-loss 0.25 → 0.18
    • Regression (RMSE): 1.02 → 0.95; improvements also seen in CRPS and distributional metrics
Benchmark TabICLv2 Outcome Speedup vs. RealTabPFN-2.5
TabArena (improvability) ∼5–10% error gap improvement 5–10×
TALENT (large datasets) Matches/exceeds on accuracy, RMSE
Million-row tables 1M×5001\mathrm{M} \times 500 rows < 450 s

5. Component Ablation and Analysis

A comprehensive ablation over 60 validation datasets (each up to 2,048 samples) quantifies the impact of each design choice:

  • Synthetic prior: Moving from the legacy TabICL prior to the TabICLv2 prior yields >200 Elo gain; substituting the old prior "breaks" TabICLv2’s architecture.
  • QASSMax attention: Adds +100 Elo compared to standard softmax.
  • Muon optimizer: Switching from AdamW to Muon (with cautious decay, elevated LR) contributes ∼100 Elo.
  • Target-aware embedding: Adds +100 Elo.
  • Repeated feature grouping and prior filtering: Each adds 20–50 Elo; long-context pretraining stages are critical for handling large nn.

Even a reduced TabICLv2 (minimal command improvements, 4 attention heads, 280K pretraining steps) matches RealTabPFN-2.5 in log-loss after about 200K pretraining steps (Qu et al., 11 Feb 2026).

6. Openness and Availability

TabICLv2’s inference code and model weights (for both classification and regression) are publicly available at https://github.com/soda-inria/tabicl. The authors have committed to releasing the complete synthetic prior engine, pretraining scripts, and model checkpoints in the same repository, in contrast to the proprietary status of competing models such as RealTabPFN. This extensive openness aims to democratize state-of-the-art tabular foundation modeling research (Qu et al., 11 Feb 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to TabICLv2.