TabICLv2: Open-Source Tabular Foundation Model
- TabICLv2 is an open-source tabular foundation model that leverages fully synthetic pretraining to establish new SOTA accuracy for regression and classification tasks.
- It introduces novel architectural innovations such as repeated feature grouping, target-aware embeddings, and Query-Aware Scalable Softmax to enhance efficiency and scalability.
- The model employs the Muon optimizer with a staged pretraining strategy, outperforming predecessors like RealTabPFN-2.5 on various benchmark datasets.
TabICLv2 is an open-source tabular foundation model designed for regression and classification, establishing new state-of-the-art accuracy across standard benchmarks without direct exposure to real-world tabular data during pretraining. Building on recent advances in tabular in-context learning (ICL), TabICLv2 introduces a synthetic prior for pretraining, novel architectural modifications—including a scalable attention mechanism—and optimized pretraining protocols. It achieves superior performance relative to prior state-of-the-art models, such as RealTabPFN-2.5 and TabPFNv2, on both middle- and large-scale tabular data benchmarks, while exhibiting marked gains in efficiency and scalability (Qu et al., 11 Feb 2026).
1. Synthetic Pretraining Data Engine
TabICLv2’s core innovation lies in its exclusive use of a fully synthetic, stochastic data-generation engine during pretraining, avoiding exposure to real tabular datasets and thus imparting diverse inductive biases.
Key components:
- Correlated hyperparameter sampling: For all repeated scalar hyperparameters (e.g., number of categories, tree depths), the generator samples and , then computes , and draws , followed by deterministic normalization to the required range. This ensures correlations across all occurrences of a given hyperparameter name.
- Random Cauchy DAGs: A random directed acyclic graph (DAG) with nodes is formed, where an edge is included with probability with drawn i.i.d. from the standard Cauchy distribution. modulates density globally, and define out/in-degree biases, respectively.
- Random node functions: Each non-root node computes matrices via probabilistic application of one of eight function classes:
- RandomNN: Small MLPs (random layers, widths, activations).
- RandomTree: Ensembles of oblivious decision trees (random , random depths).
- RandomDiscretization: Nearest-center assignment in under random distance, followed by a small linear map.
- RandomGP: Approximate Gaussian processes via random Fourier features, with exponentially random spectral decay, and random orientation.
- RandomLinear: Dense multiplication by random matrices from various structured distributions.
- RandomQuadratic: Quadratic forms on random dimension subsamples.
- RandomEMAssignment: Expectation–Maximization–style soft cluster assignment with subsequent linear mapping.
- RandomProduct: Products of two random functions.
Feature extraction (converters): Output columns are produced by coordinate extraction from vectors with optional Kumaraswamy warping (numerical), or by discretizing a subvector into categories via softmax/neighborhood assignment:
- Filtering: Any generated dataset for which a shallow ExtraTreesRegressor (25 trees, max depth 6, out-of-bag) cannot outperform a trivial baseline under 95% bootstrap, or where the targets are independent of , is rejected (25–35% rejection rate).
This approach yields datasets encompassing multivariate Gaussian processes, tree ensembles, MLPs, quadratics, and clustering dynamics, boosting pretraining diversity and inductive generalization (Qu et al., 11 Feb 2026).
2. Model Architecture and Attention Scaling Innovations
TabICLv2 advances the "compress-then-ICL" architecture underlying TabICL, incorporating several substantial extensions:
- Repeated feature grouping: Features are embedded in overlapping groups of size three, breaking symmetry and enabling richer interactions. For table :
- Target-aware embedding: The embedding of the training target is added to every feature token, not merely appended as a separate input:
- Column-wise compression: Each column’s tokens are processed by Set-Transformer blocks with inducing points; the first attention layer employs Query-Aware Scalable Softmax (QASSMax):
with
- Row-wise interaction: Four learnable [CLS] tokens are prepended to each row; outputs are concatenated into a $4d$ vector, processed via a small Transformer, again with QASSMax.
- Dataset-wide in-context learning (ICL): Embeddings of training targets are added to each row embedding, followed by a deep Transformer (12 layers) where test rows attend to train rows. Final prediction is via a two-layer MLP yielding a 10-way softmax (classification) or 999 quantiles (regression, pinball loss).
Query-Aware Scalable Softmax (QASSMax): Traditional attention mechanisms experience "attention-fading" as sequence length increases ( scaling in denominator). QASSMax introduces a multiplicative scaling factor plus query-specific gating:
This adjustment maintains sharply concentrated attention, especially at test-time scales exceeding those seen in training, improving "needle-in-haystack" retrieval (Qu et al., 11 Feb 2026).
3. Muon Optimizer and Pretraining Strategy
In contrast to common AdamW optimization, TabICLv2 adopts the Muon optimizer (Jordan et al. 2024; Schaipp 2025):
- Parameter-wise learning rate scaling: For ,
- Update rule:
where is an adaptive matrix preconditioner and the final term implements cautious weight decay, acting only when and share sign (Qu et al., 11 Feb 2026). This procedure encourages stability and proper scaling in large models.
- Scheduling and batch size: Pretraining employs a batch size of 64 throughout. Training proceeds via:
- Stage 1: 500K steps (), LR up to , cosine decay
- Stage 2: 40K steps (), LR
- Stage 3: 10K steps (), LR
- Gradient clipping (10), cautious weight decay (0.01).
Total pretraining cost is approximately 24.5 H100 GPU-days, less than half of TabICL’s 60 A100 days (Qu et al., 11 Feb 2026).
4. Empirical Performance and Scalability
TabICLv2 demonstrates robust generalization and speed advantages:
- TabArena (51 datasets): Sits at the Pareto front of improvability versus compute, outperforming RealTabPFN-2.5 (tuned and ensembled) by 5–10% in relative error while being 5–10× faster for forward-pass ICL. It wins 59% of direct model comparisons.
- TALENT (300 datasets): Matches or exceeds RealTabPFN-2.5 in accuracy (classification, RMSE, regression) and consistently improves over TabPFNv2, particularly for datasets with .
- Million-scale tables: Can process tables in under 450 s on an H100 GPU with GPU and CPU RAM, using disk-offloading with QASSMax; no need for distillation or external context retrieval.
- Metrics: Relative to RealTabPFN-2.5:
- Classification (AUC): 0.80 → 0.85; log-loss 0.25 → 0.18
- Regression (RMSE): 1.02 → 0.95; improvements also seen in CRPS and distributional metrics
| Benchmark | TabICLv2 Outcome | Speedup vs. RealTabPFN-2.5 |
|---|---|---|
| TabArena (improvability) | ∼5–10% error gap improvement | 5–10× |
| TALENT (large datasets) | Matches/exceeds on accuracy, RMSE | — |
| Million-row tables | rows < 450 s | — |
5. Component Ablation and Analysis
A comprehensive ablation over 60 validation datasets (each up to 2,048 samples) quantifies the impact of each design choice:
- Synthetic prior: Moving from the legacy TabICL prior to the TabICLv2 prior yields >200 Elo gain; substituting the old prior "breaks" TabICLv2’s architecture.
- QASSMax attention: Adds +100 Elo compared to standard softmax.
- Muon optimizer: Switching from AdamW to Muon (with cautious decay, elevated LR) contributes ∼100 Elo.
- Target-aware embedding: Adds +100 Elo.
- Repeated feature grouping and prior filtering: Each adds 20–50 Elo; long-context pretraining stages are critical for handling large .
Even a reduced TabICLv2 (minimal command improvements, 4 attention heads, 280K pretraining steps) matches RealTabPFN-2.5 in log-loss after about 200K pretraining steps (Qu et al., 11 Feb 2026).
6. Openness and Availability
TabICLv2’s inference code and model weights (for both classification and regression) are publicly available at https://github.com/soda-inria/tabicl. The authors have committed to releasing the complete synthetic prior engine, pretraining scripts, and model checkpoints in the same repository, in contrast to the proprietary status of competing models such as RealTabPFN. This extensive openness aims to democratize state-of-the-art tabular foundation modeling research (Qu et al., 11 Feb 2026).