TabICL: In-Context Learning on Tabular Data

Updated 26 November 2025

TabICL is a paradigm that reformulates tabular prediction as an in-context learning task by presenting a sequence of labeled examples to produce predictions without parameter updates.
It utilizes a two-stage embedding process—column-wise transformations and row-wise interactions—to effectively capture feature relationships and manage high-dimensional data.
TabICL integrates advanced demonstration selection, fairness, privacy, and adversarial defense strategies, facilitating efficient predictions on large-scale tabular datasets.

TabICL refers to a family of tabular foundation models and associated methodologies that perform in-context learning (ICL) on tabular data. In this paradigm, the model is presented with a sequence of labeled tabular examples (the “context”) and queried on new inputs, producing predictions without any parameter updates. TabICL architectures and inference strategies are engineered for scalability (up to hundreds of thousands of instances), group fairness, privacy, robust adversarial resistance, and token-efficiency, constituting an emerging alternative to traditional tabular machine learning approaches such as gradient-boosted trees and fully-supervised neural networks.

1. Formal Definition and TabICL Architectures

TabICL reformulates supervised tabular prediction as a contextual modeling problem, where the set of training examples $\mathcal{D}_{\text{train}} = \{(\mathbf{x}_i, y_i)\}_{i=1}^N$ , with each $\mathbf{x}_i \in \mathbb{R}^d$ (features) and $y_i \in \mathcal{Y}$ (labels), is supplied to the model as serialized context. At inference, a set of $k$ in-context demonstrations $C_k = \{(\mathbf{x}_j, y_j)\}_{j=1}^k$ is concatenated (typically via numerical or CSV-style serialization), together with a query instance $\mathbf{x}^*$ :

$I = ([\mathbf{x}_1, y_1], \ldots, [\mathbf{x}_k, y_k], [\mathbf{x}^*, ?])$

The model then produces a predictive distribution over the possible labels for $\mathbf{x}^*$ :

$P(y^*|C_k, \mathbf{x}^*) = \mathrm{Softmax}(o(I))$

The classifier’s prediction is obtained as $\hat{y} = \arg\max_{y} P(y|C_k, \mathbf{x}^*)$ (Kenfack et al., 14 May 2025).

The canonical TabICL architecture employs a two-stage embedding pipeline (Qu et al., 8 Feb 2025):

Column-wise embedding: Each column is transformed with a Set Transformer encoding statistics and local semantics.
Row-wise interaction: The resulting per-row vector is processed by a transformer to model feature interactions. This sequence of fixed-sized row embeddings is then passed, along with (optionally one-hot encoded) labels, into a final transformer for in-context prediction. This approach scales to hundreds of thousands of rows and hundreds of columns on modest hardware.

2. Pretraining Regimens and Token-Efficient Serialization

TabICL models are typically pretrained on vast quantities of synthetic tasks generated from structural causal models (SCMs) or other random processes, using curriculum strategies to cover a diverse span of dataset sizes ($1$K to $500$K examples), feature typings, and activation functions (Qu et al., 8 Feb 2025, Dong et al., 8 Sep 2025). Pretraining focuses on minimizing the negative log-likelihood of the posterior predictive conditioned on the provided “context”:

$\mathcal{L}_{\text{pre}}(\theta) = -\mathbb{E}_{\text{task}}\mathbb{E}_{(\mathbf{x}_{\text{test}}, y_{\text{test}}), \mathcal{D}_{\text{train}}} \log q_{\theta}(y_{\text{test}} | \mathbf{x}_{\text{test}}, \mathcal{D}_{\text{train}})$

Recent works adopt CSV-style prompt serialization with integer-encoding of features for maximal token efficiency and throughput, enabling thousands of context examples within practical language-model context limits (Dong et al., 8 Sep 2025).

3. Demonstration Selection and Scalability

A major challenge for TabICL on large datasets is context window length. To mitigate this, advanced retrieval and demonstration selection strategies are developed:

Retrieval-augmented TabICL: Retrieves top- $k$ nearest neighbors from the training set as in-context demonstrations, using carefully constructed distance metrics (feature weighting via Pearson/PPS, normalization, categorical matching), thereby scaling ICL to arbitrarily large datasets with $k \ll |\mathcal{D}_{\text{train}}|$ (Wen et al., 5 Feb 2025).
Residual-aware in-context example selection: For tabular data generation, selects demonstration examples that bridge the largest distributional residual between synthetic and real data, according to metrics such as Jensen–Shannon divergence or Kolmogorov–Smirnov statistics, iteratively focused on underrepresented regions (Fang et al., 23 Feb 2025).
Specialized selection for fairness and privacy: Procedures such as uncertainty-based selection, group-balancing, and correlation-removal can be applied prior to context construction to enforce group fairness or data protection constraints (Kenfack et al., 14 May 2025, Carey et al., 8 Mar 2024).

4. Fairness, Privacy, and Societal Considerations

The fairness of in-context predictions in TabICL is not guaranteed by default and has been directly interrogated in recent work (Kenfack et al., 14 May 2025). Key findings include:

Preprocessing for fairness:
- Correlation Remover (CR): Linear debiasing of features may paradoxically increase group disparity (e.g., doubling $\Delta\mathrm{DP}$ ).
- Group-balanced sampling: Simple group balancing yields marginal fairness gains.
- Uncertainty-based selection: Demonstration selection based on the conformal uncertainty of the sensitive attribute (i.e., ambiguity in predicted group memberships) systematically lowers demographic, opportunity, and equalized-odds gaps ( $\Delta\mathrm{DP}, \Delta\mathrm{EOP}, \Delta\mathrm{EOD}$ ), while minimally reducing predictive accuracy (typically $0$–$1$pp) (Kenfack et al., 14 May 2025).

TabICL has also been extended to admit rigorous privacy constraints:

LDP-TabICL and GDP-TabICL employ local and global differential privacy mechanisms, respectively, injecting noise at the record or group-statistic level prior to serialization. This yields $\epsilon$ -DP guarantees while maintaining high utility on most benchmarks, and is particularly effective on unbalanced tasks where LLM-based prompts outperform DP-trained classical baselines for low $\epsilon$ (Carey et al., 8 Mar 2024).

5. Robustness and Adversarial Defenses

TabICL architectures are susceptible to adversarial perturbations at test time. Structured attacks (CAPGD gradient-based within relational/box constraints, multi-objective search) can significantly degrade accuracy (e.g., LCLD: from $60.5\%$ clean to $12.5\%$ under attack), even exceeding the fragility of random forests and deep models (Djilani et al., 3 Jun 2025). Two defense regimes have been proposed:

Adversarial Fine-Tuning (AFT): Standard adversarial training updating model weights. Yields modest robust accuracy gains at high compute cost.
Adversarial In-Context Learning (AICL): Involves incrementally perturbing the context examples without weight updates. This can achieve substantial robust accuracy improvements under constrained attacks (e.g., LCLD: $37.2\%$ robust accuracy after AICL vs. $12.5\%$ original), while simultaneously preserving or improving clean accuracy (Djilani et al., 3 Jun 2025).

AICL leverages the in-context mechanism for robustness, but even after adversarial defense, specialized models optimized for robustness (e.g., STG) can outperform TabICL. Regular curation and refreshing of context examples, as well as combined defense strategies (robust training, certified bounds, randomized smoothing), are recommended for critical applications.

6. Applications Across Domains

TabICL systems have demonstrated state-of-the-art performance, or competitive accuracy with classical baselines, across a wide variety of domains:

Large-scale tabular classification: On $200$ real datasets (TALENT benchmark), TabICL matches or exceeds prior foundation models (e.g., TabPFN-v2), with $1.5\times$ – $10\times$ faster inference and wins especially on datasets $>10$ K rows (Qu et al., 8 Feb 2025).
Empathy detection from video-derived tabular features: In many-shot settings, TabICL achieves higher empirical accuracy and AUC than XGBoost, RF, and SVM under stratified cross-validation setups (e.g., $0.634$ ACC, $0.665$ AUC), though frozen ICL variants underperform on leave-one-subject-out generalization compared to fine-tuned models (Hasan et al., 15 Apr 2025).
Tabular data generation: Residual-aware prompt selection empowered by TabGen-ICL achieves substantial improvements in TSTR utility, fidelity metrics, and privacy over random/context-free sampling, with up to $42.2\%$ reduction in recall error, and low risk of over-memorization (Fang et al., 23 Feb 2025).
Structured NLP tasks: Tabular ICL is effective for tasks such as relational triple extraction when prompt schemas and demonstration selection policies are customized (e.g., TABLEIE, I²CL), yielding strong few-shot F1 improvements versus baseline prompt designs (Li et al., 21 Feb 2024).

7. Limitations and Prospective Directions

Current limitations include:

Context length constraints and token efficiency: Serializing hundreds of demonstrations may exceed LLM context windows for high-dimensional data. Techniques such as batch inference, token-efficient integer encoding, and retrieval-based context selection partially alleviate this bottleneck (Dong et al., 8 Sep 2025, Wen et al., 5 Feb 2025).
Restriction to classification: Many core TabICL implementations are presently focused on classification; extension to regression and hierarchical multi-label tasks is straightforward but mostly unimplemented at scale (Qu et al., 8 Feb 2025).
Synthetic pretraining: Most TabICL models are pretrained on synthetic SCMs, with limited real-world tabular data integration (Qu et al., 8 Feb 2025). Incorporating natural tables may further robustify inductive biases.
Robustness and fairness generalization: Although ICL admits flexible preprocessing for fairness/robustness, transferability of adversarial attacks is high and context curation is non-trivial (Djilani et al., 3 Jun 2025). Uncertainty-based demonstration selection and AICL remain active areas for securing downstream deployments.

A plausible implication is that future work will focus on: more principled context selection algorithms; sublinear and retrieval-based attention architectures; direct leveraging of meta-data and cross-table generalization; and deeper integration with privacy/fairness constraints for regulated domains. These directions build on the unique advantages of the TabICL framework—scalability, amortization, and zero-parameter adaptation—over classical model-centric workflows.