Tabular In-Context Learning Models

Updated 26 November 2025

Tabular in-context learning models are transformer-based architectures that condition on labeled examples to predict outcomes without updating parameters.
Uncertainty-based demonstration selection, through techniques like conformal prediction, consistently reduces fairness gaps while preserving predictive accuracy.
These methods are applied in domains such as finance and healthcare to mitigate bias by optimizing context selection rather than retraining models.

Tabular in-context learning (ICL) models are a class of neural architectures that achieve accurate predictions on structured tabular data by conditioning directly on labeled training examples, presented as context, without updating model parameters. Transformer-based tabular foundation models, trained on massive synthetic tabular datasets, underlie this approach and demonstrate highly competitive performance versus gradient-boosted trees, notably requiring no downstream fine-tuning. As ICL for tabular data moves toward deployment in fairness-sensitive domains such as finance, health, and policy, understanding and mitigating algorithmic bias in tabular foundation models is increasingly critical. "Towards Fair In-Context Learning with Tabular Foundation Models" (Kenfack et al., 14 May 2025) provides a rigorous characterization of the architecture, workflows, fairness metrics, and tested pre-processing strategies for bias mitigation in tabular ICL.

1. Transformer Architectures and Pre-training

Tabular ICL relies on transformer backbones adapted for structured data. Categorical features are one-hot encoded or embedded via learned lookup tables; numerical features undergo quantile normalization before projection into a shared embedding space. Column and row embeddings encode feature slot and position, respectively. Pre-training uses millions of synthetic tabular datasets drawn from random causal graphs, with each task formulated as tabular classification (≤500 features, ≤10,000 samples) in a single forward pass. Representative variants include TabPFN (handling up to 10,000 examples/500 features) and TabICL (scaling context capacity to 500,000 examples).

ICL prediction proceeds by constructing a single sequence of K labeled records and one query, formatted as

[CLS] embed(x₁), embed(y₁), ..., embed(x_K), embed(y_K), embed(x_{K+1}), [PT]

The final-layer output at [PT] passes through a classification head to yield $p(y_{K+1} \mid x_{K+1}, \text{context})$ .

2. Demonstration Selection and Context Construction

Bias in tabular ICL arises through the choice of in-context demonstration records. The context set is selected from a labeled training pool $\mathcal{D}=\{(x_i, y_i, s_i)\}_{i=1}^N$ , where $s_i$ denotes a binary sensitive attribute (e.g. gender, age group). Test-time predictions are conditioned on selected context records, with no model updates. The strategies analyzed include:

Correlation Remover: Implements Feldman et al. (2015) by projecting each non-sensitive feature $z^j$ orthogonally to $S$ , yielding decorrelated features:

$w_j^* = \arg\min_{w\in\mathbb{R}}\,\|\mathbf{z}^j - S w\|_2^2,\quad z^{j*} = z^j - S w_j^*$

Composite feature $Z' = (1-\alpha)Z + \alpha Z^*$ for $\alpha\in[0,1]$ ; the sensitive attribute is dropped prior to demonstration selection.

Group-Balanced Selection: Ensures equal representation in context by sampling $K/2$ examples from each group (defined by $s_i$ ), and one additional from the minority group for odd $K$ .
Uncertainty-Based Selection: For each candidate, trains a sensitive-attribute classifier $h(x)\approx \mathbb{P}(s=1|x)$ . Uncertainty score can be entropy

$H_i = -[h(x_i) \log h(x_i) + (1-h(x_i)) \log (1-h(x_i))]$

or conformal prediction set size; selects K examples with highest uncertainty. In experiments, conformal prediction (MAPIE) is preferred, with examples having an α-level prediction set of cardinality two.

3. Fairness Metrics for In-Context Predictions

Fairness is evaluated using established group metrics. For predictions $f(x)\in\{0,1\}$ ,

$\pi_g = \mathbb{P}(f(x)=1\mid s=g)$ (demographic parity per group)
$\text{TPR}_g = \mathbb{P}(f(x)=1\mid s=g, y=1)$ (sensitive group True Positive Rate)
$\text{FPR}_g = \mathbb{P}(f(x)=1\mid s=g, y=0)$ (sensitive group False Positive Rate)

Group-fairness gaps are defined:

Demographic parity gap: $\Delta \text{DP} = |\pi_0 - \pi_1|$
Equal opportunity gap: $\Delta \text{EOP} = |\text{TPR}_0 - \text{TPR}_1|$
Equalized odds gap: $\Delta \text{EOD} = |\text{FPR}_0 - \text{FPR}_1| + |\text{TPR}_0 - \text{TPR}_1|$

4. Experimental Protocol and Representative Results

Experiments span eight real tabular datasets: several Folktables tasks, Diabetes (BRFSS), German Credit (age ≤25 vs >25), and CelebA (gender vs attractiveness). The workflow applies an 80/20 train/test split, with the held-out 20% dedicated to uncertainty-model fitting. On the 80% remainder, 5-fold cross-validation (×10 seeds) yields mean ± std metrics. Context size K is maximized per model capacity (≤10,000 for TabPFN). Baselines include:

Vanilla ICL (random context)
Balanced subsampling
Correlation remover ( $\alpha$ swept in [0,1])
Uncertain_lr (logistic regression entropy/conformal uncertainty)
Uncertain_tabpfn (foundation model-based uncertainty)

Key findings:

Group-balanced demonstration selection yields only marginal fairness improvement relative to vanilla ICL.
Correlation remover frequently increases $\Delta \text{DP}$ , $\Delta \text{EOP}$ , and $\Delta \text{EOD}$ : at test time, the transformer re-infers $s$ via subtle reconstructed feature patterns.
Uncertainty-based selection, especially with foundation-model-derived uncertainty ("uncert_tabpfn"), consistently provides the lowest group fairness gaps across datasets, with only ≈1–2 percentage point reduction in predictive accuracy.
Pareto-front analysis over conformal coverage parameter $\epsilon$ shows uncertainty selection dominates other pre-processing strategies in the fairness-accuracy trade-off. TabPFN slightly outperforms TabICL for combined fairness–accuracy, but both benefit substantially from uncertainty-based selection.

5. Implications for Fairness Mitigation in Tabular ICL

Bias mitigation in tabular ICL fundamentally hinges on demonstration selection rather than model retraining. Correlation-based preprocessing, a mainstay in classic fair ML, may prove counterproductive: the transformer leverages intricate feature dependencies to restore dropped sensitive attributes post-decorrelation. Simple group-balanced contexts are insufficient for robust fairness. In contrast, selecting demonstration records for maximal uncertainty under a sensitive-attribute model minimizes the risk of context-driven bias propagation. This approach operates solely at the data presentation layer and does not require updating the pretrained tabular foundation model.

6. Future Directions and Open Challenges

The most effective fairness regimen found—uncertainty-based context selection—suggests further paper of model-driven demonstration retrieval algorithms, potentially integrating advanced conformal prediction methods or active sampling protocols. Extending these techniques to multi-class sensitive attributes, structured outputs, and regression remains open. Likewise, formal proofs of group-fairness guarantees under context uncertainty selection are needed. As tabular ICL models propagate into privacy-regulated domains, intersectional analyses (fairness by multiple overlapping sensitive groups) and compatibility with privacy-preserving data frameworks (such as DP-TabICL (Carey et al., 8 Mar 2024)) will be increasingly important.

7. Summary Table: Fairness Strategies and Effects

Selection Method	Fairness Impact	Accuracy Cost
Balanced Sampling	Marginal reduction	None
Correlation Remover	Often increases gap	None/small
Uncertainty-Based	Consistent reduction	~1–2pp

Uncertainty-based demonstration selection using conformal prediction outperforms both balanced and correlation-removal approaches for reducing group fairness gaps ( $\Delta \text{DP}, \Delta \text{EOP}, \Delta \text{EOD}$ ) in tabular in-context learning, with minimal accuracy compromise (Kenfack et al., 14 May 2025). This elevates demonstration retrieval methods to the central tool for fair in-context learning over tabular data.