Tabular In-Context Learning Overview

Updated 9 November 2025

Tabular ICL is a meta-learning paradigm where transformer-based models use a few demonstration rows from structured data to make zero-shot predictions without gradient updates.
Fairness in Tabular ICL is assessed using group fairness metrics such as ΔDP, ΔEOD, and ΔEOP to evaluate bias across sensitive attributes.
Uncertainty-based demonstration selection improves fairness by suppressing group signals, achieving lower bias gaps with only a minor drop in overall accuracy.

Tabular In-Context Learning (ICL) refers to a meta-learning paradigm in which a pre-trained, often transformer-based, model is provided with a small set of example rows—termed “demonstrations” or “prompts”—from a tabular dataset along with a new test row, and is asked to predict the corresponding label for the test point without any gradient-based parameter updates. This approach enables zero-shot task adaptation on structured data and is increasingly positioned as a competitive alternative to traditional methods such as gradient-boosted trees. Recent research has extended the capabilities of tabular ICL to both the scale and technical depth required for modern real-world prediction, and has initiated a systematic paper of the fairness properties and possible interventions to mitigate group bias in these settings (Kenfack et al., 14 May 2025).

1. Foundations and Process of Tabular ICL

Tabular ICL is grounded in the use of transformer-based foundation models such as TabPFN or TabICL, which encode sequences comprising $k$ demonstration examples—each a pair $(x_{i_j}, y_{i_j}, s_{i_j})$ , with $x_{i_j} \in \mathbb{R}^d$ (features), $y_{i_j} \in \{0,1\}$ (binary target), and $s_{i_j} \in \{0,1\}$ (binary sensitive attribute)—and a query test feature $x_{\mathrm{test}}$ of the same dimension. The predictive process is as follows:

The model $\mathsf{F}$ encodes the concatenated sequence $[(x_{i_1}, y_{i_1}), \dots, (x_{i_k}, y_{i_k}), x_{\mathrm{test}}]$ through stacked Transformer layers into a representation $h$ .
A classification head computes per-label scores, leading to the predicted label

$\hat{y} = \arg\max_{y \in \{0,1\}} \mathsf{score}(h; y)$

and probability

$p(y \mid x_{\mathrm{test}}, \{(x_{i_j}, y_{i_j})\}) = \mathrm{softmax}(\mathsf{score}(h; y))$

Demonstration selection is typically randomized or based on simple heuristics (e.g., nearest neighbor in feature space for small $k$ ).

2. Fairness Metrics in Tabular ICL

Fairness evaluation in tabular ICL follows group fairness paradigms. Let $f(x)$ denote the model's predicted label. The principal metrics, scaled by 100 for readability, are:

Metric	Formula	Interpretation
$\Delta \mathrm{DP}$	$\|P(f(X)=1 \| S=0) - P(f(X)=1 \| S=1)\|$	Demographic parity gap
$\Delta \mathrm{EOD}$	$\sum_{y \in \{0,1\}} \|P(f(X)=1 \| S=0,Y=y) - P(f(X)=1 \| S=1,Y=y)\|$	Equalized odds gap
$\Delta \mathrm{EOP}$	$\|P(f(X)=1 \| S=0,Y=1) - P(f(X)=1 \| S=1,Y=1)\|$	Equal opportunity gap

These metrics compare group-conditional rates of positive predictions, aiming to surface disparities attributable to the sensitive attribute.

3. Preprocessing Interventions for Fairness

The investigation isolates three distinct preprocessing strategies implemented on the demonstration set:

3.1 Correlation Remover (CR)

Employing the Feldman–Friedler procedure, the sensitive feature $s$ is (approximately) linearly regressed out of all other features $z^j$ :

$w_j^* = \arg\min_{w} \|\ z^j - (\mathbf{S} - \mathbf{1}\bar{s}^\top)w \ \|_2^2,$

resulting in adjusted features

$\mathbf{Z}^* = \mathbf{Z} - (\mathbf{S} - \mathbf{1}\bar{s}^\top) W^*$

and the final features

$\mathbf{X}' = \alpha\, \mathbf{Z}^* + (1-\alpha)\mathbf{Z},$

where $\alpha\in[0,1]$ governs the fairness/fidelity trade-off.

3.2 Group-Balanced Demonstration Selection

Exactly $\lfloor k/2\rfloor$ demonstrations are drawn from each sensitive group ( $s=0$ and $s=1$ ). If a group is underrepresented, the overrepresented group is down-sampled.

3.3 Uncertainty-Based Demonstration Selection

A separate predictor $g(x)$ (logistic regression or TabPFN) is trained for $s$ . Conformal prediction at coverage level $\epsilon$ is run on a validation split, and $k$ examples for which $C_\epsilon(x)$ contains both classes (“uncertain” points) are selected for the prompt. Typically $\epsilon=0.05$ is used.

4. Experimental Setup and Comparative Results

Benchmarks span Folktables subproblems (ACSIncome, ACSMobility, ACSTravelTime, ACSEmployment, ACSPublicCoverage), CelebA, a diabetes dataset, and German Credit, with group labels defined as gender, race, or age.

TabPFN (max 10,000 examples) and TabICL (max 500,000 examples) are compared under:

Vanilla (random prompt)
Group-balanced prompt
Correlation remover ( $\alpha=1$ )
Uncertainty+LR ( $\epsilon=0.05$ )
Uncertainty+TabPFN ( $\epsilon=0.05$ )

Key findings:

CR reliably increases unfairness. For ACSIncome, $\Delta\mathrm{DP}$ nearly doubles (from ~14% to ~27%). This occurs because the foundation model learns to invert the linear transformation, effectively reconstructing $s$ with near-perfect accuracy, thus leaking sensitive information.
Group-balancing yields only marginal improvements—gaps similar to vanilla on many tasks.
Uncertainty-based selection consistently dominates all other methods for accuracy–fairness trade-off. The "uncertain+TabPFN" variant achieves lowest group-fairness gaps ( $\Delta\mathrm{DP}, \Delta\mathrm{EOP}, \Delta\mathrm{EOD}$ ) across almost all tasks, with only a 1–2% decrease in overall accuracy. A clear Pareto frontier is traced by varying $\epsilon$ for uncertainty, unlike CR's $\alpha$ .
TabPFN marginally outperforms TabICL on accuracy at equal fairness, but both models obey the same intervention ranking.

5. Mechanistic Analysis and Implications

The effectiveness of uncertainty-based selection is attributed to its suppression of group information in the context: by selecting demonstrations with maximal sensitive-attribute ambiguity, the prompt conveys little about $s$ , precluding the model from encoding group membership. In contrast, group balancing addresses only the prompt representation imbalance, but does not preclude the use of proxy variables in feature space—providing only modest gains.

The observed failure of CR underscores an important aspect: linear preprocessing that modifies test-time features can be "undone" by sufficiently expressive models pre-trained on similar patterns, resulting in a catastrophic leakage of sensitive information.

A plausible implication is that in tabular ICL, preprocessing interventions acting solely on the demonstration set—even uncertainty-based ones—present the most reliable means for group fairness, whereas interventions affecting the test distribution may be systematically bypassed by sufficiently expressive architectures.

6. Limitations and Future Directions

The scope of the paper is confined to preprocessing-based interventions; it does not examine in-processing (e.g., fairness-aware fine-tuning) or post-processing of ICL outputs. The analysis also presumes ready access to ground-truth sensitive labels for both performance and conformity calculations, which may conflict with legal or ethical guidelines. Distribution shift between prompt and test distributions is not considered.

Research gaps are explicitly enumerated:

Exploring fairness-aware fine-tuning or calibration procedures for tabular foundation models.
Investigating post-hoc calibration of ICL probabilities to enforce fairness constraints directly on predictions.
Assessing the robustness of fairness interventions under covariate or label shift between demonstrations and test data.

7. Summary Table: Impact of Preprocessing Strategies

Strategy	Group Fairness Effect	Accuracy Cost	Notable Failure Mode
Vanilla (random)	Baseline – moderate gaps	–	Encodes demographic bias
Correlation Remover (CR)	Worsens fairness—gaps increase	Small	Sensitive information is perfectly leaked
Group-Balanced Selection	Minor improvement over vanilla	–	Proxy signals not suppressed
Uncertainty-Based	Consistently improves all metrics	1–2% drop	Relies on access to $s$ for selection

In sum, tabular in-context learning foundation models can encode and even amplify group bias from their demonstration context, but strategic uncertainty-based demonstration selection provides a tractable and effective mitigation pathway at minimal utility cost (Kenfack et al., 14 May 2025). The limitations of preprocessing, the capacity for models to invert naive feature obfuscations, and concrete recommendations for future fairness interventions are all substantiated in the empirical and methodological analysis.

PDF Markdown Chat (Pro)

References (1)

Towards Fair In-Context Learning with Tabular Foundation Models (2025)

Follow Topic

Get notified by email when new papers are published related to Tabular In-Context Learning (ICL).