Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 171 tok/s
Gemini 2.5 Pro 47 tok/s Pro
GPT-5 Medium 32 tok/s Pro
GPT-5 High 36 tok/s Pro
GPT-4o 60 tok/s Pro
Kimi K2 188 tok/s Pro
GPT OSS 120B 437 tok/s Pro
Claude Sonnet 4.5 36 tok/s Pro
2000 character limit reached

Tabular In-Context Learning Overview

Updated 9 November 2025
  • Tabular ICL is a meta-learning paradigm where transformer-based models use a few demonstration rows from structured data to make zero-shot predictions without gradient updates.
  • Fairness in Tabular ICL is assessed using group fairness metrics such as ΔDP, ΔEOD, and ΔEOP to evaluate bias across sensitive attributes.
  • Uncertainty-based demonstration selection improves fairness by suppressing group signals, achieving lower bias gaps with only a minor drop in overall accuracy.

Tabular In-Context Learning (ICL) refers to a meta-learning paradigm in which a pre-trained, often transformer-based, model is provided with a small set of example rows—termed “demonstrations” or “prompts”—from a tabular dataset along with a new test row, and is asked to predict the corresponding label for the test point without any gradient-based parameter updates. This approach enables zero-shot task adaptation on structured data and is increasingly positioned as a competitive alternative to traditional methods such as gradient-boosted trees. Recent research has extended the capabilities of tabular ICL to both the scale and technical depth required for modern real-world prediction, and has initiated a systematic paper of the fairness properties and possible interventions to mitigate group bias in these settings (Kenfack et al., 14 May 2025).

1. Foundations and Process of Tabular ICL

Tabular ICL is grounded in the use of transformer-based foundation models such as TabPFN or TabICL, which encode sequences comprising kk demonstration examples—each a pair (xij,yij,sij)(x_{i_j}, y_{i_j}, s_{i_j}), with xijRdx_{i_j} \in \mathbb{R}^d (features), yij{0,1}y_{i_j} \in \{0,1\} (binary target), and sij{0,1}s_{i_j} \in \{0,1\} (binary sensitive attribute)—and a query test feature xtestx_{\mathrm{test}} of the same dimension. The predictive process is as follows:

  • The model F\mathsf{F} encodes the concatenated sequence [(xi1,yi1),,(xik,yik),xtest][(x_{i_1}, y_{i_1}), \dots, (x_{i_k}, y_{i_k}), x_{\mathrm{test}}] through stacked Transformer layers into a representation hh.
  • A classification head computes per-label scores, leading to the predicted label

y^=argmaxy{0,1}score(h;y)\hat{y} = \arg\max_{y \in \{0,1\}} \mathsf{score}(h; y)

and probability

p(yxtest,{(xij,yij)})=softmax(score(h;y))p(y \mid x_{\mathrm{test}}, \{(x_{i_j}, y_{i_j})\}) = \mathrm{softmax}(\mathsf{score}(h; y))

  • Demonstration selection is typically randomized or based on simple heuristics (e.g., nearest neighbor in feature space for small kk).

2. Fairness Metrics in Tabular ICL

Fairness evaluation in tabular ICL follows group fairness paradigms. Let f(x)f(x) denote the model's predicted label. The principal metrics, scaled by 100 for readability, are:

Metric Formula Interpretation
ΔDP\Delta \mathrm{DP} P(f(X)=1S=0)P(f(X)=1S=1)|P(f(X)=1 | S=0) - P(f(X)=1 | S=1)| Demographic parity gap
ΔEOD\Delta \mathrm{EOD} y{0,1}P(f(X)=1S=0,Y=y)P(f(X)=1S=1,Y=y)\sum_{y \in \{0,1\}} |P(f(X)=1 | S=0,Y=y) - P(f(X)=1 | S=1,Y=y)| Equalized odds gap
ΔEOP\Delta \mathrm{EOP} P(f(X)=1S=0,Y=1)P(f(X)=1S=1,Y=1)|P(f(X)=1 | S=0,Y=1) - P(f(X)=1 | S=1,Y=1)| Equal opportunity gap

These metrics compare group-conditional rates of positive predictions, aiming to surface disparities attributable to the sensitive attribute.

3. Preprocessing Interventions for Fairness

The investigation isolates three distinct preprocessing strategies implemented on the demonstration set:

3.1 Correlation Remover (CR)

Employing the Feldman–Friedler procedure, the sensitive feature ss is (approximately) linearly regressed out of all other features zjz^j:

wj=argminw zj(S1sˉ)w 22,w_j^* = \arg\min_{w} \|\ z^j - (\mathbf{S} - \mathbf{1}\bar{s}^\top)w \ \|_2^2,

resulting in adjusted features

Z=Z(S1sˉ)W\mathbf{Z}^* = \mathbf{Z} - (\mathbf{S} - \mathbf{1}\bar{s}^\top) W^*

and the final features

X=αZ+(1α)Z,\mathbf{X}' = \alpha\, \mathbf{Z}^* + (1-\alpha)\mathbf{Z},

where α[0,1]\alpha\in[0,1] governs the fairness/fidelity trade-off.

3.2 Group-Balanced Demonstration Selection

Exactly k/2\lfloor k/2\rfloor demonstrations are drawn from each sensitive group (s=0s=0 and s=1s=1). If a group is underrepresented, the overrepresented group is down-sampled.

3.3 Uncertainty-Based Demonstration Selection

A separate predictor g(x)g(x) (logistic regression or TabPFN) is trained for ss. Conformal prediction at coverage level ϵ\epsilon is run on a validation split, and kk examples for which Cϵ(x)C_\epsilon(x) contains both classes (“uncertain” points) are selected for the prompt. Typically ϵ=0.05\epsilon=0.05 is used.

4. Experimental Setup and Comparative Results

Benchmarks span Folktables subproblems (ACSIncome, ACSMobility, ACSTravelTime, ACSEmployment, ACSPublicCoverage), CelebA, a diabetes dataset, and German Credit, with group labels defined as gender, race, or age.

TabPFN (max 10,000 examples) and TabICL (max 500,000 examples) are compared under:

  • Vanilla (random prompt)
  • Group-balanced prompt
  • Correlation remover (α=1\alpha=1)
  • Uncertainty+LR (ϵ=0.05\epsilon=0.05)
  • Uncertainty+TabPFN (ϵ=0.05\epsilon=0.05)

Key findings:

  • CR reliably increases unfairness. For ACSIncome, ΔDP\Delta\mathrm{DP} nearly doubles (from ~14% to ~27%). This occurs because the foundation model learns to invert the linear transformation, effectively reconstructing ss with near-perfect accuracy, thus leaking sensitive information.
  • Group-balancing yields only marginal improvements—gaps similar to vanilla on many tasks.
  • Uncertainty-based selection consistently dominates all other methods for accuracy–fairness trade-off. The "uncertain+TabPFN" variant achieves lowest group-fairness gaps (ΔDP,ΔEOP,ΔEOD\Delta\mathrm{DP}, \Delta\mathrm{EOP}, \Delta\mathrm{EOD}) across almost all tasks, with only a 1–2% decrease in overall accuracy. A clear Pareto frontier is traced by varying ϵ\epsilon for uncertainty, unlike CR's α\alpha.
  • TabPFN marginally outperforms TabICL on accuracy at equal fairness, but both models obey the same intervention ranking.

5. Mechanistic Analysis and Implications

The effectiveness of uncertainty-based selection is attributed to its suppression of group information in the context: by selecting demonstrations with maximal sensitive-attribute ambiguity, the prompt conveys little about ss, precluding the model from encoding group membership. In contrast, group balancing addresses only the prompt representation imbalance, but does not preclude the use of proxy variables in feature space—providing only modest gains.

The observed failure of CR underscores an important aspect: linear preprocessing that modifies test-time features can be "undone" by sufficiently expressive models pre-trained on similar patterns, resulting in a catastrophic leakage of sensitive information.

A plausible implication is that in tabular ICL, preprocessing interventions acting solely on the demonstration set—even uncertainty-based ones—present the most reliable means for group fairness, whereas interventions affecting the test distribution may be systematically bypassed by sufficiently expressive architectures.

6. Limitations and Future Directions

The scope of the paper is confined to preprocessing-based interventions; it does not examine in-processing (e.g., fairness-aware fine-tuning) or post-processing of ICL outputs. The analysis also presumes ready access to ground-truth sensitive labels for both performance and conformity calculations, which may conflict with legal or ethical guidelines. Distribution shift between prompt and test distributions is not considered.

Research gaps are explicitly enumerated:

  • Exploring fairness-aware fine-tuning or calibration procedures for tabular foundation models.
  • Investigating post-hoc calibration of ICL probabilities to enforce fairness constraints directly on predictions.
  • Assessing the robustness of fairness interventions under covariate or label shift between demonstrations and test data.

7. Summary Table: Impact of Preprocessing Strategies

Strategy Group Fairness Effect Accuracy Cost Notable Failure Mode
Vanilla (random) Baseline – moderate gaps Encodes demographic bias
Correlation Remover (CR) Worsens fairness—gaps increase Small Sensitive information is perfectly leaked
Group-Balanced Selection Minor improvement over vanilla Proxy signals not suppressed
Uncertainty-Based Consistently improves all metrics 1–2% drop Relies on access to ss for selection

In sum, tabular in-context learning foundation models can encode and even amplify group bias from their demonstration context, but strategic uncertainty-based demonstration selection provides a tractable and effective mitigation pathway at minimal utility cost (Kenfack et al., 14 May 2025). The limitations of preprocessing, the capacity for models to invert naive feature obfuscations, and concrete recommendations for future fairness interventions are all substantiated in the empirical and methodological analysis.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Tabular In-Context Learning (ICL).