Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 177 tok/s
Gemini 2.5 Pro 50 tok/s Pro
GPT-5 Medium 31 tok/s Pro
GPT-5 High 32 tok/s Pro
GPT-4o 93 tok/s Pro
Kimi K2 183 tok/s Pro
GPT OSS 120B 447 tok/s Pro
Claude Sonnet 4.5 36 tok/s Pro
2000 character limit reached

TabGemma: Schema-Agnostic ICL for Tabular Data

Updated 9 November 2025
  • TabGemma is a schema-agnostic, text-based in-context learning framework for tabular prediction that integrates numeric canonicalization to handle mixed data types.
  • It mitigates issues like numeric tokenization instability and prompt size limitations by converting continuous values into standardized scientific notation and employing an n-gram based retrieval system.
  • Empirical evaluations on benchmarks like CARTE, TextTab, and TabArena-Lite reveal state-of-the-art classification performance and competitive sample efficiency in few-shot scenarios.

TabGemma is a schema-agnostic, text-based in-context learning (ICL) framework for tabular prediction using LLMs, specifically leveraging a continued-pretrained Gemma 3 12B model. It is designed to perform classification and regression tasks on general tabular data with mixed text, categorical, and numeric features, addressing key challenges of numeric token instability and limited context size. TabGemma establishes new state-of-the-art performance in classification on semantically rich tabular benchmarks and demonstrates competitive sample efficiency and scaling, particularly with increasing context size.

1. Challenges in Tabular ICL with LLMs

Tabular prediction via LLMs for rows containing mixed feature types (text, categorical, numeric) presents two primary difficulties:

  1. Numeric Tokenization Instability: Raw decimal representations (e.g., "3141.592") vary in string length and use locale-dependent symbols, causing inconsistent subword tokenization. This inconsistency precludes learning stable representations of magnitude and scale.
  2. Context Window Constraints: Even with large context capacity (128k tokens in Gemma 3), naïve text serialization of entire tables (or large numbers of rows) can rapidly consume the available prompt budget. This restricts the number of candidate exemplars available for ICL, particularly for large or wide tables, adversely impacting model performance.

2. Numeric Canonicalization

To mitigate unstable numeric tokenization, TabGemma converts every continuous value xx to a standardized signed scientific notation string of the format

x=±m×10e,x = \pm m \times 10^e,

where m[1,10)m \in [1,10) and eZe \in \mathbb{Z}. The resulting string encodes xx to four significant digits using:

+3.1416e+03 for 3141.592

This encoding yields several advantages:

  • Locale Independence: The decimal separator is always ".", and the exponent is marked by "e".
  • Tokenizer Reuse: Subpatterns such as "+", "e+0", and fixed decimal exponents are reusable, reducing the token vocabulary needed for all magnitudes.
  • Stable Embeddings: Eliminates extraneously long or fragmented tokenizations that prevent the model from learning robust number representations.

Pseudocode for this transformation:

1
2
3
4
5
6
7
8
9
10
11
def canonicalize(x: float, sig_digits=4) -> str:
    if x == 0:
        return "+0.000e+00"
    sign = "+" if x > 0 else "-"
    a = abs(x)
    e = floor(log10(a))
    m = a / (10 ** e)
    m_str = format(round(m, sig_digits-1), f".{sig_digits-1}f")
    exp_sign = "+" if e >= 0 else "-"
    e_abs_str = str(abs(e)).zfill(2)
    return f"{sign}{m_str}e{exp_sign}{e_abs_str}"
After canonicalization, the LLM interacts only with a compact, regularized set of numeric strings, substantially enhancing generalization for magnitude and order-of-magnitude information.

3. Continued Pretraining: Target Imputation Objective

TabGemma adapts the base Gemma 3 architecture through further pretraining on 3M3\,\text{M} real-world tables (the T4 corpus from Tabula 8B), optimizing a target imputation objective:

  • Sampling: In each pretraining iteration, one table is sampled uniformly, and 256 rows are drawn from it.
  • Target-Imputation Task: For every batch, a column CtargetC_{\text{target}} is selected at random. Rows are serialized with all features, including CtargetC_{\text{target}}, via teacher-forcing. However, the cross-entropy loss is applied only to the tokens of CtargetC_{\text{target}}.

Mathematically, the loss objective is:

L(θ)=iMlogP(ticontext<i;θ)\mathcal{L}(\theta) = -\sum_{i \in M} \log P(t_i\,|\,\text{context}_{<i};\,\theta)

where MM indexes positions belonging to the target column across the batch.

  • Curriculum: Both tables and columns are selected uniformly, with no explicit curriculum. The number of context rows per example is varied by design, although always fixed at $256$ in pretraining.
  • Masking: Causal masking during teacher-forcing allows each target cell to serve as both context and prediction target, depending on token position.

This approach allows the model to learn context adaptation and missing-value imputation without specific per-dataset fine-tuning or feature engineering.

4. Row Retrieval and Prompt Construction

TabGemma employs a lightweight n-gram based retrieval system to maximize the informative context within a fixed prompt size:

Row Embedding Steps:

  1. Character n-grams: For each cell string, extract n-grams (n=3–5 characters).
  2. Hash and Vectorization: Bag n-grams into a 256-dimensional vector per cell (count or normalized count).
  3. Concatenation: Concatenate per-cell vectors in feature column order to form a unified row embedding of dimension 256×Ncolumns256 \times N_{\text{columns}}.

Retrieval Index:

  • L2-normalize all row embeddings.
  • Build a FAISS index (IVF or HNSW) on the embedding pool.
  • At inference, compute the query row's embedding, then retrieve the kk most similar rows using Euclidean distance (or cosine similarity via dot product).

Prompt Serialization:

Each retrieved row is serialized as:

1
<cell₁>⟨SEP⟩<cell₂>⟨SEP⟩...<cellₙ>⟨SEP⟩<target_cell>⟨EOR⟩
⟨SEP⟩ denotes the cell separator, and ⟨EOR⟩ the end of row.

The query row follows, with the target cell left blank. The resulting prompt, combining kk exemplars plus the incomplete query row, fits within the \sim128k-token window.

Example prompt with unknown target:

1
2
3
Name:Apple<SEP> Price:+1.2345e+02<SEP> Demand:High<EOR>
Name:Banana<SEP> Price:+5.6789e+01<SEP> Demand:Low<EOR>
Name:Cherry<SEP> Price:+2.5000e+02<SEP> Demand:<EOR>
The model autoregressively decodes the missing target cell.

5. Empirical Evaluation and Performance

TabGemma was evaluated on three benchmark suites:

  • CARTE: 51 tasks (11 binary-classification, 40 regression), emphasizing semantic text.
  • TextTab: 21 tasks (9 classification, 12 regression), also semantically rich.
  • TabArena-Lite: 51 tasks (38 classification, 13 regression), more conventional and numerics-heavy.

Baselines: AutoGluon (stacked AutoML), ConTextTab, RealMLP, LightGBM, TabPFN, Random Forest, and naïve predictors.

Key findings for k=128k=128 retrieved exemplars:

Metric CARTE TextTab TabArena Overall
Accuracy (%) 79.3 (SOTA) 84.1 84.8 83.6
R2R^2 (regress.) 70.3 31.6 57.8 60.7
  • For classification, TabGemma matches or surpasses all baselines—including AutoGluon and ConTextTab—across data regimes.
  • For regression, TabGemma achieves competitive performance in the small-sample (few-shot) regime (outperforming tuned LightGBM on CARTE), but degrades as task size increases in domains dominated by numeric correlation.

Sample efficiency and context scaling:

  • On 128–8k samples, TabGemma outperforms conventional methods on semantic classification; for regression, superior performance up to approximately 1k samples, then declines.
  • Increased context size (k=8k=8 to $128$) yields monotonic gains in classification accuracy and R2R^2, with performance peaking at state-of-the-art.

6. Limitations and Prospects for Advancement

Identified constraints and open challenges include:

  1. Regression Performance: For purely numeric, high-data regimes (TextTab, TabArena), TabGemma underperforms compared to specialized tabular models beyond 1k samples.
  2. Context and Retrieval: Scaling context beyond 128k tokens remains problematic for very wide tables or large candidate pools. Retrieval precision diminishes as the number of columns increases or more rows must be considered.
  3. Numeric Modeling: Scientific notation canonicalization mitigates, but does not resolve, issues with continuous target prediction. Approaches incorporating real-valued output heads or advanced numeric embedding may yield improvements.
  4. Permutation Equivariance: Autoregressive row serialization enforces a fixed column order. Investigating row/column permutations, or adopting permutation-invariant architectures, could improve generalization and schema-adaptivity.

A plausible implication is that further advances in numeric modeling and prompt construction, especially for long-context and permutation-invariant architectures, are likely necessary for optimal LLM-based tabular regression.

7. Summary and Contextual Significance

TabGemma demonstrates that schema-agnostic, text-based LLMs—when equipped with numeric canonicalization, large-scale continued pretraining via a target-imputation objective, and efficient n-gram–driven exemplar retrieval—can act as competitive in-context learners for tabular prediction. Performance is strongest on semantically rich classification tasks, establishing new accuracy benchmarks. For regression, especially with extensive, numerics-dominated data, conventional ensemble and gradient-boosted methods still have the edge. TabGemma motivates exploration into more sophisticated numeric handling, scalable context management, and permutation-invariant modeling for generalized tabular ICL.

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to TabGemma.