TabGemma: Schema-Agnostic ICL for Tabular Data
- TabGemma is a schema-agnostic, text-based in-context learning framework for tabular prediction that integrates numeric canonicalization to handle mixed data types.
- It mitigates issues like numeric tokenization instability and prompt size limitations by converting continuous values into standardized scientific notation and employing an n-gram based retrieval system.
- Empirical evaluations on benchmarks like CARTE, TextTab, and TabArena-Lite reveal state-of-the-art classification performance and competitive sample efficiency in few-shot scenarios.
TabGemma is a schema-agnostic, text-based in-context learning (ICL) framework for tabular prediction using LLMs, specifically leveraging a continued-pretrained Gemma 3 12B model. It is designed to perform classification and regression tasks on general tabular data with mixed text, categorical, and numeric features, addressing key challenges of numeric token instability and limited context size. TabGemma establishes new state-of-the-art performance in classification on semantically rich tabular benchmarks and demonstrates competitive sample efficiency and scaling, particularly with increasing context size.
1. Challenges in Tabular ICL with LLMs
Tabular prediction via LLMs for rows containing mixed feature types (text, categorical, numeric) presents two primary difficulties:
- Numeric Tokenization Instability: Raw decimal representations (e.g., "3141.592") vary in string length and use locale-dependent symbols, causing inconsistent subword tokenization. This inconsistency precludes learning stable representations of magnitude and scale.
- Context Window Constraints: Even with large context capacity (128k tokens in Gemma 3), naïve text serialization of entire tables (or large numbers of rows) can rapidly consume the available prompt budget. This restricts the number of candidate exemplars available for ICL, particularly for large or wide tables, adversely impacting model performance.
2. Numeric Canonicalization
To mitigate unstable numeric tokenization, TabGemma converts every continuous value to a standardized signed scientific notation string of the format
where and . The resulting string encodes to four significant digits using:
+3.1416e+03 for 3141.592
This encoding yields several advantages:
- Locale Independence: The decimal separator is always ".", and the exponent is marked by "e".
- Tokenizer Reuse: Subpatterns such as "+", "e+0", and fixed decimal exponents are reusable, reducing the token vocabulary needed for all magnitudes.
- Stable Embeddings: Eliminates extraneously long or fragmented tokenizations that prevent the model from learning robust number representations.
Pseudocode for this transformation:
1 2 3 4 5 6 7 8 9 10 11 |
def canonicalize(x: float, sig_digits=4) -> str: if x == 0: return "+0.000e+00" sign = "+" if x > 0 else "-" a = abs(x) e = floor(log10(a)) m = a / (10 ** e) m_str = format(round(m, sig_digits-1), f".{sig_digits-1}f") exp_sign = "+" if e >= 0 else "-" e_abs_str = str(abs(e)).zfill(2) return f"{sign}{m_str}e{exp_sign}{e_abs_str}" |
3. Continued Pretraining: Target Imputation Objective
TabGemma adapts the base Gemma 3 architecture through further pretraining on real-world tables (the T4 corpus from Tabula 8B), optimizing a target imputation objective:
- Sampling: In each pretraining iteration, one table is sampled uniformly, and 256 rows are drawn from it.
- Target-Imputation Task: For every batch, a column is selected at random. Rows are serialized with all features, including , via teacher-forcing. However, the cross-entropy loss is applied only to the tokens of .
Mathematically, the loss objective is:
where indexes positions belonging to the target column across the batch.
- Curriculum: Both tables and columns are selected uniformly, with no explicit curriculum. The number of context rows per example is varied by design, although always fixed at $256$ in pretraining.
- Masking: Causal masking during teacher-forcing allows each target cell to serve as both context and prediction target, depending on token position.
This approach allows the model to learn context adaptation and missing-value imputation without specific per-dataset fine-tuning or feature engineering.
4. Row Retrieval and Prompt Construction
TabGemma employs a lightweight n-gram based retrieval system to maximize the informative context within a fixed prompt size:
Row Embedding Steps:
- Character n-grams: For each cell string, extract n-grams (n=3–5 characters).
- Hash and Vectorization: Bag n-grams into a 256-dimensional vector per cell (count or normalized count).
- Concatenation: Concatenate per-cell vectors in feature column order to form a unified row embedding of dimension .
Retrieval Index:
- L2-normalize all row embeddings.
- Build a FAISS index (IVF or HNSW) on the embedding pool.
- At inference, compute the query row's embedding, then retrieve the most similar rows using Euclidean distance (or cosine similarity via dot product).
Prompt Serialization:
Each retrieved row is serialized as:
1 |
<cell₁>⟨SEP⟩<cell₂>⟨SEP⟩...<cellₙ>⟨SEP⟩<target_cell>⟨EOR⟩ |
The query row follows, with the target cell left blank. The resulting prompt, combining exemplars plus the incomplete query row, fits within the 128k-token window.
Example prompt with unknown target:
1 2 3 |
Name:Apple<SEP> Price:+1.2345e+02<SEP> Demand:High<EOR> Name:Banana<SEP> Price:+5.6789e+01<SEP> Demand:Low<EOR> Name:Cherry<SEP> Price:+2.5000e+02<SEP> Demand:<EOR> |
5. Empirical Evaluation and Performance
TabGemma was evaluated on three benchmark suites:
- CARTE: 51 tasks (11 binary-classification, 40 regression), emphasizing semantic text.
- TextTab: 21 tasks (9 classification, 12 regression), also semantically rich.
- TabArena-Lite: 51 tasks (38 classification, 13 regression), more conventional and numerics-heavy.
Baselines: AutoGluon (stacked AutoML), ConTextTab, RealMLP, LightGBM, TabPFN, Random Forest, and naïve predictors.
Key findings for retrieved exemplars:
| Metric | CARTE | TextTab | TabArena | Overall |
|---|---|---|---|---|
| Accuracy (%) | 79.3 (SOTA) | 84.1 | 84.8 | 83.6 |
| (regress.) | 70.3 | 31.6 | 57.8 | 60.7 |
- For classification, TabGemma matches or surpasses all baselines—including AutoGluon and ConTextTab—across data regimes.
- For regression, TabGemma achieves competitive performance in the small-sample (few-shot) regime (outperforming tuned LightGBM on CARTE), but degrades as task size increases in domains dominated by numeric correlation.
Sample efficiency and context scaling:
- On 128–8k samples, TabGemma outperforms conventional methods on semantic classification; for regression, superior performance up to approximately 1k samples, then declines.
- Increased context size ( to $128$) yields monotonic gains in classification accuracy and , with performance peaking at state-of-the-art.
6. Limitations and Prospects for Advancement
Identified constraints and open challenges include:
- Regression Performance: For purely numeric, high-data regimes (TextTab, TabArena), TabGemma underperforms compared to specialized tabular models beyond 1k samples.
- Context and Retrieval: Scaling context beyond 128k tokens remains problematic for very wide tables or large candidate pools. Retrieval precision diminishes as the number of columns increases or more rows must be considered.
- Numeric Modeling: Scientific notation canonicalization mitigates, but does not resolve, issues with continuous target prediction. Approaches incorporating real-valued output heads or advanced numeric embedding may yield improvements.
- Permutation Equivariance: Autoregressive row serialization enforces a fixed column order. Investigating row/column permutations, or adopting permutation-invariant architectures, could improve generalization and schema-adaptivity.
A plausible implication is that further advances in numeric modeling and prompt construction, especially for long-context and permutation-invariant architectures, are likely necessary for optimal LLM-based tabular regression.
7. Summary and Contextual Significance
TabGemma demonstrates that schema-agnostic, text-based LLMs—when equipped with numeric canonicalization, large-scale continued pretraining via a target-imputation objective, and efficient n-gram–driven exemplar retrieval—can act as competitive in-context learners for tabular prediction. Performance is strongest on semantically rich classification tasks, establishing new accuracy benchmarks. For regression, especially with extensive, numerics-dominated data, conventional ensemble and gradient-boosted methods still have the edge. TabGemma motivates exploration into more sophisticated numeric handling, scalable context management, and permutation-invariant modeling for generalized tabular ICL.