TabImpute: Zero-Shot Transformer Imputation
- The paper introduces a pre-trained transformer that zero-shot imputes missing numerical values using entry-wise featurization and adaptive ensembling.
- TabImpute leverages large-scale synthetic training on 25 million tables with diverse missingness patterns to achieve state-of-the-art imputation accuracy.
- Efficient deployment without dataset-specific tuning enables up to 100× speedup over prior methods while maintaining robust performance across various domains.
TabImpute is a pre-trained transformer-based framework for zero-shot imputation of missing values in fully numerical tabular data. It achieves fast, accurate performance across a diverse set of domains and missingness patterns by leveraging large-scale synthetic training, a novel entry-wise featurization, and adaptive ensembling against prior imputation baselines. TabImpute is designed for direct deployment without dataset-specific fitting or tuning at inference-time, representing a significant advance in the practical usability of transformer-based imputation methods (Feitelberg et al., 3 Oct 2025).
1. Model Architecture and Entry-Wise Featurization
TabImpute builds upon the TabPFN v2 transformer architecture, removing the prior’s attention mask to allow all cells within a table to attend to each other. Each transformer layer alternates between inter-feature (across-row) and inter-sample (across-column) multi-head self-attention, with residual connections, layer normalization, and feed-forward sublayers. The hyperparameters mirror TabPFN v2: 12 layers (), 8 attention heads (), embedding dimension 128 (), and feed-forward hidden dimension 512 (), supporting inference on common CPUs and GPUs at scale (Feitelberg et al., 3 Oct 2025).
The core innovation in input representation is the entry-wise featurization. For every missing entry in table , TabImpute constructs a raw feature vector
where and are one-hot encodings of row and column, (-dim) and (-dim) are the observed row and column respectively (after normalization), and is the missingness indicator for cell . A linear layer projects into the transformer input space. This approach allows for fully parallelized processing of all missing entries in a table (Feitelberg et al., 3 Oct 2025).
2. Synthetic Training Data and Realistic Missingness Patterns
To obtain wide generalization and strong zero-shot behavior, TabImpute is pre-trained on 25 million synthetic tables generated from randomized low-rank linear factor models. For each table, row and column dimensions are uniformly sampled from [20, 200], and entries are constructed as where are drawn from a suite of distributions (e.g., Gaussian, Laplace, Dirichlet, spike-and-slab).
Missingness is simulated via 13 distinct mechanisms:
- MCAR: Entrywise Bernoulli random masking.
- MAR (“Col”): Column-dependent logistic masking conditioned on predictor columns.
- MNAR (11 mechanisms): Includes neural network masking, sequence masking with bandits, self-masking, censoring, panel dropout, hard/soft polarization, latent-factor and cluster-based mechanisms, and two-phase designs reflecting realistic data collection or selection biases.
Pattern-specific parameters (fractions, thresholds, etc.) are randomized for each synthetic training instance, ensuring broad coverage of missingness phenomena encountered in real-world tabular datasets (Feitelberg et al., 3 Oct 2025).
3. Pre-Training Objective, Curriculum, and Zero-Shot Inference
TabImpute optimizes the negative log-likelihood of observed entries:
where is a univariate predictive distribution computed from the transformer’s contextual output for cell . The output head implements a Riemann distribution as in Müller et al. 2022 (Feitelberg et al., 3 Oct 2025). To maintain robust performance across all missingness types, an adaptive sampling curriculum periodically re-weights training batches by observed pattern-wise NLL: patterns with higher error are up-weighted, ensuring uniform competence.
For zero-shot imputation, TabImpute proceeds as follows:
- Normalize each feature, build entry-wise featurizations for all missing in the given table.
- Perform a single -layer transformer forward pass.
- For each missing entry, extract the mean of its predictive posterior as the imputed value.
No per-dataset fine-tuning or hyperparameter search is required. For practical deployment, chunking (minibatching) mitigates memory use on large tables (Feitelberg et al., 3 Oct 2025).
4. Benchmarking, Performance, and Computational Properties
TabImpute is evaluated on MissBench, a comprehensive benchmark consisting of 42 fully numerical OpenML datasets and the above 13 missingness patterns. Datasets span medicine, finance, engineering, ecology, and education, with table sizes ranging from to . Evaluation metrics include RMSE on masked entries, normalized across all methods to yield an overall imputation accuracy (higher is better):
TabImpute+ (the ensembled variant) achieves , outperforming all 11 competitors, including HyperImpute (), optimal transport, and MissForest. Across individual patterns (MCAR, MAR, MNAR variants), TabImpute+ consistently retains high accuracy, with specific gains notable under complex non-random missingness (Feitelberg et al., 3 Oct 2025).
In terms of speed, TabImpute processes entries at $0.001$ ms/entry on NVIDIA H200 GPUs, exhibiting a speedup over TabPFN’s previous iterative imputer and maintaining ms/entry even on CPUs with batching (Feitelberg et al., 3 Oct 2025).
| Method | Mean Accuracy | Std. Dev. | Inference Speed (ms/entry, GPU) |
|---|---|---|---|
| TabImpute+ | 0.833 | 0.213 | 0.001 |
| HyperImpute | 0.766 | 0.259 | n/a |
| OT | 0.765 | 0.227 | n/a |
| MissForest | 0.754 | 0.248 | n/a |
5. Comparative Methodology and Relationship to Related Approaches
TabImpute improves significantly over TabPFN’s iterative imputation by parallelizing all entries and adopting richer, more realistic missingness patterns in training. Unlike operator-based methods (SoftImpute), probabilistic iterative methods (MICE, ICE), or classical random-forest imputation (MissForest), it requires no tuning or iterative fitting per-table. In contrast to deep generative methods such as GAIN or diffusion-based imputers, TabImpute’s transformer is used as a forward-only function after pre-training, resulting in substantially lower inference cost with no loss in overall accuracy (Feitelberg et al., 3 Oct 2025).
Whereas approaches like TREB (Wang et al., 16 Sep 2024) utilize BERT or RoBERTa for continuous imputation by adapting masked-language-model objectives, and TabINR (Ochs et al., 1 Oct 2025) models tables as neural functions with learnable embeddings, TabImpute’s foundation model paradigm and comprehensive synthetic training make it agnostic to dataset size, domain, or missingness pattern, provided features are fully numerical.
6. Deployment, Extensions, and Limitations
TabImpute is designed for direct zero-shot application. Deployment requires only standard normalization of observed features and the construction of for all missing entries. For optimal results, the adaptive ensemble (TabImpute+) can combine TabImpute’s output with that of TabPFN’s world priors (EWF-TabPFN).
Limitations include:
- Scalability: The transformer attention cost restricts applicability to tables below cells. Future research into sparse or linear-cost attention may extend scalability.
- Categorical Data: Native support is restricted to real-valued inputs; proposed extensions involve categorical feature discretization or vector embedding.
- Multiple Imputation: The model yields a posterior for each entry, enabling principled multiple imputation or uncertainty quantification.
- Causal/Panel Data: TabImpute does not directly address causal or panel missing data; these are possible future benchmarks.
The open-source codebase, pretrained weights, and evaluation suite are available at https://github.com/jacobf18/tabular (Feitelberg et al., 3 Oct 2025).
7. Significance and Perspectives
TabImpute demonstrates that transformer-based models, when trained on massive, richly parameterized synthetic tables, can achieve state-of-the-art zero-shot imputation accuracy and efficiency without reliance on dataset-specific fitting or classical iterative optimization. Its strong generalization across 13 diverse missingness patterns indicates broad applicability in research and industry, particularly in reproducible science and high-throughput analytics where pipeline simplicity and inference speed are crucial. Continuing development directions include principled scaling, generalization to mixed-type data, and expansion to specialized nonlinear generation scenarios (Feitelberg et al., 3 Oct 2025).