MissBench: Tabular Imputation Benchmark

Updated 20 December 2025

MissBench is a comprehensive benchmark suite for missing-data imputation that incorporates 42 real-world datasets and 13 synthetic missingness patterns.
It enforces a systematic, zero-shot evaluation protocol to fairly compare imputation methods across MCAR, MAR, and complex MNAR regimes.
Empirical insights from MissBench reveal that while classical methods perform well under simple missingness, state-of-the-art models struggle with structured MNAR patterns.

MissBench is a comprehensive, publicly available benchmark suite for missing-data imputation in tabular settings. It was introduced in the context of “TabImpute: Accurate and Fast Zero-Shot Missing-Data Imputation with a Pre-Trained Transformer” (Feitelberg et al., 3 Oct 2025), aiming to provide a rigorous, diverse, and scalable evaluation environment for comparing imputation methods across domains and missingness mechanisms. MissBench systematically addresses gaps in existing benchmarks by incorporating 42 real-world, fully observed OpenML datasets with a wide application range and 13 distinct synthetic missingness patterns, spanning the full complexity from MCAR/MAR to highly structured MNAR regimes.

1. Motivation and Objectives

MissBench was created to confront two major deficiencies in previous empirical standards:

Existing imputation benchmarks, such as those associated with HyperImpute and GAIN, are limited both in dataset plurality and the diversity of missingness mechanisms.
There was a lack of a unified, systematic suite for head-to-head comparison of imputation methods under zero-shot inference constraints—i.e., without access to held-out data or any hyperparameter tuning prior to imputation.

MissBench operationalizes Rubin’s missingness taxonomy, encompassing:

Missing Completely At Random (MCAR): $P(M\,|\,X) = P(M)$ ,
Missing At Random (MAR): $P(M\,|\,X_{obs},X_{mis}) = P(M\,|\,X_{obs})$ ,
Missing Not At Random (MNAR): $P(M\,|\,X_{obs},X_{mis})$ depends on $X_{mis}$ ,

and substantially expands the coverage of MNAR regimes—an area historically underrepresented in benchmarking. The intent is to stress-test imputation algorithms across the array of practical patterns encountered in domains such as medicine, finance, engineering, ecology, and social science, and to enforce a standard evaluation protocol emulating operational constraints.

2. Dataset Composition and Domain Coverage

MissBench selects 42 datasets from OpenML, each fully numerical and devoid of pre-existing missing entries, enabling systematic synthetic masking and fair evaluation. The datasets cover a breadth of domains:

Domain	Dataset Examples	Shape Range (rows × cols)
Anthropology	EgyptianSkulls	150 × 5
Biology	humans_numeric	75 × 15
Education/Economics	FacultySalaries, Student-Scores	50–235 × 5–18
Medicine	ICU, appendicitis_test, wisconsin	106–200 × 8–33
Environmental Sci.	SolarPower, Rainfall-in-Kerala	60–204 × 5–18
Chemistry	MercuryinBass, benzo32	53–195 × 6–33
Finance/Fraud	creditscore, Swiss-banknote	100–200 × 7
Miscellaneous	machine_cpu, meta-features, etc.	50–209 × 6–32

All datasets comprise real-valued features, enabling uniform application of missingness patterns and evaluation.

3. Missingness Patterns

MissBench encompasses 13 masking patterns:

MCAR: $M_{ij} \sim \text{Bernoulli}(p)$ i.i.d.
Col-MAR: Column-wise MAR via logistic propensity models over selected “predictor” columns $C_p$ .
Eleven MNAR mechanisms, characterizing complex dependencies:
1. NN-MNAR: Neural network–parameterized entrywise masking, $p_{ij} = g_{ij}(X^*(i,j))$ .
2. Seq-MNAR: Bandit-algorithm-based sequential masking, using algorithms such as $\varepsilon$ -greedy, UCB, or Thompson Sampling.
3. Self-Masking-MNAR: Masking probability via $M_{ij} \sim \text{Bernoulli}(\sigma(\alpha X^*_{ij} + \beta_0))$ .
4. Censoring-MNAR: Left or right censoring based on quantile thresholds.
5. Panel-MNAR: Dropout on longitudinal data by per-row stopping time $\tau_i$ .
6. Polarization-MNAR: Masking middle quantiles, %%%%10%%%% if $L_j < X^*_{ij} < H_j$ .
7. Soft-Polarization-MNAR: Probabilistic masking proportional to absolute deviation from the column median.
8. Latent-Factor-MNAR: Masking probability depends on latent factors $U,V$ and row/column biases.
9. Cluster-MNAR: Probability as a function of latent row/column clusters.
10. Two-Phase-MNAR: Block-masking guided by the value of “cheap” features.
11. Block-MNAR: Convolutional blockwise masking with structured dependencies.

Hyperparameters (e.g., $q_{censor}=0.25$ , $q_{thresh}=0.25$ , $\alpha=2.5$ , $\varepsilon=0.05$ ) are provided in accompaying configuration files.

4. Evaluation Protocols and Metrics

For each (dataset, missingness) task, the benchmark follows this protocol:

The observed data matrix $X^*$ is synthetically masked with the selected pattern to yield $X$ and a missingness mask $\Omega$ .
The imputation method produces $X^{imp}_{ij}$ for all $(i,j) \in \Omega$ .
Root Mean Squared Error (RMSE), $\sqrt{(1/|\Omega|) \sum_{(i,j)\in\Omega} (X^*_{ij} - X^{imp}_{ij})^2}$ , is computed.
RMSE scores are min-max normalized across methods for each task:

$\mathrm{NormRMSE}_\mathrm{method} = \frac{\mathrm{RMSE}_\mathrm{method} - \min \mathrm{RMSE}}{\max \mathrm{RMSE} - \min \mathrm{RMSE}}$

Imputation Accuracy is defined as $1-\mathrm{NormRMSE}$ , such that 1 indicates best performance and 0 the worst.
The final method score is its mean Imputation Accuracy across all datasets and patterns.

Runtime per table entry is also logged on both CPU and GPU hardware.

5. Software Suite and Reproducibility

MissBench is distributed as an open-source Python package, including:

Utilities for dataset download, application of each missingness pattern with fixed seeds, and imputation result evaluation.
JSON/YAML configuration files specifying hyperparameter choices for all masking patterns.
Scripts for zero-shot evaluation, strictly preventing methods from training/tuning on held-out data.
Data splits in which all available entries can be used by the imputer, and artificially masked entries form the test set—there is no train/test row split.
Reproducibility guidelines: fixed seeds, containerized environments, and a standard comparison suite of 11 established imputation methods plus three TabImpute variants.

Installation and benchmarking proceed via pip or direct repository cloning. Running a single command can execute the full matrix of 546 tasks ($42$ datasets $\times$ $13$ patterns). All resources, code, and documentation are available at https://github.com/jacobf18/tabular.

6. Empirical Insights from MissBench

Evaluation of 11 classical and machine learning imputation methods on MissBench led to several findings:

No classical method universally dominates; k-NN and tree-based methods exhibit strong performance under MCAR or simple MAR, but fail under complex MNAR.
HyperImpute, the state-of-the-art on MAR/MCAR, degrades substantially on structured MNAR patterns such as panel-MNAR and polarization-MNAR.
Iterative column-wise imputation from TabPFN is significantly slower and less accurate than entry-wise featurization approaches, motivating this methodological pivot.
TabImpute+ (an ensemble of TabImpute and EWF-TabPFN) achieves superior mean Imputation Accuracy ( $\approx0.83$ across all tasks), outperforming HyperImpute ( $\approx0.77$ ), MissForest ( $\approx0.75$ ), and optimal transport ( $\approx0.77$ ).
Under high missingness (MCAR $p \geq 0.5$ ), methods leveraging generative pre-training (e.g., TabImpute+) perform notably better, as discriminative approaches lack sufficient observed data for parameter estimation.
TabImpute and EWF-TabPFN demonstrate complementary strengths—TabImpute for structured MNAR, and EWF-TabPFN for latent factor models—while adaptive ensembling consistently exceeds the performance of individual models.

Collectively, these findings demonstrate the challenge posed by realistic missingness regimes and motivate continued methodological innovation for missing-data imputation (Feitelberg et al., 3 Oct 2025).

7. Significance and Impact

MissBench establishes a rigorous, reproducible, and comprehensive experimental standard for benchmarking imputation techniques in tabular domains. By spanning realistic domains and missingness mechanisms—including the previously unaddressed breadth of MNAR phenomena—MissBench enables fair, zero-shot comparative evaluations closely aligned with real-world operational requirements. Adoption of this benchmark is expected to facilitate progress in the development and robust evaluation of imputation methods for scientific and industrial data analyses.

Markdown Upgrade to Chat

References (1)

TabImpute: Accurate and Fast Zero-Shot Missing-Data Imputation with a Pre-Trained Transformer (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MissBench.

MissBench: Tabular Imputation Benchmark

1. Motivation and Objectives

2. Dataset Composition and Domain Coverage

3. Missingness Patterns

4. Evaluation Protocols and Metrics

5. Software Suite and Reproducibility

6. Empirical Insights from MissBench

7. Significance and Impact

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research

MissBench: Tabular Imputation Benchmark

1. Motivation and Objectives

2. Dataset Composition and Domain Coverage

3. Missingness Patterns

4. Evaluation Protocols and Metrics

5. Software Suite and Reproducibility

6. Empirical Insights from MissBench

7. Significance and Impact

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research