MissBench: Tabular Imputation Benchmark
- MissBench is a comprehensive benchmark suite for missing-data imputation that incorporates 42 real-world datasets and 13 synthetic missingness patterns.
- It enforces a systematic, zero-shot evaluation protocol to fairly compare imputation methods across MCAR, MAR, and complex MNAR regimes.
- Empirical insights from MissBench reveal that while classical methods perform well under simple missingness, state-of-the-art models struggle with structured MNAR patterns.
MissBench is a comprehensive, publicly available benchmark suite for missing-data imputation in tabular settings. It was introduced in the context of “TabImpute: Accurate and Fast Zero-Shot Missing-Data Imputation with a Pre-Trained Transformer” (Feitelberg et al., 3 Oct 2025), aiming to provide a rigorous, diverse, and scalable evaluation environment for comparing imputation methods across domains and missingness mechanisms. MissBench systematically addresses gaps in existing benchmarks by incorporating 42 real-world, fully observed OpenML datasets with a wide application range and 13 distinct synthetic missingness patterns, spanning the full complexity from MCAR/MAR to highly structured MNAR regimes.
1. Motivation and Objectives
MissBench was created to confront two major deficiencies in previous empirical standards:
- Existing imputation benchmarks, such as those associated with HyperImpute and GAIN, are limited both in dataset plurality and the diversity of missingness mechanisms.
- There was a lack of a unified, systematic suite for head-to-head comparison of imputation methods under zero-shot inference constraints—i.e., without access to held-out data or any hyperparameter tuning prior to imputation.
MissBench operationalizes Rubin’s missingness taxonomy, encompassing:
- Missing Completely At Random (MCAR): ,
- Missing At Random (MAR): ,
- Missing Not At Random (MNAR): depends on ,
and substantially expands the coverage of MNAR regimes—an area historically underrepresented in benchmarking. The intent is to stress-test imputation algorithms across the array of practical patterns encountered in domains such as medicine, finance, engineering, ecology, and social science, and to enforce a standard evaluation protocol emulating operational constraints.
2. Dataset Composition and Domain Coverage
MissBench selects 42 datasets from OpenML, each fully numerical and devoid of pre-existing missing entries, enabling systematic synthetic masking and fair evaluation. The datasets cover a breadth of domains:
| Domain | Dataset Examples | Shape Range (rows × cols) |
|---|---|---|
| Anthropology | EgyptianSkulls | 150 × 5 |
| Biology | humans_numeric | 75 × 15 |
| Education/Economics | FacultySalaries, Student-Scores | 50–235 × 5–18 |
| Medicine | ICU, appendicitis_test, wisconsin | 106–200 × 8–33 |
| Environmental Sci. | SolarPower, Rainfall-in-Kerala | 60–204 × 5–18 |
| Chemistry | MercuryinBass, benzo32 | 53–195 × 6–33 |
| Finance/Fraud | creditscore, Swiss-banknote | 100–200 × 7 |
| Miscellaneous | machine_cpu, meta-features, etc. | 50–209 × 6–32 |
All datasets comprise real-valued features, enabling uniform application of missingness patterns and evaluation.
3. Missingness Patterns
MissBench encompasses 13 masking patterns:
- MCAR: i.i.d.
- Col-MAR: Column-wise MAR via logistic propensity models over selected “predictor” columns .
- Eleven MNAR mechanisms, characterizing complex dependencies:
- NN-MNAR: Neural network–parameterized entrywise masking, .
- Seq-MNAR: Bandit-algorithm-based sequential masking, using algorithms such as -greedy, UCB, or Thompson Sampling.
- Self-Masking-MNAR: Masking probability via .
- Censoring-MNAR: Left or right censoring based on quantile thresholds.
- Panel-MNAR: Dropout on longitudinal data by per-row stopping time .
- Polarization-MNAR: Masking middle quantiles, %%%%10%%%% if .
- Soft-Polarization-MNAR: Probabilistic masking proportional to absolute deviation from the column median.
- Latent-Factor-MNAR: Masking probability depends on latent factors and row/column biases.
- Cluster-MNAR: Probability as a function of latent row/column clusters.
- Two-Phase-MNAR: Block-masking guided by the value of “cheap” features.
- Block-MNAR: Convolutional blockwise masking with structured dependencies.
Hyperparameters (e.g., , , , ) are provided in accompaying configuration files.
4. Evaluation Protocols and Metrics
For each (dataset, missingness) task, the benchmark follows this protocol:
The observed data matrix is synthetically masked with the selected pattern to yield and a missingness mask .
- The imputation method produces for all .
- Root Mean Squared Error (RMSE), , is computed.
- RMSE scores are min-max normalized across methods for each task:
- Imputation Accuracy is defined as , such that 1 indicates best performance and 0 the worst.
- The final method score is its mean Imputation Accuracy across all datasets and patterns.
Runtime per table entry is also logged on both CPU and GPU hardware.
5. Software Suite and Reproducibility
MissBench is distributed as an open-source Python package, including:
- Utilities for dataset download, application of each missingness pattern with fixed seeds, and imputation result evaluation.
- JSON/YAML configuration files specifying hyperparameter choices for all masking patterns.
- Scripts for zero-shot evaluation, strictly preventing methods from training/tuning on held-out data.
- Data splits in which all available entries can be used by the imputer, and artificially masked entries form the test set—there is no train/test row split.
- Reproducibility guidelines: fixed seeds, containerized environments, and a standard comparison suite of 11 established imputation methods plus three TabImpute variants.
Installation and benchmarking proceed via pip or direct repository cloning. Running a single command can execute the full matrix of 546 tasks ($42$ datasets $13$ patterns). All resources, code, and documentation are available at https://github.com/jacobf18/tabular.
6. Empirical Insights from MissBench
Evaluation of 11 classical and machine learning imputation methods on MissBench led to several findings:
- No classical method universally dominates; k-NN and tree-based methods exhibit strong performance under MCAR or simple MAR, but fail under complex MNAR.
- HyperImpute, the state-of-the-art on MAR/MCAR, degrades substantially on structured MNAR patterns such as panel-MNAR and polarization-MNAR.
- Iterative column-wise imputation from TabPFN is significantly slower and less accurate than entry-wise featurization approaches, motivating this methodological pivot.
- TabImpute+ (an ensemble of TabImpute and EWF-TabPFN) achieves superior mean Imputation Accuracy ( across all tasks), outperforming HyperImpute (), MissForest (), and optimal transport ().
- Under high missingness (MCAR ), methods leveraging generative pre-training (e.g., TabImpute+) perform notably better, as discriminative approaches lack sufficient observed data for parameter estimation.
- TabImpute and EWF-TabPFN demonstrate complementary strengths—TabImpute for structured MNAR, and EWF-TabPFN for latent factor models—while adaptive ensembling consistently exceeds the performance of individual models.
Collectively, these findings demonstrate the challenge posed by realistic missingness regimes and motivate continued methodological innovation for missing-data imputation (Feitelberg et al., 3 Oct 2025).
7. Significance and Impact
MissBench establishes a rigorous, reproducible, and comprehensive experimental standard for benchmarking imputation techniques in tabular domains. By spanning realistic domains and missingness mechanisms—including the previously unaddressed breadth of MNAR phenomena—MissBench enables fair, zero-shot comparative evaluations closely aligned with real-world operational requirements. Adoption of this benchmark is expected to facilitate progress in the development and robust evaluation of imputation methods for scientific and industrial data analyses.