Papers
Topics
Authors
Recent
Search
2000 character limit reached

MissBench: Tabular Imputation Benchmark

Updated 20 December 2025
  • MissBench is a comprehensive benchmark suite for missing-data imputation that incorporates 42 real-world datasets and 13 synthetic missingness patterns.
  • It enforces a systematic, zero-shot evaluation protocol to fairly compare imputation methods across MCAR, MAR, and complex MNAR regimes.
  • Empirical insights from MissBench reveal that while classical methods perform well under simple missingness, state-of-the-art models struggle with structured MNAR patterns.

MissBench is a comprehensive, publicly available benchmark suite for missing-data imputation in tabular settings. It was introduced in the context of “TabImpute: Accurate and Fast Zero-Shot Missing-Data Imputation with a Pre-Trained Transformer” (Feitelberg et al., 3 Oct 2025), aiming to provide a rigorous, diverse, and scalable evaluation environment for comparing imputation methods across domains and missingness mechanisms. MissBench systematically addresses gaps in existing benchmarks by incorporating 42 real-world, fully observed OpenML datasets with a wide application range and 13 distinct synthetic missingness patterns, spanning the full complexity from MCAR/MAR to highly structured MNAR regimes.

1. Motivation and Objectives

MissBench was created to confront two major deficiencies in previous empirical standards:

  • Existing imputation benchmarks, such as those associated with HyperImpute and GAIN, are limited both in dataset plurality and the diversity of missingness mechanisms.
  • There was a lack of a unified, systematic suite for head-to-head comparison of imputation methods under zero-shot inference constraints—i.e., without access to held-out data or any hyperparameter tuning prior to imputation.

MissBench operationalizes Rubin’s missingness taxonomy, encompassing:

  • Missing Completely At Random (MCAR): P(MX)=P(M)P(M\,|\,X) = P(M),
  • Missing At Random (MAR): P(MXobs,Xmis)=P(MXobs)P(M\,|\,X_{obs},X_{mis}) = P(M\,|\,X_{obs}),
  • Missing Not At Random (MNAR): P(MXobs,Xmis)P(M\,|\,X_{obs},X_{mis}) depends on XmisX_{mis},

and substantially expands the coverage of MNAR regimes—an area historically underrepresented in benchmarking. The intent is to stress-test imputation algorithms across the array of practical patterns encountered in domains such as medicine, finance, engineering, ecology, and social science, and to enforce a standard evaluation protocol emulating operational constraints.

2. Dataset Composition and Domain Coverage

MissBench selects 42 datasets from OpenML, each fully numerical and devoid of pre-existing missing entries, enabling systematic synthetic masking and fair evaluation. The datasets cover a breadth of domains:

Domain Dataset Examples Shape Range (rows × cols)
Anthropology EgyptianSkulls 150 × 5
Biology humans_numeric 75 × 15
Education/Economics FacultySalaries, Student-Scores 50–235 × 5–18
Medicine ICU, appendicitis_test, wisconsin 106–200 × 8–33
Environmental Sci. SolarPower, Rainfall-in-Kerala 60–204 × 5–18
Chemistry MercuryinBass, benzo32 53–195 × 6–33
Finance/Fraud creditscore, Swiss-banknote 100–200 × 7
Miscellaneous machine_cpu, meta-features, etc. 50–209 × 6–32

All datasets comprise real-valued features, enabling uniform application of missingness patterns and evaluation.

3. Missingness Patterns

MissBench encompasses 13 masking patterns:

  • MCAR: MijBernoulli(p)M_{ij} \sim \text{Bernoulli}(p) i.i.d.
  • Col-MAR: Column-wise MAR via logistic propensity models over selected “predictor” columns CpC_p.
  • Eleven MNAR mechanisms, characterizing complex dependencies:

    1. NN-MNAR: Neural network–parameterized entrywise masking, pij=gij(X(i,j))p_{ij} = g_{ij}(X^*(i,j)).
    2. Seq-MNAR: Bandit-algorithm-based sequential masking, using algorithms such as ε\varepsilon-greedy, UCB, or Thompson Sampling.
    3. Self-Masking-MNAR: Masking probability via MijBernoulli(σ(αXij+β0))M_{ij} \sim \text{Bernoulli}(\sigma(\alpha X^*_{ij} + \beta_0)).
    4. Censoring-MNAR: Left or right censoring based on quantile thresholds.
    5. Panel-MNAR: Dropout on longitudinal data by per-row stopping time τi\tau_i.
    6. Polarization-MNAR: Masking middle quantiles, %%%%10%%%% if Lj<Xij<HjL_j < X^*_{ij} < H_j.
    7. Soft-Polarization-MNAR: Probabilistic masking proportional to absolute deviation from the column median.
    8. Latent-Factor-MNAR: Masking probability depends on latent factors U,VU,V and row/column biases.
    9. Cluster-MNAR: Probability as a function of latent row/column clusters.
    10. Two-Phase-MNAR: Block-masking guided by the value of “cheap” features.
    11. Block-MNAR: Convolutional blockwise masking with structured dependencies.

Hyperparameters (e.g., qcensor=0.25q_{censor}=0.25, qthresh=0.25q_{thresh}=0.25, α=2.5\alpha=2.5, ε=0.05\varepsilon=0.05) are provided in accompaying configuration files.

4. Evaluation Protocols and Metrics

For each (dataset, missingness) task, the benchmark follows this protocol:

  1. The observed data matrix XX^* is synthetically masked with the selected pattern to yield XX and a missingness mask Ω\Omega.

  2. The imputation method produces XijimpX^{imp}_{ij} for all (i,j)Ω(i,j) \in \Omega.
  3. Root Mean Squared Error (RMSE), (1/Ω)(i,j)Ω(XijXijimp)2\sqrt{(1/|\Omega|) \sum_{(i,j)\in\Omega} (X^*_{ij} - X^{imp}_{ij})^2}, is computed.
  4. RMSE scores are min-max normalized across methods for each task:

NormRMSEmethod=RMSEmethodminRMSEmaxRMSEminRMSE\mathrm{NormRMSE}_\mathrm{method} = \frac{\mathrm{RMSE}_\mathrm{method} - \min \mathrm{RMSE}}{\max \mathrm{RMSE} - \min \mathrm{RMSE}}

  1. Imputation Accuracy is defined as 1NormRMSE1-\mathrm{NormRMSE}, such that 1 indicates best performance and 0 the worst.
  2. The final method score is its mean Imputation Accuracy across all datasets and patterns.

Runtime per table entry is also logged on both CPU and GPU hardware.

5. Software Suite and Reproducibility

MissBench is distributed as an open-source Python package, including:

  • Utilities for dataset download, application of each missingness pattern with fixed seeds, and imputation result evaluation.
  • JSON/YAML configuration files specifying hyperparameter choices for all masking patterns.
  • Scripts for zero-shot evaluation, strictly preventing methods from training/tuning on held-out data.
  • Data splits in which all available entries can be used by the imputer, and artificially masked entries form the test set—there is no train/test row split.
  • Reproducibility guidelines: fixed seeds, containerized environments, and a standard comparison suite of 11 established imputation methods plus three TabImpute variants.

Installation and benchmarking proceed via pip or direct repository cloning. Running a single command can execute the full matrix of 546 tasks ($42$ datasets ×\times $13$ patterns). All resources, code, and documentation are available at https://github.com/jacobf18/tabular.

6. Empirical Insights from MissBench

Evaluation of 11 classical and machine learning imputation methods on MissBench led to several findings:

  • No classical method universally dominates; k-NN and tree-based methods exhibit strong performance under MCAR or simple MAR, but fail under complex MNAR.
  • HyperImpute, the state-of-the-art on MAR/MCAR, degrades substantially on structured MNAR patterns such as panel-MNAR and polarization-MNAR.
  • Iterative column-wise imputation from TabPFN is significantly slower and less accurate than entry-wise featurization approaches, motivating this methodological pivot.
  • TabImpute+ (an ensemble of TabImpute and EWF-TabPFN) achieves superior mean Imputation Accuracy (0.83\approx0.83 across all tasks), outperforming HyperImpute (0.77\approx0.77), MissForest (0.75\approx0.75), and optimal transport (0.77\approx0.77).
  • Under high missingness (MCAR p0.5p \geq 0.5), methods leveraging generative pre-training (e.g., TabImpute+) perform notably better, as discriminative approaches lack sufficient observed data for parameter estimation.
  • TabImpute and EWF-TabPFN demonstrate complementary strengths—TabImpute for structured MNAR, and EWF-TabPFN for latent factor models—while adaptive ensembling consistently exceeds the performance of individual models.

Collectively, these findings demonstrate the challenge posed by realistic missingness regimes and motivate continued methodological innovation for missing-data imputation (Feitelberg et al., 3 Oct 2025).

7. Significance and Impact

MissBench establishes a rigorous, reproducible, and comprehensive experimental standard for benchmarking imputation techniques in tabular domains. By spanning realistic domains and missingness mechanisms—including the previously unaddressed breadth of MNAR phenomena—MissBench enables fair, zero-shot comparative evaluations closely aligned with real-world operational requirements. Adoption of this benchmark is expected to facilitate progress in the development and robust evaluation of imputation methods for scientific and industrial data analyses.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MissBench.