Papers
Topics
Authors
Recent
Search
2000 character limit reached

Missingness Simulation Techniques

Updated 30 April 2026
  • Missingness simulation is the process of artificially introducing missing data into complete datasets using defined stochastic mechanisms such as MCAR, MAR, and MNAR.
  • It employs simulation algorithms with logistic/probit models to generate random masks, enabling controlled evaluation of imputation and causal inference methods.
  • The technique supports sensitivity analysis and benchmarking by using performance metrics like bias, RMSE, and coverage to assess estimator robustness.

Missingness simulation refers to the artificial introduction of missing data into otherwise complete datasets according to precisely defined stochastic mechanisms. This controlled amputation of data is foundational for methodological evaluation, sensitivity analysis, benchmarking of imputation and inference algorithms, and theoretical investigation of missing data under various mechanisms—including MCAR (missing completely at random), MAR (missing at random), MNAR (missing not at random), and structural extensions such as block or jointly dependent missingness patterns. Simulation facilitates empirical assessment of estimators’ properties, robustness of analysis pipelines, and reproducibility of statistical results in the presence of structured or non-ignorable missingness.

1. Formal Missingness Taxonomy and Mechanisms

The simulation of missingness exploits the classical Rubin taxonomy (MCAR, MAR, MNAR) and its structured generalizations. In the multivariate setting, the missingness mechanism is formalized as a set of conditional distributions for indicator variables MjM_j that may depend on the data matrix X\mathbf X and, in the structured setting, on the other missingness indicators Mj\mathbf M_{-j}:

  • MCAR: P(MjX,Mj)=P(Mjγj)P(M_j\,|\,\mathbf X, \mathbf M_{-j}) = P(M_j\,|\,\gamma_j); completely independent of data.
  • MAR: P(MjX,Mj)=P(MjXobs,γj)P(M_j\,|\,\mathbf X, \mathbf M_{-j}) = P(M_j\,|\,\mathbf X_{\rm obs}, \gamma_j); depends only on observed values.
  • MNAR: P(MjX,Mj)=P(MjXobs,Xmis,γj)P(M_j\,|\,\mathbf X, \mathbf M_{-j}) = P(M_j\,|\,\mathbf X_{\rm obs}, \mathbf X_{\rm mis}, \gamma_j); depends on unobserved information and can induce non-ignorability.
  • Structured Missingness (SM): Each MjM_j can functionally or stochastically depend on multiple other indicators Mj\mathbf M_{-j}, allowing for block, monotone, and sequential patterns (Jackson et al., 2023).

Practical models use generalized linear models (logistic/probit/canonical links), with MCAR realized by coefficients β=γ=0\beta=\gamma=0, MAR by setting γ=0\gamma=0, and MNAR by inclusion of unobserved/self variables in predictors.

2. Simulation Algorithms and Mask Generation

The central workflow is: (i) generate a complete dataset from a specified data-generating process, (ii) impose missingness using a random mask sampled from the desired mechanism, (iii) analyze or impute. Algorithms are tailored to the required missingness pattern.

General SM Simulation

For a dataset of X\mathbf X0 subjects and X\mathbf X1 variables (Jackson et al., 2023): P(MjX,Mj)=P(MjXobs,γj)P(M_j\,|\,\mathbf X, \mathbf M_{-j}) = P(M_j\,|\,\mathbf X_{\rm obs}, \gamma_j)6 The same logic generalizes to block or monotone mechanisms by forced ties or mixture models. In high dimensions, vectorized or matrix-based amputation (e.g., Bernoulli amputation using copula-dependence) is employed (Hofert et al., 2024).

Specializations

  • Block/Sporadic Structured EHR: Remove specified entire rows/columns (block), then superimpose entrywise intermittent missingness by applying masked Bernoulli draws with heterogeneous probability matrices X\mathbf X2 (Tan et al., 10 Jun 2025).
  • State- and Time-Dependent HMMs: At each X\mathbf X3: X\mathbf X4. X\mathbf X5 parameterized via state, time, or both; three-level simulation: sample X\mathbf X6, then X\mathbf X7, then X\mathbf X8 (Speekenbrink et al., 2021).
  • Tensor/Array MNAR: For each X\mathbf X9, sample Mj\mathbf M_{-j}0, with Mj\mathbf M_{-j}1 (Zhang et al., 7 Sep 2025).

3. Advanced and Structured Missingness: Copulas, Patterns, and Hierarchies

Bernoulli amputation unifies most MCAR/MAR/MNAR/SM patterns within a single framework: Given margins Mj\mathbf M_{-j}2 and a Mj\mathbf M_{-j}3-variate copula Mj\mathbf M_{-j}4, sample Mj\mathbf M_{-j}5, Mj\mathbf M_{-j}6 (Hofert et al., 2024). This supports:

  • MCAR/MAR/MNAR by selective dependence in Mj\mathbf M_{-j}7
  • Block missingness via comonotone copulas
  • Monotone missingness by introducing a latent cutpoint Mj\mathbf M_{-j}8 sampled from a categorical, imposing monotonicity across variables

Detailed mathematical characterization of joint probabilities and pairwise missingness correlations is given via survival copula transforms, fully controlling dependence within or across rows and columns.

4. Missingness Simulation in Causal and Inference Frameworks

When simulating for causal inference with missing data, missingness must be applied post-treatment assignment with precise attention to the identifiability conditions, e.g., as encoded by m-DAGs (Zhang et al., 2023), guaranteeing proper estimation of effects under various mechanism types:

  • Causal m-DAG simulation: Construct latent variable and outcome DAG, then generate missingness indicators by logistic/threshold models aligned with canonical graphs (e.g., outcome-induced, sequential, or block missingness).
  • Design-based inference: Missingness is a fixed function Mj\mathbf M_{-j}9 of all unit attributes; for randomization-based analysis, masks are invariant under permutation of treatment (Heng et al., 2023).

When outcome or key confounder missingness is MNAR, substantive-model-compatible imputation, or imputation + re-imputation within permutation tests, can be embedded into every randomization replicate to preserve finite-sample inference guarantees (Heng et al., 2023, Zhang et al., 2024).

5. Applications in Sensitivity Analysis and Evaluation

Missingness simulation under a spectrum of plausible mechanisms is foundational for sensitivity analysis, particularly for MNAR or selection models. In simulation-based sensitivity analysis (SSA) (Yin et al., 2015):

  • Construct a grid P(MjX,Mj)=P(Mjγj)P(M_j\,|\,\mathbf X, \mathbf M_{-j}) = P(M_j\,|\,\gamma_j)0 of non-ignorability parameters
  • For each P(MjX,Mj)=P(Mjγj)P(M_j\,|\,\mathbf X, \mathbf M_{-j}) = P(M_j\,|\,\gamma_j)1, (i) fit the model, (ii) generate complete data, (iii) impose missingness under P(MjX,Mj)=P(Mjγj)P(M_j\,|\,\mathbf X, \mathbf M_{-j}) = P(M_j\,|\,\gamma_j)2, (iv) compute distance (e.g., P(MjX,Mj)=P(Mjγj)P(M_j\,|\,\mathbf X, \mathbf M_{-j}) = P(M_j\,|\,\gamma_j)3) between actual and generated incomplete datasets
  • Exclude P(MjX,Mj)=P(Mjγj)P(M_j\,|\,\mathbf X, \mathbf M_{-j}) = P(M_j\,|\,\gamma_j)4 where distance is “too large” (P(MjX,Mj)=P(Mjγj)P(M_j\,|\,\mathbf X, \mathbf M_{-j}) = P(M_j\,|\,\gamma_j)5); among remaining, select most plausible (minimum P(MjX,Mj)=P(Mjγj)P(M_j\,|\,\mathbf X, \mathbf M_{-j}) = P(M_j\,|\,\gamma_j)6) or provide an interval of estimates

The practical implementation must report both point and interval estimates and quantify the shape similarity between empirical and synthetic incomplete data.

Metrics commonly used in evaluation studies include:

Metric Definition Purpose
Bias P(MjX,Mj)=P(Mjγj)P(M_j\,|\,\mathbf X, \mathbf M_{-j}) = P(M_j\,|\,\gamma_j)7 Assess estimator bias
RMSE P(MjX,Mj)=P(Mjγj)P(M_j\,|\,\mathbf X, \mathbf M_{-j}) = P(M_j\,|\,\gamma_j)8 Aggregate error (variance + bias)
Coverage Proportion of CIs containing P(MjX,Mj)=P(Mjγj)P(M_j\,|\,\mathbf X, \mathbf M_{-j}) = P(M_j\,|\,\gamma_j)9 Calibrate CI performance
Predictive AUC/MSE AUC for binary outcome, MSE for continuous Prediction accuracy

6. Handling Simulation Failures and Reporting Standards

Missingness simulation must address not only the imposed missing data but also secondary “simulation missingness” due to incomplete or failed simulation runs (DGM failure, method failure, metric non-computability) (Pawel et al., 2024). Each layer of missingness is tracked by indicator functions for data generation, method application, and metric computation; corresponding missing rates (P(MjX,Mj)=P(MjXobs,γj)P(M_j\,|\,\mathbf X, \mathbf M_{-j}) = P(M_j\,|\,\mathbf X_{\rm obs}, \gamma_j)0, P(MjX,Mj)=P(MjXobs,γj)P(M_j\,|\,\mathbf X, \mathbf M_{-j}) = P(M_j\,|\,\mathbf X_{\rm obs}, \gamma_j)1, P(MjX,Mj)=P(MjXobs,γj)P(M_j\,|\,\mathbf X, \mathbf M_{-j}) = P(M_j\,|\,\mathbf X_{\rm obs}, \gamma_j)2) inform reporting and exclusion rules.

Best practice for missingness reporting in simulation studies includes:

  1. Quantify and report missingness rates by method and condition.
  2. Use thresholds (e.g., P(MjX,Mj)=P(MjXobs,γj)P(M_j\,|\,\mathbf X, \mathbf M_{-j}) = P(M_j\,|\,\mathbf X_{\rm obs}, \gamma_j)3) to flag exclusion.
  3. State and justify handling strategy (case-wise, list-wise, method replacement).
  4. Run sensitivity analyses across plausible strategies.
  5. Share code and unaggregated replication-level data and indicators.

This protocol ensures methodological validity and reproducibility in simulation-based methodological research.

7. Selected Illustrative Examples

  • EHR Matrix Completion: Block removal of large submatrices representing hospitals/sites, combined with Bernoulli (MAR) entrywise deletion to mimic variable patient-record structure (Tan et al., 10 Jun 2025).
  • Tensor MNAR: Observed-mask modeled as logistic function of latent tensor values, enabling value-dependent non-ignorability (Zhang et al., 7 Sep 2025).
  • IRT Omitted/Not-Reached: Sequential omitted/not-reached item indicators defined via cumulative logit models conditional on latent “missingness propensity” P(MjX,Mj)=P(MjXobs,γj)P(M_j\,|\,\mathbf X, \mathbf M_{-j}) = P(M_j\,|\,\mathbf X_{\rm obs}, \gamma_j)4, correlated with ability P(MjX,Mj)=P(MjXobs,γj)P(M_j\,|\,\mathbf X, \mathbf M_{-j}) = P(M_j\,|\,\mathbf X_{\rm obs}, \gamma_j)5 to induce non-ignorable missing data (Guo, 2019).

References

  • (Jackson et al., 2023) "A Complete Characterisation of Structured Missingness"
  • (Tan et al., 10 Jun 2025) "Integrated Analysis for Electronic Health Records with Structured and Sporadic Missingness"
  • (Zhang et al., 7 Sep 2025) "Generalized Tensor Completion with Non-Random Missingness"
  • (Hofert et al., 2024) "Bernoulli amputation"
  • (Zhang et al., 2024) "Sensitivity analysis methods for outcome missingness using substantive-model-compatible multiple imputation and their application in causal inference"
  • (Heng et al., 2023) "Design-Based Causal Inference with Missing Outcomes: Missingness Mechanisms, Imputation-Assisted Randomization Tests, and Covariate Adjustment"
  • (Yin et al., 2015) "Simulation-based Sensitivity Analysis for Non-ignorable Missing Data"
  • (Guo, 2019) "An IRT-based Model for Omitted and Not-reached Items"
  • (Pawel et al., 2024) "Handling Missingness, Failures, and Non-Convergence in Simulation Studies: A Review of Current Practices and Recommendations"
  • (Speekenbrink et al., 2021) "Ignorable and non-ignorable missing data in hidden Markov models"
  • (Zhang et al., 2023) "Recoverability and estimation of causal effects under typical multivariable missingness mechanisms"

This body of recent work establishes missingness simulation as a rigorously formalized, empirically essential component of modern statistical research, enabling advances in both methodological robustness and practical data analytic workflows.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Missingness Simulation.