Missingness Simulation Techniques

Updated 30 April 2026

Missingness simulation is the process of artificially introducing missing data into complete datasets using defined stochastic mechanisms such as MCAR, MAR, and MNAR.
It employs simulation algorithms with logistic/probit models to generate random masks, enabling controlled evaluation of imputation and causal inference methods.
The technique supports sensitivity analysis and benchmarking by using performance metrics like bias, RMSE, and coverage to assess estimator robustness.

Missingness simulation refers to the artificial introduction of missing data into otherwise complete datasets according to precisely defined stochastic mechanisms. This controlled amputation of data is foundational for methodological evaluation, sensitivity analysis, benchmarking of imputation and inference algorithms, and theoretical investigation of missing data under various mechanisms—including MCAR (missing completely at random), MAR (missing at random), MNAR (missing not at random), and structural extensions such as block or jointly dependent missingness patterns. Simulation facilitates empirical assessment of estimators’ properties, robustness of analysis pipelines, and reproducibility of statistical results in the presence of structured or non-ignorable missingness.

1. Formal Missingness Taxonomy and Mechanisms

The simulation of missingness exploits the classical Rubin taxonomy (MCAR, MAR, MNAR) and its structured generalizations. In the multivariate setting, the missingness mechanism is formalized as a set of conditional distributions for indicator variables $M_j$ that may depend on the data matrix $\mathbf X$ and, in the structured setting, on the other missingness indicators $\mathbf M_{-j}$ :

MCAR: $P(M_j\,|\,\mathbf X, \mathbf M_{-j}) = P(M_j\,|\,\gamma_j)$ ; completely independent of data.
MAR: $P(M_j\,|\,\mathbf X, \mathbf M_{-j}) = P(M_j\,|\,\mathbf X_{\rm obs}, \gamma_j)$ ; depends only on observed values.
MNAR: $P(M_j\,|\,\mathbf X, \mathbf M_{-j}) = P(M_j\,|\,\mathbf X_{\rm obs}, \mathbf X_{\rm mis}, \gamma_j)$ ; depends on unobserved information and can induce non-ignorability.
Structured Missingness (SM): Each $M_j$ can functionally or stochastically depend on multiple other indicators $\mathbf M_{-j}$ , allowing for block, monotone, and sequential patterns (Jackson et al., 2023).

Practical models use generalized linear models (logistic/probit/canonical links), with MCAR realized by coefficients $\beta=\gamma=0$ , MAR by setting $\gamma=0$ , and MNAR by inclusion of unobserved/self variables in predictors.

2. Simulation Algorithms and Mask Generation

The central workflow is: (i) generate a complete dataset from a specified data-generating process, (ii) impose missingness using a random mask sampled from the desired mechanism, (iii) analyze or impute. Algorithms are tailored to the required missingness pattern.

General SM Simulation

For a dataset of $\mathbf X$ 0 subjects and $\mathbf X$ 1 variables (Jackson et al., 2023): $P(M_j\,|\,\mathbf X, \mathbf M_{-j}) = P(M_j\,|\,\mathbf X_{\rm obs}, \gamma_j)$ 6 The same logic generalizes to block or monotone mechanisms by forced ties or mixture models. In high dimensions, vectorized or matrix-based amputation (e.g., Bernoulli amputation using copula-dependence) is employed (Hofert et al., 2024).

Specializations

Block/Sporadic Structured EHR: Remove specified entire rows/columns (block), then superimpose entrywise intermittent missingness by applying masked Bernoulli draws with heterogeneous probability matrices $\mathbf X$ 2 (Tan et al., 10 Jun 2025).
State- and Time-Dependent HMMs: At each $\mathbf X$ 3: $\mathbf X$ 4. $\mathbf X$ 5 parameterized via state, time, or both; three-level simulation: sample $\mathbf X$ 6, then $\mathbf X$ 7, then $\mathbf X$ 8 (Speekenbrink et al., 2021).
Tensor/Array MNAR: For each $\mathbf X$ 9, sample $\mathbf M_{-j}$ 0, with $\mathbf M_{-j}$ 1 (Zhang et al., 7 Sep 2025).

3. Advanced and Structured Missingness: Copulas, Patterns, and Hierarchies

Bernoulli amputation unifies most MCAR/MAR/MNAR/SM patterns within a single framework: Given margins $\mathbf M_{-j}$ 2 and a $\mathbf M_{-j}$ 3-variate copula $\mathbf M_{-j}$ 4, sample $\mathbf M_{-j}$ 5, $\mathbf M_{-j}$ 6 (Hofert et al., 2024). This supports:

MCAR/MAR/MNAR by selective dependence in $\mathbf M_{-j}$ 7
Block missingness via comonotone copulas
Monotone missingness by introducing a latent cutpoint $\mathbf M_{-j}$ 8 sampled from a categorical, imposing monotonicity across variables

Detailed mathematical characterization of joint probabilities and pairwise missingness correlations is given via survival copula transforms, fully controlling dependence within or across rows and columns.

4. Missingness Simulation in Causal and Inference Frameworks

When simulating for causal inference with missing data, missingness must be applied post-treatment assignment with precise attention to the identifiability conditions, e.g., as encoded by m-DAGs (Zhang et al., 2023), guaranteeing proper estimation of effects under various mechanism types:

Causal m-DAG simulation: Construct latent variable and outcome DAG, then generate missingness indicators by logistic/threshold models aligned with canonical graphs (e.g., outcome-induced, sequential, or block missingness).
Design-based inference: Missingness is a fixed function $\mathbf M_{-j}$ 9 of all unit attributes; for randomization-based analysis, masks are invariant under permutation of treatment (Heng et al., 2023).

When outcome or key confounder missingness is MNAR, substantive-model-compatible imputation, or imputation + re-imputation within permutation tests, can be embedded into every randomization replicate to preserve finite-sample inference guarantees (Heng et al., 2023, Zhang et al., 2024).

5. Applications in Sensitivity Analysis and Evaluation

Missingness simulation under a spectrum of plausible mechanisms is foundational for sensitivity analysis, particularly for MNAR or selection models. In simulation-based sensitivity analysis (SSA) (Yin et al., 2015):

Construct a grid $P(M_j\,|\,\mathbf X, \mathbf M_{-j}) = P(M_j\,|\,\gamma_j)$ 0 of non-ignorability parameters
For each $P(M_j\,|\,\mathbf X, \mathbf M_{-j}) = P(M_j\,|\,\gamma_j)$ 1, (i) fit the model, (ii) generate complete data, (iii) impose missingness under $P(M_j\,|\,\mathbf X, \mathbf M_{-j}) = P(M_j\,|\,\gamma_j)$ 2, (iv) compute distance (e.g., $P(M_j\,|\,\mathbf X, \mathbf M_{-j}) = P(M_j\,|\,\gamma_j)$ 3) between actual and generated incomplete datasets
Exclude $P(M_j\,|\,\mathbf X, \mathbf M_{-j}) = P(M_j\,|\,\gamma_j)$ 4 where distance is “too large” ( $P(M_j\,|\,\mathbf X, \mathbf M_{-j}) = P(M_j\,|\,\gamma_j)$ 5); among remaining, select most plausible (minimum $P(M_j\,|\,\mathbf X, \mathbf M_{-j}) = P(M_j\,|\,\gamma_j)$ 6) or provide an interval of estimates

The practical implementation must report both point and interval estimates and quantify the shape similarity between empirical and synthetic incomplete data.

Metrics commonly used in evaluation studies include:

Metric	Definition	Purpose
Bias	$P(M_j\,\|\,\mathbf X, \mathbf M_{-j}) = P(M_j\,\|\,\gamma_j)$ 7	Assess estimator bias
RMSE	$P(M_j\,\|\,\mathbf X, \mathbf M_{-j}) = P(M_j\,\|\,\gamma_j)$ 8	Aggregate error (variance + bias)
Coverage	Proportion of CIs containing $P(M_j\,\|\,\mathbf X, \mathbf M_{-j}) = P(M_j\,\|\,\gamma_j)$ 9	Calibrate CI performance
Predictive AUC/MSE	AUC for binary outcome, MSE for continuous	Prediction accuracy

6. Handling Simulation Failures and Reporting Standards

Missingness simulation must address not only the imposed missing data but also secondary “simulation missingness” due to incomplete or failed simulation runs (DGM failure, method failure, metric non-computability) (Pawel et al., 2024). Each layer of missingness is tracked by indicator functions for data generation, method application, and metric computation; corresponding missing rates ( $P(M_j\,|\,\mathbf X, \mathbf M_{-j}) = P(M_j\,|\,\mathbf X_{\rm obs}, \gamma_j)$ 0, $P(M_j\,|\,\mathbf X, \mathbf M_{-j}) = P(M_j\,|\,\mathbf X_{\rm obs}, \gamma_j)$ 1, $P(M_j\,|\,\mathbf X, \mathbf M_{-j}) = P(M_j\,|\,\mathbf X_{\rm obs}, \gamma_j)$ 2) inform reporting and exclusion rules.

Best practice for missingness reporting in simulation studies includes:

Quantify and report missingness rates by method and condition.
Use thresholds (e.g., $P(M_j\,|\,\mathbf X, \mathbf M_{-j}) = P(M_j\,|\,\mathbf X_{\rm obs}, \gamma_j)$ 3) to flag exclusion.
State and justify handling strategy (case-wise, list-wise, method replacement).
Run sensitivity analyses across plausible strategies.
Share code and unaggregated replication-level data and indicators.

This protocol ensures methodological validity and reproducibility in simulation-based methodological research.

7. Selected Illustrative Examples

EHR Matrix Completion: Block removal of large submatrices representing hospitals/sites, combined with Bernoulli (MAR) entrywise deletion to mimic variable patient-record structure (Tan et al., 10 Jun 2025).
Tensor MNAR: Observed-mask modeled as logistic function of latent tensor values, enabling value-dependent non-ignorability (Zhang et al., 7 Sep 2025).
IRT Omitted/Not-Reached: Sequential omitted/not-reached item indicators defined via cumulative logit models conditional on latent “missingness propensity” $P(M_j\,|\,\mathbf X, \mathbf M_{-j}) = P(M_j\,|\,\mathbf X_{\rm obs}, \gamma_j)$ 4, correlated with ability $P(M_j\,|\,\mathbf X, \mathbf M_{-j}) = P(M_j\,|\,\mathbf X_{\rm obs}, \gamma_j)$ 5 to induce non-ignorable missing data (Guo, 2019).

References

(Jackson et al., 2023) "A Complete Characterisation of Structured Missingness"
(Tan et al., 10 Jun 2025) "Integrated Analysis for Electronic Health Records with Structured and Sporadic Missingness"
(Zhang et al., 7 Sep 2025) "Generalized Tensor Completion with Non-Random Missingness"
(Hofert et al., 2024) "Bernoulli amputation"
(Zhang et al., 2024) "Sensitivity analysis methods for outcome missingness using substantive-model-compatible multiple imputation and their application in causal inference"
(Heng et al., 2023) "Design-Based Causal Inference with Missing Outcomes: Missingness Mechanisms, Imputation-Assisted Randomization Tests, and Covariate Adjustment"
(Yin et al., 2015) "Simulation-based Sensitivity Analysis for Non-ignorable Missing Data"
(Guo, 2019) "An IRT-based Model for Omitted and Not-reached Items"
(Pawel et al., 2024) "Handling Missingness, Failures, and Non-Convergence in Simulation Studies: A Review of Current Practices and Recommendations"
(Speekenbrink et al., 2021) "Ignorable and non-ignorable missing data in hidden Markov models"
(Zhang et al., 2023) "Recoverability and estimation of causal effects under typical multivariable missingness mechanisms"

This body of recent work establishes missingness simulation as a rigorously formalized, empirically essential component of modern statistical research, enabling advances in both methodological robustness and practical data analytic workflows.