Missingness Simulation Techniques
- Missingness simulation is the process of artificially introducing missing data into complete datasets using defined stochastic mechanisms such as MCAR, MAR, and MNAR.
- It employs simulation algorithms with logistic/probit models to generate random masks, enabling controlled evaluation of imputation and causal inference methods.
- The technique supports sensitivity analysis and benchmarking by using performance metrics like bias, RMSE, and coverage to assess estimator robustness.
Missingness simulation refers to the artificial introduction of missing data into otherwise complete datasets according to precisely defined stochastic mechanisms. This controlled amputation of data is foundational for methodological evaluation, sensitivity analysis, benchmarking of imputation and inference algorithms, and theoretical investigation of missing data under various mechanisms—including MCAR (missing completely at random), MAR (missing at random), MNAR (missing not at random), and structural extensions such as block or jointly dependent missingness patterns. Simulation facilitates empirical assessment of estimators’ properties, robustness of analysis pipelines, and reproducibility of statistical results in the presence of structured or non-ignorable missingness.
1. Formal Missingness Taxonomy and Mechanisms
The simulation of missingness exploits the classical Rubin taxonomy (MCAR, MAR, MNAR) and its structured generalizations. In the multivariate setting, the missingness mechanism is formalized as a set of conditional distributions for indicator variables that may depend on the data matrix and, in the structured setting, on the other missingness indicators :
- MCAR: ; completely independent of data.
- MAR: ; depends only on observed values.
- MNAR: ; depends on unobserved information and can induce non-ignorability.
- Structured Missingness (SM): Each can functionally or stochastically depend on multiple other indicators , allowing for block, monotone, and sequential patterns (Jackson et al., 2023).
Practical models use generalized linear models (logistic/probit/canonical links), with MCAR realized by coefficients , MAR by setting , and MNAR by inclusion of unobserved/self variables in predictors.
2. Simulation Algorithms and Mask Generation
The central workflow is: (i) generate a complete dataset from a specified data-generating process, (ii) impose missingness using a random mask sampled from the desired mechanism, (iii) analyze or impute. Algorithms are tailored to the required missingness pattern.
General SM Simulation
For a dataset of 0 subjects and 1 variables (Jackson et al., 2023): 6 The same logic generalizes to block or monotone mechanisms by forced ties or mixture models. In high dimensions, vectorized or matrix-based amputation (e.g., Bernoulli amputation using copula-dependence) is employed (Hofert et al., 2024).
Specializations
- Block/Sporadic Structured EHR: Remove specified entire rows/columns (block), then superimpose entrywise intermittent missingness by applying masked Bernoulli draws with heterogeneous probability matrices 2 (Tan et al., 10 Jun 2025).
- State- and Time-Dependent HMMs: At each 3: 4. 5 parameterized via state, time, or both; three-level simulation: sample 6, then 7, then 8 (Speekenbrink et al., 2021).
- Tensor/Array MNAR: For each 9, sample 0, with 1 (Zhang et al., 7 Sep 2025).
3. Advanced and Structured Missingness: Copulas, Patterns, and Hierarchies
Bernoulli amputation unifies most MCAR/MAR/MNAR/SM patterns within a single framework: Given margins 2 and a 3-variate copula 4, sample 5, 6 (Hofert et al., 2024). This supports:
- MCAR/MAR/MNAR by selective dependence in 7
- Block missingness via comonotone copulas
- Monotone missingness by introducing a latent cutpoint 8 sampled from a categorical, imposing monotonicity across variables
Detailed mathematical characterization of joint probabilities and pairwise missingness correlations is given via survival copula transforms, fully controlling dependence within or across rows and columns.
4. Missingness Simulation in Causal and Inference Frameworks
When simulating for causal inference with missing data, missingness must be applied post-treatment assignment with precise attention to the identifiability conditions, e.g., as encoded by m-DAGs (Zhang et al., 2023), guaranteeing proper estimation of effects under various mechanism types:
- Causal m-DAG simulation: Construct latent variable and outcome DAG, then generate missingness indicators by logistic/threshold models aligned with canonical graphs (e.g., outcome-induced, sequential, or block missingness).
- Design-based inference: Missingness is a fixed function 9 of all unit attributes; for randomization-based analysis, masks are invariant under permutation of treatment (Heng et al., 2023).
When outcome or key confounder missingness is MNAR, substantive-model-compatible imputation, or imputation + re-imputation within permutation tests, can be embedded into every randomization replicate to preserve finite-sample inference guarantees (Heng et al., 2023, Zhang et al., 2024).
5. Applications in Sensitivity Analysis and Evaluation
Missingness simulation under a spectrum of plausible mechanisms is foundational for sensitivity analysis, particularly for MNAR or selection models. In simulation-based sensitivity analysis (SSA) (Yin et al., 2015):
- Construct a grid 0 of non-ignorability parameters
- For each 1, (i) fit the model, (ii) generate complete data, (iii) impose missingness under 2, (iv) compute distance (e.g., 3) between actual and generated incomplete datasets
- Exclude 4 where distance is “too large” (5); among remaining, select most plausible (minimum 6) or provide an interval of estimates
The practical implementation must report both point and interval estimates and quantify the shape similarity between empirical and synthetic incomplete data.
Metrics commonly used in evaluation studies include:
| Metric | Definition | Purpose |
|---|---|---|
| Bias | 7 | Assess estimator bias |
| RMSE | 8 | Aggregate error (variance + bias) |
| Coverage | Proportion of CIs containing 9 | Calibrate CI performance |
| Predictive AUC/MSE | AUC for binary outcome, MSE for continuous | Prediction accuracy |
6. Handling Simulation Failures and Reporting Standards
Missingness simulation must address not only the imposed missing data but also secondary “simulation missingness” due to incomplete or failed simulation runs (DGM failure, method failure, metric non-computability) (Pawel et al., 2024). Each layer of missingness is tracked by indicator functions for data generation, method application, and metric computation; corresponding missing rates (0, 1, 2) inform reporting and exclusion rules.
Best practice for missingness reporting in simulation studies includes:
- Quantify and report missingness rates by method and condition.
- Use thresholds (e.g., 3) to flag exclusion.
- State and justify handling strategy (case-wise, list-wise, method replacement).
- Run sensitivity analyses across plausible strategies.
- Share code and unaggregated replication-level data and indicators.
This protocol ensures methodological validity and reproducibility in simulation-based methodological research.
7. Selected Illustrative Examples
- EHR Matrix Completion: Block removal of large submatrices representing hospitals/sites, combined with Bernoulli (MAR) entrywise deletion to mimic variable patient-record structure (Tan et al., 10 Jun 2025).
- Tensor MNAR: Observed-mask modeled as logistic function of latent tensor values, enabling value-dependent non-ignorability (Zhang et al., 7 Sep 2025).
- IRT Omitted/Not-Reached: Sequential omitted/not-reached item indicators defined via cumulative logit models conditional on latent “missingness propensity” 4, correlated with ability 5 to induce non-ignorable missing data (Guo, 2019).
References
- (Jackson et al., 2023) "A Complete Characterisation of Structured Missingness"
- (Tan et al., 10 Jun 2025) "Integrated Analysis for Electronic Health Records with Structured and Sporadic Missingness"
- (Zhang et al., 7 Sep 2025) "Generalized Tensor Completion with Non-Random Missingness"
- (Hofert et al., 2024) "Bernoulli amputation"
- (Zhang et al., 2024) "Sensitivity analysis methods for outcome missingness using substantive-model-compatible multiple imputation and their application in causal inference"
- (Heng et al., 2023) "Design-Based Causal Inference with Missing Outcomes: Missingness Mechanisms, Imputation-Assisted Randomization Tests, and Covariate Adjustment"
- (Yin et al., 2015) "Simulation-based Sensitivity Analysis for Non-ignorable Missing Data"
- (Guo, 2019) "An IRT-based Model for Omitted and Not-reached Items"
- (Pawel et al., 2024) "Handling Missingness, Failures, and Non-Convergence in Simulation Studies: A Review of Current Practices and Recommendations"
- (Speekenbrink et al., 2021) "Ignorable and non-ignorable missing data in hidden Markov models"
- (Zhang et al., 2023) "Recoverability and estimation of causal effects under typical multivariable missingness mechanisms"
This body of recent work establishes missingness simulation as a rigorously formalized, empirically essential component of modern statistical research, enabling advances in both methodological robustness and practical data analytic workflows.