SRUW-MNARz Framework in Clustering

Updated 16 October 2025

SRUW-MNARz Framework is a statistical methodology that extends the SRUW approach by explicitly modeling MNAR missingness based on latent class membership.
It employs likelihood-based inference with an EM algorithm and modified imputation strategies, with simulations showing bias reductions of up to 80% compared to standard methods.
The framework also enhances diagnostic testing and high-dimensional variable selection through robust recovery sample designs and automated adjustment estimation.

The SRUW-MNARz Framework refers to a class of statistical methodologies designed to handle model-based clustering, inference, and imputation in the presence of missing not at random (MNAR) data, with explicit integration of missingness mechanisms and variable roles. The framework builds upon extensions of the standard SRUW approach (partitioning variables into Signal, Redundant, and Uninformative categories) by introducing mechanisms—most notably MNARz—in which the pattern of missingness is allowed to depend on latent class membership. The SRUW-MNARz family encompasses likelihood modeling, imputation procedures, diagnostic tests, sample design strategies, and variable selection principles tailored to MNAR environments, as substantiated by recent theoretical and empirical studies.

1. MNARz Mechanism and SRUW Variable Roles

The MNARz mechanism is characterized by missingness that depends solely on the latent clustering structure (class membership $z$ ) rather than the explicit values of the variables themselves. In model-based clustering, for an individual $i$ with $d$ features, the missing pattern $c_i$ under component $k$ is modeled by

$f_k(c_i | y_i; \psi_k) = \prod_{j=1}^d \rho(\alpha_k)^{c_{ij}} [1-\rho(\alpha_k)]^{1-c_{ij}},$

where $\rho(\cdot)$ is a link function (logit or probit), and $\alpha_k$ is a class-specific parameter. This parsimonious approach leads to tractable likelihood inference, as the probability of missingness for a variable is constant within each latent class.

SRUW modeling, originally introduced for variable selection in clustering, partitions the variables into three roles:

S (Signal): informative for clustering
R (Redundant): explainable by S variables
U (Uninformative): independent of the latent structure

The MNARz extension allows missingness patterns to inform both cluster assignments and variable roles, thus augmenting the information available for inference.

2. Likelihood Formulation and EM Estimation

The central statistical object in the SRUW-MNARz framework is the joint likelihood of observed data, missingness indicators, and latent class assignments. The complete-data likelihood is

$\ell_{\text{comp}}(\theta; Y, Z, C) = \sum_{i=1}^n \sum_{k=1}^K z_{ik} \log \left[ \pi_k f_k(y_i; \lambda_k) f_k(c_i|y_i; \psi_k) \right],$

where $\pi_k$ is the mixture proportion, $f_k(y_i; \lambda_k)$ is the observed variable density (Gaussian, multinomial, etc.), and $f_k(c_i|y_i; \psi_k)$ encodes the MNARz mechanism.

Inference is performed using an Expectation-Maximization (EM) algorithm:

E-step: compute posterior probabilities $t_{ik} = P(z_{ik}=1 | y_i^{\text{obs}}, c_i; \theta)$ for each observation and class, and conditional expectations of missing values.
M-step: update parameters $\pi_k$ , $\lambda_k$ , and $\psi_k$ by maximizing the expected complete-data log-likelihood.

Under MNARz, the observed mask $C$ is treated as an additional observed variable and the model is equivalent to a MAR formulation on the augmented data $[Y | C]$ . This equivalence enables efficient parameter estimation and clustering, avoiding identifiability problems present in more general MNAR settings (Sportisse et al., 2021, Ho et al., 25 May 2025).

3. Imputation Strategies and Missingness Adjustment

Extensions of sequential regression multiple imputation (SRMI) to MNAR settings are central to the SRUW-MNARz framework. The conditional imputation distribution for variable $X_j$ is modified as:

$f(X_j | X_{-j}, R) \propto f(X_j | X_{-j}) \cdot \prod_{k \neq j} f(R_k | X)$

Approximations via Taylor expansions yield additive regression models with missingness indicators ( $R_k$ ) and offset functions (e.g., $Z_k$ built from derivatives of the missingness probability):

For binary $X_1$ :

$\text{logit}\left[P(X_1=1|X_{-1}, R)\right] = \omega_0 + \sum_{j=2}^p \omega_j X_j + \sum_{j=2}^p \omega_{R_j} R_j + \text{offset terms}$

For continuous $X_1$ :

$X_1 | X_{-1}, R \sim \mathcal{N}\left( \omega_0 + \sum_k \omega_k X_k + \sum_k \omega_{R_k} R_k, \tau^2 \right)$

Simulation studies show that strategies embedding offsets or missing data indicators into the imputation model yield reduced bias in MNAR contexts; the offset approach can reduce bias by up to 80% compared to standard SRMI (Beesley et al., 2021). These approaches directly link the imputation with the fitted missingness models, following the principle of the SRUW-MNARz architecture.

4. Diagnostic Testing and Sample Design

Component frameworks such as score tests for distinguishing MAR from MNAR formalize the diagnostic phase of SRUW-MNARz:

Score tests compare $H_0: \gamma=0$ (MAR) against $H_1: \gamma \neq 0$ (MNAR) in missingness models, e.g.,

$P(D = 1 | X, Y) = \pi(X' \beta + \gamma Y)$

The innovation is that, even under MNAR, these tests require only estimation under MAR, circumventing nonidentifiability.

Optimal recovery sample designs maximize the power of MNAR tests and control Type I error by selecting missing values with covariates in a chosen region $C_A$ and, for non-logit link functions, additionally subsampling observed values (Noonan et al., 2022). For the logit link, including all observed responses in $C_A$ preserves the missing mechanism structure. Algorithms controlling recovery region selection and subsampling proportion are used to ensure valid inference and efficient use of follow-up resources.

Simulation studies demonstrate 15–25% power improvements for MNAR detection when optimized recovery regions are used versus random sampling (Noonan et al., 2022).

5. Automatic Adjustment Estimation and Data-Driven Sensitivity

Standard MNAR sensitivity analyses often rely on ad hoc, user-supplied adjustment parameters. The Random Indicator (RI) method offers automatic estimation by iteratively generating pseudo missingness indicators $\dot{R}$ and cross-classifying the incomplete variable $X$ by both $R$ and $\dot{R}$ :

Means within each $(R, \dot{R})$ group are used to estimate the adjustment parameter $\delta_{\text{adj}}$ characterizing the shift between observed and missing data.
This methodology is encapsulated by equations:

$E(X_1 | Z, R, \dot{R}=r) = Z^T \phi + \delta_{\text{adj}}(r-\text{const})$

Simulation and data analyses confirm that RI-imputed estimates maintain nominal coverage and low bias under MNAR, ameliorating reliance on subjectively chosen sensitivity parameters (Jolani et al., 22 Apr 2024).

A plausible implication is integrating RI-style adjustment estimation with the SRUW-MNARz framework, providing fully data-driven MNAR corrections.

6. Variable Selection and High-Dimensional Applications

High-dimensional applications, particularly transcriptomics, motivate unified SRUW-MNARz frameworks combining penalized clustering, variable selection, and explicit modeling of missingness–class relationships:

Penalized likelihood (e.g., adaptive LASSO) numerically ranks variables for clustering relevance.
Role assignment partitions variables into S/R/U using BIC criteria and ranking sequence, with theoretical guarantees for selection and clustering consistency.
Explicit joint modeling of missingness and class membership ensures asymptotic identification even in the presence of complex patterns of MNAR data (Ho et al., 25 May 2025).

Empirical benchmarks and real-world applications confirm improved clustering accuracy and variable recovery over methods assuming MAR or ignoring missingness patterns.

7. Applications, Scope, and Limitations

SRUW-MNARz methodologies have been validated using synthetic and real data, including clinical registries (Traumabase, breast cancer, elderly blood pressure) and social science datasets (NLSY79). Cluster-dependent missingness rates inform group assignment and improve imputation, as well as the selection of relevant covariates.

Limitations arise in cases where missingness depends on both variable values and latent class membership (e.g., general MNAR $^{ykz^j}$ models), which can introduce identifiability complexities and require more elaborate modeling than the parsimonious MNARz case. Furthermore, optimization of recovery designs, accurate modeling of offset parameters, and robustness to model misspecification remain areas of ongoing research.

The SRUW-MNARz framework offers a coherent, theoretically justified, and empirically validated set of tools for inference and clustering in the presence of MNAR data, integrating data augmentation, likelihood-based imputation, diagnostic testing, and scalable variable selection. Its principled handling of missingness ensures more reliable results in both low- and high-dimensional settings, with clear extensions emerging from recent literature.