Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 152 tok/s
Gemini 2.5 Pro 48 tok/s Pro
GPT-5 Medium 27 tok/s Pro
GPT-5 High 32 tok/s Pro
GPT-4o 87 tok/s Pro
Kimi K2 204 tok/s Pro
GPT OSS 120B 429 tok/s Pro
Claude Sonnet 4.5 35 tok/s Pro
2000 character limit reached

SRUW-MNARz Framework in Clustering

Updated 16 October 2025
  • SRUW-MNARz Framework is a statistical methodology that extends the SRUW approach by explicitly modeling MNAR missingness based on latent class membership.
  • It employs likelihood-based inference with an EM algorithm and modified imputation strategies, with simulations showing bias reductions of up to 80% compared to standard methods.
  • The framework also enhances diagnostic testing and high-dimensional variable selection through robust recovery sample designs and automated adjustment estimation.

The SRUW-MNARz Framework refers to a class of statistical methodologies designed to handle model-based clustering, inference, and imputation in the presence of missing not at random (MNAR) data, with explicit integration of missingness mechanisms and variable roles. The framework builds upon extensions of the standard SRUW approach (partitioning variables into Signal, Redundant, and Uninformative categories) by introducing mechanisms—most notably MNARz—in which the pattern of missingness is allowed to depend on latent class membership. The SRUW-MNARz family encompasses likelihood modeling, imputation procedures, diagnostic tests, sample design strategies, and variable selection principles tailored to MNAR environments, as substantiated by recent theoretical and empirical studies.

1. MNARz Mechanism and SRUW Variable Roles

The MNARz mechanism is characterized by missingness that depends solely on the latent clustering structure (class membership zz) rather than the explicit values of the variables themselves. In model-based clustering, for an individual ii with dd features, the missing pattern cic_i under component kk is modeled by

fk(ciyi;ψk)=j=1dρ(αk)cij[1ρ(αk)]1cij,f_k(c_i | y_i; \psi_k) = \prod_{j=1}^d \rho(\alpha_k)^{c_{ij}} [1-\rho(\alpha_k)]^{1-c_{ij}},

where ρ()\rho(\cdot) is a link function (logit or probit), and αk\alpha_k is a class-specific parameter. This parsimonious approach leads to tractable likelihood inference, as the probability of missingness for a variable is constant within each latent class.

SRUW modeling, originally introduced for variable selection in clustering, partitions the variables into three roles:

  • S (Signal): informative for clustering
  • R (Redundant): explainable by S variables
  • U (Uninformative): independent of the latent structure

The MNARz extension allows missingness patterns to inform both cluster assignments and variable roles, thus augmenting the information available for inference.

2. Likelihood Formulation and EM Estimation

The central statistical object in the SRUW-MNARz framework is the joint likelihood of observed data, missingness indicators, and latent class assignments. The complete-data likelihood is

comp(θ;Y,Z,C)=i=1nk=1Kziklog[πkfk(yi;λk)fk(ciyi;ψk)],\ell_{\text{comp}}(\theta; Y, Z, C) = \sum_{i=1}^n \sum_{k=1}^K z_{ik} \log \left[ \pi_k f_k(y_i; \lambda_k) f_k(c_i|y_i; \psi_k) \right],

where πk\pi_k is the mixture proportion, fk(yi;λk)f_k(y_i; \lambda_k) is the observed variable density (Gaussian, multinomial, etc.), and fk(ciyi;ψk)f_k(c_i|y_i; \psi_k) encodes the MNARz mechanism.

Inference is performed using an Expectation-Maximization (EM) algorithm:

  • E-step: compute posterior probabilities tik=P(zik=1yiobs,ci;θ)t_{ik} = P(z_{ik}=1 | y_i^{\text{obs}}, c_i; \theta) for each observation and class, and conditional expectations of missing values.
  • M-step: update parameters πk\pi_k, λk\lambda_k, and ψk\psi_k by maximizing the expected complete-data log-likelihood.

Under MNARz, the observed mask CC is treated as an additional observed variable and the model is equivalent to a MAR formulation on the augmented data [YC][Y | C]. This equivalence enables efficient parameter estimation and clustering, avoiding identifiability problems present in more general MNAR settings (Sportisse et al., 2021, Ho et al., 25 May 2025).

3. Imputation Strategies and Missingness Adjustment

Extensions of sequential regression multiple imputation (SRMI) to MNAR settings are central to the SRUW-MNARz framework. The conditional imputation distribution for variable XjX_j is modified as:

f(XjXj,R)f(XjXj)kjf(RkX)f(X_j | X_{-j}, R) \propto f(X_j | X_{-j}) \cdot \prod_{k \neq j} f(R_k | X)

Approximations via Taylor expansions yield additive regression models with missingness indicators (RkR_k) and offset functions (e.g., ZkZ_k built from derivatives of the missingness probability):

  • For binary X1X_1:

logit[P(X1=1X1,R)]=ω0+j=2pωjXj+j=2pωRjRj+offset terms\text{logit}\left[P(X_1=1|X_{-1}, R)\right] = \omega_0 + \sum_{j=2}^p \omega_j X_j + \sum_{j=2}^p \omega_{R_j} R_j + \text{offset terms}

  • For continuous X1X_1:

X1X1,RN(ω0+kωkXk+kωRkRk,τ2)X_1 | X_{-1}, R \sim \mathcal{N}\left( \omega_0 + \sum_k \omega_k X_k + \sum_k \omega_{R_k} R_k, \tau^2 \right)

Simulation studies show that strategies embedding offsets or missing data indicators into the imputation model yield reduced bias in MNAR contexts; the offset approach can reduce bias by up to 80% compared to standard SRMI (Beesley et al., 2021). These approaches directly link the imputation with the fitted missingness models, following the principle of the SRUW-MNARz architecture.

4. Diagnostic Testing and Sample Design

Component frameworks such as score tests for distinguishing MAR from MNAR formalize the diagnostic phase of SRUW-MNARz:

  • Score tests compare H0:γ=0H_0: \gamma=0 (MAR) against H1:γ0H_1: \gamma \neq 0 (MNAR) in missingness models, e.g.,

P(D=1X,Y)=π(Xβ+γY)P(D = 1 | X, Y) = \pi(X' \beta + \gamma Y)

  • The innovation is that, even under MNAR, these tests require only estimation under MAR, circumventing nonidentifiability.

Optimal recovery sample designs maximize the power of MNAR tests and control Type I error by selecting missing values with covariates in a chosen region CAC_A and, for non-logit link functions, additionally subsampling observed values (Noonan et al., 2022). For the logit link, including all observed responses in CAC_A preserves the missing mechanism structure. Algorithms controlling recovery region selection and subsampling proportion are used to ensure valid inference and efficient use of follow-up resources.

Simulation studies demonstrate 15–25% power improvements for MNAR detection when optimized recovery regions are used versus random sampling (Noonan et al., 2022).

5. Automatic Adjustment Estimation and Data-Driven Sensitivity

Standard MNAR sensitivity analyses often rely on ad hoc, user-supplied adjustment parameters. The Random Indicator (RI) method offers automatic estimation by iteratively generating pseudo missingness indicators R˙\dot{R} and cross-classifying the incomplete variable XX by both RR and R˙\dot{R}:

  • Means within each (R,R˙)(R, \dot{R}) group are used to estimate the adjustment parameter δadj\delta_{\text{adj}} characterizing the shift between observed and missing data.
  • This methodology is encapsulated by equations:

E(X1Z,R,R˙=r)=ZTϕ+δadj(rconst)E(X_1 | Z, R, \dot{R}=r) = Z^T \phi + \delta_{\text{adj}}(r-\text{const})

Simulation and data analyses confirm that RI-imputed estimates maintain nominal coverage and low bias under MNAR, ameliorating reliance on subjectively chosen sensitivity parameters (Jolani et al., 22 Apr 2024).

A plausible implication is integrating RI-style adjustment estimation with the SRUW-MNARz framework, providing fully data-driven MNAR corrections.

6. Variable Selection and High-Dimensional Applications

High-dimensional applications, particularly transcriptomics, motivate unified SRUW-MNARz frameworks combining penalized clustering, variable selection, and explicit modeling of missingness–class relationships:

  • Penalized likelihood (e.g., adaptive LASSO) numerically ranks variables for clustering relevance.
  • Role assignment partitions variables into S/R/U using BIC criteria and ranking sequence, with theoretical guarantees for selection and clustering consistency.
  • Explicit joint modeling of missingness and class membership ensures asymptotic identification even in the presence of complex patterns of MNAR data (Ho et al., 25 May 2025).

Empirical benchmarks and real-world applications confirm improved clustering accuracy and variable recovery over methods assuming MAR or ignoring missingness patterns.

7. Applications, Scope, and Limitations

SRUW-MNARz methodologies have been validated using synthetic and real data, including clinical registries (Traumabase, breast cancer, elderly blood pressure) and social science datasets (NLSY79). Cluster-dependent missingness rates inform group assignment and improve imputation, as well as the selection of relevant covariates.

Limitations arise in cases where missingness depends on both variable values and latent class membership (e.g., general MNARykzj^{ykz^j} models), which can introduce identifiability complexities and require more elaborate modeling than the parsimonious MNARz case. Furthermore, optimization of recovery designs, accurate modeling of offset parameters, and robustness to model misspecification remain areas of ongoing research.

The SRUW-MNARz framework offers a coherent, theoretically justified, and empirically validated set of tools for inference and clustering in the presence of MNAR data, integrating data augmentation, likelihood-based imputation, diagnostic testing, and scalable variable selection. Its principled handling of missingness ensures more reliable results in both low- and high-dimensional settings, with clear extensions emerging from recent literature.

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to SRUW-MNARz Framework.