MCMoE: Unified Missing Completion Framework

Updated 28 November 2025

The paper introduces a unified framework that combines imputation and expert fusion to efficiently handle missing modalities in diverse datasets.
It employs modality-specific experts, adaptive gating, and embedding-based completion to enable robust end-to-end learning in multimodal settings.
Empirical results highlight notable performance gains, including up to 38% reduction in MSE and improved latent variable recovery in various applications.

The Missing Completion Framework with Mixture of Experts (MCMoE) is a unified modeling paradigm for addressing incomplete data in multi-source or multimodal settings by integrating flexible imputation and fusion in a mixture-of-experts (MoE) architecture. MCMoE is designed to perform principled completion of missing data—whether modalities, features, or bounded discrete outcomes—while jointly supporting robust learning, inference, and prediction across datasets with arbitrary missingness patterns. Recent MCMoE frameworks instantiate this methodology in domains as diverse as action quality assessment, medical diagnostics, and bounded test score modeling, leveraging structured expert specialization, adaptive gating, and embedding-based completion mechanisms (Xu et al., 21 Nov 2025, Yun et al., 10 Oct 2024, Han et al., 5 Feb 2024, Suen et al., 2023). This approach resolves two key obstacles in classical and neural multimodal learning: catastrophic degradation in the presence of missing data and the inefficiency of conventional two-step “generate-then-fuse” pipelines.

1. Foundational Principles and Motivation

MCMoE unifies the imputation of missing components and the construction of multimodal or multivariate representations within a single-stage modeling framework. The central technical idea is to use a bank of modality- or feature-specific “experts,” each capable of ingesting arbitrary observed/missing patterns. Expert outputs are dynamically fused by a gating (“router”) mechanism, which adapts based on the available subset of observations and can be conditioned on context or covariates.

Classical mixture-of-experts models (Suen et al., 2023) leverage latent class assignments with covariate-dependent gating, while recent neural MCMoE designs generalize to high-dimensional embeddings, per-subset mask conditioning, and flexible fusion strategies (Xu et al., 21 Nov 2025, Yun et al., 10 Oct 2024, Han et al., 5 Feb 2024). This framework naturally accommodates “missing at random” (MAR) in statistical settings and arbitrary missingness or sensor dropout in neural architectures.

Key problems addressed by MCMoE include:

Reliable imputation of missing data by combining unimodal expert knowledge and cross-modal contextual information.
Seamless degradation with partial input, avoiding the total failure modes of monolithic multimodal networks.
End-to-end learning of modality generation, fusion, and supervision, enabling efficient training and deployment in real-world incomplete-data scenarios.

2. Mathematical Formulation and Model Architecture

2.1. Generalized Architecture

Suppose $M$ modalities or dimensions, each possibly missing in any instance. For example, in (Xu et al., 21 Nov 2025), $M = 3$ : visual, optical flow, and audio. Denote observed modalities for sample $i$ as $\tilde M \subset M$ , and missing as $\bar{M} = M \setminus \tilde{M}$ .

The architecture comprises:

Modality encoders: Each present modality is encoded using a domain-specific network; missing modalities are either zero-initialized (Xu et al., 21 Nov 2025), replaced with learned missing tokens (Han et al., 5 Feb 2024), or approximated via a missing-modality bank (Yun et al., 10 Oct 2024).
Adaptive modality completion/generation: For each missing item, a generator (e.g., adaptive gated modality generator, AGMG) reconstructs features by cross-attention over available modalities, possibly with gating to control the influence of synthetic signals (Xu et al., 21 Nov 2025).
Mixture-of-experts module: Expert MLPs or probabilistic models process the completed feature vector. The router/gater dynamically assigns mixture weights based on the observed subset, using schemes such as Laplace-gated routing (Han et al., 5 Feb 2024) or set-id-based top-1 selection (Yun et al., 10 Oct 2024).
Fusion and output: The expert outputs are composited via gated summation or further fusion nets (e.g., convolutional or transformer-based) for downstream prediction or imputation.

2.2. Statistical MCMoE and EM-based Inference

In the case of bounded discrete data or outcomes (e.g., neuropsychological test scores), the generative MCMoE model is specified as a finite mixture over latent states/classes, with class posteriors (“gating probabilities”) dependent on covariates (Suen et al., 2023):

$\pi_k(X_i; \beta) = P(Z_i = k | X_i) = \frac{e^{\beta_k^\top (1, X_i)}}{\sum_{\ell=1}^K e^{\beta_\ell^\top (1, X_i)}}$

Data follows class-conditional independent binomial distributions for each outcome:

$Y_i | \{Z_i = k, X_i\} \sim p_k(Y_i; \theta_k) = \prod_{j=1}^d \text{Binomial}(Y_{ij}; N_j, \theta_{k,j})$

Under MAR, missing data is handled by a nested EM (Monte Carlo EM) algorithm alternating between multiple imputation (outer E-step) and standard EM on imputed data (inner steps), using closed-form local conditional distributions for outcome sampling (Suen et al., 2023).

3. Expert Specialization, Gating, and Missing Pattern Conditioning

A defining feature of MCMoE is explicit specialization of experts to particular patterns of observed modalities or to robust imputation within partial contexts.

Missing-modality bank and set-indexed experts: In Flex-MoE (Yun et al., 10 Oct 2024), a learned tensor $B_{S,m}$ provides a unique per-set/per-modality embedding for any missing combination, allowing the construction of completed inputs without zero-padding or naive mean imputation.
Adaptive expert gating: Two-stage routers are used, with a generalized router ( $\mathcal{G}$ -Router) trained on full data for balanced knowledge absorption, and a specialized router ( $\mathcal{S}$ -Router) that, on partial inputs, activates the subset-specific expert with cross-entropy supervision on expert IDs (Yun et al., 10 Oct 2024). Laplace-gating functions (Han et al., 5 Feb 2024) or lightweight MLP routers (Xu et al., 21 Nov 2025) sort and weight expert choices in neural variants.
Modality-completion via cross-attention: AGMG performs multi-layer, multi-head cross-attention to transfer contextual cues from available to missing modalities, followed by a learned gating function to downweight unreliable completions (Xu et al., 21 Nov 2025).

This conditioning ensures that, for any observed subset, the system efficiently routes inputs to those experts best equipped for context-specific completion and fusion.

4. Training Objectives and Unified Optimization

MCMoE models are trained with losses that jointly enforce accurate reconstruction, robust expert fusion, and balanced expert utilization:

Reconstruction loss: Square error between generated and ground-truth embeddings for all missing modalities (Xu et al., 21 Nov 2025, Han et al., 5 Feb 2024).
Alignment and diversity losses: Kullback-Leibler or margin-based penalties align unimodal and fused expert representations, and encourage learned grade prototypes or expert outputs to focus on distinct aspects (Xu et al., 21 Nov 2025).
Expert load/importance balancing: Variance-based penalties (e.g., $CV^2$ loss) and entropy regularizers disperse expert usage and promote both specialization and coverage across modality subsets (Yun et al., 10 Oct 2024, Han et al., 5 Feb 2024).
Task losses: Supervised objectives penalize errors in task outputs (classification, regression, etc.), ensuring end-to-end learning (Xu et al., 21 Nov 2025, Han et al., 5 Feb 2024, Suen et al., 2023).

Unified training steps typically randomize missingness during optimization, allowing the model to learn conditional completion strategies over all possible missingness patterns (Han et al., 5 Feb 2024, Xu et al., 21 Nov 2025). This replaces traditional two-stage imputation and fusion with a single, efficient optimization pipeline.

5. Empirical Evaluation and Performance

MCMoE frameworks demonstrate strong empirical gains in scenarios with severe or structured missingness:

In multimodal action quality assessment benchmarks with up to three missing channels, MCMoE achieves up to 17% uplift in Spearman’s $\rho$ and 38% MSE reduction versus prior incomplete-modality baselines, with minimal parameter count (4.9M) and FLOPs (1.3G) (Xu et al., 21 Nov 2025).
Flex-MoE on ADNI (Alzheimer’s) and MIMIC-IV (EHR) datasets attains up to 8–12 point boosts in macro-F1 and AUROC for partial input subsets, substantially outperforming FuseMoE and MulT/LIMoE (Yun et al., 10 Oct 2024).
In bounded discrete data, the MCMoE/EM approach provides proper confidence coverage and substantial clustering accuracy, correctly recovering latent disease trajectories that would be missed under complete-case or naive imputation strategies (Suen et al., 2023).
Theoretical convergence rates for parameter estimation benefit from Laplace gating, which achieves $O(n^{-1/4})$ rates for expert parameters—faster than softmax-based alternatives and independent of Voronoi cell size (Han et al., 5 Feb 2024).

A comparative summary of model features and empirical results:

Reference	Domain	Completion Mechanism	Expert Routing	Empirical Advantage
(Xu et al., 21 Nov 2025)	AQA, video	AGMG + MoE on fused feats	MLP gating, softmax	11–38% error reduction
(Yun et al., 10 Oct 2024)	Health records	Missing-modality bank	Two-stage router	3–7 AUC points uplift
(Suen et al., 2023)	Test scores	Binomial imputation via EM	Multinomial logit gating	>95% CI coverage, clustering
(Han et al., 5 Feb 2024)	Multimodal/ML	Token-based modality masking	Laplace gate	Parametric MLE convergence

6. Adaptations, Generalizations, and Implementation Notes

MCMoE can be adapted from existing MoE architectures to handle missingness via:

Modality-indicator token injection (learned ζ_j for each missing j) (Han et al., 5 Feb 2024).
Extension of each expert to emit per-modality decoders (Xu et al., 21 Nov 2025, Han et al., 5 Feb 2024).
Conditioning the router input on missingness mask and embeddings.
Systematic randomized masking during training to match real deployment regimes.
Imposition of load-balancing and diversity constraints at each gating layer (Yun et al., 10 Oct 2024, Han et al., 5 Feb 2024).

Closed-form imputation formulas are available for discrete latent class models under MAR assumptions (Suen et al., 2023), while neural variants attain end-to-end differentiability and are compatible with transformer or convolutional backbones (Yun et al., 10 Oct 2024, Xu et al., 21 Nov 2025).

7. Significance and Outlook

The MCMoE framework offers a unifying solution to missing data completion—generalizing from MAR-structured inference for low-dimensional binomial variables (Suen et al., 2023) to high-dimensional, streaming, and multi-source neural architectures (Xu et al., 21 Nov 2025, Yun et al., 10 Oct 2024, Han et al., 5 Feb 2024). Its key strengths include architectural flexibility, theoretically grounded parameter estimation, and empirical robustness under incomplete or irregularly sampled data.

Potential generalizations include extension to arbitrary data types (beyond discrete or feature vectors), integration with variational inference, and adaptation to streaming or non-i.i.d. data. Further research may compare MCMoE with alternative generative or diffusion-based modality completion strategies, and investigate its interpretability, scalability, and convergence behavior in large-scale real-world applications.