Conf-SMoE: Robust Multimodal Mixture-of-Experts
- The paper introduces a two-stage imputation mechanism and confidence-guided gating to overcome missing modalities and expert collapse in SMoE architectures.
- It employs a pre-imputation stage to average modality embeddings and post-imputation refinement using sparse cross-attention for strong cross-modal correlations.
- Empirical evaluations on real-world benchmarks demonstrate significant performance gains in F1 and AUC, validating its robust expert specialization.
Conf-SMoE is an extension of sparse Mixture-of-Experts (SMoE) architectures that addresses the challenge of missing modalities and expert collapse in multimodal learning scenarios. By integrating a two-stage imputation mechanism and a confidence-guided gating strategy, Conf-SMoE enables robust modality fusion and maintains expert diversity and specialization without additional load balancing terms. This approach is empirically validated on multiple real-world datasets and offers a theoretical framework clarifying the mechanics of expert collapse and the effectiveness of the confidence-guided gating mechanism (2505.19525).
1. SMoE Architectures, Multimodal Settings, and the Missing-Modality Challenge
Sparse Mixture-of-Experts (SMoE) models augment deep learning backbones by replacing each feed-forward layer with distinct experts . A lightweight router computes, for each token embedding , a set of routing scores , usually normalized via a softmax. Only the top- experts per token are activated:
In multimodal settings (e.g., combining language, vision, and audio), experts often specialize in different modalities or modality combinations. When modalities are missing (e.g., sensor failures, privacy restrictions), a router expecting complete input struggles, leading to truncation or nonsensical routing. This often results in "expert collapse," where most tokens are routed to a small subset of experts, harming diversity and generalization.
2. Two-Stage Imputation for Missing Modalities
To address missing modalities, Conf-SMoE employs a two-stage imputation block preceding the expert layers.
- Pre-Imputation (Common Structure): For every missing modality , a modality pool containing all training-set embeddings is maintained. random embeddings are drawn and averaged to obtain
0
This average retains modality-wise statistics while discarding instance-specific idiosyncrasies.
- Post-Imputation (Cross-Modal Refinement): The pre-imputed matrix 1 undergoes sparse cross-attention with existing modalities. For each observed modality 2, specialized tokens from 3 are aggregated through Top-4 experts, and refined with
5
Only the top-6 keys per query are kept (default sparsity 7), ensuring only strong cross-modal correlations influence the imputed values. The complete set of real and imputed modalities is then concatenated and processed by SMoE layers.
3. Confidence-Guided Expert Gating
Conventional softmax routers create sharp distributions, predisposing some experts to receive disproportionately more gradient updates, resulting in expert collapse. Conf-SMoE introduces a decoupled, confidence-driven gating system:
- Each expert 8 includes a confidence network 9 yielding a logit 0 per token embedding. Confidence is computed as
1
- These 2 values are not normalized globally and do not enforce 3, eliminating inter-expert competition. Top-4 selection is still retained for computational efficiency.
- During training, 5 is supervised by regressing onto the true task confidence 6:
7
This decoupled gating allows each expert to develop its own selection dynamics independent of the others.
4. Theoretical Mechanisms: Avoiding Expert Collapse
Softmax-based routers induce gradients with strong cross-expert coupling via the covariance structure of the softmax Jacobian, especially for sharp outputs:
8
Here, 9 is the softmax output. For sharp 0, gradients concentrate only on the top expert, amplifying the rich-get-richer effect.
Load-balance losses based on entropy are classically introduced to mitigate collapse, but their gradients often oppose the main update direction, causing instability.
In contrast, the confidence-guided sigmoidal gating yields a local Jacobian with no cross-expert terms:
1
This structure enables independent expert specialization and prevents starvation, removing the need for auxiliary balancing losses.
5. Empirical Validation
Conf-SMoE's performance has been systematically evaluated on four real-world multimodal benchmarks—MIMIC-III, MIMIC-IV, CMU-MOSI, and CMU-MOSEI—across three regimes: natural missing modalities, random dropout, and asymmetric dropout. Representative results from MIMIC-IV (natural missingness) demonstrate substantial improvement:
| Model | F₁ | AUC |
|---|---|---|
| FuseMoE-L | 40.21 | 78.05 |
| FlexMoE | 35.29 | 80.45 |
| ConfMoE-T (Conf-SMoE) | 49.18 | 85.24 |
| ConfMoE-E (Conf-SMoE) | 48.32 | 85.09 |
On CMU-MOSI with 50% modalities dropped, ConfMoE-T achieves F₁ ≈ 43.9, outperforming FlexMoE (41.9) and standard SMoE (41.5). Under asymmetric dropout, Conf-SMoE retains superior performance for single-modality scenarios.
Ablation studies indicate that omitting the imputation block decreases F₁ by 6 points and AUC by 2 points, while removing confidence gating reduces F₁ by 4 points and degrades expert diversity. Alternative gating mechanisms (softmax with load balance, mean selection, Gaussian, Laplacian) underperform and exhibit oscillatory expert selection.
6. Training and Implementation
A typical Conf-SMoE training epoch proceeds as follows:
- For each sample and missing modality 2, 3 training embeddings are drawn and averaged to form 4.
- Observed and imputed modalities are encoded and passed to SMoE layers with confidence-based gating.
- After the expert layer, refinement is applied by sparse cross-attention, integrating cross-modal context into missing modality tokens.
- All modality tokens are concatenated and passed to the task head.
- The loss for each sample includes both the primary task loss (e.g., cross-entropy) and the confidence supervision loss.
- Parameters of experts and confidence networks are updated via backpropagation.
No explicit balancing loss or expert load regularizer is necessary. At inference, the gating proceeds solely via learned confidence scores.
7. Limitations, Connections, and Comparative Context
Conf-SMoE introduces additional memory requirements for storing the modality pool used in imputation. The effectiveness of confidence-based supervision depends on available ground-truth task confidence scores, which must be estimated if unavailable. The imputation mechanism presumes a non-negligible cross-modal correlation structure; performance may degrade if such correlations are weak.
Connections to Gaussian and Laplacian gating mechanisms reveal that while these alternatives can partially maintain expert utilization, they retain off-diagonal Jacobian terms and remain susceptible to collapse over long training as evidenced in empirical plots. Conf-SMoE’s confidence mechanism, by fully decoupling gating, uniquely maintains robust expert diversity.
Conf-SMoE provides a principled and empirically substantiated approach for robust modality fusion and resistance to expert collapse in multimodal SMoE architectures, setting a new standard for handling the missing modality problem without reliance on auxiliary loss terms (2505.19525).