Papers
Topics
Authors
Recent
Search
2000 character limit reached

Conf-SMoE: Robust Multimodal Mixture-of-Experts

Updated 6 May 2026
  • The paper introduces a two-stage imputation mechanism and confidence-guided gating to overcome missing modalities and expert collapse in SMoE architectures.
  • It employs a pre-imputation stage to average modality embeddings and post-imputation refinement using sparse cross-attention for strong cross-modal correlations.
  • Empirical evaluations on real-world benchmarks demonstrate significant performance gains in F1 and AUC, validating its robust expert specialization.

Conf-SMoE is an extension of sparse Mixture-of-Experts (SMoE) architectures that addresses the challenge of missing modalities and expert collapse in multimodal learning scenarios. By integrating a two-stage imputation mechanism and a confidence-guided gating strategy, Conf-SMoE enables robust modality fusion and maintains expert diversity and specialization without additional load balancing terms. This approach is empirically validated on multiple real-world datasets and offers a theoretical framework clarifying the mechanics of expert collapse and the effectiveness of the confidence-guided gating mechanism (2505.19525).

1. SMoE Architectures, Multimodal Settings, and the Missing-Modality Challenge

Sparse Mixture-of-Experts (SMoE) models augment deep learning backbones by replacing each feed-forward layer with NN distinct experts {Ei}i=1N\{E_i\}_{i=1}^{N}. A lightweight router G()G(\cdot) computes, for each token embedding hRd\mathbf h\in\mathbb R^d, a set of routing scores {gi(h)}\{g_i(\mathbf h)\}, usually normalized via a softmax. Only the top-KK experts per token are activated:

E(h)=h+iTopK(g(h))  gi(h)Ei(h)E(\mathbf h) = \mathbf h + \sum_{i\in \mathrm{TopK}(g(\mathbf h))}\;g_i(\mathbf h)\,E_i(\mathbf h)

In multimodal settings (e.g., combining language, vision, and audio), experts often specialize in different modalities or modality combinations. When modalities are missing (e.g., sensor failures, privacy restrictions), a router expecting complete input struggles, leading to truncation or nonsensical routing. This often results in "expert collapse," where most tokens are routed to a small subset of experts, harming diversity and generalization.

2. Two-Stage Imputation for Missing Modalities

To address missing modalities, Conf-SMoE employs a two-stage imputation block preceding the expert layers.

  • Pre-Imputation (Common Structure): For every missing modality ii, a modality pool containing all training-set embeddings Mi,nM_{i,n} is maintained. KK random embeddings are drawn and averaged to obtain

{Ei}i=1N\{E_i\}_{i=1}^{N}0

This average retains modality-wise statistics while discarding instance-specific idiosyncrasies.

  • Post-Imputation (Cross-Modal Refinement): The pre-imputed matrix {Ei}i=1N\{E_i\}_{i=1}^{N}1 undergoes sparse cross-attention with existing modalities. For each observed modality {Ei}i=1N\{E_i\}_{i=1}^{N}2, specialized tokens from {Ei}i=1N\{E_i\}_{i=1}^{N}3 are aggregated through Top-{Ei}i=1N\{E_i\}_{i=1}^{N}4 experts, and refined with

{Ei}i=1N\{E_i\}_{i=1}^{N}5

Only the top-{Ei}i=1N\{E_i\}_{i=1}^{N}6 keys per query are kept (default sparsity {Ei}i=1N\{E_i\}_{i=1}^{N}7), ensuring only strong cross-modal correlations influence the imputed values. The complete set of real and imputed modalities is then concatenated and processed by SMoE layers.

3. Confidence-Guided Expert Gating

Conventional softmax routers create sharp distributions, predisposing some experts to receive disproportionately more gradient updates, resulting in expert collapse. Conf-SMoE introduces a decoupled, confidence-driven gating system:

  • Each expert {Ei}i=1N\{E_i\}_{i=1}^{N}8 includes a confidence network {Ei}i=1N\{E_i\}_{i=1}^{N}9 yielding a logit G()G(\cdot)0 per token embedding. Confidence is computed as

G()G(\cdot)1

  • These G()G(\cdot)2 values are not normalized globally and do not enforce G()G(\cdot)3, eliminating inter-expert competition. Top-G()G(\cdot)4 selection is still retained for computational efficiency.
  • During training, G()G(\cdot)5 is supervised by regressing onto the true task confidence G()G(\cdot)6:

G()G(\cdot)7

This decoupled gating allows each expert to develop its own selection dynamics independent of the others.

4. Theoretical Mechanisms: Avoiding Expert Collapse

Softmax-based routers induce gradients with strong cross-expert coupling via the covariance structure of the softmax Jacobian, especially for sharp outputs:

G()G(\cdot)8

Here, G()G(\cdot)9 is the softmax output. For sharp hRd\mathbf h\in\mathbb R^d0, gradients concentrate only on the top expert, amplifying the rich-get-richer effect.

Load-balance losses based on entropy are classically introduced to mitigate collapse, but their gradients often oppose the main update direction, causing instability.

In contrast, the confidence-guided sigmoidal gating yields a local Jacobian with no cross-expert terms:

hRd\mathbf h\in\mathbb R^d1

This structure enables independent expert specialization and prevents starvation, removing the need for auxiliary balancing losses.

5. Empirical Validation

Conf-SMoE's performance has been systematically evaluated on four real-world multimodal benchmarks—MIMIC-III, MIMIC-IV, CMU-MOSI, and CMU-MOSEI—across three regimes: natural missing modalities, random dropout, and asymmetric dropout. Representative results from MIMIC-IV (natural missingness) demonstrate substantial improvement:

Model F₁ AUC
FuseMoE-L 40.21 78.05
FlexMoE 35.29 80.45
ConfMoE-T (Conf-SMoE) 49.18 85.24
ConfMoE-E (Conf-SMoE) 48.32 85.09

On CMU-MOSI with 50% modalities dropped, ConfMoE-T achieves F₁ ≈ 43.9, outperforming FlexMoE (41.9) and standard SMoE (41.5). Under asymmetric dropout, Conf-SMoE retains superior performance for single-modality scenarios.

Ablation studies indicate that omitting the imputation block decreases F₁ by 6 points and AUC by 2 points, while removing confidence gating reduces F₁ by 4 points and degrades expert diversity. Alternative gating mechanisms (softmax with load balance, mean selection, Gaussian, Laplacian) underperform and exhibit oscillatory expert selection.

6. Training and Implementation

A typical Conf-SMoE training epoch proceeds as follows:

  1. For each sample and missing modality hRd\mathbf h\in\mathbb R^d2, hRd\mathbf h\in\mathbb R^d3 training embeddings are drawn and averaged to form hRd\mathbf h\in\mathbb R^d4.
  2. Observed and imputed modalities are encoded and passed to SMoE layers with confidence-based gating.
  3. After the expert layer, refinement is applied by sparse cross-attention, integrating cross-modal context into missing modality tokens.
  4. All modality tokens are concatenated and passed to the task head.
  5. The loss for each sample includes both the primary task loss (e.g., cross-entropy) and the confidence supervision loss.
  6. Parameters of experts and confidence networks are updated via backpropagation.

No explicit balancing loss or expert load regularizer is necessary. At inference, the gating proceeds solely via learned confidence scores.

7. Limitations, Connections, and Comparative Context

Conf-SMoE introduces additional memory requirements for storing the modality pool used in imputation. The effectiveness of confidence-based supervision depends on available ground-truth task confidence scores, which must be estimated if unavailable. The imputation mechanism presumes a non-negligible cross-modal correlation structure; performance may degrade if such correlations are weak.

Connections to Gaussian and Laplacian gating mechanisms reveal that while these alternatives can partially maintain expert utilization, they retain off-diagonal Jacobian terms and remain susceptible to collapse over long training as evidenced in empirical plots. Conf-SMoE’s confidence mechanism, by fully decoupling gating, uniquely maintains robust expert diversity.

Conf-SMoE provides a principled and empirically substantiated approach for robust modality fusion and resistance to expert collapse in multimodal SMoE architectures, setting a new standard for handling the missing modality problem without reliance on auxiliary loss terms (2505.19525).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Conf-SMoE.