Papers
Topics
Authors
Recent
Search
2000 character limit reached

Confidence-Guided Gating in Sparse MoE

Updated 2 February 2026
  • The paper introduces a novel two-stage modality imputation mechanism that effectively restores missing modalities by averaging modality pools and refining through sparse cross-attention.
  • It replaces traditional softmax gating with a confidence-guided gating strategy, ensuring balanced expert utilization and preventing expert collapse during training.
  • Empirical results on clinical and sentiment benchmarks indicate significant F1 and AUC improvements, confirming the robustness and practical benefits of Conf-SMoE.

Confidence-guided gating in Sparse Mixture-of-Experts (Conf-SMoE) is an architecture for multimodal learning that addresses the pervasive problem of missing modalities and expert collapse encountered by traditional sparse MoE (SMoE) frameworks. Conf-SMoE combines a principled two-stage modality imputation mechanism with a confidence-driven gating strategy that replaces conventional softmax-based routers. This architecture offers both theoretical and empirical advances, improving robustness and performance in scenarios involving arbitrary combinations of modality availability, as substantiated across diverse clinical and sentiment classification benchmarks (2505.19525).

1. Sparse Mixture-of-Experts Backbone

The Conf-SMoE framework builds on the classic SMoE layer. For a given input xRdx \in \mathbb{R}^d, typically representing a fused per-modality token embedding (e.g., output from a Transformer block), a set of NN experts {E1,,EN}\{E_1, \ldots, E_N\}—each a feedforward subnetwork—processes xx in parallel. A gating network GG generates a routing logit vector u(x)RNu(x) \in \mathbb{R}^N, yielding softmax-activated routing weights

gi(x)=exp(ui)j=1Nexp(uj).g_i(x) = \frac{\exp(u_i)}{\sum_{j=1}^N \exp(u_j)}.

The output aggregates the Top-K experts by:

y=x+iK(x)gi(x)Ei(x).y = x + \sum_{i \in \mathcal{K}(x)} g_i(x) \cdot E_i(x).

Here, K(x)\mathcal{K}(x) selects the indices corresponding to the KK largest gi(x)g_i(x). This formulation is standard, yet exposes the core deficits in multimodal, incomplete, or imbalanced training regimes.

2. Missing Modalities and Expert Collapse

In multimodal settings, feature vectors xx are formed from concatenated embeddings of several modalities {Mj}\{M_j\}. When some modalities are missing, xx contains “holes,” and SMoE performance degrades due to two factors:

  • Unreliable Gating Selections: Missing modalities distort xx, causing the gate GG to select sub-optimal experts.
  • Expert Collapse: Even with all modalities available, softmax gating's “sharpness” leads to nearly all mass concentrating on a few experts (the “rich-get-richer” phenomenon), so only certain experts receive meaningful gradient updates.

This behavior can be analytically traced to the Jacobian of the SMoE layer, where the gating gradient term EK(x)(diag(g)ggT)E_{\mathcal{K}(x)}(\mathrm{diag}(g) - g g^T) vanishes as gg becomes sharp, limiting gradient flow to non-dominant experts. Attempts to correct this with auxiliary entropy-based load-balance losses trigger conflicting gradients and oscillatory, rather than truly diverse, expert selection (2505.19525).

3. Two-Stage Imputation for Modality Restoration

Conf-SMoE introduces a two-stage imputation process to address arbitrary missing subsets of modalities:

  • Pre-Imputation: For each missing modality MiM_i, a “modality pool” is sampled from the training set. KK instances {Mi,1,...,Mi,K}\{M_{i,1}, ..., M_{i,K}\} are averaged to produce a neutral embedding Mˉi=1Km=1KMi,m\bar M_i = \frac{1}{K} \sum_{m=1}^K M_{i,m}, capturing general modality characteristics while suppressing instance noise. KK is typically set to $10$.
  • Post-Imputation: After SMoE fusion with pre-imputed and present modalities, instance-specific refinement occurs. The pre-imputed token sequence Mˉi\bar M_i is subjected to sparse cross-attention,

Mi=Mˉi+SparseCrossAttention(Mˉi,{expert-tokens from active experts}),M_i^* = \bar M_i + \mathrm{SparseCrossAttention}(\bar M_i, \{\text{expert-tokens from active experts}\}),

where only the top SS interactions are permitted. SS is set to select roughly $1/4$ of all candidate tokens per missing modality, ensuring computational tractability and focusing on most relevant context.

This imputation mechanism delivers significant improvements in both robustness and accuracy under various missingness scenarios (2505.19525).

4. Confidence-Guided Gating Mechanism

The core innovation of Conf-SMoE is its confidence-guided expert gating, replacing the softmax operation with a per-expert “ConfNet.” For each expert ii, a sub-network Ui:RdRU_i:\mathbb{R}^d\rightarrow\mathbb{R} computes a scalar vi=Ui(h)v_i=U_i(h) from the shared embedding hh. The confidence score ci(h)=σ(vi)c_i(h) = \sigma(v_i), where σ\sigma denotes the sigmoid function, represents the likelihood that expert ii is suitable for the current sample. During training, ci(h)c_i(h) is regressed towards the true downstream confidence ptp_t of the ground-truth label yty_t, yielding a quadratic confidence loss:

Lconf=1DN(h,yt)Di=1N[ci(h)pt]2.\mathcal{L}_\mathrm{conf} = \frac{1}{|D|N} \sum_{(h,y_t)\in D} \sum_{i=1}^N [c_i(h) - p_t]^2.

At inference, ci(h)c_i(h) is used directly as the gating score, selecting the Top-K experts. The output is then:

y=h+iTopK(c(h))ci(h)Ei(h).y = h + \sum_{i\in \mathrm{TopK}(c(h))} c_i(h) \cdot E_i(h).

Unlike softmax, sigmoid-based gating does not create excessively sharp or vanishing gradients. Theoretical analysis demonstrates that the update term ci(1ci)Ei(h)c_i(1-c_i)\cdot E_i(h) remains nonzero for all ii until cic_i saturates. This gates gradient flow to all experts, empirically preventing expert collapse without the need for auxiliary load balancing.

5. Comparisons with Alternative Gating Strategies

Conf-SMoE’s confidence-guided gating is contrasted with several alternatives:

  • Softmax Gating: gi=eui/jeujg_i = e^{u_i} / \sum_j e^{u_j}. Prone to sharpness and collapse.
  • Laplacian Gating (FuseMoE): gi=exp(uwi1)/jexp(uwj1)g_i = \exp(-\|u - w_i\|_1) / \sum_j \exp(-\|u - w_j\|_1). Slight improvement in balance; collapse persists.
  • Gaussian Gating: gi=exp(uwi22/2σ2)/jexp(uwj22/2σ2)g_i = \exp(-\|u - w_i\|_2^2 / 2\sigma^2) / \sum_j \exp(-\|u - w_j\|_2^2 / 2\sigma^2).
  • Mean Gating: gi=1/Ng_i = 1/N. Yields sinusoidal oscillation in expert usage. Empirical ablations demonstrate that only confidence-guided gates preserve both expert specialization and usage balance over extended training (2505.19525).

6. Pseudocode and Model Hyperparameters

The Conf-SMoE training loop interleaves modality imputation, SMoE routing with confidence gating, and output refinement. The essential operations are as follows:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
for each minibatch of samples {x, {M_j present}, y_t}:
    # Two-stage imputation
    for each missing modality i:
        sample K pool instances M_{i,1..K}
        pre_impute[i] = (1/K) sum M_{i,m}
    input_mods = {present M_j}  {pre_impute[i]}
    h = Backbone(input_mods)  # e.g. Transformer encoder

    # SMoE forward
    for i=1..N:
        v_i = U_i(h)
        c_i = sigmoid(v_i)
    select top-K expert indices K = TopK(c, K)
    y = h
    for i in K:
        y += c_i * E_i(h)

    # Post-imputation refinement
    for each missing modality i:
        M_i^* = pre_impute[i] + SparseCrossAttention(pre_impute[i], {E_k(h) | k in K})

    # Task prediction & losses
    logits = Head(y)
    L_task = CrossEntropy(logits, y_t)
    L_conf = (1/(batch*N)) sum_i (c_i - p_t)^2
    L = L_task + L_conf

    backpropagate L, update all parameters

Critical hyperparameters include: number of experts N=8N=8 (MIMIC), N=4N=4 (CMU); K=2K=2 active experts per token; embedding dimension d=128d=128; pre-imputation pool size K=10K=10; sparse attention sparsity B=4B=4; learning rate 3×1043\times 10^{-4}; dropout $0.1$; Lconf\mathcal{L}_\mathrm{conf} weight =1=1.

7. Experimental Evidence and Performance

Conf-SMoE was evaluated on MIMIC-III and MIMIC-IV (clinical timeseries/notes/ECG/X-ray) as well as multimodal sentiment datasets (CMU-MOSI, CMU-MOSEI). Three missingness scenarios were tested: natural missingness (clinical EHR), random modality dropout (up to 50%), and asymmetric dropout (half modalities always dropped during training; only 1–2 available at test). Performance was assessed with F1 and AUC metrics using 3-fold cross-validation.

Notably, on MIMIC-IV, Conf-SMoE-Token (“ConfMoE-T”) achieved F1 gains of +1.4+1.44.1%4.1\% and AUC gains of +2.0+2.04.8%4.8\% over strong baselines (FlexMoE). On CMU-MOSI, ConfMoE-T maintained superiority by $1$–$3$ points in F1 and AUC even at 50%50\% missing modalities. Ablations established that omitting two-stage imputation or confidence gating led to $5$–6%6\% and  1%~1\% F1 drops, respectively. Alternative gating methods (Gaussian, Laplacian, mean) improved balance compared to raw softmax but did not match Conf-SMoE's robustness. Computational complexity was moderate: $47$ GFLOPs and $3.1$M parameters on CMU-MOSI—slightly over a single Expert MoE and substantially less than large SMoE variants, but yielding highest F1 (43.9 vs. 41–42 for others) (2505.19525).

In summary, Conf-SMoE’s integration of two-stage imputation and confidence-guided gating addresses fundamental challenges in sparse MoE architectures for multimodal and incomplete data, achieving empirically validated improvements in robustness, balance, and accuracy over prior methods.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Confidence-Guided Gating in Sparse MoE (Conf-SMoE).