Confidence-Guided Gating in Sparse MoE

Updated 2 February 2026

The paper introduces a novel two-stage modality imputation mechanism that effectively restores missing modalities by averaging modality pools and refining through sparse cross-attention.
It replaces traditional softmax gating with a confidence-guided gating strategy, ensuring balanced expert utilization and preventing expert collapse during training.
Empirical results on clinical and sentiment benchmarks indicate significant F1 and AUC improvements, confirming the robustness and practical benefits of Conf-SMoE.

Confidence-guided gating in Sparse Mixture-of-Experts (Conf-SMoE) is an architecture for multimodal learning that addresses the pervasive problem of missing modalities and expert collapse encountered by traditional sparse MoE (SMoE) frameworks. Conf-SMoE combines a principled two-stage modality imputation mechanism with a confidence-driven gating strategy that replaces conventional softmax-based routers. This architecture offers both theoretical and empirical advances, improving robustness and performance in scenarios involving arbitrary combinations of modality availability, as substantiated across diverse clinical and sentiment classification benchmarks (2505.19525).

1. Sparse Mixture-of-Experts Backbone

The Conf-SMoE framework builds on the classic SMoE layer. For a given input $x \in \mathbb{R}^d$ , typically representing a fused per-modality token embedding (e.g., output from a Transformer block), a set of $N$ experts $\{E_1, \ldots, E_N\}$ —each a feedforward subnetwork—processes $x$ in parallel. A gating network $G$ generates a routing logit vector $u(x) \in \mathbb{R}^N$ , yielding softmax-activated routing weights

$g_i(x) = \frac{\exp(u_i)}{\sum_{j=1}^N \exp(u_j)}.$

The output aggregates the Top-K experts by:

$y = x + \sum_{i \in \mathcal{K}(x)} g_i(x) \cdot E_i(x).$

Here, $\mathcal{K}(x)$ selects the indices corresponding to the $K$ largest $g_i(x)$ . This formulation is standard, yet exposes the core deficits in multimodal, incomplete, or imbalanced training regimes.

2. Missing Modalities and Expert Collapse

In multimodal settings, feature vectors $x$ are formed from concatenated embeddings of several modalities $\{M_j\}$ . When some modalities are missing, $x$ contains “holes,” and SMoE performance degrades due to two factors:

Unreliable Gating Selections: Missing modalities distort $x$ , causing the gate $G$ to select sub-optimal experts.
Expert Collapse: Even with all modalities available, softmax gating's “sharpness” leads to nearly all mass concentrating on a few experts (the “rich-get-richer” phenomenon), so only certain experts receive meaningful gradient updates.

This behavior can be analytically traced to the Jacobian of the SMoE layer, where the gating gradient term $E_{\mathcal{K}(x)}(\mathrm{diag}(g) - g g^T)$ vanishes as $g$ becomes sharp, limiting gradient flow to non-dominant experts. Attempts to correct this with auxiliary entropy-based load-balance losses trigger conflicting gradients and oscillatory, rather than truly diverse, expert selection (2505.19525).

3. Two-Stage Imputation for Modality Restoration

Conf-SMoE introduces a two-stage imputation process to address arbitrary missing subsets of modalities:

Pre-Imputation: For each missing modality $M_i$ , a “modality pool” is sampled from the training set. $K$ instances $\{M_{i,1}, ..., M_{i,K}\}$ are averaged to produce a neutral embedding $\bar M_i = \frac{1}{K} \sum_{m=1}^K M_{i,m}$ , capturing general modality characteristics while suppressing instance noise. $K$ is typically set to $10$.
Post-Imputation: After SMoE fusion with pre-imputed and present modalities, instance-specific refinement occurs. The pre-imputed token sequence $\bar M_i$ is subjected to sparse cross-attention,

$M_i^* = \bar M_i + \mathrm{SparseCrossAttention}(\bar M_i, \{\text{expert-tokens from active experts}\}),$

where only the top $S$ interactions are permitted. $S$ is set to select roughly $1/4$ of all candidate tokens per missing modality, ensuring computational tractability and focusing on most relevant context.

This imputation mechanism delivers significant improvements in both robustness and accuracy under various missingness scenarios (2505.19525).

4. Confidence-Guided Gating Mechanism

The core innovation of Conf-SMoE is its confidence-guided expert gating, replacing the softmax operation with a per-expert “ConfNet.” For each expert $i$ , a sub-network $U_i:\mathbb{R}^d\rightarrow\mathbb{R}$ computes a scalar $v_i=U_i(h)$ from the shared embedding $h$ . The confidence score $c_i(h) = \sigma(v_i)$ , where $\sigma$ denotes the sigmoid function, represents the likelihood that expert $i$ is suitable for the current sample. During training, $c_i(h)$ is regressed towards the true downstream confidence $p_t$ of the ground-truth label $y_t$ , yielding a quadratic confidence loss:

$\mathcal{L}_\mathrm{conf} = \frac{1}{|D|N} \sum_{(h,y_t)\in D} \sum_{i=1}^N [c_i(h) - p_t]^2.$

At inference, $c_i(h)$ is used directly as the gating score, selecting the Top-K experts. The output is then:

$y = h + \sum_{i\in \mathrm{TopK}(c(h))} c_i(h) \cdot E_i(h).$

Unlike softmax, sigmoid-based gating does not create excessively sharp or vanishing gradients. Theoretical analysis demonstrates that the update term $c_i(1-c_i)\cdot E_i(h)$ remains nonzero for all $i$ until $c_i$ saturates. This gates gradient flow to all experts, empirically preventing expert collapse without the need for auxiliary load balancing.

5. Comparisons with Alternative Gating Strategies

Conf-SMoE’s confidence-guided gating is contrasted with several alternatives:

Softmax Gating: $g_i = e^{u_i} / \sum_j e^{u_j}$ . Prone to sharpness and collapse.
Laplacian Gating (FuseMoE): $g_i = \exp(-\|u - w_i\|_1) / \sum_j \exp(-\|u - w_j\|_1)$ . Slight improvement in balance; collapse persists.
Gaussian Gating: $g_i = \exp(-\|u - w_i\|_2^2 / 2\sigma^2) / \sum_j \exp(-\|u - w_j\|_2^2 / 2\sigma^2)$ .
Mean Gating: $g_i = 1/N$ . Yields sinusoidal oscillation in expert usage. Empirical ablations demonstrate that only confidence-guided gates preserve both expert specialization and usage balance over extended training (2505.19525).

6. Pseudocode and Model Hyperparameters

The Conf-SMoE training loop interleaves modality imputation, SMoE routing with confidence gating, and output refinement. The essential operations are as follows:

for each minibatch of samples {x, {M_j present}, y_t}:
    # Two-stage imputation
    for each missing modality i:
        sample K pool instances M_{i,1..K}
        pre_impute[i] = (1/K) sum M_{i,m}
    input_mods = {present M_j} ∪ {pre_impute[i]}
    h = Backbone(input_mods)  # e.g. Transformer encoder

    # SMoE forward
    for i=1..N:
        v_i = U_i(h)
        c_i = sigmoid(v_i)
    select top-K expert indices K = TopK(c, K)
    y = h
    for i in K:
        y += c_i * E_i(h)

    # Post-imputation refinement
    for each missing modality i:
        M_i^* = pre_impute[i] + SparseCrossAttention(pre_impute[i], {E_k(h) | k in K})

    # Task prediction & losses
    logits = Head(y)
    L_task = CrossEntropy(logits, y_t)
    L_conf = (1/(batch*N)) sum_i (c_i - p_t)^2
    L = L_task + L_conf

    backpropagate L, update all parameters

Critical hyperparameters include: number of experts $N=8$ (MIMIC), $N=4$ (CMU); $K=2$ active experts per token; embedding dimension $d=128$ ; pre-imputation pool size $K=10$ ; sparse attention sparsity $B=4$ ; learning rate $3\times 10^{-4}$ ; dropout $0.1$; $\mathcal{L}_\mathrm{conf}$ weight $=1$ .

7. Experimental Evidence and Performance

Conf-SMoE was evaluated on MIMIC-III and MIMIC-IV (clinical timeseries/notes/ECG/X-ray) as well as multimodal sentiment datasets (CMU-MOSI, CMU-MOSEI). Three missingness scenarios were tested: natural missingness (clinical EHR), random modality dropout (up to 50%), and asymmetric dropout (half modalities always dropped during training; only 1–2 available at test). Performance was assessed with F1 and AUC metrics using 3-fold cross-validation.

Notably, on MIMIC-IV, Conf-SMoE-Token (“ConfMoE-T”) achieved F1 gains of $+1.4$ – $4.1\%$ and AUC gains of $+2.0$ – $4.8\%$ over strong baselines (FlexMoE). On CMU-MOSI, ConfMoE-T maintained superiority by $1$–$3$ points in F1 and AUC even at $50\%$ missing modalities. Ablations established that omitting two-stage imputation or confidence gating led to $5$– $6\%$ and $~1\%$ F1 drops, respectively. Alternative gating methods (Gaussian, Laplacian, mean) improved balance compared to raw softmax but did not match Conf-SMoE's robustness. Computational complexity was moderate: $47$ GFLOPs and $3.1$M parameters on CMU-MOSI—slightly over a single Expert MoE and substantially less than large SMoE variants, but yielding highest F1 (43.9 vs. 41–42 for others) (2505.19525).

In summary, Conf-SMoE’s integration of two-stage imputation and confidence-guided gating addresses fundamental challenges in sparse MoE architectures for multimodal and incomplete data, achieving empirically validated improvements in robustness, balance, and accuracy over prior methods.

Markdown Report Issue Upgrade to Chat

References (1)

Rethinking Gating Mechanism in Sparse MoE: Handling Arbitrary Modality Inputs with Confidence-Guided Gate (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Confidence-Guided Gating in Sparse MoE (Conf-SMoE).

Confidence-Guided Gating in Sparse MoE

1. Sparse Mixture-of-Experts Backbone

2. Missing Modalities and Expert Collapse

3. Two-Stage Imputation for Modality Restoration

4. Confidence-Guided Gating Mechanism

5. Comparisons with Alternative Gating Strategies

6. Pseudocode and Model Hyperparameters

7. Experimental Evidence and Performance

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Confidence-Guided Gating in Sparse MoE

1. Sparse Mixture-of-Experts Backbone

2. Missing Modalities and Expert Collapse

3. Two-Stage Imputation for Modality Restoration

4. Confidence-Guided Gating Mechanism

5. Comparisons with Alternative Gating Strategies

6. Pseudocode and Model Hyperparameters

7. Experimental Evidence and Performance

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research