Papers
Topics
Authors
Recent
Search
2000 character limit reached

StAbilized Mixture-of-Experts (SAME)

Updated 9 February 2026
  • SAME is a method designed to stabilize Mixture-of-Experts architectures by addressing router drift and expert drift in Multimodal Continual Instruction Tuning.
  • It employs spectral-aware routing and curvature-based scaling to preserve critical pathways and prevent catastrophic parameter shifts in expert modules.
  • Adaptive expert activation minimizes computational redundancy and cross-task interference, achieving state-of-the-art performance on MCIT benchmarks.

StAbilized Mixture-of-Experts (SAME) is a methodology designed to resolve instability and interference in Mixture-of-Experts (MoE) architectures, specifically tailored for Multimodal Continual Instruction Tuning (MCIT) in large-scale multimodal models. SAME systematically targets two primary sources of continual-learning degradation—router drift and expert drift—by introducing spectral-aware routing, curvature-based expert regularization, and adaptive expert activation. These innovations allow MCIT systems to incorporate new capabilities without catastrophic forgetting or performance instability across tasks (Xie et al., 2 Feb 2026).

1. Background: Challenges in Multimodal Continual Instruction Tuning

MCIT tasks require a model to sequentially acquire, retain, and integrate new instruction-following or reasoning capabilities, often across heterogeneous modalities (e.g., images and text). Conventional MoE approaches leverage sparse, learnable routing to assign a subset of expert modules to each input, enabling specialization and scale. However, as new tasks are introduced sequentially, two coupled pathologies emerge:

  • Router Drift: The gating network responsible for expert selection (typically parameterized as WGW_G; w(x)=Softmax(WGx)w(x) = \mathrm{Softmax}(W_G x)) gradually shifts its selection behavior for previously seen input distributions. Given ws,tw^{s,t} as the post-task-tt routing for task-ss data, the divergence D(ws,sws,t)D(w^{s,s} \| w^{s,t}) grows with t>st>s, betraying an instability in reactivating the original expert pathway. The model therefore loses reliable access to task-specialized functionality.
  • Expert Drift: Even if router outputs are restored to their pre-drift state, the functional characteristics of the experts themselves may erode. With Wi(t)W_i^{(t)} denoting expert ii after task tt, accuracy on earlier tasks ss using (Wi(t),WG(s))(W_i^{(t)}, W_G^{(s)}) can be significantly less than with the original expert parameters (Wi(s),WG(s))(W_i^{(s)}, W_G^{(s)}).

This dual-drift undermines the intended isolation and reusability of expert subroutines in MCIT pipelines (Xie et al., 2 Feb 2026).

2. Spectral-aware Routing: Stabilization of the Expert Gating Process

To address router drift, SAME regularizes the update dynamics of the gating network through spectral analysis and constrained optimization in subspaces aligned with data geometry.

  • Historical Input Covariance: For each task tt, the method maintains the (uncentered) empirical covariance CtC_t over all router inputs up to tt:

Ct=αt1Ct1+xDtxxT,αt=αt1+DtC_t = \alpha_{t-1} C_{t-1} + \sum_{x \in D_t} xx^T, \qquad \alpha_t = \alpha_{t-1} + |D_t|

  • Low-rank Decomposition: Singular Value Decomposition (SVD) is applied: Ct=VΣVTC_t = V\Sigma V^T. The top-kk directions (retaining at least fraction ϵ\epsilon of the variance) form VnewV_{\mathrm{new}}, and the remainder span VoldV_{\mathrm{old}}.
  • Gradient Projection and Rescaling: Gating gradients ΔWG\Delta W_G are split: adaptations in VnewV_{\mathrm{new}} directions enable task adaptation, while VoldV_{\mathrm{old}} is preserved:

ΔWGnew=VnewG(VnewTΔWG)\Delta W_G^{\mathrm{new}} = V_{\mathrm{new}} G (V_{\mathrm{new}}^T \Delta W_G)

where GG rescales by the inverse of smoothed singular values. The combined update is

ΔWG=ΔWGnew+ΔWGold\Delta W_G = \Delta W_G^{\mathrm{new}} + \Delta W_G^{\mathrm{old}}

with ΔWGold=VoldVoldTΔWG\Delta W_G^{\mathrm{old}} = V_{\mathrm{old}} V_{\mathrm{old}}^T \Delta W_G.

This ensures that updates favor subspaces informative for current tasks while exactly preserving routing integrity for history-critical (null) directions, greatly reducing routing inconsistency over time (Xie et al., 2 Feb 2026).

3. Curvature-aware Scaling: Mitigating Expert Drift

Mitigating expert parameter erosion requires explicit control over weight change effects on the function space covered by past data.

  • Degradation Metric: For each expert, the expected squared deviation due to parameter change is quantified:

Di=Expast[ΔWix2]=tr(ΔWiCt1ΔWiT)D_i = \mathbb{E}_{x \sim \text{past}} \left[ \| \Delta W_i x \|^2 \right] = \operatorname{tr} (\Delta W_i C_{t-1} \Delta W_i^T)

  • Drift-aware Objective: Optimization jointly minimizes new-task loss \ell and penalizes DiD_i above a threshold ϵ\epsilon:

minΔWi  (Wi+ΔWi)+λmax(0,tr(ΔWiCt1ΔWiT)ϵ)\min_{\Delta W_i} \; \ell(W_i + \Delta W_i) + \lambda \max(0, \operatorname{tr} (\Delta W_i C_{t-1} \Delta W_i^T) - \epsilon)

  • Riemannian Preconditioning: The update direction is preconditioned by the inverse input covariance, computed efficiently via the SVD as:

(Ct1)1Vnew(Σnew+pI)1VnewT+(IVnewVnewT)(C_{t-1})^{-1} \approx V_{\mathrm{new}} (\Sigma_{\mathrm{new}} + pI)^{-1} V_{\mathrm{new}}^T + (I - V_{\mathrm{new}} V_{\mathrm{new}}^T)

This restricts catastrophic update movement in subspaces most critical to earlier performance, preventing irreversible expert drift (Xie et al., 2 Feb 2026).

4. Adaptive Expert Activation and Freezing

SAME incorporates a dynamic mechanism to limit unnecessary computation and reduce cross-task interference by selectively freezing experts based on activation statistics.

  • Current-task Utilization: U(i)U(i), the moving average routing weight for expert ii during the current task.
  • Historical Importance: F(i)F(i), approximated by the input-energy-weighted routing over history.
  • Freezing Criterion: Each expert’s activation score is set as S(i)=U~(i)F^pre(i)S(i)=\tilde{U}(i)-\hat{F}_{\mathrm{pre}}(i) (normalized). Experts with S(i)<TscoreS(i) < T_{\text{score}} are frozen for the duration of the current task and excluded from forward/backward computation.

This mechanism reduces redundant computation, further decreasing cross-task parameter interference and computational overhead during training (Xie et al., 2 Feb 2026).

5. Training Pipeline and Implementation Details

The SAME pipeline proceeds as follows:

  1. Initialization: Construct initial input covariance C0C_0, scaling α0\alpha_0, and expert importance records.
  2. Task-wise Loop: For each task tt,
    • Per minibatch: compute routing, update CtC_t, perform SVD, calculate expert activation/freezing, project/rescale routing updates and apply, precondition and update experts.
    • After all batches: update historical importance.
  3. Inference: Standard MoE routing is used, activating top-k experts (Xie et al., 2 Feb 2026).

This pipeline is rehearsal-free and efficiently incorporates history-conscious safeguards at every parameter update.

6. Empirical Evaluation

SAME demonstrates state-of-the-art performance on the CoIN MCIT benchmark, which includes eight sequential Visual Question Answering (VQA) tasks (ScienceQA, TextVQA, ImageNet, GQA, VizWiz, Ref-REC, VQAv2, OCR-VQA).

  • Final Average Accuracy:
    • MoELoRA: 50.58%
    • Continual LLaVA: 53.33%
    • ModalPrompt: 55.06%
    • SEFE: 58.57%
    • ProgLoRA: 59.09%
    • LLaVA-CMoE: 59.23%
    • HiDe-LLaVA: 63.95%
    • SAME: 66.82% (+2.87% over HiDe-LLaVA) (Xie et al., 2 Feb 2026)
  • Ablation Study:
    • Spectral routing alone: 61.32%
    • +Curvature scaling: 65.89%
    • +Adaptive activation (full SAME): 66.82%
  • Task-specific improvements highlight robustness across both OCR (TextVQA: +2.5%) and vision (ImageNet: +5.3%) tasks.

7. Limitations and Future Directions

SAME is sensitive to ambiguous task boundaries or rapidly changing input conventions, which may weaken the effectiveness of covariance-based stabilizations. Additionally, phenomena such as formatting-induced forgetting—where format conventions learned in early tasks are erased and then rebounded in later tasks—are robustly addressed in SAME due to modular rigidity, but such challenges may persist under more adversarially shifting distributions.

Enhancements such as tighter coupling of router and expert drift controls, dynamic subspace dimensioning, and integration of hybrid rehearsal techniques are cited as promising future avenues to further augment robustness and generalization in MCIT settings (Xie et al., 2 Feb 2026).


StAbilized Mixture-of-Experts (SAME) establishes a new paradigm for reliable, history-aware, and computationally-efficient continual tuning of multimodal models, providing modular remedies for drift phenomena and setting empirical performance benchmarks in the MCIT domain (Xie et al., 2 Feb 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to StAbilized Mixture-of-Experts (SAME).