StAbilized Mixture-of-Experts (SAME)
- SAME is a method designed to stabilize Mixture-of-Experts architectures by addressing router drift and expert drift in Multimodal Continual Instruction Tuning.
- It employs spectral-aware routing and curvature-based scaling to preserve critical pathways and prevent catastrophic parameter shifts in expert modules.
- Adaptive expert activation minimizes computational redundancy and cross-task interference, achieving state-of-the-art performance on MCIT benchmarks.
StAbilized Mixture-of-Experts (SAME) is a methodology designed to resolve instability and interference in Mixture-of-Experts (MoE) architectures, specifically tailored for Multimodal Continual Instruction Tuning (MCIT) in large-scale multimodal models. SAME systematically targets two primary sources of continual-learning degradation—router drift and expert drift—by introducing spectral-aware routing, curvature-based expert regularization, and adaptive expert activation. These innovations allow MCIT systems to incorporate new capabilities without catastrophic forgetting or performance instability across tasks (Xie et al., 2 Feb 2026).
1. Background: Challenges in Multimodal Continual Instruction Tuning
MCIT tasks require a model to sequentially acquire, retain, and integrate new instruction-following or reasoning capabilities, often across heterogeneous modalities (e.g., images and text). Conventional MoE approaches leverage sparse, learnable routing to assign a subset of expert modules to each input, enabling specialization and scale. However, as new tasks are introduced sequentially, two coupled pathologies emerge:
- Router Drift: The gating network responsible for expert selection (typically parameterized as ; ) gradually shifts its selection behavior for previously seen input distributions. Given as the post-task- routing for task- data, the divergence grows with , betraying an instability in reactivating the original expert pathway. The model therefore loses reliable access to task-specialized functionality.
- Expert Drift: Even if router outputs are restored to their pre-drift state, the functional characteristics of the experts themselves may erode. With denoting expert after task , accuracy on earlier tasks using can be significantly less than with the original expert parameters .
This dual-drift undermines the intended isolation and reusability of expert subroutines in MCIT pipelines (Xie et al., 2 Feb 2026).
2. Spectral-aware Routing: Stabilization of the Expert Gating Process
To address router drift, SAME regularizes the update dynamics of the gating network through spectral analysis and constrained optimization in subspaces aligned with data geometry.
- Historical Input Covariance: For each task , the method maintains the (uncentered) empirical covariance over all router inputs up to :
- Low-rank Decomposition: Singular Value Decomposition (SVD) is applied: . The top- directions (retaining at least fraction of the variance) form , and the remainder span .
- Gradient Projection and Rescaling: Gating gradients are split: adaptations in directions enable task adaptation, while is preserved:
where rescales by the inverse of smoothed singular values. The combined update is
with .
This ensures that updates favor subspaces informative for current tasks while exactly preserving routing integrity for history-critical (null) directions, greatly reducing routing inconsistency over time (Xie et al., 2 Feb 2026).
3. Curvature-aware Scaling: Mitigating Expert Drift
Mitigating expert parameter erosion requires explicit control over weight change effects on the function space covered by past data.
- Degradation Metric: For each expert, the expected squared deviation due to parameter change is quantified:
- Drift-aware Objective: Optimization jointly minimizes new-task loss and penalizes above a threshold :
- Riemannian Preconditioning: The update direction is preconditioned by the inverse input covariance, computed efficiently via the SVD as:
This restricts catastrophic update movement in subspaces most critical to earlier performance, preventing irreversible expert drift (Xie et al., 2 Feb 2026).
4. Adaptive Expert Activation and Freezing
SAME incorporates a dynamic mechanism to limit unnecessary computation and reduce cross-task interference by selectively freezing experts based on activation statistics.
- Current-task Utilization: , the moving average routing weight for expert during the current task.
- Historical Importance: , approximated by the input-energy-weighted routing over history.
- Freezing Criterion: Each expert’s activation score is set as (normalized). Experts with are frozen for the duration of the current task and excluded from forward/backward computation.
This mechanism reduces redundant computation, further decreasing cross-task parameter interference and computational overhead during training (Xie et al., 2 Feb 2026).
5. Training Pipeline and Implementation Details
The SAME pipeline proceeds as follows:
- Initialization: Construct initial input covariance , scaling , and expert importance records.
- Task-wise Loop: For each task ,
- Per minibatch: compute routing, update , perform SVD, calculate expert activation/freezing, project/rescale routing updates and apply, precondition and update experts.
- After all batches: update historical importance.
- Inference: Standard MoE routing is used, activating top-k experts (Xie et al., 2 Feb 2026).
This pipeline is rehearsal-free and efficiently incorporates history-conscious safeguards at every parameter update.
6. Empirical Evaluation
SAME demonstrates state-of-the-art performance on the CoIN MCIT benchmark, which includes eight sequential Visual Question Answering (VQA) tasks (ScienceQA, TextVQA, ImageNet, GQA, VizWiz, Ref-REC, VQAv2, OCR-VQA).
- Final Average Accuracy:
- MoELoRA: 50.58%
- Continual LLaVA: 53.33%
- ModalPrompt: 55.06%
- SEFE: 58.57%
- ProgLoRA: 59.09%
- LLaVA-CMoE: 59.23%
- HiDe-LLaVA: 63.95%
- SAME: 66.82% (+2.87% over HiDe-LLaVA) (Xie et al., 2 Feb 2026)
- Ablation Study:
- Spectral routing alone: 61.32%
- +Curvature scaling: 65.89%
- +Adaptive activation (full SAME): 66.82%
- Task-specific improvements highlight robustness across both OCR (TextVQA: +2.5%) and vision (ImageNet: +5.3%) tasks.
7. Limitations and Future Directions
SAME is sensitive to ambiguous task boundaries or rapidly changing input conventions, which may weaken the effectiveness of covariance-based stabilizations. Additionally, phenomena such as formatting-induced forgetting—where format conventions learned in early tasks are erased and then rebounded in later tasks—are robustly addressed in SAME due to modular rigidity, but such challenges may persist under more adversarially shifting distributions.
Enhancements such as tighter coupling of router and expert drift controls, dynamic subspace dimensioning, and integration of hybrid rehearsal techniques are cited as promising future avenues to further augment robustness and generalization in MCIT settings (Xie et al., 2 Feb 2026).
StAbilized Mixture-of-Experts (SAME) establishes a new paradigm for reliable, history-aware, and computationally-efficient continual tuning of multimodal models, providing modular remedies for drift phenomena and setting empirical performance benchmarks in the MCIT domain (Xie et al., 2 Feb 2026).