Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 76 tok/s
Gemini 2.5 Pro 58 tok/s Pro
GPT-5 Medium 26 tok/s Pro
GPT-5 High 25 tok/s Pro
GPT-4o 81 tok/s Pro
Kimi K2 206 tok/s Pro
GPT OSS 120B 465 tok/s Pro
Claude Sonnet 4 35 tok/s Pro
2000 character limit reached

MoE Upcycling: Transforming Dense Models

Updated 7 September 2025
  • MoE Upcycling is a technique that converts dense neural sub-layers into sparsely activated Mixture-of-Experts modules, enhancing scalability.
  • It leverages pretrained weights by partitioning and reusing them while introducing routers and diverse expert initialization methods like Drop-Upcycling and Delta Decomposition.
  • This approach delivers improved resource efficiency and performance boosts across language, vision, and multimodal applications.

The Mixture-of-Experts (MoE) upcycling technique is a class of methods for transforming a pretrained dense model into a sparsely activated MoE architecture by reusing, partitioning, or augmenting the existing model weights. Initially advanced in the context of neural language and vision models to address the prohibitive cost of large-scale pretraining, MoE upcycling leverages the “sunk cost” of dense model training to enable scalable, high-capacity networks without training from scratch. The technique is now foundational for resource-efficient scaling in LLMs, vision transformers, multimodal encoders, and domain-specialized systems.

1. Core Principles and Algorithmic Formulation

MoE upcycling centers on replacing selected dense sub-layers—typically feedforward multilayer perceptron (MLP) or feedforward network (FFN) modules in transformer architectures—with sparsely activated MoE equivalents. The process preserves the global architecture (e.g., number of layers, attention structure), augments the parameter count, and introduces conditional computation via routers or gating networks.

The generic upcycling workflow comprises:

  1. Layer Selection and Replacement: Within the pretrained dense model, choose which MLP/FFN layers to convert to MoE. Common practice is to substitute approximately half of these layers, but configurations vary (Komatsuzaki et al., 2022).
  2. Expert Initialization: For each layer to be upcycled, replace the dense MLP/FFN with an MoE block containing EE experts. Each expert is initialized by directly copying the dense model’s weights. Additional strategies include advanced allocation via importance-based weight sampling, co-activation clustering, or genetic parameter merging to increase expert diversity (Zhu et al., 7 Jun 2024, Hui et al., 2 Oct 2024).
  3. Router Introduction and Initialization: A new router (gating mechanism) is introduced to direct tokens to experts. Typically, routers are randomly initialized, although advanced variants can be upcycled from attention heads or constructed using domain/task priors for adaptive routing (Ran et al., 31 Aug 2025, Zhou et al., 21 Aug 2024).

The routing mechanism yields a sparse activation pattern per token:

RRn×E,rij0,j=1Erij=1\mathbf{R} \in \mathbb{R}^{n \times E}, \quad r_{ij} \ge 0, \quad \sum_{j=1}^E r_{ij} = 1

for nn tokens and EE experts. The top-KK routing scheme activates only KK experts per token, with the effective output:

xi=jSirijkSirikej(xi)x'_i = \sum_{j \in S_i} \frac{r_{ij}}{\sum_{k \in S_i} r_{ik}} e_j(x_i)

where SiS_i are the selected experts and ej(xi)e_j(x_i) their outputs (Komatsuzaki et al., 2022).

Training is resumed (“continued pretraining” or “post-pretraining”), allowing the router and expert weights to diverge and specialize.


2. Model Surgery, Initialization Variants, and Parameter Efficiency

While basic upcycling duplicates dense weights into all experts, recent work has introduced several refinements to address redundancy and facilitate convergence:

  • Virtual Group Initialization: For “fine-grained” MoE architectures, the dense FFN is partitioned into GG shards, with the MoE constructed so each group collectively covers all shards. Router initialization is coordinated to maintain functional parity with the original dense function at startup (He et al., 10 Oct 2024).
  • Partial Re-initialization (Drop-Upcycling): To overcome slow convergence and lack of expert specialization arising from identical expert initialization, Drop-Upcycling randomly re-initializes a specified fraction rr of FFN weights per expert. This introduces diversity while retaining most pretrained knowledge. Optimal rr values (e.g., r=0.5r=0.5) yield both rapid convergence and strong generalization (Nakamura et al., 26 Feb 2025).
  • Delta Decomposition (DeRS Paradigm): Given the high similarity of upcycled experts in early training, each expert’s weight is decomposed as Wi=Wbase+ΔiW_i = W_\text{base} + \Delta_i, with Δi\Delta_i stored or trained as a lightweight, sparse, or low-rank delta. This reduces parameter count while retaining expert flexibility (Huang et al., 3 Mar 2025).
  • Attention Parameter Upcycling: Beyond FFN/MLP layers, BAM (Branch-Attend-Mix) introduces the upcycling of specialist attention modules, initializing Mixture-of-Attention (MoA) layers directly from multiple independently fine-tuned dense experts. Query/output parameters can be upcycled individually; key/value parameters can be shared for memory efficiency (Zhang et al., 15 Aug 2024).

Tables below summarize several representative upcycling techniques:

Technique Expert Initialization Additional Features
Vanilla Upcycling Duplicate dense FFN Random router, MoE GLU/MLP
Virtual Group (He et al., 10 Oct 2024) Sharded FFN  grouped experts Router grouping, scaling factor
Drop-Upcycling (Nakamura et al., 26 Feb 2025) Duplicate & partial reinit. Expert-level randomness, rr-tuned
DeRS (Huang et al., 3 Mar 2025) Base FFN + Δi\Delta_i light Sparse/low-rank delta during upcycling
BAM (Zhang et al., 15 Aug 2024) Full FFN + full attention MoA soft routing, shared vs expert KV

3. Routing Mechanisms and Diversification Strategies

Early upcycling schemes used random router initialization with “Top-K” or “Expert Choice” dispatching. To enhance specialization and avoid expert collapse:

  • Gating Logit Normalization: Applied in Skywork-MoE, this rescales router logits:

z~=λzμσ\tilde{z} = \lambda \frac{z - \mu}{\sigma}

g=softmax(z~)g = \text{softmax}(\tilde{z})

where λ\lambda controls the distribution sharpness, increasing specialization and reducing token “dropouts” (Wei et al., 3 Jun 2024).

  • Mixture-of-Routers (Router Upcycling): Instead of a single linear router, multiple routers are initialized from attention heads of the dense model. Each token representation xx is projected into multiple queries Q(j)=W(j)xQ^{(j)} = W^{(j)} x, and routing is performed via an attention-like alignment with expert keys. The aggregation:

Si=j(Q(j))Ki/dS_i = \sum_j (Q^{(j)})^\top K_i / \sqrt{d'}

R=softmax(S)R = \text{softmax}(S)

leads to more diverse, well-specialized expert assignments and improved convergence (Ran et al., 31 Aug 2025).


4. Empirical Performance and Resource Efficiency

A consistent finding across studies is that sparse upcycling achieves favorable trade-offs in compute versus downstream performance relative to both dense continuation and MoE-from-scratch baselines.

LLMs:

  • Upcycled T5 models (Base, Large, XL) outperform dense baselines on SuperGLUE by 1.5–2 points, requiring only 50% additional training computation versus the initial dense pretraining (Komatsuzaki et al., 2022).
  • Upcycling Nemotron-4 15B (E8) with virtual grouping achieves a 2.3% absolute MMLU improvement over continued dense training under the same training compute (He et al., 10 Oct 2024).
  • On large scale, upcycled Skywork-MoE and Llama 3 MoE both demonstrate strong downstream performance (e.g., +2%+2\% 0-shot MMLU versus Llama 3-8B with only 1%1\% of typical pretraining FLOPs) (Vavre et al., 13 Dec 2024).

Vision and Multimodal Models:

  • Upcycled ViT-B/16 models require only 13% of the dense training compute to achieve +1% improvement on 10-shot ImageNet compared to 58% for dense continuation (Komatsuzaki et al., 2022).
  • CLIP-MoE and CLIP-UP surpass dense and scratch-trained MoE counterparts by 6–7% Recall@1 on COCO while using only 30% inference FLOPs of larger dense models (Wang et al., 3 Feb 2025).

Resource Implications:

Trade-offs:

  • Sparse upcycling can impose inference slowdowns up to 40% for larger models due to routing and increased active parameters, and is most advantageous when higher accuracy is desired under moderate compute budgets (Doubov et al., 13 Nov 2024).
  • Scaling laws reveal that upcycling’s efficiency diminishes with excessive sunk pretraining; there exists a critical threshold in additional upcycling tokens (DD^*) beyond which scratch training is preferable (Liew et al., 5 Feb 2025).

5. Extensions and Adaptive Approaches

Recent innovations further broaden the MoE upcycling paradigm:

  • Parameter and Routing Compression: DeRS directly compresses upcycled experts through sparsification or quantization of delta weights, enabling bandwidth- and storage-limited deployment with negligible loss (Huang et al., 3 Mar 2025).
  • Instruction Fine-tuning Upcycling: UpIT exploits intermediate instruction-tuning checkpoints as naturally diverse experts, applies genetic expert expansion for scalability, and seed-based router warm-up to maximize expert specialization with minimal data (Hui et al., 2 Oct 2024).
  • Multitask and Domain Adaptivity: Nexus enables open-ended expansion by efficiently integrating new, independently trained domain experts via adaptive embedding-based routing, supporting "plug-and-play" community model assembly (Gritsch et al., 28 Aug 2024).
  • Fine-grained Scientific Induction: The Innovator series introduces multi-stage upcycling (induction, splitting, routing warmup, integration) to build fine-grained science experts within hybrid MoE architectures, achieving 25% performance gains on scientific benchmarks with 99% retention of general ability (Liao et al., 24 Jul 2025).
  • Early Stopping and Undertraining: Recent evidence from pipeline analysis demonstrates that aggressive early stopping during expert fine-tuning improves downstream merging and MoE upcycling accuracy by avoiding over-memorization of hard examples, counter to prior assumptions (Horoi et al., 17 Jun 2025).

6. Practical Guidance and Future Research Directions

MoE upcycling has matured into a flexible, efficient paradigm for repurposing dense checkpoints into high-capacity, adaptable architectures, but best practices are nuanced:

  • Employ diversity-inducing strategies (partial re-initialization, instruction checkpoint selection, expert expansion) to avoid slow convergence and expert collapse.
  • Tune router initialization and optimization (normalization, domain embedding, mixture-of-attention) to ensure effective specialization and balanced load.
  • Balance the allocation of dense pretraining and additional upcycling based on empirical scaling laws to avoid diminishing returns (Liew et al., 5 Feb 2025).
  • For multi-task or continual learning, publish intermediate checkpoints and favor early-stopped experts for subsequent upcycling or merging (Horoi et al., 17 Jun 2025).
  • Parameter-efficient storage and adaptive extension via lightweight delta representations (DeRS), adjugate experts (Grove MoE), or modular domain embeddings (Nexus) are promising areas for further exploration.

The MoE upcycling framework continues to enable advances across language, vision, multimodal, and domain-specialized models, supporting scalable, efficient, and evolvable AI systems.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (18)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to MoE Upcycling Technique.