Papers
Topics
Authors
Recent
2000 character limit reached

Fine-grained MoE-LLMs Overview

Updated 30 December 2025
  • Fine-grained MoE-LLMs are advanced language models that use modular expert routing and increased expert granularity to enhance computational efficiency and specialization.
  • They employ dynamic token-level routing, norm-based selection, and graph-based load balancing to optimize both pretraining and fine-tuning performance.
  • These models achieve superior scaling and convergence while addressing challenges such as additional router FLOPs and memory bottlenecks during deployment.

Fine-grained Mixture-of-Experts LLMs (Fine-grained MoE-LLMs) are advanced neural architectures that leverage modular, sparse activation of model parameters through sophisticated expert routing at scales ranging from whole FFNs down to individual neurons. By increasing the number of smaller experts, fine-grained MoE-LLMs achieve superior computational efficiency, convergence speed, specialization, and scaling properties across both pretraining and fine-tuning regimes. This paradigm is characterized by hierarchical modularity, dynamic token-level expert selection, and a variety of architectural innovations that collectively enable more effective and efficient large language modeling.

1. Architectural Frameworks for Fine-Grained MoE-LLMs

Fine-grained MoE-LLMs extend classic Mixture-of-Experts designs by increasing expert granularity—using many more experts of smaller size—with flexible routing mechanisms:

  • Standard MoE layer replaces the FFN in each transformer block by NN parallel expert FFNs plus a router. For input xRhx \in \mathbb{R}^h:
    • Router logits: l(x)=Wgxl(x) = W_g x, WgRN×hW_g \in \mathbb{R}^{N \times h}
    • Gating: g(x)=softmax(l(x))ΔNg(x) = \text{softmax}(l(x)) \in \Delta^N; Top-kk mask selects kNk \ll N active experts
    • Output: y=iK(x)gi(x)Ei(x)y = \sum_{i \in \mathcal{K}(x)} g_i(x) E_i(x)
  • Fine-grained extension: each FFN may be further subdivided into micro-experts (neurons or slices), with routers operating at the neuron or subblock level. Notably, neurons themselves often function as “mini-experts,” exhibiting distinctive activation and specialization profiles (Lo et al., 26 Jun 2024).
  • Dynamic expert activation: advanced schemes, as in Grove MoE, introduce groups of heterogeneous experts with adjugate (“little”) modules for adaptive capacity scaling. The architecture computes both large (big) and small (adjugate) expert activations in parallel, modulating their output contribution according to input complexity (Wu et al., 11 Aug 2025).
  • Graph-based expert collaboration: routers may be implemented via message-passing GNNs over expert-token graphs, enabling direct expert–expert interaction and coordinated activation strategies (Bai et al., 18 Dec 2024).

2. Expert Routing Mechanisms and Load-Balancing

  • Linear and nonlinear routers: Most fine-grained MoE-LLMs employ linear gating networks (WgxW_g x), often followed by Top-kk selection and softmax normalization. MixLoRA, for instance, uses g(x)=Softmax(TopK(Wgx))g(x) = \text{Softmax}(\text{TopK}(W_g x)), with load-balance losses to prevent expert collapse (Li et al., 22 Apr 2024).
  • Norm-based routing: Routers empirically select experts with largest output norms. The selected expert indices nearly perfectly match those whose L2L_2 output norms are highest, indicating a strong norm-preference in the gating mechanism (Lo et al., 26 Jun 2024).
  • Dynamic activation and group biasing: Grove MoE employs dual softmax-sigmoid heads, with group-wise adjugate experts whose activations depend on accumulated sigmoid scores, further balanced via learnable bias updates (Wu et al., 11 Aug 2025).
  • Graph routers and distributional balancing: GMoE’s router leverages graph-convolutions between experts and tokens, coupled with Poisson-based distinction (encouraging specialization) and normal-based load balancing, both enforced via KL-divergence regularization terms (Bai et al., 18 Dec 2024).

3. Scaling Laws and Granularity Optimization

  • Granularity (GG) as a hyperparameter: The efficiency, convergence, and specialization of MoE-LLMs depend strongly on the number and size of experts. Scaling laws established in (Krajewski et al., 12 Feb 2024) show:
    • For compute-optimal configuration: as FLOPs budget FF increases, expert granularity rises super-linearly (GF2.21G^* \propto F^{2.21}).
    • Standard expert-width heuristics (dexpert=4md_\text{expert} = 4m) are suboptimal; at realistic budgets, experts should often be 4 ⁣ ⁣16×4\!-\!16\times smaller, and GG should be set to $8, 16, 32$ or larger.
  • Empirical findings: Fine-grained MoE with high GG achieves superior convergence, downstream accuracy, and efficiency compared to dense or heuristic-coarse MoE baselines. Gains are magnified at larger scale and longer training durations (Krajewski et al., 3 Jun 2025).
FLOPs (F)(F) Optimal GG^* Expert size (d)(d^*)
2×10202 \times 10^{20} 16 $4m/16 = 0.25m$
5×10245 \times 10^{24} 64 $4m/64 = 0.0625m$

4. Layer-Wise Expert Specialization and Allocation

  • Neuron-level and micro-expert behavior: Cosine similarity among expert neurons reveals dense clustering, with many “tiny experts” replicated across experts. In Mixtral, neuron similarity remains high across experts; in other models, it is low and more specialized (Lo et al., 26 Jun 2024).
  • Layer-dependent diversity: Expert diversity (measured by cosine similarity and routing entropy) increases along the depth of the network, peaking in later layers. The final MoE layer is an outlier and often reverts to more homogeneous specialization (Lo et al., 26 Jun 2024).
  • Layer-wise expert allocation: Algorithms such as LayerMoE allocate experts per layer in inverse proportion to cross-lingual representation similarity. Shallow and deep layers (low similarity) get more new experts, while middle layers require fewer (Zhang et al., 28 May 2025).
  • Fine-grained tuning: Expert-Specialized Fine-Tuning (ESFT) leverages high expert granularity to tune only a small fraction of relevant experts per downstream task, greatly enhancing PEFT efficiency and reducing negative transfer (Wang et al., 2 Jul 2024).

5. Compression, Deployment, and Practical Engineering

  • Mixed-precision expert compression: Fine-grained MoE-LLMs can be further compressed by mixed-precision quantization, with bit-widths assigned to each expert via integer programming that Pareto-optimizes importance and quantization error (Huang et al., 13 Oct 2025).
  • Dynamic expert pruning: At inference, Gumbel-softmax routers select token-dependent masks, reducing the number of activated experts per token while maintaining performance within 1–2% (Huang et al., 13 Oct 2025).
  • Serving and offloading: Fine-grained expert offloading systems (e.g. fMoE) analyze token-level expert selection trajectories and semantic input hints to optimize expert prefetching and caching, reducing serving latency by up to 47% versus coarse-grained approaches (Yu et al., 7 Feb 2025).

6. Training, Continual Pretraining, and Specialization

  • Continual MoE construction: Models such as LLaMA-MoE partition pretrained dense FFNs into fine-grained experts via random or clustering-based slicing. Routing and gating networks are introduced for token-wise sparse expert activation, followed by 200B token continual pretraining for stabilization and specialization (Zhu et al., 24 Jun 2024).
  • Upcycling strategies: Adjugate expert addition during mid-training or post-training leverages upcycled weights and group-wise specialization to further expand model capacity without significant overhead (Wu et al., 11 Aug 2025).
  • Diversity and initialization: Random expert initialization yields greater downstream task specialization and performance than dense upcycling or clustering-based methods (Lo et al., 26 Jun 2024).

7. Evaluation, Limitations, and Future Directions

Fine-grained MoE-LLMs combine algorithmic advances in expert modularity, dynamic routing, compression, and specialized training to advance the state of the art in scalable and efficient large language modeling. These systems offer principled strategies for balancing specialization, memory, and computational requirements in both research and deployment settings.

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Fine-grained Mixture-of-Experts Large Language Models (Fine-grained MoE-LLMs).