Fine-grained MoE-LLMs Overview

Updated 30 December 2025

Fine-grained MoE-LLMs are advanced language models that use modular expert routing and increased expert granularity to enhance computational efficiency and specialization.
They employ dynamic token-level routing, norm-based selection, and graph-based load balancing to optimize both pretraining and fine-tuning performance.
These models achieve superior scaling and convergence while addressing challenges such as additional router FLOPs and memory bottlenecks during deployment.

Fine-grained Mixture-of-Experts LLMs (Fine-grained MoE-LLMs) are advanced neural architectures that leverage modular, sparse activation of model parameters through sophisticated expert routing at scales ranging from whole FFNs down to individual neurons. By increasing the number of smaller experts, fine-grained MoE-LLMs achieve superior computational efficiency, convergence speed, specialization, and scaling properties across both pretraining and fine-tuning regimes. This paradigm is characterized by hierarchical modularity, dynamic token-level expert selection, and a variety of architectural innovations that collectively enable more effective and efficient large language modeling.

1. Architectural Frameworks for Fine-Grained MoE-LLMs

Fine-grained MoE-LLMs extend classic Mixture-of-Experts designs by increasing expert granularity—using many more experts of smaller size—with flexible routing mechanisms:

Standard MoE layer replaces the FFN in each transformer block by $N$ $N$ parallel expert FFNs plus a router. For input $x \in \mathbb{R}^h$ $x \in R^{h}$ :
- Router logits: $l(x) = W_g x$ , $W_g \in \mathbb{R}^{N \times h}$
- Gating: $g(x) = \text{softmax}(l(x)) \in \Delta^N$ ; Top- $k$ mask selects $k \ll N$ active experts
- Output: $y = \sum_{i \in \mathcal{K}(x)} g_i(x) E_i(x)$
Fine-grained extension: each FFN may be further subdivided into micro-experts (neurons or slices), with routers operating at the neuron or subblock level. Notably, neurons themselves often function as “mini-experts,” exhibiting distinctive activation and specialization profiles (Lo et al., 26 Jun 2024).
Dynamic expert activation: advanced schemes, as in Grove MoE, introduce groups of heterogeneous experts with adjugate (“little”) modules for adaptive capacity scaling. The architecture computes both large (big) and small (adjugate) expert activations in parallel, modulating their output contribution according to input complexity (Wu et al., 11 Aug 2025).
Graph-based expert collaboration: routers may be implemented via message-passing GNNs over expert-token graphs, enabling direct expert–expert interaction and coordinated activation strategies (Bai et al., 18 Dec 2024).

2. Expert Routing Mechanisms and Load-Balancing

Linear and nonlinear routers: Most fine-grained MoE-LLMs employ linear gating networks ( $W_g x$ ), often followed by Top- $k$ selection and softmax normalization. MixLoRA, for instance, uses $g(x) = \text{Softmax}(\text{TopK}(W_g x))$ , with load-balance losses to prevent expert collapse (Li et al., 22 Apr 2024).
Norm-based routing: Routers empirically select experts with largest output norms. The selected expert indices nearly perfectly match those whose $L_2$ output norms are highest, indicating a strong norm-preference in the gating mechanism (Lo et al., 26 Jun 2024).
Dynamic activation and group biasing: Grove MoE employs dual softmax-sigmoid heads, with group-wise adjugate experts whose activations depend on accumulated sigmoid scores, further balanced via learnable bias updates (Wu et al., 11 Aug 2025).
Graph routers and distributional balancing: GMoE’s router leverages graph-convolutions between experts and tokens, coupled with Poisson-based distinction (encouraging specialization) and normal-based load balancing, both enforced via KL-divergence regularization terms (Bai et al., 18 Dec 2024).

3. Scaling Laws and Granularity Optimization

Granularity ( $G$ ) as a hyperparameter: The efficiency, convergence, and specialization of MoE-LLMs depend strongly on the number and size of experts. Scaling laws established in (Krajewski et al., 12 Feb 2024) show:
- For compute-optimal configuration: as FLOPs budget $F$ increases, expert granularity rises super-linearly ( $G^* \propto F^{2.21}$ ).
- Standard expert-width heuristics ( $d_\text{expert} = 4m$ ) are suboptimal; at realistic budgets, experts should often be $4\!-\!16\times$ smaller, and $G$ should be set to $8, 16, 32$ or larger.
Empirical findings: Fine-grained MoE with high $G$ achieves superior convergence, downstream accuracy, and efficiency compared to dense or heuristic-coarse MoE baselines. Gains are magnified at larger scale and longer training durations (Krajewski et al., 3 Jun 2025).

FLOPs $(F)$	Optimal $G^*$	Expert size $(d^*)$
$2 \times 10^{20}$	16	$4m/16 = 0.25m$
$5 \times 10^{24}$	64	$4m/64 = 0.0625m$

4. Layer-Wise Expert Specialization and Allocation

Neuron-level and micro-expert behavior: Cosine similarity among expert neurons reveals dense clustering, with many “tiny experts” replicated across experts. In Mixtral, neuron similarity remains high across experts; in other models, it is low and more specialized (Lo et al., 26 Jun 2024).
Layer-dependent diversity: Expert diversity (measured by cosine similarity and routing entropy) increases along the depth of the network, peaking in later layers. The final MoE layer is an outlier and often reverts to more homogeneous specialization (Lo et al., 26 Jun 2024).
Layer-wise expert allocation: Algorithms such as LayerMoE allocate experts per layer in inverse proportion to cross-lingual representation similarity. Shallow and deep layers (low similarity) get more new experts, while middle layers require fewer (Zhang et al., 28 May 2025).
Fine-grained tuning: Expert-Specialized Fine-Tuning (ESFT) leverages high expert granularity to tune only a small fraction of relevant experts per downstream task, greatly enhancing PEFT efficiency and reducing negative transfer (Wang et al., 2 Jul 2024).

5. Compression, Deployment, and Practical Engineering

Mixed-precision expert compression: Fine-grained MoE-LLMs can be further compressed by mixed-precision quantization, with bit-widths assigned to each expert via integer programming that Pareto-optimizes importance and quantization error (Huang et al., 13 Oct 2025).
Dynamic expert pruning: At inference, Gumbel-softmax routers select token-dependent masks, reducing the number of activated experts per token while maintaining performance within 1–2% (Huang et al., 13 Oct 2025).
Serving and offloading: Fine-grained expert offloading systems (e.g. fMoE) analyze token-level expert selection trajectories and semantic input hints to optimize expert prefetching and caching, reducing serving latency by up to 47% versus coarse-grained approaches (Yu et al., 7 Feb 2025).

6. Training, Continual Pretraining, and Specialization

Continual MoE construction: Models such as LLaMA-MoE partition pretrained dense FFNs into fine-grained experts via random or clustering-based slicing. Routing and gating networks are introduced for token-wise sparse expert activation, followed by 200B token continual pretraining for stabilization and specialization (Zhu et al., 24 Jun 2024).
Upcycling strategies: Adjugate expert addition during mid-training or post-training leverages upcycled weights and group-wise specialization to further expand model capacity without significant overhead (Wu et al., 11 Aug 2025).
Diversity and initialization: Random expert initialization yields greater downstream task specialization and performance than dense upcycling or clustering-based methods (Lo et al., 26 Jun 2024).

7. Evaluation, Limitations, and Future Directions

Empirical benchmarking: Across reasoning, STEM, and multilingual datasets, fine-grained MoE-LLMs outperform comparable dense and coarse-grained MoE models. Dynamic activation, structural heterogeneity (MoA), graph-based routers (GMoE), and micro-expert tuning (ESFT) provide incremental accuracy, robustness, and memory savings (Krajewski et al., 3 Jun 2025, Bai et al., 18 Dec 2024, Wang et al., 2 Jul 2024, Cao et al., 6 Jun 2025).
Limitations: Additional router FLOPs, memory bottlenecks in serving, and brittle expert assignment for highly clustered tokens may require further algorithmic and system refinements (Huang et al., 13 Oct 2025, Yu et al., 7 Feb 2025).
Open questions: Learning dynamic expert compositions, end-to-end heterogeneity, and incorporating expert–router co-adaptation remain active areas. Scaling demonstrated up to 56 B parameters with further validation needed for $>100$ B-scale and multi-modal tasks (Krajewski et al., 3 Jun 2025, Bai et al., 18 Dec 2024).

Fine-grained MoE-LLMs combine algorithmic advances in expert modularity, dynamic routing, compression, and specialized training to advance the state of the art in scalable and efficient large language modeling. These systems offer principled strategies for balancing specialization, memory, and computational requirements in both research and deployment settings.