Papers
Topics
Authors
Recent
Search
2000 character limit reached

Fine-Grained MoE LLMs

Updated 1 March 2026
  • Fine-grained MoE LLMs are large language models that replace dense feed-forward networks with many small experts, enabling dynamic token routing.
  • They achieve high sample efficiency and lower perplexity by optimizing expert count, size, and granularity while keeping per-token FLOPs constant.
  • Innovative training methods and hardware co-design, including load-balancing losses and advanced routers, enhance scalability and inference efficiency.

Fine-grained Mixture-of-Experts LLMs (Fine-grained MoE-LLMs) are an advanced subclass of sparse neural architectures for scaling LLMs wherein the feed-forward sublayers are decomposed into numerous, relatively small (“fine-grained”) experts. Each token is routed, per layer, to a dynamically selected small subset of these experts, resulting in high model capacity at limited computational and memory cost. Fine-grained MoE-LLMs are distinguished from classical (“coarse-grained”) MoEs by their increased expert count, reduced per-expert parameter budget, and enhanced routing flexibility, enabling better convergence, task specificity, and hardware utilization at scale (Krajewski et al., 2024, Krajewski et al., 3 Jun 2025).

1. Architectural Principles of Fine-Grained MoE-LLMs

Fine-grained MoE-LLMs are built by replacing dense feed-forward blocks (FFNs) with sparse mixtures comprising many small experts. In a canonical transformer block, the dense FFN with hidden size dffd_{\text{ff}} is replaced by N=O(10100)N=\mathcal{O}(10-100) experts, each of (potentially) reduced size dexpert=dff/Gd_{\text{expert}}=d_{\text{ff}}/G where GG is the granularity parameter (Krajewski et al., 2024). The two principal design dimensions are:

  • Expert Count and Size: Fine granularity increases the expert count and decreases expert size. For granularity GG, the system employs G×NEG \times N_E experts of size dff/Gd_{\rm ff} / G (Krajewski et al., 3 Jun 2025).
  • Routing Sparsity: Each token is routed to k=G×k0k=G\times k_0 experts (with k0k_0 standardly 1 or 2), preserving the per-token FLOPs budget while leveraging more diverse parameter subsets.

Routing Mechanisms: Most fine-grained MoEs utilize a top-kk gating mechanism: for each input, a router computes logits via z(x)=Wrx+brz(x)=W_r x + b_r and dispatches the token to the kk highest-scoring experts. Recent work also explores GNN-based routers (GMoE), which use graph message-passing to incorporate inter-expert collaboration and mitigate load imbalance (Bai et al., 2024).

Auxiliary Losses: To address load imbalance and mitigate routing collapse, auxiliary objectives such as load-balance loss, Z-loss, or specialized graph-based KL-divergence terms are added to the traditional cross-entropy loss (Li et al., 2024, Bai et al., 2024).

2. Scaling Laws and Efficiency Gains

Empirical and theoretical analyses reveal scaling laws unique to fine-grained MoE-LLMs. The core finding is that proper selection of granularity GG leads to better sample efficiency and lower perplexity at fixed compute budgets, especially as model and data scale increases (Krajewski et al., 2024):

L(N,D,G)=c+a+gGγNα+bDβL(N, D, G) = c + \frac{a + gG^{-\gamma}}{N^\alpha} + \frac{b}{D^\beta}

  • NN: non-embedding parameter count
  • DD: training token count
  • GG: granularity (experts per token)
  • c,a,b,g,γ,α,βc, a, b, g, \gamma, \alpha, \beta: empirically fitted constants

Increasing GG (while keeping the per-token active parameter count constant) reduces the effective representation error, enhancing the convergence rate and quality. At high GG, the FLOPs cost of routing becomes non-negligible, but the overall efficiency improves: fine-grained MoE models can achieve the same perplexity as dense models using $5$–40×40\times fewer FLOPs depending on problem scale (Krajewski et al., 2024).

Optimal Granularity: For training budgets in the regime F10201025F \sim 10^{20}-10^{25} FLOPs, optimal GG rises with compute, reaching GG \sim 16–64 for the largest scales (Krajewski et al., 2024). The common practice of setting expert size to dffd_{\rm ff} (G=1G=1) is strictly sub-optimal at almost any budget.

3. Fine-Grained MoE Construction and Training

Expert Construction: Fine-grained MoE-LLMs can be built either from scratch or by transforming dense LLMs via feed-forward factorization (Zhu et al., 2024, Zhao et al., 2024). In the latter, each FFN is partitioned into NN neuron subsets, often via random balanced splits or clustering. Additional gating layers are injected post-partition to dynamically select experts per token.

Training Protocols:

Dynamic Adaptation: Some models employ dynamic activation mechanisms, e.g., adjugate experts in Grove MoE, which group experts into heterogeneous “big/LITTLE” clusters; shared adjugate experts are activated only as needed, further tailoring the capacity per token (Wu et al., 11 Aug 2025).

4. Empirical Performance, Specialization, and Trade-offs

Fine-grained MoE-LLMs consistently demonstrate superior task performance and convergence relative to dense or coarse MoE baselines at fixed compute (Krajewski et al., 3 Jun 2025, Zhu et al., 2024). Key empirical results include:

Model/Setting Active Params Validation Loss (11B) Avg. Downstream Acc. Memory/Speed Trade-off
Dense 2.7B 2.23 48.4% reference
1×FLOPs-G8 2.7B 2.18 50.6% =1×FLOPs; similar throughput
2×FLOPs-G8 3.9B 2.17 51.5% 1.5× memory; better accuracy

Fine-grained MoEs converge in fewer steps and achieve higher sample efficiency. Benefits increase at larger model and data scales.

Task Specialization and Fine-Tuning: Expert Specialization Fine-Tuning (ESFT) exploits routing concentration in fine-grained MoEs: only the most heavily used experts per task are unlocked for tuning, resulting in parameter savings without quality loss (Wang et al., 2024). Finer granularity enables more sharply distinct sets of experts to be tuned for different tasks, improving both adaptation and generalization.

5. Inference, Compression, and Hardware Co-Design

Efficient Inference: Serving fine-grained MoEs presents challenges due to the sparsity and dynamism of expert activation. Systems like fMoE implement fine-grained expert prefetching/offloading via trajectory and semantic signal matching, cutting inference latency nearly in half and improving memory hit rates (Yu et al., 7 Feb 2025).

Model Compression: MC# achieves aggressive model size reduction and activation sparsity by combining pre-loading mixed-precision quantization per expert with token-level online expert pruning (via Gumbel-softmax sampling). This reduces parameter storage 6.2× and cuts expert activations >20%, with <2% accuracy drop on LM and VL benchmarks (Huang et al., 13 Oct 2025).

Hardware Acceleration: The A3D-MoE co-design demonstrates that conventional 2D accelerators are suboptimal for fine-grained MoE workloads due to variable GEMV/GEMM ratios and DRAM bottlenecks. Vertical integration (compute die + HBM + DRAM), V-Cache, dynamic 3D systolic arrays, and resource-aware fusion schedulers together achieve 1.8–2× latency and 2–4× energy reductions with minimal accuracy impact (Huang et al., 25 Jul 2025).

6. Routing Mechanisms and Expert Collaboration

Standard fine-grained MoEs employ a linear router to score and select experts, but recent proposals incorporate graph-based or dynamic routers for enhanced collaboration and load balance. GMoE introduces a graph convolutional network (GCN) router, which integrates token-expert and expert-expert signals, combined with Poisson and Normal distribution-based loss terms to encourage specialization and equitable expert utilization (Bai et al., 2024).

Task-Dependent Specialization: Empirical analysis shows lower cross-task overlap and entropy of expert activations in fine-grained MoEs, supporting their use in multi-task and transfer scenarios (Wang et al., 2024). Fine granularity allows the model to activate disjoint or near-disjoint expert sets for semantically distinct tasks.

7. Limitations, Open Challenges, and Practical Guidelines

Limitations:

  • Router Complexity and Initialization: More experts increase router/dispatch overhead and complicate balancing expert utilization, especially during early training (Krajewski et al., 3 Jun 2025, Li et al., 2024).
  • Inference Memory and Latency: Serving large fine-grained MoEs requires advanced caching, offloading, and scheduling strategies to avoid inference stalls (Yu et al., 7 Feb 2025, Huang et al., 13 Oct 2025).
  • Hardware Fragmentation: Variable expert activation stresses classical hardware designs, motivating specialized accelerators or co-designs (Huang et al., 25 Jul 2025).

Best Practices:

Future advances are expected in adaptive expert shaping, cross-layer expert sharing, unified attention-expert MoE layers, and integration of software/hardware feedback loops for ultimate scalability and energy efficiency.


References:

(Krajewski et al., 2024, Krajewski et al., 3 Jun 2025, Li et al., 2024, Bai et al., 2024, Wu et al., 11 Aug 2025, Zhu et al., 2024, Yu et al., 7 Feb 2025, Zhao et al., 2024, Huang et al., 13 Oct 2025, Huang et al., 25 Jul 2025, Wang et al., 2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Fine-Grained Mixture-of-Experts Large Language Models (Fine-grained MoE-LLMs).