Fine-Grained MoE LLMs
- Fine-grained MoE LLMs are large language models that replace dense feed-forward networks with many small experts, enabling dynamic token routing.
- They achieve high sample efficiency and lower perplexity by optimizing expert count, size, and granularity while keeping per-token FLOPs constant.
- Innovative training methods and hardware co-design, including load-balancing losses and advanced routers, enhance scalability and inference efficiency.
Fine-grained Mixture-of-Experts LLMs (Fine-grained MoE-LLMs) are an advanced subclass of sparse neural architectures for scaling LLMs wherein the feed-forward sublayers are decomposed into numerous, relatively small (“fine-grained”) experts. Each token is routed, per layer, to a dynamically selected small subset of these experts, resulting in high model capacity at limited computational and memory cost. Fine-grained MoE-LLMs are distinguished from classical (“coarse-grained”) MoEs by their increased expert count, reduced per-expert parameter budget, and enhanced routing flexibility, enabling better convergence, task specificity, and hardware utilization at scale (Krajewski et al., 2024, Krajewski et al., 3 Jun 2025).
1. Architectural Principles of Fine-Grained MoE-LLMs
Fine-grained MoE-LLMs are built by replacing dense feed-forward blocks (FFNs) with sparse mixtures comprising many small experts. In a canonical transformer block, the dense FFN with hidden size is replaced by experts, each of (potentially) reduced size where is the granularity parameter (Krajewski et al., 2024). The two principal design dimensions are:
- Expert Count and Size: Fine granularity increases the expert count and decreases expert size. For granularity , the system employs experts of size (Krajewski et al., 3 Jun 2025).
- Routing Sparsity: Each token is routed to experts (with standardly 1 or 2), preserving the per-token FLOPs budget while leveraging more diverse parameter subsets.
Routing Mechanisms: Most fine-grained MoEs utilize a top- gating mechanism: for each input, a router computes logits via and dispatches the token to the highest-scoring experts. Recent work also explores GNN-based routers (GMoE), which use graph message-passing to incorporate inter-expert collaboration and mitigate load imbalance (Bai et al., 2024).
Auxiliary Losses: To address load imbalance and mitigate routing collapse, auxiliary objectives such as load-balance loss, Z-loss, or specialized graph-based KL-divergence terms are added to the traditional cross-entropy loss (Li et al., 2024, Bai et al., 2024).
2. Scaling Laws and Efficiency Gains
Empirical and theoretical analyses reveal scaling laws unique to fine-grained MoE-LLMs. The core finding is that proper selection of granularity leads to better sample efficiency and lower perplexity at fixed compute budgets, especially as model and data scale increases (Krajewski et al., 2024):
- : non-embedding parameter count
- : training token count
- : granularity (experts per token)
- : empirically fitted constants
Increasing (while keeping the per-token active parameter count constant) reduces the effective representation error, enhancing the convergence rate and quality. At high , the FLOPs cost of routing becomes non-negligible, but the overall efficiency improves: fine-grained MoE models can achieve the same perplexity as dense models using $5$– fewer FLOPs depending on problem scale (Krajewski et al., 2024).
Optimal Granularity: For training budgets in the regime FLOPs, optimal rises with compute, reaching 16–64 for the largest scales (Krajewski et al., 2024). The common practice of setting expert size to () is strictly sub-optimal at almost any budget.
3. Fine-Grained MoE Construction and Training
Expert Construction: Fine-grained MoE-LLMs can be built either from scratch or by transforming dense LLMs via feed-forward factorization (Zhu et al., 2024, Zhao et al., 2024). In the latter, each FFN is partitioned into neuron subsets, often via random balanced splits or clustering. Additional gating layers are injected post-partition to dynamically select experts per token.
Training Protocols:
- Pretraining: Fine-grained MoEs are trained from dense checkpoints or randomly initialized, typically requiring hundreds of billions of tokens for large-scale efficacy (Zhu et al., 2024, Krajewski et al., 3 Jun 2025).
- Load Balancing: Explicitly regularizing expert utilization (e.g., auxiliary importance/load losses) is essential to prevent expert underutilization or collapse (Li et al., 2024, Zhu et al., 2024).
- Parameter-Efficient Fine-Tuning: Variants such as MixLoRA and GMoE apply LoRA-style low-rank adapters as the experts, with only adapter and router parameters trained for domain adaptation (Li et al., 2024, Bai et al., 2024).
Dynamic Adaptation: Some models employ dynamic activation mechanisms, e.g., adjugate experts in Grove MoE, which group experts into heterogeneous “big/LITTLE” clusters; shared adjugate experts are activated only as needed, further tailoring the capacity per token (Wu et al., 11 Aug 2025).
4. Empirical Performance, Specialization, and Trade-offs
Fine-grained MoE-LLMs consistently demonstrate superior task performance and convergence relative to dense or coarse MoE baselines at fixed compute (Krajewski et al., 3 Jun 2025, Zhu et al., 2024). Key empirical results include:
| Model/Setting | Active Params | Validation Loss (11B) | Avg. Downstream Acc. | Memory/Speed Trade-off |
|---|---|---|---|---|
| Dense | 2.7B | 2.23 | 48.4% | reference |
| 1×FLOPs-G8 | 2.7B | 2.18 | 50.6% | =1×FLOPs; similar throughput |
| 2×FLOPs-G8 | 3.9B | 2.17 | 51.5% | 1.5× memory; better accuracy |
Fine-grained MoEs converge in fewer steps and achieve higher sample efficiency. Benefits increase at larger model and data scales.
Task Specialization and Fine-Tuning: Expert Specialization Fine-Tuning (ESFT) exploits routing concentration in fine-grained MoEs: only the most heavily used experts per task are unlocked for tuning, resulting in parameter savings without quality loss (Wang et al., 2024). Finer granularity enables more sharply distinct sets of experts to be tuned for different tasks, improving both adaptation and generalization.
5. Inference, Compression, and Hardware Co-Design
Efficient Inference: Serving fine-grained MoEs presents challenges due to the sparsity and dynamism of expert activation. Systems like fMoE implement fine-grained expert prefetching/offloading via trajectory and semantic signal matching, cutting inference latency nearly in half and improving memory hit rates (Yu et al., 7 Feb 2025).
Model Compression: MC# achieves aggressive model size reduction and activation sparsity by combining pre-loading mixed-precision quantization per expert with token-level online expert pruning (via Gumbel-softmax sampling). This reduces parameter storage 6.2× and cuts expert activations >20%, with <2% accuracy drop on LM and VL benchmarks (Huang et al., 13 Oct 2025).
Hardware Acceleration: The A3D-MoE co-design demonstrates that conventional 2D accelerators are suboptimal for fine-grained MoE workloads due to variable GEMV/GEMM ratios and DRAM bottlenecks. Vertical integration (compute die + HBM + DRAM), V-Cache, dynamic 3D systolic arrays, and resource-aware fusion schedulers together achieve 1.8–2× latency and 2–4× energy reductions with minimal accuracy impact (Huang et al., 25 Jul 2025).
6. Routing Mechanisms and Expert Collaboration
Standard fine-grained MoEs employ a linear router to score and select experts, but recent proposals incorporate graph-based or dynamic routers for enhanced collaboration and load balance. GMoE introduces a graph convolutional network (GCN) router, which integrates token-expert and expert-expert signals, combined with Poisson and Normal distribution-based loss terms to encourage specialization and equitable expert utilization (Bai et al., 2024).
Task-Dependent Specialization: Empirical analysis shows lower cross-task overlap and entropy of expert activations in fine-grained MoEs, supporting their use in multi-task and transfer scenarios (Wang et al., 2024). Fine granularity allows the model to activate disjoint or near-disjoint expert sets for semantically distinct tasks.
7. Limitations, Open Challenges, and Practical Guidelines
Limitations:
- Router Complexity and Initialization: More experts increase router/dispatch overhead and complicate balancing expert utilization, especially during early training (Krajewski et al., 3 Jun 2025, Li et al., 2024).
- Inference Memory and Latency: Serving large fine-grained MoEs requires advanced caching, offloading, and scheduling strategies to avoid inference stalls (Yu et al., 7 Feb 2025, Huang et al., 13 Oct 2025).
- Hardware Fragmentation: Variable expert activation stresses classical hardware designs, motivating specialized accelerators or co-designs (Huang et al., 25 Jul 2025).
Best Practices:
- Set granularity and use auxiliary load-balance objectives (Krajewski et al., 2024, Krajewski et al., 3 Jun 2025).
- Apply expert output rescaling and softmax-after-top- when to stabilize convergence (Zhu et al., 2024, Krajewski et al., 3 Jun 2025).
- For fine-tuning or adaptation, leverage parameter-efficient expert modules (LoRA/DoRA) and restrict training to task-relevant experts (Li et al., 2024, Bai et al., 2024, Wang et al., 2024).
- Design serving stack and hardware for dynamic, fine-grained expert workloads (Yu et al., 7 Feb 2025, Huang et al., 25 Jul 2025).
Future advances are expected in adaptive expert shaping, cross-layer expert sharing, unified attention-expert MoE layers, and integration of software/hardware feedback loops for ultimate scalability and energy efficiency.
References:
(Krajewski et al., 2024, Krajewski et al., 3 Jun 2025, Li et al., 2024, Bai et al., 2024, Wu et al., 11 Aug 2025, Zhu et al., 2024, Yu et al., 7 Feb 2025, Zhao et al., 2024, Huang et al., 13 Oct 2025, Huang et al., 25 Jul 2025, Wang et al., 2024)