Router-weighted Expert Pruning (REAP)
- The paper introduces REAP, an algorithm that scores experts using router gate-values and activation norms for principled pruning in MoE architectures.
- The paper demonstrates that REAP maintains generative performance—achieving less than an 8% drop—despite over 50% expert pruning, outperforming merging methods.
- The paper validates REAP on large-scale models, showing that preserving router-driven control enhances model specialization and speeds up inference.
Router-weighted Expert Activation Pruning (REAP) is an algorithmic paradigm addressing the efficient compression and adaptation of Mixture-of-Experts (MoE) architectures, with particular focus on maintaining performance in large-scale, sparsely-activated models. REAP encompasses both the identification and selective removal (pruning) of experts according to a saliency metric that incorporates router gate-values and expert activation norms. This approach preserves the router’s independent, input-conditioned control—an essential property for generative and task-flexible LLMs. REAP has demonstrated superior performance preservation compared to merging-based methods, particularly for generative workloads such as code synthesis, tool-calling, and reasoning under compression ratios exceeding 50% (Lasby et al., 15 Oct 2025).
1. Fundamental Principles of REAP
REAP is formulated to optimize expert selection in MoE layers by leveraging the statistical outputs of learned routing mechanisms. In SMoE models, a router assigns input tokens to experts based on gate-values, which denote the likelihood and magnitude that an expert is activated for a given input. Rather than indiscriminately merging or pruning experts, REAP scores and ranks experts using their cumulative utility across a representative dataset.
Formally, the REAP saliency score for expert is:
where is the input subset for which expert is active (top-K by router gate), is the gate-value, and the expert’s activation (Lasby et al., 15 Oct 2025). Experts with the lowest values are pruned to reduce parameter and memory overhead while preserving router control.
This design principle allows retention of critical model specialization and task fidelity by aligning pruning with real usage statistics from the router’s decision process.
2. Comparison with Expert Merging and Prior Pruning Criteria
Expert merging approaches average the parameters of "similar" experts into a single merged unit, typically guided by clustering over router logits or expert activation vectors (Muqeeth et al., 2023, Sarkar et al., 2 Sep 2024). Although merging can reduce redundancy, it leads to "functional subspace collapse"—a loss of router’s ability to exert independent, input-dependent expert selection. Theoretical analysis in REAP shows this collapse introduces irreducible error, substantially degrading performance on generative tasks (Lasby et al., 15 Oct 2025).
Other pruning baselines operate on frequency counts (how often experts are selected) or activation norm thresholds, but lack joint consideration of the router’s confidence and functional activation, resulting in less principled expert selection. REAP, by contrast, incorporates both router gate-value and activation norm in its metric, aligning pruning more closely with the true operational significance of each expert.
Empirical results demonstrate that REAP’s mean performance drop under 50% expert pruning is for coding benchmarks, compared to for merging (Lasby et al., 15 Oct 2025).
3. Algorithmic Framework and Technical Implementation
REAP operates in a one-shot, post-training manner. Given a pretrained SMoE model and a calibration dataset, the procedure is:
- For each MoE layer, accumulate gate-values and activation norms for each expert across the calibration data.
- Compute the REAP saliency score for each expert.
- Rank experts by and prune those with lowest scores according to the desired compression ratio.
- No retraining is required, but optional post-pruning distillation (for example, expert-wise knowledge distillation (Xie et al., 15 Oct 2024)) can be used to further recover or refine performance.
The pruning process applies directly to SMoE architectures—those with token-routed, feedforward expert layers. The approach generalizes across scales, shown effective from models containing $20$ billion up to $1$ trillion parameters, and applies to diverse tasks including code generation and multi-turn tool-calling.
4. Influence on Model Capacity, Specialization, and Routing
Preserving router-driven control is central to REAP. By retaining only salient experts, the remaining expert pool maintains high diversity, capacity, and specialization. Power-law distributions of have been observed, indicating that a minority of experts contribute disproportionately to model capability, while many are near-obsolete and can be pruned with minimal impact (2505.17639).
Studies also show that REAP, unlike expert merging, maintains the input-dependent routing fidelity which is crucial for generative and compositional tasks. In settings requiring language family-specific capabilities or high task adaptivity, router-weighted pruning can be tuned by evaluating multilingual or domain-specific activation patterns (Liu et al., 26 Feb 2024, 2505.17639).
5. Performance Metrics and Empirical Validation
REAP has been extensively evaluated on large-scale generative tasks. For example, on Qwen3-Coder-480B and Kimi-K2 under 50% expert reduction, coding accuracy and tool-calling success rates remain nearly unchanged relative to uncompressed models (Lasby et al., 15 Oct 2025). For Mixtral-8x7B, similar gains have been reported—WikiText perplexity improves over alternative pruning baselines, and knowledge distillation post-pruning restores the model to of original zero-shot accuracy (Xie et al., 15 Oct 2024).
Further, human evaluation reveals that inference speed is increased proportionally to expert parameter reduction, with negligible loss in output quality for both general and specialized expert pruning (Zhao et al., 3 Jun 2025).
6. Adaptations, Extensions, and Related Approaches
Recent developments generalize the REAP principles. Probabilistic filtering (Task-Conditioned Expected Selection Score, TCESS) combines top-K router selections and confidence thresholds for task-specific pruning (2505.17639), while dynamic test-time rerouting using lightweight additive vectors enables online adaptation to distribution shifts, strictly in terms of router-statistics (Su et al., 16 Oct 2025). Methods such as SteerMoE provide behavior control by paired-example expert (de)activation, leveraging router statistic differentials for soft routing interventions (Fayyaz et al., 11 Sep 2025).
Similarity-preserving load balancing losses also complement REAP, ensuring that router assignment stability across similar inputs further optimizes expert redundancy and convergence speed (Omi et al., 16 Jun 2025).
7. Practical Implications and Future Directions
REAP offers a principled, empirically validated solution for one-shot MoE compression. It facilitates deployment of large SMoE models in memory-constrained environments, supports broader accessibility for generative workloads, and provides a mathematical rationale for retaining router-driven modularity under compression.
Recent research suggests further directions:
- Integration with hybrid compression (quantization, low-rank factorization)
- Structured pruning for hardware-optimized (e.g., tensor-core) deployment
- Advanced task-sensitive router metrics, potentially combining confidence, frequency, and contextual entropy
- Fine-grained domain/language/task pruning via corpus-based relevance evaluation (Zhao et al., 3 Jun 2025, 2505.17639)
In summary, Router-weighted Expert Activation Pruning (REAP) stands as a key compression strategy for MoE architectures. By aligning expert retention with router activation statistics, REAP preserves the adaptive modularity underlying modern LLMs, unlocking efficient and high-fidelity deployment for both general and specialized generative applications (Lasby et al., 15 Oct 2025, Xie et al., 15 Oct 2024, 2505.17639).