Papers
Topics
Authors
Recent
Search
2000 character limit reached

Adaptive and Fine-grained Module-wise Expert Pruning for Efficient LoRA-MoE Fine-Tuning

Published 29 Apr 2026 in cs.LG | (2604.26340v1)

Abstract: LoRA-MoE has emerged as an effective paradigm for parameter-efficient fine-tuning, combining the low training cost of LoRA with the increased adaptation capacity of Mixture-of-Experts (MoE). However, existing LoRA-MoE frameworks typically adopt a fixed and uniform expert configuration across heterogeneous Transformer modules (\eg, attention query/key projections and MLP gating networks), ignoring their distinct functional roles and capacity requirements. This design leads to localized over-provisioning, redundant trainable parameters, and unnecessary optimizer-state overhead. Moreover, prior methods enforce load balancing among experts throughout training. Although beneficial in the early stage, this constraint becomes restrictive once routing patterns stabilize, limiting expert specialization on downstream tasks. In this paper, we propose DMEP, a novel LoRA-MoE fine-tuning framework based on Dynamic Module-wise Expert Pruning. DMEP tracks expert utilization during training and physically removes low-utility experts on a per-module basis, yielding a more compact expert structure tailored to different modules. The pruned model then continues training without the load-balancing constraint, freeing the remaining experts to focus entirely on the downstream task and develop specialized expertise. By jointly adapting module-wise expert capacity and eliminating unnecessary balancing, DMEP improves both parameter efficiency and training efficiency. Extensive experiments on multiple reasoning benchmarks show that DMEP reduces trainable parameters by 35\%--43\% and improves training throughput by about 10\%, while maintaining or surpassing the downstream reasoning accuracy of uniform LoRA-MoE baselines.

Authors (3)

Summary

  • The paper demonstrates that DMEP dynamically prunes underutilized experts to align module capacity with actual routing demands.
  • It introduces a three-phase framework using online routing statistics to structurally eliminate redundant parameters, enhancing training throughput.
  • Empirical evaluations on Qwen3-0.6B and Qwen3-8B show parameter reductions of 35–43% with maintained or improved accuracy.

Adaptive, Fine-grained Module-wise Expert Pruning for Efficient LoRA-MoE Fine-Tuning

Motivation and Limitations of Uniform Expert Allocation

The LoRA-MoE paradigm enhances parameter-efficient fine-tuning by combining LoRA's low training cost with MoE's conditional computational capacity. However, typical LoRA-MoE implementations assume a fixed, uniform allocation of experts across all Transformer modules, irrespective of functional heterogeneity. This uniform configuration ignores module- and layer-specific requirements, resulting in localized over-provisioning and latent parameter redundancy. Empirical profiling of expert routing during fine-tuning exposes substantial intra-layer and intra-module heterogeneity: attention projection modules demonstrate skewed routing, with token traffic concentrated in a minority of experts, while MLP projections distribute tokens more evenly. Figure 1

Figure 1: Intra-layer heterogeneity of expert utilization—O_PROJ attention projection shows dominating experts, while GATE_PROJ MLP projections distribute tokens uniformly.

Uniform layer-wise allocation thus fails to reflect the distinct demands of attention versus feed-forward modules, leading to severe capacity over-provisioning. Figure 2

Figure 2: Module-wise utilization heatmap, displaying extreme load imbalance (dark red) in O_PROJ, versus uniform routing (light yellow) in GATE_PROJ, highlighting intra-layer heterogeneity and redundancy.

Moreover, persistent enforcement of auxiliary load-balancing loss (LauxL_{\text{aux}}) during training increases routing entropy and inhibits task-driven specialization even after routing preferences stabilize. Disabling LauxL_{\text{aux}} post-exploration yields sharper expert specialization and slightly higher accuracy. Figure 3

Figure 3: Routing entropy with and without LauxL_{\text{aux}}; lower entropy and rapid drift stabilization achieved after auxiliary loss removal.

Figure 4

Figure 4: Fine-tuning accuracy convergence, showing peak accuracy with load-balancing disabled after routing preferences stabilize.

Dynamic Module-wise Expert Pruning (DMEP) Framework

DMEP introduces a dynamic, module-adaptive approach wherein expert capacity is progressively pruned based on observed online routing statistics:

  • Phase I: The model initializes with a uniform dense configuration and auxiliary load-balancing to prevent early routing collapse. An online tracker records discrete token-to-expert assignments using hard Top-kk routing decisions, ensuring empirical utilization reflects actual computation.
  • Phase II: Upon stabilization of routing distributions (minimal routing drift), module-wise pruning is applied. Experts with utilization below threshold τ\tau are structurally excised, except for a minimum KminK_{\min} required for routing validity. Pruning is realized through physical slicing of both parameter tensors and optimizer states, permanently eliminating computational and memory overhead of removed experts.
  • Phase III: Fine-tuning resumes with load-balancing loss fully disabled (λ=0\lambda=0), enabling surviving experts to specialize for the downstream task without penalization for concentrated routing. Figure 5

    Figure 5: DMEP workflow—dense initialization followed by online utilization tracking, module-wise structural pruning, and task-driven specialization after load-balancing removal.

Empirical Evaluation

Extensive experiments on Qwen3-0.6B and Qwen3-8B across ScienceQA, OpenBookQA, and GSM8K confirm DMEP's efficacy. DMEP reduces trainable parameters by 35–43% and improves training throughput by ~10%, with accuracy maintained or surpassed relative to uniform MoE baselines. For instance, on OpenBookQA with Qwen3-8B, DMEP cuts parameters from 185.2M to 107.9M (41.7% reduction) while preserving accuracy at 95.00%. On GSM8K with Qwen3-0.6B, accuracy improved from 17.00% to 19.00% under a 40.7% parameter reduction.

DMEP's impact is pronounced in smaller models, where redundancy removal leads to modest accuracy gains; in larger models, benefits are seen primarily in efficiency. Aggressive threshold adjustments provide controllable trade-offs: increasing τ\tau to 0.15 or 0.20 further reduces parameters and increases throughput but yields minor accuracy losses. Early pruning before adequate exploration reduces accuracy, while late pruning retains more redundancy.

Structural Analysis of the Post-Pruning Architecture

DMEP's pruning outcomes reflect the statistical routing heterogeneity. Post-pruning heatmaps reveal sparse expert retention in attention projections—often only 4–5 experts per module—while MLP projections retain 7–8 experts to support broad knowledge distribution. This observed asymmetry validates DMEP's adaptive capacity allocation, confirming alignment between utilization patterns and architectural compression. Figure 6

Figure 6: Post-pruning expert heatmap—attention projections sparsified with fewer retained experts, MLP projections maintain broad capacity.

Implications and Future Directions

DMEP systematically overcomes key bottlenecks in symmetric LoRA-MoE fine-tuning: it delivers module-adaptive compression, removes optimizer-state overhead by structural parameter slicing, and enables sharper specialization by relaxing load-balancing constraints post-routing stabilization. The framework is deployable for both small and large models and allows practitioners to select desired efficiency-accuracy trade-offs. Future developments may extend online pruning to multi-task or dynamic environments, further optimizing conditional computation and expanding the scalability frontier for resource-constrained LLM fine-tuning.

Conclusion

DMEP provides a robust, task-adaptive framework for parameter-efficient fine-tuning, leveraging module-wise utilization profiling to prune redundant experts and optimize downstream specialization. Across empirical benchmarks, DMEP consistently reduces parameter footprint and accelerates throughput without sacrificing accuracy, advancing the Pareto frontier for fine-tuned LLMs. The alignment between statistical routing heterogeneity and structural compression underscores DMEP's effectiveness in tailoring capacity to functional module demands (2604.26340).

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 2 tweets with 2 likes about this paper.