Papers
Topics
Authors
Recent
2000 character limit reached

Expert Pruning in Mixture-of-Experts

Updated 27 December 2025
  • Expert pruning approaches are techniques in MoE models that selectively remove underperforming expert subnetworks to enhance computational efficiency and reduce memory usage.
  • The methodology employs trajectory-driven strategies using dynamic programming to identify and retain the most critical paths based on node and edge importance metrics.
  • Empirical evaluations demonstrate that targeted pruning can achieve high sparsity (up to 50%) while maintaining competitive accuracy across various benchmarks.

An expert pruning approach refers to methods that selectively remove or deactivate entire experts (specialized subnetworks) within Mixture-of-Experts (MoE) architectures to optimize resource usage, memory footprint, and inference efficiency, while maintaining the composite model’s performance across target tasks. State-of-the-art approaches move beyond naïve or uniform expert removal by leveraging complex metrics and combinatorial optimization that account for both the heterogeneous roles of experts across layers and global dependencies in token-wise expert trajectories. This article provides a technical overview of advanced expert pruning methodologies, with a particular focus on trajectory-driven strategies as embodied by MoE Pathfinder, and situates them within the broader context of MoE model scaling and deployment.

1. MoE Architectures and the Rationale for Expert Pruning

Mixture-of-Experts architectures, widespread in LLMs, distribute capacity across many parallel expert subnetworks (typically parameterized as distinct feed-forward blocks), with lightweight routers dynamically selecting a sparse subset of experts per token. This design enables competitive accuracy with substantially lower FLOPs per inference, but introduces severe static memory and deployment costs due to the need to store all expert parameters, regardless of their runtime utilization.

Expert pruning addresses the proportional scaling of memory/storage requirements with the number of experts NeN_e per layer. As observed empirically, many experts are functionally redundant or contribute disproportionately little to the final layer outputs for typical workload distributions. Pruning approaches aim to reduce computational and memory demands by removing such experts—either statically (architecture surgery before deployment) or adaptively (via task- or token-aware schemes)—while avoiding the performance degradation associated with random or frequency-only pruning (Yang et al., 20 Dec 2025).

2. Trajectory-Driven Expert Pruning Formulation

The trajectory-driven expert pruning paradigm reconceptualizes MoE inference as traversing a weighted computation graph G=(V,E)G = (V, E), with each node Ei(l)E_i^{(l)} (the ii-th expert in layer ll) annotated by a node weight ei(l)e_i^{(l)} reflecting expert importance, and inter-layer connections (Ei(l)Ej(l+1))(E_i^{(l)} \to E_j^{(l+1)}) by edge weights ti,j(l)t_{i,j}^{(l)} capturing transition intensity.

A trajectory, or path p=(i1,i2,...,iL)p = (i_1, i_2, ..., i_L), represents a sequence of expert choices, one per layer, for a specific input. The overall importance of a path is given multiplicatively over node and edge weights:

wp=l=1Leil(l)l=1L1til,il+1(l)w_p = \prod_{l=1}^L e_{i_l}^{(l)} \cdot \prod_{l=1}^{L-1} t_{i_l, i_{l+1}}^{(l)}

For numerical stability, scores are summed in log-space:

logwp=l=1Llogeil(l)+l=1L1logtil,il+1(l)\log w_p = \sum_{l=1}^L \log e_{i_l}^{(l)} + \sum_{l=1}^{L-1} \log t_{i_l,i_{l+1}}^{(l)}

The pruning task is recast as a global path-planning problem: over a representative calibration set, determine the top-mm most important trajectories (high-scoring paths) and prune away all experts not involved in any of these top-mm trajectories (Yang et al., 20 Dec 2025).

3. Expert Importance Metrics—Node and Edge Weighting

MoE Pathfinder’s node and edge weights fuse three complementary signals, each computed over a calibration set:

  • Activation Strength: Measures upstream importance as the mean 2\ell_2-norm of an expert’s output.

ai,k(l)=Hk(l1)(Wi(l))T2,ai(l)=1Nxk=1Nxai,k(l)a_{i,k}^{(l)} = \left\| H_k^{(l-1)} (W_i^{(l)})^T \right\|_2,\quad a_i^{(l)} = \frac{1}{N_x} \sum_{k=1}^{N_x} a_{i,k}^{(l)}

  • Routing Probability: Downstream preference as the mean router softmax probability for an expert.

rj,k(l+1)=[softmax(Hk(l)(R(l+1))T)]j;rj(l+1)=1Nxk=1Nxrj,k(l+1)r_{j,k}^{(l+1)} = \left[\operatorname{softmax}\left( H_k^{(l)} (R^{(l+1)})^T \right)\right]_j;\quad r_j^{(l+1)} = \frac{1}{N_x} \sum_{k=1}^{N_x} r_{j,k}^{(l+1)}

  • Reconstruction Error: Negative softmax of average squared error between the full MoE output and the output with only one expert active.

Li(l)=1Nxk=1NxYk(l)Y^k,i(l)22;    ei(l)=softmaxi(Li(l))\mathcal{L}_i^{(l)} = \frac{1}{N_x} \sum_{k=1}^{N_x} \left\| Y_k^{(l)} - \hat{Y}_{k,i}^{(l)} \right\|_2^2;\;\; e_i^{(l)} = \operatorname{softmax}_i\left(-\mathcal{L}_i^{(l)}\right)

The directional nature of the computation graph means boundary-layer node weights are further weighted by missing edge terms (e.g., for the first layer, multiply ei(1)e_i^{(1)} by ri(1)r_i^{(1)}; for the last, by ai(L)a_i^{(L)}). Edge weights are defined as

ti,j(l)=ai(l)rj(l+1)t_{i,j}^{(l)} = a_i^{(l)} \cdot r_j^{(l+1)}

(Yang et al., 20 Dec 2025).

4. Algorithmic Pipeline: Calibration, Graph Construction, and Pruning

The step-by-step pruning workflow comprises:

  1. Calibration Set Construction: Apply KK-means clustering to the training data, sampling one representative per cluster (K=5K=5–$20$ produces robust diversity). This mitigates sample bias by ensuring diverse token contexts.
  2. Forward Pass and Metric Computation: For each calibration sample XX, perform a single forward pass through the MoE, recording all expert activations, router logits, and layer outputs. Compute ai(l)a_i^{(l)}, rj(l)r_j^{(l)}, Li(l)\mathcal{L}_i^{(l)}, ei(l)e_i^{(l)}, ti,j(l)t_{i,j}^{(l)} as described.
  3. Optimal Path Search: Treat the layered MoE as a weighted DAG, with log-probabilities assigned to each node (expert) and edge (transition). Using dynamic programming, recursively identify the top-mm highest-weight trajectories from input to output. Each trajectory corresponds to a sequence of experts, one per layer.
  4. Expert Set Aggregation: For each calibration instance, take the union of experts appearing in its top-mm trajectories across all layers, then globally unite these sets over all calibration instances to form the final "keep set" EkeepE_{\text{keep}}.
  5. Model Masking and Pruning: Define, per layer,

Mi(l)={1,Ei(l)Ekeep 0,otherwiseM_i^{(l)} = \begin{cases}1, & E_i^{(l)} \in E_{\text{keep}} \ 0, & \text{otherwise}\end{cases}

Prune all experts where Mi(l)=0M_i^{(l)} = 0 in each layer. Notably, pruning ratios (Ekeeplayerl/Ne|E_{\text{keep}}\cap\text{layer}{l}|/N_e) naturally vary across layers, as expert retention is trajectory and data dependent (Yang et al., 20 Dec 2025).

5. Non-Uniformity, Hyperparameter Control, and Ablation Insights

A key outcome of the trajectory-driven formulation is non-uniform expert retention across layers—pruning is not enforced evenly, but emerges from the global optimality of cross-layer expert combinations. The primary sparsity hyperparameter is mm (the number of top paths per sample), indirectly setting the target average sparsity via

α=11LNelEkeep(l)\alpha = 1-\frac{1}{L N_e}\sum_l |E_{\text{keep}}^{(l)}|

Notable hyperparameters and their empirical impact:

Parameter Description Effect / Default Range
K # K-means clusters 5–20 (diversity; see Fig. 6 in (Yang et al., 20 Dec 2025))
m Top-m trajectories m=1m=1\sim50% sparsity; m=500m=500\sim25% sparsity
NxN_x Tokens per calibration sample Defined by prompt length; larger for more robust averages
α\alpha Target sparsity Implicitly determined by choice of mm
[others] Cluster seed, calibration set size, # of pruning rounds Typically default settings sufficient

Ablation studies confirm that leveraging both transition intensity (TI) and expert importance score (IS) yields state-of-the-art retention under fixed sparsity constraints. Discarding either signal, especially IS, sharply worsens pruning performance on reasoning tasks [(Yang et al., 20 Dec 2025), Table 6].

6. Empirical Evaluation and Deployment Outcomes

Experiments on state-of-the-art LLMs (e.g. Mixtral-8×7B, Mixtral-8×7B-Instruct) establish that trajectory-driven pruning at 50% expert sparsity achieves, for Mixtral-8×7B, 53.77% average accuracy vs. 64.35% for full model and only 28.87% for random expert drop; on Mixtral-8x7B-Instruct, 53.52% vs. 66.88% (full) and 34.52% (random). Against prior expert-merging and pruning methods, this approach consistently achieves the top aggregate performance across MMLU, GSM8K, MedQA, and ARC [(Yang et al., 20 Dec 2025), Table 4]. Retaining a higher proportion of experts (e.g., 25% sparsity) further boosts accuracy (Mixtral-8×7B: 59.20% vs. 53.77% at 50% sparsity).

Analysis of layerwise expert retention patterns reveals emergent specialization: early layers typically retain a shared pool of generalists, while deeper layers are dominated by a few highly specialized experts, consistent with data-driven polarization in selection frequencies (see Figure 1). Pruning accuracy shows unimodal sensitivity to KK, confirming the importance of calibration-set design (Figure 2).

The non-uniform, trajectory-level optimization enables high compression ratios (up to 2x total parameter elimination) with only moderate performance degradation, and is robust across six evaluated benchmarks.

7. Comparative Perspective and Evolving Directions

Trajectory-driven expert pruning, as formalized in MoE Pathfinder, addresses deficiencies in prior art that relied on local expert metrics or enforced uniform pruning schedules. Conventional methods—such as frequency-based, router-activation, or reconstruction-loss–only expert ranking—fail to exploit the interdependencies between expert selections across layers, thereby discarding important non-local signals (Yang et al., 20 Dec 2025).

The present approach builds on global combinatorial search, dynamic programming for path enumeration, and a multi-signal fusion principle that is increasingly central in high-quality pruning strategies across recent MoE LLM literature.

Emerging future directions suggested by trajectory-driven pruning research include:

  • Dynamically scheduling calibration set selection and top-mm path search over multiple rounds.
  • Integration with block-wise or semi-global importance coordination for further efficiency (Yang et al., 1 Nov 2024).
  • Downstream task-aware pruning where trajectory weights are modulated by externally supervised signals.
  • Application to hybrid MoE architectures beyond language (vision, multi-modal) domains.

In sum, authority on expert pruning approaches now rests on sophisticated optimization over expert trajectories, integrating both per-expert and cross-layer information, and establishing new efficiency benchmarks for scalable, deployable Mixture-of-Experts LLMs (Yang et al., 20 Dec 2025).

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Expert Pruning Approach.