Papers
Topics
Authors
Recent
Search
2000 character limit reached

C3PO: Optimizing MoE Expert Re-Mixing

Updated 18 December 2025
  • The paper introduces C3PO, a framework that re-mixes expert pathways at test time, closing a 10–20% performance gap in MoE LLMs.
  • It leverages surrogate objectives—via mode-finding, kernel regression, or gradient descent—and kNN-based neighbor retrieval to refine critical layers and core experts.
  • Empirical evaluations on OLMoE and DeepSeekMoE show significant accuracy gains (up to +15%) while reducing computational overhead.

Critical-Layer, Core-Expert, Collaborative Pathway Optimization (C3PO) for Test-Time Expert Re-Mixing

Critical-Layer, Core-Expert, Collaborative Pathway Optimization (C3PO) is a test-time intervention framework for Mixture-of-Experts (MoE) LLMs that addresses the persistent sub-optimality in routing decisions made by pretrained routers. Empirical analysis across major benchmarks reveals a substantial (10–20%) accuracy gap between standard pretraining-based routing and an oracle that selects optimal expert pathways post hoc. C3PO optimizes only the most influential layers (“critical layers”) and a reduced set of high-impact experts (“core experts”) at inference time, leveraging a collaborative surrogate loss estimated from neighboring successful reference samples. This yields test-sample-specific re-mixing of expert selection without modifying model weights or performing any supervised tuning at test time, resulting in robust task improvements and improved efficiency (Li et al., 10 Apr 2025).

1. MoE Routing Suboptimality and Test-Time Pathway Gaps

Mixture-of-Experts LLMs route each input token through a sparsely activated subset of experts per layer, as determined by a learned router. Despite specialized expert layers, pretrained routers exhibit myopic routing that leaves a considerable margin for improved performance. Oracle studies on OLMoE and DeepSeek-MoE indicate achievable test-time accuracy improvements ranging from 7% to 15% absolute (e.g., on ARC-C: 51.3% → 66.3%) (Section 6), signifying that the core limitation is due to suboptimal expert pathways, not model capacity. Standard adaptation techniques like in-context learning or prompt tuning do not address expert selection and fail to close more than a fraction of this gap.

2. Surrogate Objectives for Pathway Optimization

Direct optimization of expert pathways at inference is infeasible, as ground truth is unknown. C3PO introduces reference-based surrogate objectives by leveraging a collection of reference samples with known ground truth and precomputed pathway activations:

  • Let xx be the test input, f(x,ω)f(x, \omega) the output given pathway parameters ωRL×E\omega \in \mathbb{R}^{L \times E}, with LL layers, EE experts per layer.
  • Reference set {(xi,yi,ωi)}i=1m\{(x_i, y_i, \omega_i)\}_{i=1}^m with yiy_i correct and ωi\omega_i default routing weights.
  • Retrieve “successful neighbors” N(x)=N(x) = kNN(xx, reference set) via an embedding model E()E(\cdot) (e.g., NV-Embed-V2); define kernel K(xi,x)K(x_i, x).

Three surrogates are proposed:

  • Mode-finding (Mean-shift): Iteratively shift ω\omega towards the densest region in nearby successful pathways:

ωˉ=iNω(ω)K(ωi,ω)  ωiiNω(ω)K(ωi,ω);ωαω+(1α)ωˉ\bar{\omega} = \frac{\sum_{i \in N_\omega(\omega)} K(\omega_i, \omega)\;\omega_i}{\sum_{i \in N_\omega(\omega)} K(\omega_i, \omega)};\quad \omega \leftarrow \alpha \omega + (1-\alpha)\bar{\omega}

with Nω(ω)N_\omega(\omega) the top-k reference pathways in ω\omega-space.

  • Kernel Regression: Compute a weighted average pathway ω^\hat{\omega} from neighborhood N(x)N(x):

ω^=iN(x)K(xi,x)  ωiiN(x)K(xi,x)\hat{\omega} = \frac{\sum_{i \in N(x)} K(x_i, x)\;\omega_i}{\sum_{i \in N(x)} K(x_i, x)}

and interpolate: ωαω+(1α)ω^\omega \leftarrow \alpha\omega + (1-\alpha)\hat{\omega}, with optimal α\alpha via line search on proxy loss.

  • Neighborhood Gradient Descent (NGD): Use the average reference neighbor loss as a differentiable surrogate:

L(ω)=iN(x)K(xi,x)  (f(xi,ω),yi)iN(x)K(xi,x)L(\omega) = \frac{\sum_{i \in N(x)} K(x_i, x)\;\ell(f(x_i, \omega), y_i)}{\sum_{i \in N(x)} K(x_i, x)}

and perform gradient updates: ωωλωL(ω)\omega \leftarrow \omega - \lambda \nabla_\omega L(\omega), typically for T=10T=10 steps.

All methods restrict optimization to a relevant subset of pathway weights (see Section 5) and avoid test-time reliance on true labels.

3. “Successful Neighbor” Retrieval and Kernel Structure

C3PO retrieves neighbors in either embedding space or pathway space:

  • Embeddings E(x)E(x) are precomputed for the entire reference set, indexed for kNN retrieval (e.g., via FAISS).
  • Gaussian kernel K(xi,x)=exp(d(E(xi),E(x))2/(2σ2))K(x_i, x)=\exp(-d(E(x_i), E(x))^2/(2\sigma^2)) is used for affinity.
  • To avoid leakage, near-duplicate questions are excluded.
  • Empirically, a small neighborhood k=3k=3 offers optimal trade-off: too few under-regularize, too many dilute the surrogate signal.

4. C3PO Procedure: Algorithmic Outline

Input: Test sample xx, pretrained MoE LLM f()f(\cdot), reference set {(xi,yi,ωi)}\{(x_i, y_i, \omega_i)\}, embedding model E()E(\cdot).

Algorithmic steps:

  1. Compute embedding E(x)E(x); retrieve N(x)N(x)—the kNN in embedding space.
  2. Initialize ω\omega using the pretrained router (ω=router(x)\omega = \text{router}(x)).
  3. Select critical layers (LcL_c) and core experts per layer (EcE_c).
  4. Optimize ωLc,Ec\omega_{L_c,E_c} via the chosen surrogate (mode, kernel, NGD):
    • For mode and kernel surrogates, compute average pathway and interpolate.
    • For NGD, use neighborhood-based surrogate loss and perform T gradient steps.
  5. Evaluate stopping by proxy loss or fixed iterations.
  6. Output: f(x,ωopt)f(x, \omega_{\text{opt}}).

This design achieves strong performance gains while containing computational cost by reducing the search space to a manageable subset of pathway weights.

5. Critical Layer and Core Expert Selection

Extensive ablation reveals:

  • Critical Layers: Accuracy gains are concentrated in the last LcL_c layers (e.g., last 5 of 16 in OLMoE). Optimizing early or all layers yields negligible added benefit.
  • Core Experts: Within each critical layer, only the top-N experts (by router score, typically N=8N=8 to $20$) require optimization; coverage is nearly complete for N=20N=20 (99.8% of eventual top-8 selection).
  • Token Position: Targeting only the routing weights for the last token maximizes gains; multi-token or early token optimization is less effective.

This targeted optimization reduces both computational and memory cost at test time and explains the practicality of C3PO, even in large MoE LLMs.

6. Empirical Evaluation and Ablation

On OLMoE (16 layers, 64 experts/layer, 1.3B active params) and DeepSeekMoE (28 layers, 2.8B active), C3PO yields:

  • OLMoE: 69.9% base → 79.2% (+9.3%) via NGD, +7.0% by kernel regression.
  • DeepSeekMoE: 66.4% base → 74.4% (+8.0%).
  • ARC-C: up to +15% gain, exceeding 7B–9B parameter dense LLMs in efficiency-adjusted accuracy.
  • In-context/prompt tuning baselines: C3PO outperforms by 4–8% (absolute).

Ablation shows optimal performance with:

  • k=3k=3 neighbors;
  • 10 NGD steps;
  • Gaussian kernel;
  • NV-Embed-V2 for retrieval quality;
  • Last 5 layers and top-20 experts;
  • < 5% of cases show risk of flipping correct → incorrect prediction early in optimization.

7. Implementation, Complexity, and Extensions

  • Reference set: Embeddings and pathway weights stored offline; kNN retrieval in O(logm)O(\log m) per test sample.
  • Optimization: Restricting to Lc×Ec=L_c \times E_c = (e.g., 5 layers × 20 experts = 100 parameters) enables per-sample adaptation within ~1,000 scalar updates.
  • Computation: Overhead is negligible relative to forward inference, as gradient-free surrogates avoid full backward passes.
  • Extensions: Unsupervised neighbor selection, hybrid surrogates, online meta-learning of loss interpolation, and dynamic adaptation to other sparse architectures (e.g., Mixture-of-Mixture, conditional computation).
  • Limitations: Requires a well-curated labeled reference set and reliable retrieval embedding; performance is tied to neighbor fidelity.

By leveraging critical-layer, core-expert restriction and reference-based collaborative surrogates, C3PO provides a reproducible, scalable blueprint for optimized expert re-mixing at test time in large-scale MoE LLMs. Its demonstrated efficiency and improvement over dense models and classical adaptation baselines significantly broaden the operational frontiers of sparse modular architectures (Li et al., 10 Apr 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Expert Reference Solution.