Papers
Topics
Authors
Recent
2000 character limit reached

Super Experts in MoE LLMs & Gossip Protocols

Updated 5 December 2025
  • Super Experts (SEs) are defined by strict activation thresholds in early MoE layers and as agents with complete higher-order knowledge in gossip protocols.
  • In MoE LLMs, SEs are extremely rare yet crucial, with their removal causing dramatic performance degradation in both general and reasoning tasks.
  • Preserving SEs through dedicated profiling and exclusion from compression safeguards model efficiency and guarantees epistemic convergence in distributed systems.

Super experts (SEs) are a distinguished concept in both large-scale machine learning systems and epistemic gossip protocols, denoting agents or components whose influence or knowledge demonstrates exceptional, system-critical properties. In mixture-of-experts (MoE) LLMs, SEs are a minute subset of experts whose rare but extreme activations are mechanistically essential for overall model performance. In the context of gossip protocols, super experts emerge as agents who possess not only complete individual knowledge but higher-order knowledge that everyone in the system is likewise expert. These parallel definitions are both distinguished by stringent conditions and empirically observable effects on system efficiency, performance, and termination guarantees (Su et al., 31 Jul 2025, Ditmarsch et al., 2020).

1. Formal Definitions and Identification Criteria

In MoE LLMs, the formalism for SEs arises from activation profiling. Let A={a,e}A = \{a_{\ell,e}\} denote the set of all maximum absolute down_proj activations for every expert ee in all MoE layers \ell that are candidates for massive activations. Denoting:

  • P99.5=Percentile99.5(A)P_{99.5} = \text{Percentile}_{99.5}(A),
  • amax=maxAa_{\max} = \max\,A,
  • LmL_m = subset of early MoE layers where massive activations emerge, an expert ee in layer \ell is a super expert if and only if

a,e>P99.5(A)a,e>110amaxLma_{\ell,e} > P_{99.5}(A) \land a_{\ell,e} > \tfrac{1}{10}a_{\max} \land \ell \in L_m

This identification is operationalized via a four-step profiling routine:

  1. For each (,e)(\ell,e), run a batch of inputs, recording a,e=maxtokens, hidden-dimsdown_proj(He)a_{\ell,e} = \max_{\text{tokens, hidden-dims}} |\text{down\_proj}(H^{\ell'}_e)|.
  2. Accumulate all a,ea_{\ell,e} into AA, compute P99.5P_{99.5} and amaxa_{\max}.
  3. Assign SE status to (,e)(\ell,e) if both criteria above are satisfied.
  4. The tool for automated SE profiling is made available by Zunhai Su et al. github.com/ZunhaiSu/Super-Experts-Profiling.

In epistemic gossip protocols, super experts are agents aa satisfying SEa:=KaExpASE_a := K_a\,\text{Exp}_A, where KaK_a denotes aa’s knowledge and ExpA=aAExpa\text{Exp}_A = \bigwedge_{a\in A} \text{Exp}_a refers to the state in which every agent is an expert (i.e., knows all secrets). Hence, aa is a super expert when it knows that everyone in AA is an expert. Super-expert protocols terminate only when EExpA:=aAKaExpAE\,\text{Exp}_A := \bigwedge_{a \in A}K_a\,\text{Exp}_A holds: all agents have achieved this higher-order knowledge (Ditmarsch et al., 2020).

2. Distribution, Rarity, and System-Invariance

Empirical results establish that SEs are rare and highly localized. For MoE LLMs:

  • Proportion: SEs account for only 0.05–0.39% of all experts across architectures (e.g., Qwen3-30B-A3B: 3 out of 6,144; DeepSeek-R1: 10 out of 15,677; Mixtral-8×7B-Instruct: 1 out of 256).
  • Layer localization: SEs almost always arise in the first 1–3 MoE layers, coincident with the emergence of massive activations.
  • Model-specificity and invariance: The SE set remains unchanged across post-training modifications (e.g., base vs. RLHF-tuned variants of Qwen3-30B-A3B, DeepSeek-V2-Lite vs. -Lite-Chat).
  • Input domain invariance: Profiling across diverse datasets (e.g., C4, WikiText-2, GSM8K, HumanEval, C-Eval) identifies the same SEs (Su et al., 31 Jul 2025).

In gossip protocols, the property of super expertise is similarly robust; protocols designed for super-expert convergence (such as ANY and PIG) ensure all agents attain SEaSE_a, irrespective of call sequence, provided the protocol’s fairness or maximality conditions are satisfied (Ditmarsch et al., 2020).

3. Mechanistic Role and Performance Impact

In MoE LLMs, the ablation of SEs reveals their unique functional necessity:

  • In non-reasoning tasks, SE pruning causes dramatic model degradation: Qwen3-30B-A3B’s perplexity on WikiText-2 increases from 8.70 to 59.86, with accuracy drops of 21.7% (QA) and ~52.7% (GSM8K). Similar catastrophic collapses are found in DeepSeek and Mixtral models.
  • For reasoning and code tasks, the effect is decisive: DeepSeek-R1 5-shot GPQA Diamond Pass@1 plummets from 71.5% to 5.05% (–93%), and Pass@1 on Math-500, AIME’24/’25, LiveCodeBench, and HumanEval approaches zero after SE pruning (Su et al., 31 Jul 2025).

Mechanistically, SEs are responsible for producing the “massive activations” that create “attention sinks”—tokens (typically the first) that attract almost all attention mass in heads. Metrics such as the attention sink decay rate DsinkD_{\text{sink}} empirically show that SE removal leads to 90–100% decay (i.e., the sink disappears), and output heatmaps confirm the loss of the attention sink following SE pruning. This effect is not observed with random expert pruning (Su et al., 31 Jul 2025).

In gossip protocols, super expertise corresponds to exhaustive epistemic closure. The transition from all-expert status (ExpA\text{Exp}_A) to all-super-expert status (EExpAE\,\text{Exp}_A) ensures common knowledge, critical for distributed systems requiring higher-order agreement. Protocols not designed for this convergence (e.g., CMO without common knowledge) can fail to reach EExpAE\,\text{Exp}_A, while super-successful protocols (ANY, PIG) guarantee it (Ditmarsch et al., 2020).

4. Protocols, Algorithms, and Quantitative Bounds

In MoE LLMs, the SE profiling and preservation pipeline is explicit:

  • Profile SEs using percentile-and-fraction thresholds.
  • Exclude identified SEs from all compression or quantization.
  • Subject non-SE experts to standard compression (e.g., Merge & Prune, MoEQuant).

For gossip protocols, three canonical protocols are analyzed:

  • ANY: Permits any call at any time. Super-successful both synchronously and asynchronously; may require up to O(n2)O(n^2) (= n2+(n2)n-2+\binom{n}{2}) calls for termination without engaged agents, and O(n)O(n) (= $3n-4$) with engaged (agents who cease participation upon becoming super experts).
  • PIG: Only allows calls when there is uncertainty over secret knowledge. Provides tight epistemic invariant and is always super-successful.
  • CMO: Allows each pair to call only once; only super-successful if common knowledge is assumed and not combined with engaged-agent status (where it can fail, as shown by counterexample) (Ditmarsch et al., 2020).

Table: SE Distribution in MoE LLMs

Model Total Experts # SEs SE %
Qwen3-30B-A3B 6,144 3 0.049%
DeepSeek-R1 15,677 10 0.064%
Mixtral-8×7B-Instruct 256 1 0.39%

5. Practical Guidelines and Consequences

For MoE LLM compression and efficiency:

  • Always perform SE profiling before compression or quantization.
  • Tag and exclude SEs from all destructive expert-level operations.
  • SEs compose less than 0.5% of all experts; preserving them imposes minimal efficiency loss but offers dramatic retention of model capability.
  • Aggressively prune/quantize non-SE experts using prior heuristics, but maintain SE-aware pipelines for robust model performance (Su et al., 31 Jul 2025).

In gossip network design:

  • Protocol selection and agent engagement rules can dramatically reduce protocol run time from quadratic (O(n2)O(n^2)) to linear (O(n)O(n)).
  • Engaged agents—agents who cease calling/responding when super expert—are crucial for increasing efficiency in suitable protocols (ANY, PIG), but can prevent full termination in others (CMO) (Ditmarsch et al., 2020).

6. Key Takeaways

  • Super experts in MoE LLMs are rare, extreme-activation experts, demarcated by stringent percentile and absolute thresholds within specific layers; their role is irreplaceable for both general and reasoning-heavy model outputs.
  • In classical and distributed epistemic protocols, super experts denote agents with system-wide higher-order knowledge, essential for protocol correctness when strong epistemic termination conditions are required.
  • The methodological frameworks for identifying and utilizing SEs are now codified for both model compression (SE-aware preservation) and protocol analysis (epistemic conditions, engagement).
  • The removal or mishandling of SEs—in either context—has quantifiable, catastrophic consequences for system performance, efficiency, or epistemic guarantees (Su et al., 31 Jul 2025, Ditmarsch et al., 2020).
Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Super Experts (SEs).