Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 100 tok/s
Gemini 2.5 Pro 58 tok/s Pro
GPT-5 Medium 29 tok/s
GPT-5 High 29 tok/s Pro
GPT-4o 103 tok/s
GPT OSS 120B 480 tok/s Pro
Kimi K2 215 tok/s Pro
2000 character limit reached

Autonomy-of-Experts Models (2501.13074v2)

Published 22 Jan 2025 in cs.CL, cs.AI, and cs.LG

Abstract: Mixture-of-Experts (MoE) models mostly use a router to assign tokens to specific expert modules, activating only partial parameters and often outperforming dense models. We argue that the separation between the router's decision-making and the experts' execution is a critical yet overlooked issue, leading to suboptimal expert selection and ineffective learning. To address this, we propose Autonomy-of-Experts (AoE), a novel MoE paradigm in which experts autonomously select themselves to process inputs. AoE is based on the insight that an expert is aware of its own capacity to effectively process a token, an awareness reflected in the scale of its internal activations. In AoE, routers are removed; instead, experts pre-compute internal activations for inputs and are ranked based on their activation norms. Only the top-ranking experts proceed with the forward pass, while the others abort. The overhead of pre-computing activations is reduced through a low-rank weight factorization. This self-evaluating-then-partner-comparing approach ensures improved expert selection and effective learning. We pre-train LLMs having 700M up to 4B parameters, demonstrating that AoE outperforms traditional MoE models with comparable efficiency.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

  • The paper introduces a novel MoE paradigm where experts autonomously select tokens based on internal activation norms, eliminating the need for a separate router.
  • It employs low-rank weight factorization to optimize activation pre-computation, enhancing efficiency and effective expert selection.
  • Experiments on models up to 4 billion parameters show that AoE improves load balancing and accuracy across various language tasks compared to traditional MoE setups.

The paper introduces Autonomy-of-Experts (AoE), a novel Mixture-of-Experts (MoE) paradigm designed to address the limitations arising from the separation between the router's decision-making process and the experts' execution in traditional MoE models. The authors posit that this separation leads to suboptimal expert selection and ineffective learning. AoE enables experts to autonomously select themselves for processing inputs based on their internal activation scales, thereby eliminating the need for a router.

The key insight behind AoE is the observation that an expert's capacity to effectively process a token is reflected in the scale of its internal activations. In AoE, all experts pre-compute internal activations for each input token, and then they are ranked based on their activation norms. Only the top-ranking experts proceed with the forward pass, while the others abort. To mitigate the overhead of pre-computing activations, the paper introduces a low-rank weight factorization.

The ii-th AoE expert is formulated as: $E_{i}(\mathbf{x}) = \left(SiLU\left(\mathbf{x}\mathbf{W}^{i}_{\text{down}\mathbf{W}^{i}_{\text{up} \right) \odot \left(\mathbf{x}\mathbf{W}^{i}_{p}\right)\right)\mathbf{W}^{i}_{o}$, where $\mathbf{x} \in \mathbb{R}^{d_{\text{model}$ is the input hidden state, $\mathbf{W}^{i}_{p} \in \mathbb{R}^{d_{\text{model} \times d_{\text{wide}$, and $\mathbf{W}^{i}_{o} \in \mathbb{R}^{d_{\text{wide} \times d_{\text{model}$ are the expert weights, and $\mathbf{W}_{\text{down} \in \mathbb{R}^{d_{\text{model} \times d_{\text{low}$ and $\mathbf{W}_{\text{up} \in \mathbb{R}^{d_{\text{low} \times d_{\text{wide}$ are the low-rank matrices with $d_{\text{low} < d_{\text{model} < d_{\text{wide}$.

The paper details the architecture of an AoE layer. Initially, the input vectors are compressed into low-dimensional activations via $\mathbf{W}_{\text{down}$. These activations are cached, and their L2L^2 norms are used to rank the experts. For a given input, the experts with the top-KK norms continue the forward computation using the cache, while unchosen experts are terminated. To further enhance efficiency, the calculation of the activation cache is optimized using a single matrix multiplication: $\mathbf{\hat{W}_{\text{down} = [\mathbf{W}^{1}_{\text{down}, \cdots, \mathbf{W}^{n}_{\text{down}] \in \mathbb{R}^{d_{\text{model}\times (n d_{\text{low})}$, $\mathbf{C} = \mathbf{x} \mathbf{\hat{W}_{\text{down}}$, where $\mathbf{C} \in \mathbb{R}^{n d_{\text{low}$ is then reshaped into an $n \times d_{\text{low}$ matrix for subsequent computations.

The authors pre-trained AoE LLMs with up to 4 billion parameters and demonstrated that AoE outperforms traditional MoE models on downstream tasks while maintaining comparable efficiency. The advantages of AoE include improved expert selection, more specialized experts, and more effective training.

The paper presents a background on MoE, focusing on sparse MoE models where each feed-forward network (FFN) module acts as an expert. The ii-th expert within a layer is represented as: Ei(x)=(SiLU(xWgi)(xWpi))WoiE_{i}(\mathbf{x}) = \left(SiLU(\mathbf{x}\mathbf{W}^{i}_{g}) \odot (\mathbf{x} \mathbf{W}^{i}_{p})\right) \mathbf{W}^{i}_{o}, where $\mathbf{x} \in \mathbb{R}^{d_{\text{model}$ is the input hidden state; $\mathbf{W}^{i}_{g}, \mathbf{W}^{i}_{p} \in \mathbb{R}^{d_{\text{model} \times d_{\text{ffn}$, and $\mathbf{W}^{i}_{o} \in \mathbb{R}^{d_{\text{ffn} \times d_{\text{model}$ are the expert weights. The router determines which expert processes which hidden state. The paper notes that a challenge faced by MoE is the imbalanced expert load, which is addressed using a load-balancing loss: $\mathcal{L}_{\text{aux} = \alpha_{\text{aux}\cdot n \cdot \sum^{n}_{i=1} \mathbf{f}_{i} \cdot \mathbf{P}_{i}$, where $\mathbf{f}_{i} = \frac{1}{T}\sum_{\mathbf{x}\in\mathcal{B} \mathbbm{1}\left\{i \in argtopK\left(R\left(\mathbf{x}\right)\right)\right\}$ and $\mathbf{P}_{i} = \frac{1}{T}\sum_{\mathbf{x}\in\mathcal{B} Softmax\left(R\left(\mathbf{x}\right)\right)[i]$.

The paper presents an experiment where routers are removed from pre-trained MoE-LLMs, and experts are selected during inference based on internal activation norms. This experiment demonstrates that the performance of pre-trained LLMs can be largely preserved without parameter updates, using specific nodes.

Ablation studies were conducted on 732M-parameter LLMs (with 247M active parameters) trained on 100 billion tokens. The results indicate that AoE configurations generally outperform traditional MoE setups in terms of average accuracy across various tasks, including ARC-E, PIQA, SIQA, WINO, HELLA, MNLI, QNLI, and SST2. The paper also examines the impact of varying $d_{\text{low}$, finding that performance peaks when $d_{\text{low}$ is approximately one-third of $d_{\text{model}$. The paper explores the compatibility of AoE with different expert-selection strategies, such as Top-PP and expert-choice, demonstrating AoE's versatility.

The paper analyzes the load balancing of AoE, revealing that AoE improves load balancing compared to traditional MoE models and exhibits stronger confidence in expert selection. The confidence entropy is defined as: $\text{Ent}_{\text{conf} = -\sum_{i=1}^{n} \mathbf{p}_{i} \log \mathbf{p}_{i}$, where pi\mathbf{p}_{i} is either $Softmax\left(L2-Norm\left(\mathbf{x} \mathbf{W}^{i}_{\text{down}\right)\right)$ for AoE or Softmax(R(x))Softmax\left(R(\mathbf{x})\right) for traditional MoE.

The authors examined whether improvements stem from the factorization of $\mathbf{W_{g}$, finding that the factorization itself does not significantly influence performance. They also examined the impact of involving more parameters in expert selection and found that the improvement in AoE is not primarily due to this factor.

The alignment of self-evaluation criteria among experts is analyzed, showing that experts within the same layer achieve similar activation scales, indicating alignment in their self-evaluation criteria. The efficiency of AoE is evaluated in terms of throughput and memory usage, demonstrating that AoE achieves up to 97% of the throughput of traditional MoE models.

Finally, the paper presents results from pre-training LLMs with 4 billion parameters. These results show that AoE outperforms traditional MoE models as they scale, with the performance improvement being more pronounced in LLMs compared to smaller models.

Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Paper Prompts

Sign up for free to create and run prompts on this paper using GPT-5.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

HackerNews

  1. Autonomy-of-Experts Models (ArXiv) (2 points, 0 comments)

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube