Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

Gemini 2.5 Flash 100 tok/s

Gemini 2.5 Pro 58 tok/s Pro

GPT-5 Medium 29 tok/s

GPT-5 High 29 tok/s Pro

GPT-4o 103 tok/s

GPT OSS 120B 480 tok/s Pro

Kimi K2 215 tok/s Pro

2000 character limit reached

Autonomy-of-Experts Models (2501.13074v2)

Published 22 Jan 2025 in cs.CL, cs.AI, and cs.LG

Abstract: Mixture-of-Experts (MoE) models mostly use a router to assign tokens to specific expert modules, activating only partial parameters and often outperforming dense models. We argue that the separation between the router's decision-making and the experts' execution is a critical yet overlooked issue, leading to suboptimal expert selection and ineffective learning. To address this, we propose Autonomy-of-Experts (AoE), a novel MoE paradigm in which experts autonomously select themselves to process inputs. AoE is based on the insight that an expert is aware of its own capacity to effectively process a token, an awareness reflected in the scale of its internal activations. In AoE, routers are removed; instead, experts pre-compute internal activations for inputs and are ranked based on their activation norms. Only the top-ranking experts proceed with the forward pass, while the others abort. The overhead of pre-computing activations is reduced through a low-rank weight factorization. This self-evaluating-then-partner-comparing approach ensures improved expert selection and effective learning. We pre-train LLMs having 700M up to 4B parameters, demonstrating that AoE outperforms traditional MoE models with comparable efficiency.

Collections

Summary

The paper introduces a novel MoE paradigm where experts autonomously select tokens based on internal activation norms, eliminating the need for a separate router.
It employs low-rank weight factorization to optimize activation pre-computation, enhancing efficiency and effective expert selection.
Experiments on models up to 4 billion parameters show that AoE improves load balancing and accuracy across various language tasks compared to traditional MoE setups.

The paper introduces Autonomy-of-Experts (AoE), a novel Mixture-of-Experts (MoE) paradigm designed to address the limitations arising from the separation between the router's decision-making process and the experts' execution in traditional MoE models. The authors posit that this separation leads to suboptimal expert selection and ineffective learning. AoE enables experts to autonomously select themselves for processing inputs based on their internal activation scales, thereby eliminating the need for a router.

The key insight behind AoE is the observation that an expert's capacity to effectively process a token is reflected in the scale of its internal activations. In AoE, all experts pre-compute internal activations for each input token, and then they are ranked based on their activation norms. Only the top-ranking experts proceed with the forward pass, while the others abort. To mitigate the overhead of pre-computing activations, the paper introduces a low-rank weight factorization.

The $i$ -th AoE expert is formulated as: $E_{i}(\mathbf{x}) = \left(SiLU\left(\mathbf{x}\mathbf{W}^{i}_{\text{down}\mathbf{W}^{i}_{\text{up} \right) \odot \left(\mathbf{x}\mathbf{W}^{i}_{p}\right)\right)\mathbf{W}^{i}_{o}$, where $\mathbf{x} \in \mathbb{R}^{d_{\text{model}$ is the input hidden state, $\mathbf{W}^{i}_{p} \in \mathbb{R}^{d_{\text{model} \times d_{\text{wide}$, and $\mathbf{W}^{i}_{o} \in \mathbb{R}^{d_{\text{wide} \times d_{\text{model}$ are the expert weights, and $\mathbf{W}_{\text{down} \in \mathbb{R}^{d_{\text{model} \times d_{\text{low}$ and $\mathbf{W}_{\text{up} \in \mathbb{R}^{d_{\text{low} \times d_{\text{wide}$ are the low-rank matrices with $d_{\text{low} < d_{\text{model} < d_{\text{wide}$.

The paper details the architecture of an AoE layer. Initially, the input vectors are compressed into low-dimensional activations via $\mathbf{W}_{\text{down}$. These activations are cached, and their $L^2$ norms are used to rank the experts. For a given input, the experts with the top- $K$ norms continue the forward computation using the cache, while unchosen experts are terminated. To further enhance efficiency, the calculation of the activation cache is optimized using a single matrix multiplication: $\mathbf{\hat{W}_{\text{down} = [\mathbf{W}^{1}_{\text{down}, \cdots, \mathbf{W}^{n}_{\text{down}] \in \mathbb{R}^{d_{\text{model}\times (n d_{\text{low})}$, $\mathbf{C} = \mathbf{x} \mathbf{\hat{W}_{\text{down}}$, where $\mathbf{C} \in \mathbb{R}^{n d_{\text{low}$ is then reshaped into an $n \times d_{\text{low}$ matrix for subsequent computations.

The authors pre-trained AoE LLMs with up to 4 billion parameters and demonstrated that AoE outperforms traditional MoE models on downstream tasks while maintaining comparable efficiency. The advantages of AoE include improved expert selection, more specialized experts, and more effective training.

The paper presents a background on MoE, focusing on sparse MoE models where each feed-forward network (FFN) module acts as an expert. The $i$ -th expert within a layer is represented as: $E_{i}(\mathbf{x}) = \left(SiLU(\mathbf{x}\mathbf{W}^{i}_{g}) \odot (\mathbf{x} \mathbf{W}^{i}_{p})\right) \mathbf{W}^{i}_{o}$ , where $\mathbf{x} \in \mathbb{R}^{d_{\text{model}$ is the input hidden state; $\mathbf{W}^{i}_{g}, \mathbf{W}^{i}_{p} \in \mathbb{R}^{d_{\text{model} \times d_{\text{ffn}$, and $\mathbf{W}^{i}_{o} \in \mathbb{R}^{d_{\text{ffn} \times d_{\text{model}$ are the expert weights. The router determines which expert processes which hidden state. The paper notes that a challenge faced by MoE is the imbalanced expert load, which is addressed using a load-balancing loss: $\mathcal{L}_{\text{aux} = \alpha_{\text{aux}\cdot n \cdot \sum^{n}_{i=1} \mathbf{f}_{i} \cdot \mathbf{P}_{i}$, where $\mathbf{f}_{i} = \frac{1}{T}\sum_{\mathbf{x}\in\mathcal{B} \mathbbm{1}\left\{i \in argtopK\left(R\left(\mathbf{x}\right)\right)\right\}$ and $\mathbf{P}_{i} = \frac{1}{T}\sum_{\mathbf{x}\in\mathcal{B} Softmax\left(R\left(\mathbf{x}\right)\right)[i]$.

The paper presents an experiment where routers are removed from pre-trained MoE-LLMs, and experts are selected during inference based on internal activation norms. This experiment demonstrates that the performance of pre-trained LLMs can be largely preserved without parameter updates, using specific nodes.

Ablation studies were conducted on 732M-parameter LLMs (with 247M active parameters) trained on 100 billion tokens. The results indicate that AoE configurations generally outperform traditional MoE setups in terms of average accuracy across various tasks, including ARC-E, PIQA, SIQA, WINO, HELLA, MNLI, QNLI, and SST2. The paper also examines the impact of varying $d_{\text{low}$, finding that performance peaks when $d_{\text{low}$ is approximately one-third of $d_{\text{model}$. The paper explores the compatibility of AoE with different expert-selection strategies, such as Top- $P$ and expert-choice, demonstrating AoE's versatility.

The paper analyzes the load balancing of AoE, revealing that AoE improves load balancing compared to traditional MoE models and exhibits stronger confidence in expert selection. The confidence entropy is defined as: $\text{Ent}_{\text{conf} = -\sum_{i=1}^{n} \mathbf{p}_{i} \log \mathbf{p}_{i}$, where $\mathbf{p}_{i}$ is either $Softmax\left(L2-Norm\left(\mathbf{x} \mathbf{W}^{i}_{\text{down}\right)\right)$ for AoE or $Softmax\left(R(\mathbf{x})\right)$ for traditional MoE.

The authors examined whether improvements stem from the factorization of $\mathbf{W_{g}$, finding that the factorization itself does not significantly influence performance. They also examined the impact of involving more parameters in expert selection and found that the improvement in AoE is not primarily due to this factor.

The alignment of self-evaluation criteria among experts is analyzed, showing that experts within the same layer achieve similar activation scales, indicating alignment in their self-evaluation criteria. The efficiency of AoE is evaluated in terms of throughput and memory usage, demonstrating that AoE achieves up to 97% of the throughput of traditional MoE models.

Finally, the paper presents results from pre-training LLMs with 4 billion parameters. These results show that AoE outperforms traditional MoE models as they scale, with the performance improvement being more pronounced in LLMs compared to smaller models.

PDF Markdown

Paper Prompts

Explore 10 Community Prompts

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Generate Now

Authors (8)

Tweets

https://twitter.com/max_paperclips/status/1883418664315306378

https://twitter.com/TheTuringPost/status/1883659921625166186

https://twitter.com/teortaxesTex/status/1883689036470255631

https://twitter.com/theomitsa/status/1885384682549760228

https://twitter.com/fly51fly/status/1882544959661965360

https://twitter.com/astridwilde1/status/1946609187351601338

HackerNews

Autonomy-of-Experts Models (ArXiv) (2 points, 0 comments)