UMoE: Unifying Attention and FFN with Shared Experts (2505.07260v1)

Published 12 May 2025 in cs.LG and cs.AI

Abstract: Sparse Mixture of Experts (MoE) architectures have emerged as a promising approach for scaling Transformer models. While initial works primarily incorporated MoE into feed-forward network (FFN) layers, recent studies have explored extending the MoE paradigm to attention layers to enhance model performance. However, existing attention-based MoE layers require specialized implementations and demonstrate suboptimal performance compared to their FFN-based counterparts. In this paper, we aim to unify the MoE designs in attention and FFN layers by introducing a novel reformulation of the attention mechanism, revealing an underlying FFN-like structure within attention modules. Our proposed architecture, UMoE, achieves superior performance through attention-based MoE layers while enabling efficient parameter sharing between FFN and attention components.

PDF Abstract

Summarize Bookmark Chat (Pro)

Authors (3)

Yuanhang Yang (8 papers)
Chaozheng Wang (28 papers)
Jing Li (621 papers)

Tweets

https://twitter.com/GptMaestro/status/1928913215582347721

UMoE: Unifying Attention and FFN with Shared Experts (2505.07260v1)

Related Papers

Tweets