Unified Attention-FFN MoE
- The paper introduces UMoE, a method that reformulates multi-head attention as FFN-like operations to enable shared expert modules across both sub-layers.
- UMoE employs a top‑k gating mechanism with low‑rank expert-specific adaptations, yielding state‑of‑the‑art perplexity and performance improvements.
- Empirical results demonstrate that UMoE outperforms dense and FFN‑MoE baselines in language modeling and multilingual ASR, validating its unified design.
Unified Attention-FFN Mixture-of-Experts (UMoE) is an architectural strategy for scaling Transformer models by integrating Mixture-of-Experts (MoE) blocks into both the multi-head attention (MHA) and feed-forward network (FFN) sub-layers, with the unique property of parameter sharing between these traditionally distinct modules. UMoE introduces an algebraic reformulation of attention that reveals its underlying FFN-like structure, thereby enabling unified MoE design using a global bank of shared sparse experts. Empirical results indicate UMoE achieves state-of-the-art perplexity and downstream performance under tight compute and parameter budgets, and its principles have informed architectures in language modeling and multilingual speech recognition domains (Yang et al., 12 May 2025, Ma et al., 22 Jan 2025).
1. Motivation and Conceptual Insight
Sparse Mixture-of-Experts (MoE) layers scale Transformer capacity by routing each token through only a small subset of a large set of learned "experts," typically implemented as two-layer FFNs. Standard MoE adoption concentrated on the FFN sub-layer, but efforts to extend MoE to self-attention components (e.g., MoA, SwitchHead) saw reduced performance, owing to fundamental differences: attention layers consist of projection, softmax mixing, and weighted sum operations, whereas FFNs are two straightforward linear transformations sandwiching a nonlinearity.
UMoE bridges this divide by expressing multi-head attention algebraically as a token-mixing operation followed by a two-matrix FFN transform, thus making both attention and FFNs amenable to the same MoE methodologies. Critically, this decomposition permits shared expert modules (i.e., parameterized two-layer FFNs) to be used interchangeably in both attention and FFN roles within any Transformer layer (Yang et al., 12 May 2025).
2. Algebraic Reformulation of Attention as FFN-like Blocks
UMoE relies on a "pre-mixing" view of multi-head attention. Given input embeddings and query , for a single head:
- , ,
- ,
In the conventional output, for heads and block-decomposed into per-head :
The term aggregates a soft mix of all input tokens (token mixing), while the matrix product constitutes a two-matrix transformation matching the FFN pattern. Therefore, per-head computation is interpretable as applying an FFN expert to a mixed token representation. This structure underpins the unified expert architecture (Yang et al., 12 May 2025).
3. Routing and Expert Transform Mechanisms
Routing in UMoE applies the standard top- gating algorithm popularized in BASE layers: for input , routing logits are , and the set of active experts corresponds to the top scores. The gating mask is:
Each expert is realized as a two-layer FFN with nonlinearity :
For attention-MoE, routing operates on the soft-mixed token vectors; for FFN-MoE, on the sub-layer input or output.
The query projections in attention require per-expert adaptation, implemented in UMoE with a LoRA-style low-rank additive term:
4. Layer Structure and Parameter Sharing
A UMoE layer incorporates two major MoE sub-layers:
- Pre-mixing Attention-MoE:
- Compute shared and for the sequence.
- For each top- expert allocated to token , form via the low-rank and shared projections.
- Compute , then expert output .
- Aggregate into output via the router's mixture weights.
- FFN-MoE:
- Route the updated token representation through the top- selected experts, aggregate their outputs.
All modules draw from a single global pool of experts ({}); only the routers (matrices ) are distinct for attention and FFN sub-blocks. This sharing halves the parameter count compared to duplicating expert sets for each sub-layer without capacity loss.
Layer normalization and residual connections are implemented per standard Transformer practice.
5. Empirical Results and Quantitative Performance
UMoE exhibits superior performance metrics over Dense, FFN-MoE, MoA, and SwitchHead baselines. For "Base" LLMs (~540M params; ~610G MACs), UMoE achieves:
| Model | FineWeb PPL | Wiki103 PPL | Avg Zero-shot Acc (%) |
|---|---|---|---|
| Dense (134M) | 25.79 | 30.41 | 36.14 |
| FFN-MoE (535M) | 21.19 | 27.94 | 39.55 |
| MoA (525M) | 22.28 | 27.57 | 38.49 |
| SwitchHead (533M) | 22.91 | 29.47 | 38.30 |
| UMoE-Att only (547M) | 20.81 | 27.45 | 39.94 |
| UMoE full (540M) | 20.44 | 26.67 | 40.06 |
UMoE full denotes shared expert pool across attention and FFN MoE sub-layers. Best figures in each column are bolded.
Ablation studies show that allocating all active experts in a layer to attention (rather than FFN) yields further perplexity reduction, while the inclusion of nonlinearity in expert FFNs is critical (removing it degrades perplexity by 1.6 points). Slight gains are realized when using separate routers for the two sub-layer types, even with shared expert weights (Yang et al., 12 May 2025).
6. Application in Domain-Robust Multilingual ASR
UMoE principles have influenced architectures such as BLR-MoE for end-to-end multilingual automatic speech recognition (ASR) (Ma et al., 22 Jan 2025). In BLR-MoE:
- Each Transformer layer in the Mixture-of-Language Experts (MLE) block replaces standard MHA and FFN sub-layers with attention-MoE and FFN-MoE counterparts.
- The router, mediated by a LID (Language ID) signal and optionally augmented with a TDNN adapter, produces gating weights that are shared across attention and FFN MoE modules within a layer.
- During inference, "expert pruning" is employed using known language constraints to further improve recognition performance under domain shift.
Performance on a 10,000-hour MASR dataset demonstrates substantial relative WER reductions: BLR-MoE outperforms LR-MoE (FFN-only) by 16% relative WER (15.84% vs. 18.89% overall WER). Out-of-domain WER shows 19% relative gain. Additional ablations confirm that both attention-MoE and router augmentation independently yield notable improvements (Ma et al., 22 Jan 2025).
7. Comparative Architectural Approaches and Implications
The UMoE paradigm is closely related to recent advances in expert decomposition and routing, such as Union-of-Experts (UoE) (Yang et al., 4 Mar 2025), which conducts expert decomposition on both MLP and attention blocks using matrix partitioning and supports hierarchical, patch-wise, or expert-wise routing. While UoE attains strong efficiency gains and performance improvements, UMoE's distinct contribution is the algebraically motivated, exact equivalence between attention and FFN MoE mechanisms and the consequent ability to use a single global expert pool.
A plausible implication is that, as model designs continue to bring routing flexibility and parameter sharing to both attention and FFN modules, future large-scale Transformer architectures will increasingly converge in spirit toward UMoE-style unified, capacity-scaled frameworks. The specific algebraic insights of UMoE ensure optimal parameter efficiency and enable scaling both key sub-layers synergistically, rather than in isolation.
References
- "UMoE: Unifying Attention and FFN with Shared Experts" (Yang et al., 12 May 2025)
- "BLR-MoE: Boosted Language-Routing Mixture of Experts for Domain-Robust Multilingual E2E ASR" (Ma et al., 22 Jan 2025)
- "Union of Experts: Adapting Hierarchical Routing to Equivalently Decomposed Transformer" (Yang et al., 4 Mar 2025)