SwitchHead: Accelerating Transformers with Mixture-of-Experts Attention (2312.07987v3)

Published 13 Dec 2023 in cs.LG, cs.CL, and cs.NE

Abstract: Despite many recent works on Mixture of Experts (MoEs) for resource-efficient Transformer LLMs, existing methods mostly focus on MoEs for feedforward layers. Previous attempts at extending MoE to the self-attention layer fail to match the performance of the parameter-matched baseline. Our novel SwitchHead is an effective MoE method for the attention layer that successfully reduces both the compute and memory requirements, achieving wall-clock speedup, while matching the LLMing performance of the baseline Transformer. Our novel MoE mechanism allows SwitchHead to compute up to 8 times fewer attention matrices than the standard Transformer. SwitchHead can also be combined with MoE feedforward layers, resulting in fully-MoE "SwitchAll" Transformers. For our 262M parameter model trained on C4, SwitchHead matches the perplexity of standard models with only 44% compute and 27% memory usage. Zero-shot experiments on downstream tasks confirm the performance of SwitchHead, e.g., achieving more than 3.5% absolute improvements on BliMP compared to the baseline with an equal compute resource.

References (42)

Authors (4)

Róbert Csordás (25 papers)
Piotr Piękos (6 papers)
Kazuki Irie (35 papers)
Jürgen Schmidhuber (124 papers)

Citations (6)

View on Semantic Scholar

Summary

The paper proposes a novel MoE-based attention mechanism, SwitchHead, that significantly lowers computational and memory requirements for transformers.
The methodology applies MoE to value and output projections, reducing the number of attention matrices while preserving model expressiveness.
Results show that SwitchHead achieves comparable or superior performance on language tasks with reduced redundancy and lower resource costs.

Introduction

Transformers have significantly impacted the field of natural language processing, demonstrating impressive capabilities across various tasks. Despite their success, they have one major limitation: transformer models, especially large ones, require substantial computational power and memory, making them inaccessible to many researchers and institutions. Scaling these models efficiently is an important yet challenging problem. Mixture-of-Experts (MoE) techniques have been considered for improving parameter efficiency, but their use in attention mechanisms has been less explored. This paper introduces SwitchHead, an MoE-based attention method, to alleviate resource demands while maintaining comparable model performance.

Methodology and Results

SwitchHead reduces memory and compute requirements by minimizing the number of attention matrices without sacrificing expressiveness. It applies MoE to the projections of value and output, computing fewer attention matrices than traditional transformers. The method's effectiveness is demonstrated on both small and LLMing datasets, achieving similar or better performance compared to baselines with the same parameter count while significantly lowering computational cost. Analyses of attention maps and expert selection indicate substantial reduction in redundancy.

Discussion

The paper highlights the potential of reducing the computational burden in Transformer models with MoE techniques, specifically by using MoE for selective parts of the attention mechanism. Additionally, the research encompasses various datasets and model sizes, confirming the broad applicability of SwitchHead. Furthermore, the combined MoE approach for both MLP and attention layers (termed "SwitchAll") offers a fully MoE-based Transformer model that is competitive against traditional dense models. These advancements offer a path toward more accessible and efficient Transformer models.

Conclusion

SwitchHead proposes a novel, resource-efficient attention mechanism that uses MoE layers, demonstrating comparable LLMing performance with reduced compute and memory requirements. This approach opens the door for training and inference of powerful LLMs on less resource-intensive infrastructure. The paper's findings suggest potential for scaling up neural networks with MoE models, which could democratize access to advanced AI capabilities. The provided open-source code enables further exploration and adoption by the wider research community.

Related Papers

Tweets

https://twitter.com/robert_csordas/status/1866862611641610401

https://twitter.com/PiotrPiekosAI/status/1845545924874981393

https://twitter.com/16829893/status/1735695498474934312

https://twitter.com/emanpleb/status/1789365108549759010

https://twitter.com/22146921/status/1736152339822653668

https://twitter.com/omouamoua/status/1918266517591908658

YouTube

Show All Videos

Reddit

SwitchHead: Accelerating Transformers with Mixture-of-Experts Attention (14 points, 1 comment)