2000 character limit reached
Optimised Grouped-Query Attention Mechanism for Transformers (2406.14963v1)
Published 21 Jun 2024 in cs.LG
Abstract: Grouped-query attention (GQA) has been widely adopted in LLMs to mitigate the complexity of multi-head attention (MHA). To transform an MHA to a GQA, neighbour queries in MHA are evenly split into groups where each group shares the value and key layers. In this work, we propose AsymGQA, an activation-informed approach to asymmetrically grouping an MHA to a GQA for better model performance. Our AsymGQA outperforms the GQA within the same model size budget. For example, AsymGQA LLaMA-2-7B has an accuracy increase of 7.5% on MMLU compared to neighbour grouping. Our approach addresses the GQA's trade-off problem between model performance and hardware efficiency.
- Yuang Chen (19 papers)
- Cheng Zhang (388 papers)
- Xitong Gao (23 papers)
- George A. Constantinides (41 papers)
- Yiren Zhao (58 papers)
- Robert D. Mullins (4 papers)