Optimised Grouped-Query Attention Mechanism for Transformers (2406.14963v1)

Published 21 Jun 2024 in cs.LG

Abstract: Grouped-query attention (GQA) has been widely adopted in LLMs to mitigate the complexity of multi-head attention (MHA). To transform an MHA to a GQA, neighbour queries in MHA are evenly split into groups where each group shares the value and key layers. In this work, we propose AsymGQA, an activation-informed approach to asymmetrically grouping an MHA to a GQA for better model performance. Our AsymGQA outperforms the GQA within the same model size budget. For example, AsymGQA LLaMA-2-7B has an accuracy increase of 7.5% on MMLU compared to neighbour grouping. Our approach addresses the GQA's trade-off problem between model performance and hardware efficiency.

PDF HTML Abstract

Summarize Bookmark Chat (Pro)

Authors (6)

Yuang Chen (19 papers)
Cheng Zhang (388 papers)
Xitong Gao (23 papers)
George A. Constantinides (41 papers)
Yiren Zhao (58 papers)
Robert D. Mullins (4 papers)

Citations (3)

View on Semantic Scholar

Tweets

https://twitter.com/spatialmlnet/status/1807288674507206865

YouTube

Show All Videos

Optimised Grouped-Query Attention Mechanism for Transformers (2406.14963v1)

Related Papers

Tweets

YouTube