Grouped-Query Attention Overview

Updated 13 October 2025

Grouped-Query Attention (GQA) is a transformer mechanism that groups query projections to share key–value pairs, reducing memory usage during inference.
GQA offers a flexible tradeoff between standard multi-head and multi-query attention by adjusting the number of KV groups, which improves throughput while maintaining quality.
Recent GQA variants, such as QCQA, WGQA, and DHA, enhance adaptivity and accuracy through dynamic grouping and optimized head merging strategies.

Grouped-Query Attention (GQA) is a class of transformer attention mechanisms in which Q (query) projections are grouped so that each group shares its own set of K (key) and V (value) projections, thereby reducing the number of unique key–value heads stored in the key–value (KV) cache. The primary objective of GQA is to decrease the memory and bandwidth requirements during inference—particularly in autoregressive decoding—while maintaining high model quality. GQA acts as an interpolation between standard multi-head attention (MHA), where all heads are independent, and multi-query attention (MQA), where all query heads share a single KV head. Recent years have seen rapid evolution of GQA, leading to a rich family of architectural and optimization variants targeting improved efficiency, expressivity, and adaptivity for large-scale language and vision models.

1. Core Principles and Formal Definition

In GQA, the attention heads are partitioned into $G$ groups ( $1 \leq G \leq H$ , with $H$ the total number of heads). Each group of query heads is assigned a shared key–value pair, thereby reducing the number of distinct KV heads from $H$ (in MHA) to $G$ . Formally, for query group $g \in \{1,\ldots,G\}$ , if the group contains $n_g$ queries $\{Q_i\}_{i=1}^{n_g}$ , then all $n_g$ queries attend to the same $K_g$ , $V_g$ :

$K_g = \frac{1}{n_g} \sum_{i \in \text{group }g} K_i, \qquad V_g = \frac{1}{n_g} \sum_{i \in \text{group }g} V_i$

The attention output for all queries in the group is

$\text{Attention}_g = \text{softmax}\left( \frac{Q_g K_g^{T}}{\sqrt{d_k}} \right) V_g$

Concatenating the outputs over $G$ groups yields the final multi-headed attention output.

When $G = 1$ , GQA becomes equivalent to MQA; when $G = H$ (one query per group), it reduces to standard MHA. GQA therefore enables practitioners to balance memory use and representational capacity by selecting $G$ .

2. Memory, Quality, and Efficiency Tradeoffs

GQA reduces the storage and bandwidth required for caching key–value tensors during inference. For a decoder with $H$ heads, head width $d_h$ , sequence length $L$ , and batch size $B$ , the memory overhead is:

$\text{KV Cache Size}_{\text{GQA}} = 2 \cdot G \cdot d_h \cdot B \cdot L$

compared to

$\text{KV Cache Size}_{\text{MHA}} = 2 \cdot H \cdot d_h \cdot B \cdot L$

This memory reduction is vital in large-scale LLMs and vision transformers, especially for long-sequence generation and large batch serving.

Empirical results (Ainslie et al., 2023, Brandon et al., 21 May 2024) consistently show that GQA with moderate $G$ matches or nearly matches MHA quality (e.g., accuracy, BLEU, or ROUGE) with much faster inference—e.g., in T5-XXL, GQA-8 achieves nearly MHA-level quality with inference speed close to MQA. Quality drops are modest for moderate grouping but increase as $G$ decreases. The design space (illustrated below) allows interpolation between speed and accuracy:

Attention Mechanism	Distinct KV Heads	KV Cache Size	Expected Quality	Inference Speed
Multi-Head (MHA)	$H$	$*$	Best	Slowest
Grouped-Query (GQA- $G$ )	$G$	$\sim1/G$	Near-best	Faster
Multi-Query (MQA)	$1$	Lowest	Lower	Fastest

(* $*$ signifies maximum: $2\cdot H\cdot d_h\cdot B\cdot L$ .)

Notably, GQA targets the memory bandwidth bottleneck that dominates autoregressive decoding (Brandon et al., 21 May 2024, Kong et al., 5 May 2025). However, raw compute cost (FLOPs for attention score computation) is not reduced relative to MHA, a limitation specifically addressed by new approaches such as Sparse Query Attention (Filipek, 2 Oct 2025) and CCGQA (Figliolia et al., 6 Oct 2025).

3. Extensions: Adaptive, Weighted, and Activation-Informed Grouping

Several works have demonstrated substantial gains by refining the grouping mechanism:

Quality- and Capacity-Aware Grouping (QCQA): Instead of fixed-size (naive) groups, QCQA (Joshi et al., 8 Jun 2024) employs an evolutionary algorithm and a lightweight fitness function (weight-sharing error) to discover groupings that minimize loss in accuracy under a KV-cache constraint. QCQA achieves up to 20% higher accuracy at equal cache size to naive GQA, and up to 40% reduction in KV size for the same accuracy.
Weighted GQA (WGQA): WGQA (Chinnakonduru et al., 15 Jul 2024) replaces uniform mean-pooling with learnable weights for each key and value in the group. During finetuning, the model discovers optimal weightings, leading to an additional ~0.5% average improvement in summarization and translation tasks over regular GQA, with no extra inference cost.
Activation-Informed and Asymmetric Grouping (AsymGQA): AsymGQA (Chen et al., 21 Jun 2024) clusters attention heads by activation similarity (cosine on output activations) and allows for asymmetric, non-uniform group sizes, further tailoring groups for accuracy. LLaMA-2-7B finetuned with AsymGQA achieves a 7.5% accuracy gain on MMLU compared to contiguous grouping.
Key-Driven and Dynamic Grouped GQA (KDGQA/DGQA): These mechanisms (Khan et al., 15 Aug 2024) assign more queries to key heads with higher L₂ norm, interpreted as more "important". Dynamic variants track changes in key norm over training using EMA, allocating queries adaptively. For ViT-L, DGQA achieves up to +8% accuracy compared to static grouping.
Expert Mixtures (mixSGA): mixSGA (Song et al., 16 Jun 2025) introduces dynamic token-wise routing: importance scores route tokens to experts with different group sizes (KV granularity), allocating more compute/memory to "important" tokens. Weight-sharing ensures efficiency, and an auxiliary loss enforces one-hot expert selection for consistency between training and inference.

These approaches demonstrate that static or uniform grouping is suboptimal; activation- or importance-informed, weighted, and dynamic grouping strategies yield significant quality gains for equal or lower resource use.

4. Conversion, Compatibility, and Interactions

State-of-the-art methods address practical challenges in converting pretrained MHA checkpoints to efficient GQA models:

Progressive Head Merging via Orthogonal Alignment: (Jin et al., 30 Dec 2024) proposes aligning KV heads using block-diagonal orthogonal transforms (compatible with RoPE); this increases intra-group similarity before merging and allows compressing up to 87.5% of KV heads in LLaMA2-7B with negligible degradation.
Decoupled-Head Attention (DHA): (Chen et al., 3 Jun 2024) further generalizes GQA by adaptively fusing similar heads via linear fusion. Unlike mean pooling, the mapping between queries and group keys/values is learned layerwise, with adaptive budgets based on redundancy. DHA achieves 97.6% of MHA performance with only 0.25% of the original pre-training budget and 75% KV cache reduction, outperforming standard GQA, especially under tight training or fine-tuning budgets.
GQKVA: (Javadi et al., 2023) generalizes grouping across queries, keys, and values. By flexibly partitioning all components, one can realize various tradeoffs between parameter count, efficiency, and representational diversity.
Weighted Grouping and Anchored Initializations: (Chinnakonduru et al., 15 Jul 2024) and (Jin et al., 30 Dec 2024) empirically demonstrate that learned or optimized fusion and grouping significantly outperform naive mean pooling for head merging and model adaptation.

5. Hardware and Algorithmic Co-Design

Practical deployment of GQA and its variants is closely intertwined with hardware-oriented innovations:

Cache and Kernel Efficiency: GQA forms the basis of highly efficient inference engines in modern LLMs, including Opt-GPTQ (Kong et al., 5 May 2025), which combines GQA with paging memory management and kernel customizations optimized for data center units (DCUs). Query grouping reduces both memory usage and computation by batching shared KV access, and enables improved parallelism and hardware utilization.
Flash Sparse Attention (FSA): (Yan et al., 25 Aug 2025) develops a GPU kernel for efficient Native Sparse Attention (NSA) even when GQA group sizes are small (as in most modern LLMs), avoiding padding inefficiencies of previous NSA kernels. FSA achieves up to 3.5× kernel-level latency reduction and consistent end-to-end speedups across LLMs.
Grouped-Tied and Latent Attention: (Zadouri et al., 27 May 2025) introduces Grouped-Tied Attention (GTA), tying K and V projections to a single shared state per group, halving KV cache usage relative to GQA with equivalent perplexity, and doubling arithmetic intensity. Grouped Latent Attention (GLA) extends this principle for even more memory-efficient parallel execution.

6. Beyond GQA: Differential, Sparse, and Latent Variants

Differential and Grouped Differential Attention: (Lim et al., 8 Oct 2025) extends GQA to unbalanced allocation regimes for noise and signal branches, assigning more heads to signal extraction, with controlled repetition (mirroring GQA) for noise reduction. This yields improved generalization and stability under constrained compute.
Sparse Query Attention (SQA): (Filipek, 2 Oct 2025) pursues FLOP reduction by decreasing the number of query heads directly, unlike GQA which reduces memory but not arithmetic cost. SQA achieves up to 3× throughput improvement in compute-bound training and prefill scenarios, with minimal quality loss.
Latent and Compressed Latent Attention (CCGQA): (Figliolia et al., 6 Oct 2025) demonstrates that combining group head sharing (from GQA) with latent-space compression (from CCA) yields an attention mechanism—CCGQA—that outperforms GQA and MLA both in FLOP and memory cost, enabling up to 8× cache compression in MoE models without loss of performance compared to MHA.
Differential Rescaling (DiffQKV): (Lin et al., 23 Jan 2025) decomposes the compression of Q, K, and V, compressing K more, preserving V more, and augmenting Q (by increasing its head width) to maintain expressivity. DiffQKV achieves up to 33% inference speed improvement over GQA on long contexts, establishing the importance of non-uniform rescaling.

7. Applications and Future Trajectory

GQA and its descendants have proven critical in scaling transformer models to longer contexts and larger batch sizes, especially in resource- and latency-sensitive environments. Major applications include:

LLMs, such as T5, Llama, GPT-like architectures, where GQA reduces memory cost during decoding, enabling longer contexts and higher throughput.
Vision transformers, where memory overhead is a limiting factor in large-scale pretraining and inference.
Mixture-of-Experts (MoE) transformers, where composability of groupings enables scaling diversity and capacity efficiently.
Adaptive sequence processing and real-time generation, enabled by dynamic, token-wise routing and adaptive grouping.

Research is rapidly progressing towards further synergy between grouping, latent and low-rank compression, kernel-level hardware optimizations, and dynamic computation routing. Pivotal directions include combining grouping with sequence compression (e.g., sliding windows, chunking), non-linear fusion for head aggregation, and hardware–algorithm co-design for maximally efficient LLM deployment. GQA, as both a family of architectures and a design paradigm, remains foundational to the continued scale-up and democratization of efficient transformer models.