Sparse Multi-Query Attention (MQA)

Updated 5 January 2026

Sparse Multi-Query Attention (MQA) is an efficient attention mechanism that shares key and value projections across query heads to dramatically reduce memory and inference latency.
It achieves an O(1/H) reduction in KV-cache storage, enabling up to 6.3× speedup in decoding with only a minor drop (≈0.6 points) in quality metrics compared to full MHA.
MQA forms the basis for advanced adaptive attention schemes like GQA, QCQA, and MoAS, which offer flexible quality–performance trade-offs in large-scale Transformer models.

Sparse Multi-Query Attention (MQA) is an attention mechanism that addresses the memory and inference-time efficiency bottlenecks observed in large-scale Transformer decoders, especially during autoregressive text generation. By sharing a single key and value projection across all query heads, MQA achieves a significant reduction in the memory and bandwidth requirements for storing and accessing the key-value cache ("KV-cache") at inference, at the cost of some reduction in generation quality. MQA occupies a central role in the evolution of efficient attention mechanisms, serving as a limiting case of the broader grouped-query attention (GQA) family, and is foundational to subsequent adaptive paradigms such as QCQA and MoAS.

1. Definition and Mechanism of Sparse Multi-Query Attention

Multi-Query Attention (MQA) modifies the standard multi-head attention (MHA) architecture by retaining $H$ distinct query projections, while sharing a single key and value projection across all heads. Formally, given an input sequence $X\in\mathbb{R}^{L \times d}$ , number of query heads $H$ (with head dimension $d_k = d/H$ ), the projections are:

$W_Q \in \mathbb{R}^{d\times (H d_k)}$ : per-head query projections
$W_K \in \mathbb{R}^{d\times d_k}$ : single shared key projection
$W_V \in \mathbb{R}^{d\times d_k}$ : single shared value projection

The computation proceeds as follows:

$Q = X W_Q \rightarrow \mathbb{R}^{L\times (H d_k)} \rightarrow \mathbb{R}^{L\times H\times d_k}$
$K = X W_K \rightarrow \mathbb{R}^{L\times d_k}$
$V = X W_V \rightarrow \mathbb{R}^{L\times d_k}$

Each head $h=1,\dots,H$ computes the attention as:

$A_h = \mathrm{softmax}\left( \frac{Q_h K^\top}{\sqrt{d_k}} \right) V \in \mathbb{R}^{L \times d_k}$

The final output is obtained by concatenating all $A_h$ over $h$ and projecting via $W_O \in \mathbb{R}^{d\times d}$ (Gumaan, 16 Dec 2025, Ainslie et al., 2023).

In this scheme, all heads attend over the same key and value content but with different queries, striking a balance between representational diversity and memory efficiency.

2. Correspondence with Other Attention Mechanisms

The spectrum of attention mechanisms includes:

Multi-Head Attention (MHA): Each of the $H$ heads uses independent query, key, and value projections ( $W^Q_h$ , $W^K_h$ , $W^V_h$ ).
Grouped-Query Attention (GQA): The $H$ query heads are divided into $G$ disjoint groups, each group sharing a single key and value projection. MQA is the case $G=1$ (Ainslie et al., 2023, Joshi et al., 2024).
Sparse MQA: Used here as a synonym for standard MQA, highlighting the sparsity in the key/value projections.

The memory cost per layer for key-value storage (KV-cache) scales as follows: | Mechanism | #KV heads per layer | KV-cache per layer | Compute profile | |-----------|--------------------|--------------------|-----------------| | MHA | H | $2 B T H d$ | $O(Hd^2)$ (dominated by Q-proj) | | GQA ( $G$ groups) | G | $2 B T G d$ | $O((H+G)d d_k)$ | | MQA | 1 | $2 B T d$ | $O((H+2)d d_k)$ |

Here, $B$ = batch size, $T$ = sequence length, $d$ = hidden dimension, $d_k$ = head dimension.

The principal efficiency advantage of MQA is an $O(1/H)$ reduction in KV-cache memory compared to MHA (Gumaan, 16 Dec 2025, Ainslie et al., 2023).

3. Memory–Latency–Quality Trade-offs

The utility of MQA and its generalizations is rooted in the KV-cache bottleneck during autoregressive inference, where for each new token, all past key and value vectors are required:

MHA: $2 B T H d$ floats per layer
MQA: $2 B T d$ floats per layer

On large LLMs (such as T5-XXL, $H=64$ ), this enables $6.3\times$ speedup in decoding (latency reduced from $1.51\,\mathrm{s}$ to $0.24\,\mathrm{s}$ per sample) (Ainslie et al., 2023). However, this memory advantage comes with model quality degradation. For instance, on standard summarization/translation/QA benchmarks, MQA exhibits an absolute drop of $0.6$ in ROUGE/BLEU/F1 compared to MHA (e.g., $47.2$ for MHA vs. $46.6$ for MQA on T5-XXL). GQA with $G\ll H$ offers an intermediate quality–performance balance. (Ainslie et al., 2023)

Empirical results on WikiText-2 with MoAS show that pure MQA induces a validation loss increase of $0.02\text{–}0.03$ over full MHA, for a $6\times$ KV-cache reduction (Gumaan, 16 Dec 2025). In the context of Llama2-7B, similar scaling is evident (Joshi et al., 2024).

4. Conversion and Deployment of MQA

Sparse MQA can be instantiated via direct conversion from existing MHA checkpoints (termed "uptraining") without full retraining:

Mean-pool MHA key/value weights: For an MHA checkpoint with per-head $W^K_h, W^V_h$ , set:

$W^K_{\text{MQA}} = \frac{1}{H} \sum_{h=1}^H W^K_h \quad W^V_{\text{MQA}} = \frac{1}{H} \sum_{h=1}^H W^V_h$

Discard per-head $W^K_h, W^V_h$ , keep $H$ query projections
Uptrain for $5\%$ of original pretraining compute (e.g., $0.05\,T$ additional steps) with original hyperparameters (Ainslie et al., 2023).

The uptraining procedure restores most of the strict conversion's quality loss, e.g., on T5-XXL, from an initial post-conversion drop to near-MHA quality after 5% additional compute.

5. Modeling Considerations and Adaptive Extensions

MQA is a fixed grouping scheme and does not optimize the trade-off between memory footprint and quality for a given application. Adaptive variants such as Grouped-Query Attention (GQA) and Quality and Capacity–Aware Grouped Query Attention (QCQA) generalize MQA by flexibly partitioning query heads into higher-quality groupings, using more than one key/value per layer. QCQA further exploits evolutionary search and a weight-sharing error (WSE) proxy objective to identify Pareto-optimal groupings, leading to significant quality improvements at similar or reduced KV-cache—e.g., on Llama2 7B, QCQA achieves $20\%$ higher accuracy than GQA at the same cache size, and requires $40\%$ less cache to match GQA quality (Joshi et al., 2024).

Mixture-of-Attention-Schemes (MoAS) introduces token-level dynamic routing between MQA, GQA, and MHA branches, with a learned router MLP selecting the optimal scheme per token, yielding further gains in efficiency while maintaining competitive modeling quality (Gumaan, 16 Dec 2025).

6. Practical Implementation Guidelines

Implementing MQA entails:

Using the same number of query heads $H$ as MHA baseline.
Setting shared key/value projections: $W_K \in \mathbb{R}^{d\times d_k}, W_V \in \mathbb{R}^{d\times d_k}$ .
Head dimension $d_k = d/H$ .
At inference, storing a single pair $(K, V)$ per token per layer in the KV-cache.
For conditional computation, an MLP router with hidden width ≈ $d/4$ may be added if adopting MoAS-style adaptive routing.

Expectations from deployment:

$H\times$ reduction in KV-cache storage.
Minimal additional compute over MHA (projection FLOPs are dominated by the $H$ query projections).
Quality degradation of $0.6$–$1$ points (ROUGE/BLEU/F1) on generative tasks, substantially recoverable via uptraining and minimized in adaptive/hybrid schemes (Gumaan, 16 Dec 2025, Ainslie et al., 2023).

7. Position in the Attention Efficiency Landscape

Sparse MQA represents an extremal point in the attention speed–quality Pareto frontier, enabling significant reductions in decoder inference latency and memory usage for large generative models. It is foundational for scalable, memory-constrained LLM inference. Research trends indicate ongoing preference for grouped and adaptive attention schemes, such as QCQA and MoAS, which further balance efficiency and quality by leveraging more sophisticated groupings and per-token routing. These advances generalize MQA’s insight—aggressive KV-cache reduction is feasible with only small, quantifiable, and adjustable loss in output quality (Joshi et al., 2024, Gumaan, 16 Dec 2025, Ainslie et al., 2023).