Mobile Multi-Query Attention (MQA)
- Mobile Multi-Query Attention (MQA) is an attention mechanism that shares a single key/value projection across multiple query heads, reducing model parameters.
- It significantly lowers memory consumption and speeds up inference on mobile and edge devices by diminishing cache storage needs.
- The trade-off between efficiency and minor accuracy loss has spurred enhancements like GQA and WGQA to regain performance while retaining resource savings.
Mobile Multi-Query Attention (MQA) refers to an attention mechanism within transformer architectures where multiple query heads attend to a single shared key and value projection, as opposed to the distinct key/value projections per head in standard multi-head attention. Designed primarily for lowering memory and computational costs, especially during autoregressive inference or on resource-constrained devices, MQA and its generalizations have become central to efficient transformer deployment, particularly in mobile and edge environments.
1. Formal Definition and Core Principles
The defining characteristic of Multi-Query Attention is the sharing of key and value heads across all query heads. Given an input sequence , standard multi-head attention (MHA) with heads computes
- ,
- ,
- .
Each head computes .
In MQA, there are distinct query projections but only a single, shared key and value projection per layer:
Attention for query head :
This approach reduces both the number of key/value parameters and, crucially, the amount of cache storage required during inference by a factor of , making it particularly attractive for deployment in memory-constrained settings and in mobile architectures (Ainslie et al., 2023, Brandon et al., 2024).
2. Motivations for MQA in Mobile and Efficient Inference
Transformers, especially at large scale, are memory-bound at inference time. The requirement to cache all key and value activations for each attention head at every decoding step becomes a limiting factor for sequence length, batch size, and model size in real-time or on-device scenarios. MQA directly addresses this by enabling:
- Drastic reduction in per-token memory: For heads, the KV-cache in MHA is ; in MQA it is (Chinnakonduru et al., 2024, Ainslie et al., 2023).
- Lower latency: Reducing memory footprint relieves DRAM bandwidth bottlenecks that dominate on mobile NPUs and DSPs (Qin et al., 2024).
In MobileNetV4, Mobile MQA is implemented specifically to optimize memory traffic for visual transformers:
- Only projections (for query heads, one shared , and one shared ) versus $3n$ in MHSA.
- Optionally downsamples spatially using depthwise strided convolution, further boosting operational intensity on mobile accelerators (Qin et al., 2024).
3. Empirical Performance, Trade-Offs, and Scaling Laws
The core trade-off with MQA is between memory efficiency and model quality:
- Accuracy Degradation: Sharing / across all heads can impair representational fidelity, resulting in lower accuracy compared to MHA. For example, in T5-XXL, MQA achieves ∼46.6 average score versus 47.2 for full MHA on summarization, translation, and QA (Ainslie et al., 2023). In Pythia-160M, MQA-12 shows ∼49.43% accuracy versus ∼52.79% for baseline (Zuhri et al., 2024).
- Speed and Memory Gains: MQA runs up to 6× faster on memory-bound inference and consumes 1.5 GB of cache versus 144 GB for full MHA in OPT-175B (for batch size , , ) (Zuhri et al., 2024).
Scaling laws are favorable: MQA provides a robust reduction in inference-time cost that scales with both model and sequence length, making larger architectures feasible on fixed memory (Ainslie et al., 2023, Brandon et al., 2024).
4. Generalizations: GQA, WGQA, and Quality-Aware Sharing
Grouped-Query Attention (GQA) interpolates between MQA and MHA by sharing key/value projections across query heads, recovering part of the expressive power of MHA while retaining significant memory savings. Weighted Grouped-Query Attention (WGQA) augments GQA by learning per-head weights for the key and value averaging process, resulting in empirically superior results (WGQA improves GQA by ≈0.5% in T5-base on ROUGE and BLEU) (Chinnakonduru et al., 2024).
Quality and Capacity-aware Grouped Query Attention (QCQA) algorithms further refine the head grouping by employing quality-aware (evolutionary) optimization for the grouping structure, yielding up to 20% higher accuracy than GQA for the same cache in Llama2-7B without fine-tuning, and 10.6% higher accuracy after fine-tuning at 50% KV-cache (Joshi et al., 2024).
Low-Rank KV adaptation (LRKV) encompasses MQA as the special case, while with moderate low-rank head-specific adaptations, it recovers almost all head-diversity and downstream performance at much reduced cache size (O'Neill et al., 16 Jan 2026).
5. Fine-Tuning, Uptraining, and Deployment Considerations
MQA and its variants can be efficiently instantiated from existing multi-head checkpoints via a two-stage process:
- Checkpoint Conversion: Replace projections with their mean (for MQA) or group-wise mean (for GQA).
- Uptraining: Further pre-train for 5% of the original compute budget to restore most of the lost quality (≥97% of full MHA performance is typical) (Ainslie et al., 2023).
WGQA introduces $2h$ new scalars per attention layer for fine-tuning, initialized as mean-pooling coefficients, and recovers ∼0.5% of performance with no extra inference burden (Chinnakonduru et al., 2024).
On mobile platforms, architectures like MobileNetV4 implement Mobile MQA by combining multi-query key/value sharing with INT8 quantization, depthwise convolution spatial reduction, and operator fusion, yielding 39–84% end-to-end speedup with only ≈0.03pt drop in top-1 accuracy versus MHSA (Qin et al., 2024). Inference on memory-constrained hardware thus becomes possible for models and sequences infeasible with full MHA.
6. Limitations, Head Diversity, and Adaptive/Hybrid Schemes
Fully sharing keys/values (pure MQA) restricts head diversity: PCA-based measures show a drop from ∼94% rank in MHA to ∼91% for MQA. As a result, downstream task accuracy and perplexity also suffer, with MQA consistently worse than MHA or even GQA (O'Neill et al., 16 Jan 2026). While query heads can compensate by specializing, this does not fully close the gap.
Hybrid methods such as Mixture of Attention Schemes (MoAS) propose dynamic, per-token routing between MHA, GQA, and MQA via a learned router. This strategy achieves validation loss competitive with MHA while still enabling potential for conditional memory efficiency, although at present hard branching is not enforced during training (Gumaan, 16 Dec 2025).
For deployment, if absolute minimal cache is required and slight accuracy loss is acceptable, pure MQA is optimal. For higher quality at moderate cost, GQA, WGQA, QCQA, or low-rank LRKV (with ) offer practical operating points. Cross-layer KV sharing (CLA, MLKV) can further halve or reduce cache requirements (e.g., CLA2 plus MQA yields a 32× cache reduction at ≤0.1 perplexity cost) (Brandon et al., 2024, Zuhri et al., 2024).
7. Practical Recommendations and Application Domains
MQA and its extensions are integral to modern transformer deployment in memory-bound or latency-sensitive contexts such as mobile devices, edge accelerators (EdgeTPU, ANE, DSP), and large-scale inference with strict throughput requirements. Recipe and deployment guidance include:
- Use MQA for maximum cache reduction when quality can be modestly sacrificed.
- Prefer GQA or WGQA with moderate group counts for best memory–quality trade-off.
- With off-the-shelf models, uptrain MQA/GQA variants from MHA checkpoints, using mean pooling and 5% retraining for performance recovery.
- Integrate spatial-downsampling and quantization for hybrid vision or multimodal scenarios on mobile (Qin et al., 2024).
- For advanced scenarios, use LRKV with moderate rank or QCQA for Pareto-optimal accuracy/memory on LLMs.
Across the transformer landscape, MQA and its generalizations constitute the dominant paradigm for scaling attention under practical engineering constraints (Ainslie et al., 2023, Chinnakonduru et al., 2024, Qin et al., 2024, Joshi et al., 2024, Brandon et al., 2024, O'Neill et al., 16 Jan 2026).