Query-Key Biasing in Attention Models

Updated 27 February 2026

Query-Key Biasing (QKb) is an approach that modifies standard attention by introducing compositional, learnable, and data-driven biases in query-key interactions.
It extends conventional self-attention using techniques such as key-only attention, additive biasing, grouped query allocation, and activation-based retrieval to enhance efficiency.
Empirical results show that QKb methods improve image classification accuracy, reduce memory requirements in language models, and boost ASR performance while cutting computational costs.

Query-Key Biasing (QKb) encompasses a family of architectural manipulations within attention-based sequence models in which the standard, uniform interaction between queries and keys is replaced or augmented with compositional, learnable, or data-driven biases. These modifications introduce inductive structure, facilitate efficient retrieval, or condition model computations on external side information. QKb represents both a unifying abstraction across a range of model architectures and a practical methodology, extending from key-only attention in vision transformers and activation-driven retrieval in LLMs to diarization-based conditioning in speech recognition.

1. Principles and Theoretical Foundations

Conventional self-attention computes interactions via the scaled dot product: $\mathrm{Attention}(Q,K,V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d}}\right)V$ where $Q \in \mathbb{R}^{N_q \times d}$ (queries), $K \in \mathbb{R}^{N_k \times d}$ (keys), and $V \in \mathbb{R}^{N_k \times d}$ (values). This formulation results in $\mathcal{O}(N_q N_k d)$ compute and $\mathcal{O}(N_q N_k)$ memory complexity, which is impractical for long sequences or large feature maps.

QKb intervenes at the $QK^T$ interaction, introducing biases—learned, fixed, or data-adaptive—that:

Replace pairwise dot products (key-only attention, single global gates)
Modulate them (additive biases from side information)
Select or reweight candidate KV-pairs (activation-aware or key-norm-driven grouping) These mechanisms reduce computational burden, encode inductive priors, or adapt the model’s attention in accordance with auxiliary signals, all while preserving compatibility with Transformer-style workflows.

2. Model Instantiations and Algorithmic Mechanisms

2.1 Key-Only and Parametric Bias Variants

In "Rethinking Query-Key Pairwise Interactions in Vision Transformers" (Li et al., 2022), standard $QK^T$ is eliminated entirely in favor of a key-weighted saliency gate: $A = \mathrm{softmax}\left(\frac{Kw_{\mathrm{sal}}}{\sqrt{D}}\right)\,,\quad G = \sum_{i=1}^N A_i K_i\,,\quad Y = \left((G \odot V) U_1 + K\right) U_2$ This produces linear complexity with respect to sequence length, and empirically preserves high task accuracy in vision domains. The design admits generalizations:

Low-rank/factorized $QK^T$ (approximating full pairwise interactions)
Additive or multiplicative bias terms, e.g., $\beta q_i^T u$ or entrywise bias matrices These provide a spectrum from purely global key gating to partial restoration of fine-grained query-key expressivity (Li et al., 2022).

2.2 Key-Norm and Grouped Query Allocation

Key-driven query allocation is central to the Dynamic Key-Distributed Grouped Query Attention (DGQA) mechanism (Khan et al., 2024). Given grouped keys $K_g$ , one computes norms $n_g = \|K_g\|_2$ and adapts the assignment of query heads to each key group according to normalized importances, either via min–max scaling or exponential moving average: $Q_g = \mathrm{round}\left(\frac{d_g}{\sum_{g'} d_{g'}} H \right)$ Adaptive scheduling via EMA or windowed differences allows temporal evolution of the grouping, making the attention distribution responsive to both current and historical key statistics. Such biasing increases capacity for highly informative regions and reduces redundant computation for low-importance ones, resulting in empirical gains of up to +8% in vision classification tasks (Khan et al., 2024).

2.3 Activation-Biased Retrieval in Long-Context LLMs

Activation-aware probe-query retrieval (Xiao et al., 19 Feb 2025) constructs a “probe-query” via an activation bias metric: $\boldsymbol\phi^t_j = \frac{(\mathbf{q}_j^t - \bar{\mathbf{z}}^t)^2}{\sigma^2},\quad \alpha^t_j = \frac{\|\boldsymbol\phi^t_j\|_1}{\sum_k \|\boldsymbol\phi^t_k\|_1}$

$\mathbf{Q}^t_{\mathrm{probe}} = \sum_{j=1}^m \alpha^t_j \mathbf{q}^t_j$

This probe-query is then used to select historical key-value pairs by similarity, focusing memory budget on tokens exhibiting high “activation outlier” scores. At decoding, a dynamic key-value cut-off allocates KV retention per layer by entropy, balancing representational richness and efficiency (Xiao et al., 19 Feb 2025).

2.4 Diarization-Conditioned Attention Bias

In DiCoW TS-ASR (Polok et al., 2024), QKb is realized as a learnable, diarization-driven bias $b_j$ added to logit scores: $A_{ij} = \mathrm{softmax}_j\left(\frac{(W_q q_i)^T (W_k k_j)}{\sqrt{d}} + b_j\right)$ where $b_j=0$ for target frames, $b_j=-c$ for non-target. This achieves soft attention masking toward the diarized target speaker, with $c$ learnable during fine-tuning. Unlike hard masking or waveform zeroing, QKb allows the model to retain some interaction with non-target frames, which can improve disambiguation under overlap (Polok et al., 2024).

In WCTC-Biasing for ASR keyword adaptation (Nakagome et al., 2 Jun 2025), a wildcard CTC keyword-spotting (“query” stage) triggers inter-layer injection of bias vectors (“key/value” stage) via a convex combination of the current model output and a keyword-derived one-hot, mapped back to hidden space for subsequent layers. This framing generalizes readily to encoder/decoder and cross-modal setups by defining Q and K/V as linear projections or contextual embeddings appropriate to the task (Nakagome et al., 2 Jun 2025).

3. Task-Specific Applications

3.1 Vision

Key-only attention, DGQA, KDGQA, and their variants have shown effectiveness in high-resolution image tasks including ImageNet classification, COCO object detection, and ADE20K segmentation. By explicitly biasing the receptive field and query allocation toward salient visual features, these approaches circumvent bottlenecks imposed by quadratic cost, allowing global context modeling at all network stages (Li et al., 2022, Khan et al., 2024).

3.2 Language Modeling

Long-context LLMs leverage QKb in the form of probe-query or key-norm based KV caching to mitigate the impracticality of storing and attending to all past tokens. By adaptively focusing attention on information-dense or activation-outlier positions and discarding distractors, resource efficiency is improved—up to 16× reduction in KV pairs stored and a 10.4% accuracy gain over “full-context” retrieval—without the need for retraining (Xiao et al., 19 Feb 2025).

3.3 Speech Recognition

In ASR, both diarization-conditioned QKb (Polok et al., 2024) and inter-layer biasing (Nakagome et al., 2 Jun 2025) realize QKb as plug-and-play contextual adaptation. The result is enhanced recognition of rare or out-of-vocabulary keywords (+29% OOV F1 in Japanese ASR (Nakagome et al., 2 Jun 2025)) and targeted speaker transcription, while maintaining robustness on single-speaker inputs. Injection of external knowledge or trigger-derived bias is achieved with minimal architectural modification and often without retraining.

4. Computational Complexity and Trade-Offs

A principal motivation for QKb is the reduction of compute and memory cost:

Key-only attention: $\mathcal{O}(Nd^2)$ time and $\mathcal{O}(Nd)$ memory, versus $\mathcal{O}(N^2d)$ and $\mathcal{O}(N^2)$ for standard attention (Li et al., 2022).
Grouped query attention: Key-norm-driven grouping matches GQA parameter count with negligible overhead, while PGQA adds minimal per-forward cost for noise generation (Khan et al., 2024).
Activation-aware retrieval: Memory and retrieval costs are dynamically budgeted per-layer by entropy, leading to up to 80% GPU memory savings and improved factual accuracy (Xiao et al., 19 Feb 2025).
Biasing in ASR: Additive attention bias introduces only a single extra parameter ( $c$ ) per head, making the method efficiently scalable to deep, multi-headed architectures (Polok et al., 2024).

These computational gains must be balanced against possible loss of fine-grained pairwise interactions; hybrid strategies (low-rank, sparse, or mixed bias) provide intermediate trade-offs (Li et al., 2022).

5. Empirical Performance and Limitations

QKb mechanisms, across tasks and modalities, consistently demonstrate substantial empirical improvement when compared to uniform or unstructured alternatives:

In vision: LinGlo-b2 achieves 82.7% ImageNet top-1 (vs 75.1% for PVT-Tiny), and DGQA increases ViT-L accuracy by up to +8% on Tiny ImageNet (Li et al., 2022, Khan et al., 2024).
In language modeling: ActQKV achieves 49.40% (Long-Bench) vs. 36.18–47.24% for strong baselines at a fraction of memory cost (Xiao et al., 19 Feb 2025).
In speech/ASR: WCTC-Biasing achieves 29% improved OOV F1 and QKb in DiCoW attains large WER reductions; yet, strong initial bias can cause hallucinations or attention blind spots without additional architectural adaptations (e.g., positional embedding shifts) (Nakagome et al., 2 Jun 2025, Polok et al., 2024).

Noted limitations include dependence on quality of auxiliary signals (e.g., diarization in TS-ASR), slower convergence or lower upper-bound performance compared to more expressive (but heavier) mechanisms, and the fixed nature of key-only bias for tasks requiring nuanced context (Polok et al., 2024).

6. Generalizations and Future Directions

QKb frameworks are readily extended to diverse architectures:

Transformer encoders/decoders, RNN-Transducer, and Masked CTC models (Nakagome et al., 2 Jun 2025)
Any setting where a “query” can trigger the injection or reweighting of a contextual “key/value” (e.g., via cross-modal signals, external memory, or side-information) Key design choices include:
Efficient trigger and bias computation (Q stage)
Projection and injection strategy for bias features (K/V stage)
Gating and scheduling (e.g., $\omega$ -weighting, dynamic budget allocation)
Choice of layer(s) for bias insertion

Work remains in learning QKb parameters from soft or continuous signals, optimizing efficiency in very deep multi-headed networks, and integrating QKb end-to-end with other adaptation, regularization, or memory modules (Polok et al., 2024). The QKb abstraction supports a systematic taxonomy of attention biasing methodologies and is anticipated to drive future research into plug-and-play contextual adaptation across vision, language, and speech domains.