Keyformer: Efficient Transformer Enhancements

Updated 18 March 2026

Keyformer is a transformer enhancement paradigm that identifies salient tokens, frames, or entities to reduce compute load and memory use while enhancing task performance.
It achieves efficiency gains in generative inference by restricting the key-value cache to the most important tokens, cutting memory usage by up to 50% and speeding up processing over twofold.
In speech recognition and document extraction, Keyformer employs key frame selection and coarse-to-fine retrieval, leading to significant speedups and measurable improvements in error rates and F1 scores.

Keyformer refers to a family of transformer model enhancements that leverage the identification or selection of salient "key" elements—frames, tokens, or entities—to drive computational efficiency and task-specific improvements. The term designates distinct but conceptually convergent innovations across several domains: efficient end-to-end speech recognition, memory-reduced transformer inference, and key value extraction from document images. This entry surveys the principal research lines, formal methodologies, empirical results, and commonalities among Keyformer approaches with primary reference to (Fan et al., 2023, Adnan et al., 2024), and (Hu et al., 2023).

1. Keyformer for KV Cache Reduction in Generative Inference

Keyformer, as described by (Adnan et al., 2024), targets the memory and compute bottlenecks in LLM autoregressive inference by restricting the key-value (KV) cache to only the most salient past tokens ("key tokens"). Empirical analysis revealed that a minority of past tokens (∼40%) account for approximately 90% of the total attention mass during generation.

Formally, for each decode step $t$ , the attention scores are $S_t(i) = \mathrm{softmax}_i\left(Q_t K^\top / \sqrt{d_k}\right)$ , $i=1...m$ , with $m$ as current cache size. Selecting the smallest $K_t \subset \{1...m\}$ with $|K_t| \approx 0.4m$ and $\sum_{i \in K_t} S_t(i) \approx 0.9$ achieves maximum retention of information content.

Key token selection is performed using a noisy-argmax over unnormalized logits $x_t(i)$ perturbed with Gumbel noise and softmaxed with annealed temperature, yielding accumulation scores $f_{KF}(i)$ . A "recent window" ensures local context is always kept. The pruning process maintains $k$ tokens in the cache (recent $w$ plus top- $(k-w)$ by $f_{KF}$ ), dramatically reducing cache size and bandwidth per step from $O(n)$ to $O(k)$ without retraining or architecture alteration.

2. Key Frame Mechanisms for Efficient Conformer ASR

In end-to-end automatic speech recognition, Keyformer describes a dual-encoder architecture that identifies "key frames" using intermediate Connectionist Temporal Classification (CTC) outputs (Fan et al., 2023).

Given per-frame logits $C \in \mathbb{R}^{T \times V}$ from the first encoder, the CTC label frame with highest logit not equal to "blank" is marked as a key frame: $P = \{ t \mid \operatorname*{arg\,max}_v\, C_{t,v} \neq \text{blank} \}$ . Consecutive non-blank frames are collapsed per CTC rules. Both self-attention and acoustic feature processing in downstream encoder blocks are then restricted:

Key Frame-based Self-Attention (KFSA): Masks the attention so that each query attends only to positions near key frames and to key-frame positions globally. With local context width $w$ , the mask $M_{t_1, t_2} = 1$ if $|t_1 - t_p| \leq w$ for any key frame $t_p$ or $t_2 \in P$ ; 0 otherwise.
Key Frame-based Downsampling (KFDS): Entirely drops frames that do not reside in a small window around key frames. Resulting sequence length is at most $(2w+1)U$ , where $U = |P|$ .

Empirically, on AISHELL-1 and LibriSpeech benchmarks, dropping over 60% of frames via KFDS yields up to 2.5–3× speedup at no loss, and sometimes a slight gain, in recognition accuracy.

3. Keyformer for Key-Value Pair Extraction in Form Documents

KVPFormer—termed "KeyFormer" in (Hu et al., 2023)—addresses key-value pair extraction from form-like document images. The process is cast as a question-answering task over spatially and semantically structured entities detected in the document.

Entities $E_1...E_n$ are embedded using a pre-trained LayoutLM or LayoutXLM backbone.
A spatial-aware transformer encoder, with attention biasing based on 2D geometry ( $r_{ij} \in \mathbb{R}^{18}$ ), classifies entities into keys (questions) and others.
DETR-style transformer decoder takes key embeddings as queries and predicts values from all entity embeddings with cross-attention.
A two-stage answer prediction module: a coarse stage selects $K$ candidate values per key via a sigmoid-activated MLP, and a fine stage applies a softmax-activated MLP among candidates.
Loss is the sum of cross-entropy on question classification and answer stages.

This architecture, including spatial compatibility attention bias and coarse-to-fine answer prediction, outperforms strong baselines by 3–7 F1 points on FUNSD and 1.85 F1 points (multi-task) or 0.87 points (zero-shot, average) on XFUND.

4. Computational and Architectural Implications

For all Keyformer variants, the central concept is model sparsification along a task-relevant dimension—frame, token, or entity—identified via weakly or strongly supervised intermediate signals (CTC, accumulated attention, entity type). In generative transformers (Adnan et al., 2024), memory and bandwidth savings are proportional to the KV reduction ratio $(1 - k/n)$ . In speech models (Fan et al., 2023), quadratic-to-linear (KVDS) or quadratic-to-subquadratic (KFSA) attention complexity reductions are achieved. For document extraction (Hu et al., 2023), computation is focused by entity preselection and coarse-to-fine answer filtering.

These methods remain agnostic to position encoding: Keyformer explicitly addresses RoPE, learnable, and ALiBi schemes, requiring only the preservation of original position tags (for ALiBi, original positional distance must be preserved).

5. Empirical Evaluation and Benchmarks

Keyformer (KV Cache)

Model/Task	Quality Impact (vs. Full)	Max KV Reduction	Latency Speedup	Throughput Speedup
GPT-J, MPT, C-GPT	≥99% ROUGE, accuracy	50%	2.1×	2.4×

Key Frame Mechanism (ASR)

Method	% Frames Dropped	CER (AISHELL-1)	WER (LibriSpeech)
Vanilla	0	4.75%	3.18% / 8.72%
KFDS (w=1)	64.8%	4.52%	3.09% / 7.96%

KVPFormer (Doc Extraction)

Benchmark	Strong Baseline F1	KVPFormer F1	∆ F1
FUNSD	78.84	82.23	+3.4
XFUND (multi)	92.57	94.42	+1.85

6. Limitations and Applicability

Keyformer (KV reduction) incurs $O(n)$ per-step maintenance cost for score accumulators, but this remains negligible for typical head and dimension sizes compared to $O(kd)$ matrix multiplications. For ASR, placement of the intermediate CTC layer is critical; too early causes up to 0.2 pp CER drop. All methods assume fixed or pre-parsed context. The Keyformer paradigm is applicable to any decoder-only transformer architecture (including multi-query and group-query), with no retraining or fine-tuning required for generative inference (Adnan et al., 2024).

Keyformer approaches exemplify the broader trend in neural model optimization towards competitive accuracy with lower resource budgets, accomplished by identifying and operating on the substructure most relevant to the modeled dependencies. Methods akin to Keyformer could guide sparse-attention training (Adnan et al., 2024) or inform efficient pruning schemes. In multimodal or multilingual document processing, entity-centric attention biasing— as with KVPFormer—augments transformer flexibility in structured extraction tasks (Hu et al., 2023). In speech, the integration of CTC-based intermediate signals for sparsification opens avenues for further model acceleration (Fan et al., 2023).

Markdown Report Issue Upgrade to Chat

References (3)

Key Frame Mechanism For Efficient Conformer Based End-to-end Speech Recognition (2023)

Keyformer: KV Cache Reduction through Key Tokens Selection for Efficient Generative Inference (2024)

A Question-Answering Approach to Key Value Pair Extraction from Form-like Document Images (2023)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Keyformer.