Frozen-QK Transformer Model

Updated 15 November 2025

The paper shows that freezing key and query weights still enables formation of induction heads for effective in-context reasoning.
The methodology uses fixed random feature maps for Q/K with trainable value projections and MLPs, achieving 24–32% faster training.
Empirical results confirm that Frozen-QK models perform near-standard transformers on language, algorithmic, and retrieval tasks, validating universal approximation.

The Frozen-QK model is a transformer variant characterized by freezing the key and query projection matrices at random initialization, while leaving the value projections and all MLP weights trainable. This architectural design questions the conventional understanding that fully trainable attention is essential for effective sequence modeling. Empirical and theoretical results demonstrate that Frozen-QK transformers are capable of forming induction heads, support strong in-context reasoning, and maintain competitive performance on language modeling and algorithmic tasks, all while offering practical computational benefits.

1. Formulation of Frozen-QK Attention

In standard multi-head self-attention, a hidden state matrix $h_\ell \in \mathbb{R}^{n\times m}$ is projected to query, key, and value representations for each attention head: $Q_\ell = W^Q_\ell\,h_\ell, \quad K_\ell = W^K_\ell\,h_\ell, \quad V_\ell = W^V_\ell\,h_\ell,$ where $W^Q_\ell, W^K_\ell \in \mathbb{R}^{n_h\times n}$ and $W^V_\ell\in\mathbb{R}^{n_h\times n}$ , with $H$ heads and $H n_h = n$ . The per-head attention output is

$\mathrm{Attn}(h_\ell) = \mathrm{Softmax}\bigl(Q_\ell^\top K_\ell/\sqrt{n_h}\bigr)V_\ell \in \mathbb{R}^{m\times n_h},$

with outputs concatenated over heads.

In the Frozen-QK model, both $W^Q_\ell$ and $W^K_\ell$ are sampled once from $\mathcal{N}(0,1/n)$ at initialization and held fixed. Only $W^V_\ell$ and all MLP weights remain trainable. This restriction turns the attention mechanism into a form of random feature map, while preserving standard optimization over value projections and the residual–MLP pathway.

2. Mechanistic Insights and Induction Heads

Despite the “frozenness” of $W^Q_\ell$ and $W^K_\ell$ , the network retains the ability to form induction heads—a key mechanism underlying in-context learning and copying in transformers. This emerges because the intermediate representations $h_\ell$ evolve throughout training, so $Q_\ell$ and $K_\ell$ remain input-dependent via $h_\ell$ , even with fixed projections. Empirically, in retrieval and multi-hop induction tasks, Frozen-QK networks develop heads that capture token-copying behavior: for instance, in a two-layer, two-head setting, distinct heads learn to attend backward along tokens and propagate relevant context forward. These learned circuits approach the performance of fully trainable transformers on sequence modeling tasks where copying and induction are critical.

3. Theoretical Expressivity

The expressivity of Frozen-QK transformers is formalized by a universal approximation theorem:

Theorem (Universal Approximation of Frozen-QK):

Let $K \subset (\mathbb{R}^n)^m$ be compact. Any continuous causal function $f: K \rightarrow (\mathbb{R}^n)^m$ —where the $i$ th output depends only on inputs up to $i$ —can be approximated arbitrarily well in sup-norm by a one-layer multi-head attention + MLP network with fixed random $W^Q$ , $W^K$ and trainable $W^V$ , MLP weights.

This proof leverages the connection to random feature expansions: each Frozen-QK attention head defines a feature map $g_i(h) = h\,\mathrm{Softmax}\bigl((W^Q_i h)^\top(W^K_i h)/\sqrt{n_h}\bigr)$ , with the concatenation serving as the input to a trainable MLP readout. By classical results on random feature ridge regression, as the number of heads increases, such random-feature networks inherit universal approximation properties from their fully trainable counterparts. Thus, even with random attention scores, the architecture remains capable of representing any continuous causal sequence function, provided sufficient capacity.

4. Empirical Performance Across Tasks

Frozen-QK models have been systematically evaluated on in-context reasoning, algorithmic, language modeling, and memorization tasks. Key experimental findings include:

Task	Standard	Frozen-QK	MixiT
Retrieval (acc %)	100	97.0	11.2
Hop $_k$ Induction (acc %)	99.99	96.7	48.6
Wikitext-103 (log-ppl)	2.78	3.07	3.73
Fineweb-edu (log-ppl)	3.05	3.16	4.08
Decimal addition (10-dig)	100	100	see notes
Dyck-1 Parentheses (acc %)	95.8	97.4	96.2
Yelp sentiment (acc %)	90.6	90.9	92.6

On language modeling (Wikitext-103, Fineweb-edu), Frozen-QK achieves log-perplexities within 10\% of the standard transformer.
In key retrieval and multi-hop induction tasks, Frozen-QK accuracy is within a few percent of the standard, indicating robust in-context reasoning.
For tasks such as decimal and modular addition, all models except MixiT reach perfect accuracy.
Bits-per-parameter in memorization: Standard (2.98), Frozen-QK (2.25), Frozen-MLP (1.13), MixiT (2.18), establishing that MLPs provide the main memory substrate, with complementary improvements from trainable or structured attention.
Training throughput improves substantially: Frozen-QK and MixiT are 24–32% faster on language modeling benchmarks, due to reduced gradient computations for frozen weights.

Increasing the number of heads in MixiT (another random-attention model with static mixing matrices) directly enhances its performance on algorithmic tasks where static patterns suffice, but fails to match Frozen-QK or standard attention on complex context-dependent tasks.

5. Comparison with MixiT and Other Random Attention Models

The MixiT architecture fixes the entire attention pattern to a random, input-independent matrix adjusted to maintain row sums of one. Its attention operation is: $\mathrm{Attn}(h_\ell) = W^V_\ell h_\ell (I_m + W^M_\ell - \overline{W}^M_\ell)$ with $W^V_\ell$ trainable and $W^M_\ell$ sampled from $\mathcal{N}(0,1/\sqrt{nm})$ , $\overline{W}^M_\ell$ enforcing causal-style row normalization. Theoretical analysis shows that MixiT’s empirical covariance evolves according to a specifically scaled SDE, ensuring robust signal propagation at depth, unlike naïve random frozen transformers which exhibit rapid rank collapse.

Empirically, MixiT:

Struggles with tasks requiring precise in-context retrieval (e.g., scoring 11.2% accuracy on retrieval).
Approaches standard models on algorithmic tasks, especially as the number of random heads increases (e.g., decimal addition 35%→92% as heads scale 4→256).
Avoids representation collapse at large depth, maintaining 100% accuracy on decimal addition up to 16 layers.

Frozen-QK is markedly more flexible than MixiT for language and context-sensitive tasks, attributable to the residual stream and learned value/MLP layers compensating for frozen attention projections by leveraging input-dependence via $h_\ell$ .

6. Functional and Architectural Implications

The Frozen-QK results elucidate the specialized roles within the transformer architecture:

Attention (especially trainable Q/K): Essential for complex in-context reasoning and high-fidelity language modeling via flexible, input-adaptive selection circuits.
MLP layers: Principal mechanism for memory storage, performing arbitrary memorization in the absence of trainable attention circuits.
Residual Streams and Circuit Formation: The capacity for forming induction circuits via residual pathway and learned value/MLP parameters persists even when queries/keys are random, indicating an intrinsic architectural tendency (inductive bias) towards compositional and specialized “circuit” formation.

Notably, these observations suggest that “freeze-most” regimes may render architectural and computational benefits for large-scale training, and that emerging circuits are a property of the architecture itself rather than a consequence of uniformly trainable parameters.

7. Limitations, Practical Gains, and Future Directions

While performance on many tasks remains competitive, Frozen-QK models demonstrate a modest loss in perplexity on open-domain language modeling (e.g., ~10% worse on Wikitext-103), and performance on complex reasoning or retrieval tasks does not fully match fully trainable transformers. Still, “freeze-most” approaches yield 24–32% speedup during training, with less computational overhead and the potential to simplify inference by avoiding dynamic key–value caches when attention weights are static.

Open research directions include:

Extending the analysis to sequence modeling tasks demanding rich, open-ended reasoning such as chain-of-thought.
Hybrid curricula where Q/K or MLP weights are selectively unfrozen over training.
Mechanistic interpretability studies probing circuits that emerge under different trainability regimes.
Efficient hardware implementations leveraging attention weight freezing.

These findings clarify the internal division of labor in transformers and support the design of architectures that balance expressivity, efficiency, and interpretability by tailoring which components are trainable.

PDF Markdown Chat (Pro)

Follow Topic

Get notified by email when new papers are published related to Frozen-QK Model.