Compressible Softmax-Attended Language under Incompressible Attention

Published 6 Apr 2026 in cs.CL and cs.AI | (2604.04384v2)

Abstract: Softmax attention defines an interaction through $d_h$ head dimensions, but not all dimensions carry equal weight once real text passes through. We decompose the attention logit field into a learned component and a generated component and measure their spectra separately. For all 5,888 KV heads in five transformer LLMs (124M--7B parameters, four architecture families), the logit energy field $\tilde{E}$ reaches 90\% of its variance in 2--11 singular components. The learned interaction matrix $W_Q^\mathrm{T} W_K$ needs 38--75 components for the same threshold out of $d_h \in {64, 128}$. The spectral gap is 5--25$\times$ in effective rank. The compressibility of softmax-attended language is a property of the data, not the frame that analyzes it.

Abstract PDF Upgrade to Chat

Authors (1)

Wonsuk Lee

Summary

The paper demonstrates that softmax-attended language compressibility emerges from data rather than from the model’s architectural design.
It employs rigorous spectral analysis (via SVD) across transformer heads, showing that nearly all variance is captured in just a few singular vectors.
Results imply that only data-adaptive compression schemes, not static low-rank projections, can effectively optimize KV-cache memory usage.

Compressibility of Softmax-Attended Language: Data Origin and Implications

Overview

The paper "Compressible Softmax-Attended Language under Incompressible Attention" (2604.04384) undertakes a thorough spectral analysis of softmax attention in LLMs, disentangling the contributions of model architecture from natural language data structure. The study demonstrates a significant spectral gap between the attention mechanism's learned interaction matrix and the attention logit field generated from real text, establishing that compressibility is a property emergent from data rather than from architectural constraints.

Decomposition of Attention Mechanism and Energy Field

The authors rigorously separate the attention logit field $\tilde{E}$ , applied to specific language tokens during inference, into two principal components: one generated by the interaction of the input (token embeddings, transformed by the sequence context), and another encoded by the learned query and key weight matrices, $W_Q$ and $W_K$ . The distinction enables isolation of compressibility as a phenomenon: does it arise from low effective rank encoded in the architecture, or from data-dependent activation patterns in the embedding manifold?

Formally, for an attention head with head dimension $d_h$ , $W_Q^{\mathrm{T}} W_K$ is the learned interaction matrix $M$ (weight-based), and the row-centered logit matrix $\tilde{E}$ (data-based) reflects query-key interactions from a particular context. SVD analysis on both reveals their spectral properties.

Spectral Findings Across Models and Contexts

Extensive experiments across five transformer LLMs—GPT-2, LLaMA-3.2B, LLaMA-3.2-1B, Qwen-3B, and Mistral-7B—involving 5,888 key-value heads and multiple context samples, yield robust and stable results:

Generated logit energy field $\tilde{E}$ : For all models and nearly all attention heads, at least 90% of variance concentrates in just 2–11 singular vectors. The spectrum of $\tilde{E}$ decays sharply, indicating that for any given input, the effective dimensionality of attention is minuscule relative to $d_h$ .
Learned interaction matrix $W_Q$ 0: In contrast, the learned weight spectrum is flat. Capturing 90% variance requires 38–75 components (for $W_Q$ 1), indicating no intrinsic architectural bias toward low-rank subspaces. The effective rank ratio between $W_Q$ 2 and $W_Q$ 3 ranges from 5 $W_Q$ 4 (LLaMA-1B) up to 25 $W_Q$ 5 (GPT-2).

These results generalize across text domains, genres, and model scales, with low standard deviation across multiple samples.

Theoretical Implications

Incompressible Architectural Capacity

The flat spectrum of $W_Q$ 6 establishes that the capacity of the attention layer is fully distributed: no fixed, input-independent projection can substantially compress the head's operational subspace without significant loss. Discarding any direction in $W_Q$ 7 proportionally reduces capacity. The attention mechanism is, in this sense, spectrally incompressible.

Data-Driven Compressibility

Conversely, the sharp drop in $W_Q$ 8's spectrum is not a consequence of the mechanism's design but arises from the low intrinsic dimensionality of contextualized embeddings produced by the language data and prior network layers. As a result, the true attention interaction on any context occupies a subspace of dramatically lower rank than $W_Q$ 9, even as $W_K$ 0 varies or head count grows.

Robustness Under Softmax

Since the softmax operation is nonlinear and can accentuate small differences at peak positions, the paper proves a stability result: as long as the relevant singular vectors are delocalized (empirically shown for real inputs), the attention probability distributions resulting from low-rank approximations of $W_K$ 1 remain close, with the error bounded proportionally to the tail of the singular spectrum.

Practical Consequences for Attention Compression

Key-value (KV) cache compression is essential for cost-effective inference, especially under long contexts due to linear memory growth. The findings directly challenge methods based on fixed low-rank projections or static head pruning: since architectural spectra are flat, these methods are ineffective. Instead, only data-adaptive compression schemes—such as dynamic token merging or context-specific projection learning—can robustly exploit input compressibility.

Moreover, the role of "idle" capacity (unused spectral dimensions in a particular context) is interpreted as a form of flexibility, allowing the same architectural substrate to support a wide array of context-dependent activations.

Open Questions and Future Directions

The universality of low effective rank for $W_K$ 2 in language is conjectured, supported by empirical evidence across both architecture type and scale. However, several directions remain unresolved:

Extension to non-NLP modalities: Whether similar spectral gaps appear in vision or sequential protein models is an important line of inquiry, as it would clarify to what degree compressibility is a generic property of structured data input.
Accumulation of compression error: While per-head, per-layer fidelity is quantified, how approximation errors propagate across stacked layers and whether they affect end-to-end metrics such as perplexity are left for future experimental analysis.
Optimality of data-driven projections: Designing effective and efficient on-the-fly adaptive compression schemes tailored to persistent data manifold structure is a key practical frontier.

Conclusion

This study demonstrates that the compressibility of softmax-attended language in transformer models is not encoded in the mechanism's parameterization but is a direct result of the low intrinsic dimension of natural language embeddings. The decoupling of input-driven and architecture-driven spectra provides a rigorous basis for focusing future KV-cache compression work on context-adaptive methods. The results prompt both further theoretical study of manifold structures in large model activations and practical innovation in scalable inference algorithms for LLMs (2604.04384).