Papers
Topics
Authors
Recent
Search
2000 character limit reached

Slash-Dominant Heads (SDHs) in Transformers

Updated 14 January 2026
  • SDHs are attention heads in decoder-only Transformers that concentrate attention along fixed sub-diagonals, representing consistent token lags.
  • They emerge from nearly rank-one query/key matrices and the dominant influence of medium- and high-frequency components in Rotary Position Embedding.
  • SDHs contribute to parameter compression, improved long-range context tracking, and mechanistic interpretability in large language models.

Slash-Dominant Heads (SDHs) are attention heads in decoder-only Transformers, particularly LLMs, that exhibit attention score concentration along a fixed sub-diagonal ("slash line") of the attention score matrix. This manifests as persistent patterns where, for a given offset Δ\Delta, an attention head focuses disproportionately on positions that are exactly Δ\Delta tokens behind the current token. These phenomena, intrinsic to model architecture and the employment of Rotary Position Embedding (RoPE), have been empirically documented across a broad range of models and elucidated through theoretical analysis (Cheng et al., 13 Jan 2026).

1. Formal Definition and Notation

Let S(P)RN×NS(P)\in\mathbb R^{N\times N} denote the attention score (after row-wise softmax) for a prompt PP of context length NN in a causal attention head. For an integer offset Δ0\Delta\ge 0 and threshold κ[0,1]\kappa\in[0,1], the average slash score at lag Δ\Delta is defined by

SlashScore(Δ)=EPD[1N(P)Δi=Δ+1N(P)Si,iΔ(P)].\mathrm{SlashScore}(\Delta) = \mathbb{E}_{P\sim\mathcal D} \left[ \frac{1}{N(P)-\Delta} \sum_{i=\Delta+1}^{N(P)} S_{i,\,i-\Delta}(P) \right].

A head is termed \emph{(κ,Δ)(\kappa,\Delta)-slash-dominant} if SlashScore(Δ)κ\mathrm{SlashScore}(\Delta)\ge\kappa. Geometrically, such a head concentrates attention along the Δ\Delta-th sub-diagonal of S(P)S(P), i.e., attends with notable probability to tokens at fixed relative position Δ-\Delta.

2. Empirical Properties: Prevalence and OOD Robustness

Empirical analyses reveal that SDHs are ubiquitous in open-source LLMs including Gemma-7B, Llama3-8B, and Qwen2.5-7B. Numerous heads with Δ=0,1,2,\Delta=0,1,2,\ldots exhibit SlashScore(Δ)0.1\mathrm{SlashScore}(\Delta)\gtrsim0.1 for Δ<5\Delta<5, and even at long-range (Δ>500\Delta>500), dozens of SDHs per model can be identified for κ103\kappa\approx 10^{-3}. Notably, when inputs are replaced by i.i.d. random tokens (sampled uniformly over the vocabulary), the slash-dominance of the same heads persists with comparable scores. Thus, SDHs are a consequence of model architecture rather than content-driven or semantic effects (Cheng et al., 13 Jan 2026).

3. Architectural Conditions Inducing SDHs

SDHs are closely associated with two structural properties of the model:

(a) Nearly Rank-One Queries and Keys:

Given hidden states h1,,hNRdh_1,\ldots,h_N\in\mathbb R^d and attention weights WQ,WKW_Q, W_K, define Q=(q1,,qN)Q=(q_1,\ldots,q_N)^\top, K=(k1,,kN)K=(k_1,\ldots,k_N)^\top with qi=hiWQq_i=h_i^\top W_Q, kj=hjWKk_j=h_j^\top W_K. The spectral concentration is quantified by

r1(X)=σ12(X)iσi2(X),R0.95(X)=min{:i=1ri(X)0.95}r_1(X) = \frac{\sigma_1^2(X)}{\sum_i \sigma_i^2(X)}, \quad R_{0.95}(X) = \min\left\{ \ell : \sum_{i=1}^\ell r_i(X)\ge0.95 \right\}

where σi(X)\sigma_i(X) denotes singular values. For SDHs, typically r1(Q)0.9r_1(Q)\gtrsim 0.9 or r1(K)0.9r_1(K)\gtrsim 0.9, and R0.95(Q),R0.95(K)dR_{0.95}(Q), R_{0.95}(K) \ll d. Thus, qiαiuq_i\approx\alpha_i u and kjβjvk_j\approx\beta_j v for nearly fixed unit vectors u,vu, v, with directional and norm variation of at most \sim10%.

(b) RoPE Dominated by Medium- and High-Frequency Components:

With RoPE, position ii is encoded by Rϑ(i)=diag(ρ(iθ1),,ρ(iθd/2))R_\vartheta(i) = \mathrm{diag}(\rho(i\theta_1),\dots,\rho(i\theta_{d/2})), where ρ(ϕ)\rho(\phi) is a 2×22\times 2 block rotation, and {θ}\{\theta_\ell\} are monotonically decreasing frequencies. The "slash-dominance frequency condition" requires that for the db/2d_b/2 largest frequencies,

s=1db/2cos(θsx)C1δ0(x)C2ϵ,forxN,\left| \sum_{s=1}^{d_b/2} \cos(\theta_s x) - C_1\delta_0(x) - C_2 \right| \le \epsilon,\quad \text{for}\,|x|\le N,

with ϵ1\epsilon\ll1. In practice, the active frequencies in SDHs are medium/high, producing a "pulse" frequency response aligned with particular lags Δ\Delta. Removal of low-frequency components does not disrupt the slash peak, whereas removal of medium/high frequencies abolishes it.

4. Mathematical Decomposition of SDH Attention

Under the preceding conditions, the pre-softmax attention logit for iji\ge j simplifies to

q~ik~j=qiRϑ(i)Rϑ(j)kjuRϑ(i)Rϑ(j)v==1d/2Acos(θ(ij)+φ),\tilde{q}_i^\top\tilde{k}_j = q_i^\top R_\vartheta(i) R_\vartheta(j)^\top k_j \approx u^\top R_\vartheta(i) R_\vartheta(j)^\top v = \sum_{\ell=1}^{d/2} A_\ell \cos(\theta_\ell(i-j) + \varphi_\ell),

with amplitudes AA_\ell and phases φ\varphi_\ell dependent on uu and vv. Thus, the logit as a function of the lag iji-j becomes a sum of cosines, and constructive interference at integer offsets Δ\Delta gives rise to the observed slash peaks in attention:

AttnLogit(i,j)==1d/2Acos(θ(ij)+φ).\mathrm{AttnLogit}(i,j) = \sum_{\ell=1}^{d/2} A_\ell \cos\left(\theta_\ell (i-j) + \varphi_\ell\right).

The Fourier-like structure explains the emergent concentration at specific sub-diagonals.

5. Theoretical Results on SDH Emergence

A theoretical treatment considers a shallow, two-layer, single-head, disentangled Transformer with RoPE, trained via gradient descent for in-context linear regression:

  • Input: Prompts of the form P=(x1,y1,,xn,yn,xq)P=(x_1,y_1,\dots,x_n,y_n,x_q) with embeddings on a cone plus semantic subspace.
  • Layer 1: Queries and keys restricted to the cone axis, effectively enforcing rank-one structure; RoPE applied at every layer.
  • Training protocol: Two-stage gradient descent. Stage I updates the first-layer queries to focus on the immediate previous token (Δ=1\Delta=1), converging to a 1-slash pattern. Stage II updates second-layer queries to match features between training examples and the query xqx_q.

Key Lemmas:

During Stage I, the difference Ai,i1maxji1Ai,jA_{i,i-1} - \max_{j\neq i-1} A_{i,j} for Layer 1 logits accelerates at j=i1j=i-1, establishing the slash pattern. In Stage II, attention solidifies on the appropriate feature-matching token.

Main Theorem:

After O~(KNlogN)\tilde{O}(KN\log N) steps (Stage I) and O~(K2logK)\tilde{O}(K^2\log K) (Stage II), the first layer becomes a (11)(1-1) SDH, the second layer performs accurate feature matching, and the squared loss is O(1)O(1). The resulting SDHs are observed to generalize to out-of-distribution (OOD) data (Cheng et al., 13 Jan 2026).

6. Implications and Applications

  • Intrinsic Architectural Effect:

SDHs arise systematically from the combination of near rank-one Q,KQ,K matrices and the induction structure of high/medium-frequency RoPE, independent of semantic content.

  • Parameter Compression:

The near rank-one property enables aggressive low-rank factorization of Q,KQ,K without significant loss of model performance.

  • Length Generalization:

Modifying or reweighting low-frequency RoPE components can improve model generalization to longer contexts without degrading slash behavior.

  • Mechanistic Interpretability:

The structure of SDHs provides insight into how positional encodings in LLMs direct information flow, elucidating the dominant, fixed-lag attention implemented by certain heads.

A plausible implication is that SDHs are essential to reliable long-range context tracking, and interventions on RoPE or the parameterization of queries/keys can be leveraged to manipulate or enhance a model's information propagation capabilities.

7. Summary

Slash-Dominant Heads are a robust and reproducible artifact of attention architectures with Rotary Position Embedding, induced by the near rank-one structure of queries and keys and the frequency response of RoPE. Their identification in modern LLMs, theoretical characterization, and implications for parameter compression, length generalization, and interpretability underscore their significance in transformer research. SDHs emerge independently of semantic data, reflecting architectural and training-driven mechanisms that systematically enforce fixed-lag relative attention (Cheng et al., 13 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Slash-Dominant Heads (SDHs).