Slash-Dominant Heads (SDHs) in Transformers

Updated 14 January 2026

SDHs are attention heads in decoder-only Transformers that concentrate attention along fixed sub-diagonals, representing consistent token lags.
They emerge from nearly rank-one query/key matrices and the dominant influence of medium- and high-frequency components in Rotary Position Embedding.
SDHs contribute to parameter compression, improved long-range context tracking, and mechanistic interpretability in large language models.

Slash-Dominant Heads (SDHs) are attention heads in decoder-only Transformers, particularly LLMs, that exhibit attention score concentration along a fixed sub-diagonal ("slash line") of the attention score matrix. This manifests as persistent patterns where, for a given offset $\Delta$ , an attention head focuses disproportionately on positions that are exactly $\Delta$ tokens behind the current token. These phenomena, intrinsic to model architecture and the employment of Rotary Position Embedding (RoPE), have been empirically documented across a broad range of models and elucidated through theoretical analysis (Cheng et al., 13 Jan 2026).

1. Formal Definition and Notation

Let $S(P)\in\mathbb R^{N\times N}$ denote the attention score (after row-wise softmax) for a prompt $P$ of context length $N$ in a causal attention head. For an integer offset $\Delta\ge 0$ and threshold $\kappa\in[0,1]$ , the average slash score at lag $\Delta$ is defined by

$\mathrm{SlashScore}(\Delta) = \mathbb{E}_{P\sim\mathcal D} \left[ \frac{1}{N(P)-\Delta} \sum_{i=\Delta+1}^{N(P)} S_{i,\,i-\Delta}(P) \right].$

A head is termed \emph{ $(\kappa,\Delta)$ -slash-dominant} if $\Delta$ 0. Geometrically, such a head concentrates attention along the $\Delta$ 1-th sub-diagonal of $\Delta$ 2, i.e., attends with notable probability to tokens at fixed relative position $\Delta$ 3.

2. Empirical Properties: Prevalence and OOD Robustness

Empirical analyses reveal that SDHs are ubiquitous in open-source LLMs including Gemma-7B, Llama3-8B, and Qwen2.5-7B. Numerous heads with $\Delta$ 4 exhibit $\Delta$ 5 for $\Delta$ 6, and even at long-range ( $\Delta$ 7), dozens of SDHs per model can be identified for $\Delta$ 8. Notably, when inputs are replaced by i.i.d. random tokens (sampled uniformly over the vocabulary), the slash-dominance of the same heads persists with comparable scores. Thus, SDHs are a consequence of model architecture rather than content-driven or semantic effects (Cheng et al., 13 Jan 2026).

3. Architectural Conditions Inducing SDHs

SDHs are closely associated with two structural properties of the model:

(a) Nearly Rank-One Queries and Keys:

Given hidden states $\Delta$ 9 and attention weights $S(P)\in\mathbb R^{N\times N}$ 0, define $S(P)\in\mathbb R^{N\times N}$ 1, $S(P)\in\mathbb R^{N\times N}$ 2 with $S(P)\in\mathbb R^{N\times N}$ 3, $S(P)\in\mathbb R^{N\times N}$ 4. The spectral concentration is quantified by

$S(P)\in\mathbb R^{N\times N}$ 5

where $S(P)\in\mathbb R^{N\times N}$ 6 denotes singular values. For SDHs, typically $S(P)\in\mathbb R^{N\times N}$ 7 or $S(P)\in\mathbb R^{N\times N}$ 8, and $S(P)\in\mathbb R^{N\times N}$ 9. Thus, $P$ 0 and $P$ 1 for nearly fixed unit vectors $P$ 2, with directional and norm variation of at most $P$ 310%.

(b) RoPE Dominated by Medium- and High-Frequency Components:

With RoPE, position $P$ 4 is encoded by $P$ 5, where $P$ 6 is a $P$ 7 block rotation, and $P$ 8 are monotonically decreasing frequencies. The "slash-dominance frequency condition" requires that for the $P$ 9 largest frequencies,

$N$ 0

with $N$ 1. In practice, the active frequencies in SDHs are medium/high, producing a "pulse" frequency response aligned with particular lags $N$ 2. Removal of low-frequency components does not disrupt the slash peak, whereas removal of medium/high frequencies abolishes it.

4. Mathematical Decomposition of SDH Attention

Under the preceding conditions, the pre-softmax attention logit for $N$ 3 simplifies to

$N$ 4

with amplitudes $N$ 5 and phases $N$ 6 dependent on $N$ 7 and $N$ 8. Thus, the logit as a function of the lag $N$ 9 becomes a sum of cosines, and constructive interference at integer offsets $\Delta\ge 0$ 0 gives rise to the observed slash peaks in attention:

$\Delta\ge 0$ 1

The Fourier-like structure explains the emergent concentration at specific sub-diagonals.

5. Theoretical Results on SDH Emergence

A theoretical treatment considers a shallow, two-layer, single-head, disentangled Transformer with RoPE, trained via gradient descent for in-context linear regression:

Input: Prompts of the form $\Delta\ge 0$ 2 with embeddings on a cone plus semantic subspace.
Layer 1: Queries and keys restricted to the cone axis, effectively enforcing rank-one structure; RoPE applied at every layer.
Training protocol: Two-stage gradient descent. Stage I updates the first-layer queries to focus on the immediate previous token ( $\Delta\ge 0$ 3), converging to a 1-slash pattern. Stage II updates second-layer queries to match features between training examples and the query $\Delta\ge 0$ 4.

Key Lemmas:

During Stage I, the difference $\Delta\ge 0$ 5 for Layer 1 logits accelerates at $\Delta\ge 0$ 6, establishing the slash pattern. In Stage II, attention solidifies on the appropriate feature-matching token.

Main Theorem:

After $\Delta\ge 0$ 7 steps (Stage I) and $\Delta\ge 0$ 8 (Stage II), the first layer becomes a $\Delta\ge 0$ 9 SDH, the second layer performs accurate feature matching, and the squared loss is $\kappa\in[0,1]$ 0. The resulting SDHs are observed to generalize to out-of-distribution (OOD) data (Cheng et al., 13 Jan 2026).

6. Implications and Applications

Intrinsic Architectural Effect:

SDHs arise systematically from the combination of near rank-one $\kappa\in[0,1]$ 1 matrices and the induction structure of high/medium-frequency RoPE, independent of semantic content.

Parameter Compression:

The near rank-one property enables aggressive low-rank factorization of $\kappa\in[0,1]$ 2 without significant loss of model performance.

Length Generalization:

Modifying or reweighting low-frequency RoPE components can improve model generalization to longer contexts without degrading slash behavior.

Mechanistic Interpretability:

The structure of SDHs provides insight into how positional encodings in LLMs direct information flow, elucidating the dominant, fixed-lag attention implemented by certain heads.

A plausible implication is that SDHs are essential to reliable long-range context tracking, and interventions on RoPE or the parameterization of queries/keys can be leveraged to manipulate or enhance a model's information propagation capabilities.

7. Summary

Slash-Dominant Heads are a robust and reproducible artifact of attention architectures with Rotary Position Embedding, induced by the near rank-one structure of queries and keys and the frequency response of RoPE. Their identification in modern LLMs, theoretical characterization, and implications for parameter compression, length generalization, and interpretability underscore their significance in transformer research. SDHs emerge independently of semantic data, reflecting architectural and training-driven mechanisms that systematically enforce fixed-lag relative attention (Cheng et al., 13 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (1)

Demystifying the Slash Pattern in Attention: The Role of RoPE (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Slash-Dominant Heads (SDHs).