Idiom Heads in Transformers

Updated 27 November 2025

Idiom Heads are specialized attention heads in transformer models that recover an idiom's figurative meaning by acting as functional bottlenecks for non-compositional semantics.
They are identified using quantitative ablation and causal tracing protocols that measure performance drops and highlight their recurrence across various idiomatic expressions.
This mechanism enhances idiom disambiguation by routing figurative interpretations separately from literal meanings, providing scalable insights for language processing.

An Idiom Head is a specialized attention head within transformer-based LLMs that is causally necessary for recovering an idiom’s figurative meaning in an intermediate layer and recurs across the mechanistic circuits for multiple idiomatic expressions (Gomes, 20 Nov 2025). Idiom processing in transformers is characterized by non-compositional semantics, with figurative and literal meanings running in parallel computational paths. Idiom Heads serve as functional bottlenecks that inject, amplify, and route figurative interpretation while allowing literal meanings to bypass this specialized processing. These heads are discovered through quantitative ablation and causal tracing protocols, and they underpin the transformer’s ability to competently represent and disambiguate idioms.

1. Formal Definition and Identification

Idiom Heads are rigorously defined by their causal role in the transformer’s computational graph. For an idiomatic string $I$ at layer $L$ , the idiom circuit $H_I$ is the subgraph of nodes (attention heads and post-MLP residual streams) responsible for encoding the figurative meaning. The impact of a head $h$ is measured by the performance drop:

$d_h = \cos \theta_{M,H} - \cos \theta_{M,H \setminus \{h\}}$

where $\cos \theta_{M,H}$ is the cosine similarity between the meaning embedding $x_m$ and the circuit-patched hidden state $x_H$ (Gomes, 20 Nov 2025). A head is classified as an Idiom Head if it recurs ( $\geq3$ times) in the $H_I$ circuit across multiple idioms and exhibits $|d_h| > \tau_{\text{idiom}}$ , with $\tau_{\text{idiom}}$ empirically set in the range $0.004$ to $0.008$.

2. Circuit Discovery via Path Patching

Idiom Head identification relies on modified path patching methodologies:

Graph construction: Nodes represent per-token attention heads and post-MLP residuals; edges represent Q, K, V connections, as well as intra-layer residual-to-attention associations.
Single-corruption tracing: The model iteratively patches activations, ablates edges, and quantifies $d_e$ to isolate causal components.
Threshold sweep and merging: Circuits are aggregated across corrupted variants, and heads are pruned if lacking substantive upstream causal connections (Gomes, 20 Nov 2025).

This protocol delineates idiom-specific computational circuits, revealing that certain heads act as cross-token routers for figurative cues.

3. Quantitative Profiling of Idiom Heads

Empirical results on canonical idioms show strong performance drops for a highly constrained set of heads:

Idiom	Layer 1 Heads	Layer 2 Heads	Max $d$ (× $10^{-2}$ )
kicked the bucket	5*, 14	20*	0.20
piece of cake	15	2, 2	0.15
hit the sack	2*	15	0.15
pulling your leg	3, 3	11	0.11

Heads (2,0), (1,2), (1,3), and (1,5) repeatedly serve as Idiom Heads across diverse idioms, displaying large $d_h$ values and near absence of antagonistic (negative drop) heads (Gomes, 20 Nov 2025). Early MLPs and MHSA sublayers (Layers 0–3) also play critical roles in boosting figurative signals and suppressing literal interpretations (Oh et al., 2 Jun 2025).

4. Mechanistic Role: Figurative versus Literal Routing

Transformer architectures encode idiom semantics via dual-path routing:

Intermediate (figurative) path: Early MHSA heads bind idiom tokens; subsequent MLP transformations create high-dimensional figurative slots. From layer $\ell \approx 4$ onward, positions like the "because" token shift toward alignment with figurative paraphrase embeddings (Oh et al., 2 Jun 2025).
Bypass (literal) route: Literal compositional interpretations are preserved via direct attention edges from the idiom span to the final token, effectively bypassing figurative gating. Selective edge masking can block either route, isolating the flow of meaning (e.g., blocking idiom → because at $\ell=5$ kills figurative boost, while blocking idiom → last at $\ell\approx 11,13$ eliminates literal bypass) (Oh et al., 2 Jun 2025).

5. Augmented Reception and Cross-Token Dynamics

"Augmented reception" describes the phenomenon where early token representations (even function words) causally influence the Query or Key vectors of an idiom’s final token via upstream attention edges. For a head $h$ at layer $\ell$ targeting token $j$ , the attention weight is

$\alpha_{j \to i}^{(\ell,h)} \propto \exp(Q_j^{(\ell,h)} \cdot K_i^{(\ell,h)})$

Incoming Q-edges from $k$ to $j$ mean $Q_j$ depends on $x_k^{(\ell-1)}$ . This mechanism sharpens cross-token coupling, enabling the final idiom token to "receive" context-influenced attention, which is quantifiably necessary for figurative resolution (Gomes, 20 Nov 2025).

6. Computational Efficiency, Redundancy, and Implications

Transformers deploy Idiom Heads with functional specialization analogous to Induction or Mover heads, but for non-compositional semantics. Redundancy—multiple heads handling similar cues—balances computational efficiency and generalization robustness. Idiom-specific directions in Q–K space allow shared heads to serve multiple idioms without semantic collapse (Gomes, 20 Nov 2025).

This specialization suggests a scalable mechanism for tackling figurative, metaphorical, or sarcastic language via mechanistic interpretability. Monitoring Idiom Head activation or intermediate-path alignment (e.g., at the "because" token) has diagnostic potential for real-time disambiguation or model auditing (Oh et al., 2 Jun 2025).

7. Limitations and Pathways for Extension

Current findings are established for 1B-parameter autoregressive transformers and short-context idioms. A plausible implication is that integrating larger models, richer discourse contexts, and continuous-time state probes (CCA, RSA) could resolve even more granular mechanisms involved in non-compositional language processing (Oh et al., 2 Jun 2025). Further research is expected to adapt these mechanistic interpretability tools for more complex grammatical phenomena, idiom types, and linguistic ambiguity beyond idioms.

Markdown Report Issue Upgrade to Chat

References (2)

Anatomy of an Idiom: Tracing Non-Compositionality in Language Models (2025)

Tug-of-war between idiom's figurative and literal meanings in LLMs (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Idiom Heads.