Talking-Heads Attention in Transformers

Updated 11 May 2026

Talking-Heads Attention is a variant of multi-head attention that introduces learned linear projections before and after softmax to enable cross-head interactions.
It employs W_pre and W_post matrices to mix attention logits and weights, leading to consistent improvements in perplexity, F1, and accuracy across various models.
Empirical evaluations on T5, BERT, and ALBERT demonstrate that this approach mitigates bottlenecks in small head dimensions and scales with minimal extra computation.

Talking-Heads Attention is a generalization of standard multi-head attention in Transformer architectures wherein linear projections are inserted across the attention-heads dimension, immediately before and after the softmax operation. This modification creates information pathways among attention heads, enabling richer cross-head interactions. The construction, computational implications, empirical results, and theoretical motivations are distinct from standard multi-head formulations, yet involve a relatively minor increase in parametrization and compute burden. Talking-Heads Attention has demonstrated consistent improvements on large-scale masked language modeling and downstream transfer-learning tasks (Shazeer et al., 2020).

1. Standard Multi-Head Attention: Preliminaries

In multi-head self-attention, the input $X \in \mathbb{R}^{n \times d}$ (with sequence length $n$ and model dimension $d$ ) is projected via matrices $W^Q, W^K, W^V$ into $Q, K, V$ representations, which are reshaped into $h$ head-parallel blocks: $Q[i], K[i], V[i] \in \mathbb{R}^{n \times k}$ for each $i=1\dots h$ . Each attention head computes the scaled softmax-dot-product as:

$A[i] = \text{softmax}(Q[i]K[i]^T/\sqrt{k}) \, V[i] \in \mathbb{R}^{n \times v}$

The outputs across heads are concatenated and linearly projected by $W^O$ , yielding the final multi-head attention output. In this structure, heads' scoring and weighting operate in isolation, without interaction until the output merger.

2. Architectural Modifications: Head-Dimension Projections

Talking-Heads Attention introduces two learned matrices, $n$ 0 and $n$ 1, which project across logit and value head dimensions, respectively. The typical configuration sets $n$ 2, making both $n$ 3 and $n$ 4 square matrices of dimension $n$ 5.

The computation proceeds as follows:

Standard projections generate $n$ 6 and $n$ 7.
Raw attention logits $n$ 8 are computed via pairwise dot products: $n$ 9.
Logit heads are mixed: $d$ 0, where for each $d$ 1, $d$ 2.
Softmax is applied sequence-wise: $d$ 3.
Weight heads are further mixed: $d$ 4.
$d$ 5 is contracted with $d$ 6 to yield per-head outputs, which are then projected as in the standard approach.

In tensor notation,

$d$ 7

3. Computational and Parameter Complexity

Relative to standard multi-head attention, Talking-Heads Attention adds $d$ 8 parameters (two matrices of size $d$ 9 under $W^Q, W^K, W^V$ 0) and $W^Q, W^K, W^V$ 1 arithmetic operations per layer. By contrast, conventional multi-head attention requires $W^Q, W^K, W^V$ 2 multiplies per layer and $W^Q, W^K, W^V$ 3 parameters when $W^Q, W^K, W^V$ 4 is assumed for simplicity. The increase in operations and parameters is moderate when $W^Q, W^K, W^V$ 5 and is justified by empirical performance gains. The extra overhead arises specifically from the head-mixing tensor contractions and is proportional to $W^Q, W^K, W^V$ 6, $W^Q, W^K, W^V$ 7 (sequence dimensions), and $W^Q, W^K, W^V$ 8.

4. Empirical Evaluations and Ablation Analyses

Extensive experiments were conducted using T5 (12-layer encoder-decoder, $W^Q, W^K, W^V$ 9), ALBERT (12-layer parameter-shared encoder), and BERT (12-layer, with relative positions, up to 768 heads).

Quantitative results on T5 (Table 1):

At 12 heads, $Q, K, V$ 0, standard multi-head: ln PPL = 1.678; Talking-Heads: ln PPL = 1.641.
At 24 heads, $Q, K, V$ 1, multi-head: 1.669; Talking-Heads: 1.624.
SQuAD v1.1 F1 improved from 90.87 (multi-head) to 91.38 (Talking-Heads) with 12 heads, and to 91.83 with 24 heads.
MNLI-m accuracy increased from 86.20 (multi-head) to 87.42 (Talking-Heads, 24 heads).

ALBERT (Table 7) revealed multi-head stagnation as head count increases, but Talking-Heads maintained or improved accuracy (e.g., avg accuracy rises from 80.78 to 81.44 as heads increase from 12 to 48).

BERT ablations (Table 9) using up to 768 heads ( $Q, K, V$ 2) showed continuous gains: SQuAD1.1 F1 improved from 88.51 (12 heads) to 90.5 (768 heads), MNLI-m from 82.6 to 84.2. Ablation studies underscored the necessity of both pre- and post-softmax projections, and that most downstream gains were recoverable by applying Talking-Heads only to encoder self-attention.

Summary Table: Perplexity and F1/Accuracy Gains

Model	Heads ( $Q, K, V$ 3)	Multi-Head (PPL / F1 / Acc)	Talking-Heads (PPL / F1 / Acc)
T5	12	ln PPL = 1.678 / F1 = 90.87	ln PPL = 1.641 / F1 = 91.38
T5	24	ln PPL = 1.669 / F1 = 91.83	ln PPL = 1.624 / F1 = 91.83
BERT	12	F1 = 88.51 / Acc = 82.6	F1 = 90.5 / Acc = 84.2

5. Mechanistic Interpretation and Information Flow

In standard multi-head attention, each head’s Q–K scoring and subsequent value weighting are isolated, which creates an information bottleneck, especially pronounced when $Q, K, V$ 4 is small relative to $Q, K, V$ 5. The insertion of $Q, K, V$ 6 before softmax permits each head’s logits to incorporate information from every other head, forming cross-head compatibility patterns. $Q, K, V$ 7, applied after softmax, enables recombination of weight distributions prior to aggregation with $Q, K, V$ 8. Visualization of $Q, K, V$ 9 and $h$ 0 (see Figure 1 in (Shazeer et al., 2020)) demonstrates their dense, well-conditioned structure, with little evidence of dominance by any diagonal component. This suggests that head cross-talk is both active and nontrivial, mitigating bottlenecks observed in large-head, small-dimension settings.

6. Implementation Notes and Open Research Questions

The additional matrix multiplications required by head-mixing operations can be inefficient on hardware optimized for large GEMMs, raising the question of whether hardware or algorithmic innovations—such as locality-aware or memory-compressed attention—may further accelerate the method. A “dynamic” variant, in which $h$ 1 and $h$ 2 receive small input-dependent offsets, was briefly explored; while it further improved pretraining perplexity, downstream accuracy benefits were not conclusively observed, suggesting the potential for future research into more sophisticated dynamic projections or data-dependent head mixing.

7. Broader Impact and Extensions

Talking-Heads Attention establishes a lightweight, extensible alternative to standard multi-head attention by introducing only $h$ 3 additional parameters and moderate computational overhead. The consistent improvements in pretraining log-perplexity and transfer learning performance across major Transformer variants highlight its practical relevance. While the gains are robust across a range of architectural hyperparameters, further work remains on optimizing small-matrix multiplication efficiency and exploring data-dependent projection schemes. The approach demonstrates that enhancing inter-head communication can compensate for limitations imposed by head dimension bottlenecks and offers a template for further architectural innovation within attention-based models (Shazeer et al., 2020).

Markdown Report Issue Upgrade to Chat

References (1)

Talking-Heads Attention (2020)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Talking-Heads Attention.