Papers
Topics
Authors
Recent
2000 character limit reached

PKV Residual Attention in Transformers

Updated 28 November 2025
  • PKV Residual Attention is a Transformer modification that adds previous layer logits to current pre-softmax scores, improving model behavior without extra parameters.
  • It is integrated into models like BERT, ETC, and ADMIN to enhance convergence, training stability, and attention sparsity through residual connections.
  • Empirical results demonstrate faster convergence, improved accuracy metrics, and reduced divergence in deep settings across various NLP tasks.

PKV Residual Attention, also described as "Residual Attention" in the RealFormer architecture, is a modification to the standard Transformer multi-head attention mechanism wherein a skip connection is applied directly to the pre-softmax attention logits. This approach enables each attention head in a given layer to incorporate its corresponding logits from the previous layer, yielding improvements in model performance, training stability, and attention sparsity across diverse Transformer backbones including BERT, ETC, and ADMIN. The technique introduces no new learnable parameters and is compatible with both Post-LayerNorm (Post-LN) and Pre-LayerNorm (Pre-LN) architectures (He et al., 2020).

1. Mathematical Formulation

In the canonical Transformer, the scaled dot-product attention for head ii at layer ll involves computing the raw attention logits: Si(l)=Qi(l)(Ki(l))TdkS^{(l)}_i = \frac{Q^{(l)}_i (K^{(l)}_i)^{T}}{\sqrt{d_k}} where QQ, KK, and VV are the standard query, key, and value projections, and dkd_k is the key/query dimension. These logits undergo a softmax normalization: Ai(l)=softmax(Si(l))A^{(l)}_i = \mathrm{softmax}(S^{(l)}_i) followed by an output computation: Oi(l)=Ai(l)Vi(l)O^{(l)}_i = A^{(l)}_i V^{(l)}_i

In PKV Residual Attention, the core change is: S~i(l)=Qi(l)(Ki(l))Tdk+Si(l1)\widetilde{S}^{(l)}_i = \frac{Q^{(l)}_i (K^{(l)}_i)^{T}}{\sqrt{d_k}} + S^{(l-1)}_i

Ai(l)=softmax(S~i(l))A^{(l)}_i = \mathrm{softmax}(\widetilde{S}^{(l)}_i)

Oi(l)=Ai(l)Vi(l)O^{(l)}_i = A^{(l)}_i V^{(l)}_i

For a stack of hh heads, all logits are stored as S(l)Rh×L×LS^{(l)} \in \mathbb{R}^{h \times L \times L}. No additional learnable parameters are introduced; the only change is the elementwise addition of Si(l1)S^{(l-1)}_i to Si(l)S^{(l)}_i prior to the softmax.

2. Sequence of Operations and Q/K/V–PKV Interaction

For each layer ll and head ii:

  1. Input activations H(l1)H^{(l-1)} undergo layer normalization and linear projections to produce Qi(l)Q^{(l)}_i, Ki(l)K^{(l)}_i, Vi(l)V^{(l)}_i:

Qi(l)=H(l1)WiQ,Ki(l)=H(l1)WiK,Vi(l)=H(l1)WiVQ^{(l)}_i = H^{(l-1)} W^Q_i, \quad K^{(l)}_i = H^{(l-1)} W^K_i, \quad V^{(l)}_i = H^{(l-1)} W^V_i

  1. Compute new raw logits:

Snew=Qi(l)(Ki(l))TdkS_\text{new} = \frac{Q^{(l)}_i (K^{(l)}_i)^{T}}{\sqrt{d_k}}

  1. Retrieve previous layer logits Si(l1)S^{(l-1)}_i (zero-initialized for l=0l=0).
  2. Add residual logits:

S~=Snew+Si(l1)\widetilde{S} = S_\text{new} + S^{(l-1)}_i

  1. Apply softmax and aggregate values:

Ai(l)=softmax(S~),Oi(l)=Ai(l)Vi(l)A^{(l)}_i = \mathrm{softmax}(\widetilde{S}), \quad O^{(l)}_i = A^{(l)}_i V^{(l)}_i

  1. Store S~\widetilde{S} as Si(l)S^{(l)}_i for use in layer l+1l+1.

The outputs from all heads are concatenated, projected, and passed through the usual residual and FFN blocks. This sequence forms a “Residual MultiHead” mechanism.

3. Pseudocode and Implementation Sketch

An implementation of a single attention block supporting residual logits (“PKV Residual Attention”) is:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
class ResidualMultiHead(nn.Module):
    def __init__(self, d_model, num_heads):
        super().__init__()
        self.num_heads = num_heads
        self.d_k = d_model // num_heads
        self.Wq = nn.Linear(d_model, d_model)
        self.Wk = nn.Linear(d_model, d_model)
        self.Wv = nn.Linear(d_model, d_model)
        self.Wo = nn.Linear(d_model, d_model)
        self.norm = nn.LayerNorm(d_model)

    def forward(self, x, prev_scores):
        # x: [batch, L, d_model]
        # prev_scores: [batch, heads, L, L]
        h = self.num_heads
        d_k = self.d_k

        x_norm = self.norm(x)
        Q = self.Wq(x_norm).view(batch, L, h, d_k).transpose(1,2)
        K = self.Wk(x_norm).view(batch, L, h, d_k).transpose(1,2)
        V = self.Wv(x_norm).view(batch, L, h, d_k).transpose(1,2)

        scores_new = torch.matmul(Q, K.transpose(-2,-1)) / math.sqrt(d_k)
        scores = scores_new + prev_scores
        attn = torch.softmax(scores, dim=-1)
        out = torch.matmul(attn, V)
        out = out.transpose(1,2).contiguous().view(batch, L, d_model)
        o = self.Wo(out)

        return o, scores
The module returns both the usual outputs and the new logits, which are passed to the next layer.

4. Architectural Properties and Hyperparameter Considerations

  • No additional trainable parameters are introduced; all weight matrices (WQW^Q, WKW^K, WVW^V, WOW^O) are unchanged.
  • The standard RealFormer configuration uses Post-LN (layer normalization after each sublayer), exactly as in BERT. Pre-LN can also be used; for Pre-LN, the GPT-2 initialization for projection weights is recommended.
  • All architectural hyperparameters (number of heads, hidden size, FFN width, dropout, etc.) remain identical to the baseline model.
  • For deep networks (e.g., 36+ layers), logits may be accumulated using a running mean instead of an unbounded sum (effectively adding a temperature factor to the softmax).
  • The approach is compatible with encoders, decoders, and encoder–decoder models; only the multi-head attention module requires modification.

5. Empirical Performance

PKV Residual Attention has demonstrated consistent improvements across a spectrum of NLP tasks and backbone architectures. Selected results:

Model / Task Baseline (Post-LN/other) RealFormer
BERT-Small, MLM accuracy 61.57% (Post-LN) 61.70%
BERT-Base, MLM accuracy 70.20% (Post-LN) 70.42%
BERT-Large, MLM accuracy 73.64% (Post-LN) 73.94%
BERT-xLarge, MLM accuracy 73.72% (Post-LN) 74.76%
GLUE overall (BERT-Large) 84.01 (Post-LN) 84.53
SQuAD v1.1 F1/EM 91.68/85.15 (Post-LN) 91.93/85.58
**ADMIN (WMT’14 En\toDe 12L) 28.58 BLEU 29.06
ETC-Large, WikiHop (acc) 78.92 79.21

Additional findings:

  • In limited-budget pretraining, RealFormer at 500k steps rivals or outperforms Post-LN at 1M steps on GLUE and SQuAD.
  • In deep or unstable settings, RealFormer prevents divergence and enables higher learning rates (2×1042\times10^{-4}) without loss of stability.

6. Training Dynamics: Stability, Convergence, and Attention Sparsity

  • Stability: RealFormer eliminates divergence issues prevalent in large Post-LN models such as BERT-xLarge.
  • Convergence: Demonstrated faster convergence; for the same step count, RealFormer attains higher development set accuracy.
  • Attention Sparsity: RealFormer yields attention maps with lower per-token, per-head entropy in upper layers and reduced variance across heads, indicating increased sparsity. The entropy for attention matrix Ai(l)A_i^{(l)} is computed as:

H(Ai(l))=jAi,j(l)logAi,j(l)H(A_i^{(l)}) = -\sum_j A_{i,j}^{(l)}\log A_{i,j}^{(l)}

  • Cross-layer Correlation: The Jensen–Shannon Divergence between attention maps of adjacent layers is lower, implying vertically persistent attention patterns for each head index. This suggests cross-layer coherence, potentially regularizing model behavior.

7. Integration and Implementation Guidelines

Adapting existing codebases to incorporate PKV Residual Attention requires:

  • Modifying the MultiHeadAttention module to accept and return the logits ("prev_scores").
  • Inserting the addition operation immediately after logits computation and before the softmax:
    1
    2
    3
    
    scores = scores_new + prev_scores
    attn = softmax(scores, dim=-1)
    out  = attn @ V
  • No new hyperparameters or normalization placements are necessary; existing Transformer hyperparameters are retained.

This minimal change facilitates integration into any Transformer-based model, including encoders, decoders, and cross-attention layers (He et al., 2020).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to PKV Residual Attention.