PKV Residual Attention in Transformers

Updated 28 November 2025

PKV Residual Attention is a Transformer modification that adds previous layer logits to current pre-softmax scores, improving model behavior without extra parameters.
It is integrated into models like BERT, ETC, and ADMIN to enhance convergence, training stability, and attention sparsity through residual connections.
Empirical results demonstrate faster convergence, improved accuracy metrics, and reduced divergence in deep settings across various NLP tasks.

PKV Residual Attention, also described as "Residual Attention" in the RealFormer architecture, is a modification to the standard Transformer multi-head attention mechanism wherein a skip connection is applied directly to the pre-softmax attention logits. This approach enables each attention head in a given layer to incorporate its corresponding logits from the previous layer, yielding improvements in model performance, training stability, and attention sparsity across diverse Transformer backbones including BERT, ETC, and ADMIN. The technique introduces no new learnable parameters and is compatible with both Post-LayerNorm (Post-LN) and Pre-LayerNorm (Pre-LN) architectures (He et al., 2020).

1. Mathematical Formulation

In the canonical Transformer, the scaled dot-product attention for head $i$ at layer $l$ involves computing the raw attention logits: $S^{(l)}_i = \frac{Q^{(l)}_i (K^{(l)}_i)^{T}}{\sqrt{d_k}}$ where $Q$ , $K$ , and $V$ are the standard query, key, and value projections, and $d_k$ is the key/query dimension. These logits undergo a softmax normalization: $A^{(l)}_i = \mathrm{softmax}(S^{(l)}_i)$ followed by an output computation: $O^{(l)}_i = A^{(l)}_i V^{(l)}_i$

In PKV Residual Attention, the core change is: $\widetilde{S}^{(l)}_i = \frac{Q^{(l)}_i (K^{(l)}_i)^{T}}{\sqrt{d_k}} + S^{(l-1)}_i$

$A^{(l)}_i = \mathrm{softmax}(\widetilde{S}^{(l)}_i)$

$O^{(l)}_i = A^{(l)}_i V^{(l)}_i$

For a stack of $h$ heads, all logits are stored as $S^{(l)} \in \mathbb{R}^{h \times L \times L}$ . No additional learnable parameters are introduced; the only change is the elementwise addition of $S^{(l-1)}_i$ to $S^{(l)}_i$ prior to the softmax.

2. Sequence of Operations and Q/K/V–PKV Interaction

For each layer $l$ and head $i$ :

Input activations $H^{(l-1)}$ undergo layer normalization and linear projections to produce $Q^{(l)}_i$ , $K^{(l)}_i$ , $V^{(l)}_i$ :

$Q^{(l)}_i = H^{(l-1)} W^Q_i, \quad K^{(l)}_i = H^{(l-1)} W^K_i, \quad V^{(l)}_i = H^{(l-1)} W^V_i$

Compute new raw logits:

$S_\text{new} = \frac{Q^{(l)}_i (K^{(l)}_i)^{T}}{\sqrt{d_k}}$

Retrieve previous layer logits $S^{(l-1)}_i$ (zero-initialized for $l=0$ ).
Add residual logits:

$\widetilde{S} = S_\text{new} + S^{(l-1)}_i$

Apply softmax and aggregate values:

$A^{(l)}_i = \mathrm{softmax}(\widetilde{S}), \quad O^{(l)}_i = A^{(l)}_i V^{(l)}_i$

Store $\widetilde{S}$ as $S^{(l)}_i$ for use in layer $l+1$ .

The outputs from all heads are concatenated, projected, and passed through the usual residual and FFN blocks. This sequence forms a “Residual MultiHead” mechanism.

3. Pseudocode and Implementation Sketch

An implementation of a single attention block supporting residual logits (“PKV Residual Attention”) is:

class ResidualMultiHead(nn.Module):
    def __init__(self, d_model, num_heads):
        super().__init__()
        self.num_heads = num_heads
        self.d_k = d_model // num_heads
        self.Wq = nn.Linear(d_model, d_model)
        self.Wk = nn.Linear(d_model, d_model)
        self.Wv = nn.Linear(d_model, d_model)
        self.Wo = nn.Linear(d_model, d_model)
        self.norm = nn.LayerNorm(d_model)

    def forward(self, x, prev_scores):
        # x: [batch, L, d_model]
        # prev_scores: [batch, heads, L, L]
        h = self.num_heads
        d_k = self.d_k

        x_norm = self.norm(x)
        Q = self.Wq(x_norm).view(batch, L, h, d_k).transpose(1,2)
        K = self.Wk(x_norm).view(batch, L, h, d_k).transpose(1,2)
        V = self.Wv(x_norm).view(batch, L, h, d_k).transpose(1,2)

        scores_new = torch.matmul(Q, K.transpose(-2,-1)) / math.sqrt(d_k)
        scores = scores_new + prev_scores
        attn = torch.softmax(scores, dim=-1)
        out = torch.matmul(attn, V)
        out = out.transpose(1,2).contiguous().view(batch, L, d_model)
        o = self.Wo(out)

        return o, scores

The module returns both the usual outputs and the new logits, which are passed to the next layer.

4. Architectural Properties and Hyperparameter Considerations

No additional trainable parameters are introduced; all weight matrices ( $W^Q$ , $W^K$ , $W^V$ , $W^O$ ) are unchanged.
The standard RealFormer configuration uses Post-LN (layer normalization after each sublayer), exactly as in BERT. Pre-LN can also be used; for Pre-LN, the GPT-2 initialization for projection weights is recommended.
All architectural hyperparameters (number of heads, hidden size, FFN width, dropout, etc.) remain identical to the baseline model.
For deep networks (e.g., 36+ layers), logits may be accumulated using a running mean instead of an unbounded sum (effectively adding a temperature factor to the softmax).
The approach is compatible with encoders, decoders, and encoder–decoder models; only the multi-head attention module requires modification.

5. Empirical Performance

PKV Residual Attention has demonstrated consistent improvements across a spectrum of NLP tasks and backbone architectures. Selected results:

Model / Task	Baseline (Post-LN/other)	RealFormer
BERT-Small, MLM accuracy	61.57% (Post-LN)	61.70%
BERT-Base, MLM accuracy	70.20% (Post-LN)	70.42%
BERT-Large, MLM accuracy	73.64% (Post-LN)	73.94%
BERT-xLarge, MLM accuracy	73.72% (Post-LN)	74.76%
GLUE overall (BERT-Large)	84.01 (Post-LN)	84.53
SQuAD v1.1 F1/EM	91.68/85.15 (Post-LN)	91.93/85.58
**ADMIN (WMT’14 En $\to$ De 12L)	28.58 BLEU	29.06
ETC-Large, WikiHop (acc)	78.92	79.21

Additional findings:

In limited-budget pretraining, RealFormer at 500k steps rivals or outperforms Post-LN at 1M steps on GLUE and SQuAD.
In deep or unstable settings, RealFormer prevents divergence and enables higher learning rates ( $2\times10^{-4}$ ) without loss of stability.

6. Training Dynamics: Stability, Convergence, and Attention Sparsity

Stability: RealFormer eliminates divergence issues prevalent in large Post-LN models such as BERT-xLarge.
Convergence: Demonstrated faster convergence; for the same step count, RealFormer attains higher development set accuracy.
Attention Sparsity: RealFormer yields attention maps with lower per-token, per-head entropy in upper layers and reduced variance across heads, indicating increased sparsity. The entropy for attention matrix $A_i^{(l)}$ is computed as:

$H(A_i^{(l)}) = -\sum_j A_{i,j}^{(l)}\log A_{i,j}^{(l)}$

Cross-layer Correlation: The Jensen–Shannon Divergence between attention maps of adjacent layers is lower, implying vertically persistent attention patterns for each head index. This suggests cross-layer coherence, potentially regularizing model behavior.

7. Integration and Implementation Guidelines

Adapting existing codebases to incorporate PKV Residual Attention requires:

Modifying the MultiHeadAttention module to accept and return the logits ("prev_scores").
Inserting the addition operation immediately after logits computation and before the softmax:
1 2 3
scores = scores_new + prev_scores attn = softmax(scores, dim=-1) out = attn @ V
No new hyperparameters or normalization placements are necessary; existing Transformer hyperparameters are retained.

This minimal change facilitates integration into any Transformer-based model, including encoders, decoders, and cross-attention layers (He et al., 2020).

PDF Markdown Chat (Pro)

References (1)

RealFormer: Transformer Likes Residual Attention (2020)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to PKV Residual Attention.