Papers
Topics
Authors
Recent
Search
2000 character limit reached

Centered Self-Attention Layers

Updated 30 January 2026
  • Centered self-attention layers are a variant of the Transformer mechanism that incorporate a learnable Gaussian bias to dynamically predict attention centers and local windows.
  • They adjust query-specific attention by integrating a bias that favors local neighborhoods in lower layers, leading to significant improvements in BLEU scores for machine translation.
  • This approach effectively balances short-range context in early layers with global dependency capture in deeper layers, enhancing phrasal and overall translation quality.

Centered self-attention layers, as introduced by Yang et al. in "Modeling Localness for Self-Attention Networks" (Yang et al., 2018), modify the standard self-attention mechanism by biasing the attention distribution toward a local neighborhood centered at a dynamically predicted position. This is achieved by integrating a learnable Gaussian bias into the attention computation, enabling the model to focus on local context in lower layers while preserving the ability to capture global dependencies in higher layers. The approach yields substantial improvements in BLEU scores for neural machine translation tasks, demonstrating both quantitative and qualitative gains in modeling phrasal and short-range dependencies.

1. Mathematical Formulation of Gaussian Localness Bias

In standard Transformer self-attention, the input hidden states Hl1=[h1,,hI]RI×dH^{l-1} = [h_1, \ldots, h_I] \in \mathbb{R}^{I \times d} are projected to queries QQ, keys KK, and values VV of shape RI×d\mathbb{R}^{I \times d}. The logit for query ii and key jj is given by

eij=QiKjd,e_{ij} = \frac{Q_i \cdot K_j}{\sqrt{d}},

and attention weights are computed via

αij=softmaxj(eij).\alpha_{ij} = \mathrm{softmax}_j(e_{ij}).

Centered self-attention augments this mechanism by incorporating a learnable Gaussian bias bijb_{ij}:

bij=(jci)22σi2,b_{ij} = -\frac{(j - c_i)^2}{2 \sigma_i^2},

where cic_i is the center position and σi\sigma_i controls the scope. The modified logit becomes

eij=eij+bij,e'_{ij} = e_{ij} + b_{ij},

and the resulting attention weights are

αij=softmaxj(eij+bij).\alpha_{ij} = \mathrm{softmax}_j(e_{ij} + b_{ij}).

In the authors' notation, cic_i is called PiP_i (predicted center), and σi\sigma_i is tied to the window size DiD_i via σi=Di/2\sigma_i = D_i/2.

2. Parameterization and Learning of Local Window

The center PiP_i and scope σi\sigma_i are learned in a query-specific manner. For each query QiQ_i, two scalar scores are computed:

pi=UpTtanh(WpQi),zi=UdTtanh(WpQi),p_i = U_p^T \tanh(W_p Q_i), \quad z_i = U_d^T \tanh(W_p Q_i),

where WpRd×dW_p \in \mathbb{R}^{d \times d}, Up,UdRdU_p, U_d \in \mathbb{R}^d are learned parameters.

These are projected into valid ranges:

[Pi,Di]=Isigmoid([pi;zi]),[P_i, D_i] = I \cdot \mathrm{sigmoid}([p_i; z_i]),

ensuring Pi,Di(0,I)P_i, D_i \in (0, I), with σi=Di/2\sigma_i = D_i / 2. This formulation allows each token's query to define its own attention center and window width.

Alternative parameterizations include:

  • Layer-specific: A single window size DD per layer, predicted from the mean of all keys via z=UdTtanh(Wdmean(K))z = U_d^T \tanh(W_d\, \mathrm{mean}(K)).
  • Fixed: Assign a constant window size DD and set Pi=jP_i = j (local window around each position).

3. Integration into Transformer Architecture

The Gaussian bias is incorporated into the scaled dot-product attention as:

Gij=(jPi)22σi2,αij=softmaxj(QiKjd+Gij),G_{ij} = -\frac{(j - P_i)^2}{2 \sigma_i^2}, \quad \alpha_{ij} = \mathrm{softmax}_j \left( \frac{Q_i \cdot K_j}{\sqrt{d}} + G_{ij} \right),

with output

Hil=j=1IαijVj.H^l_i = \sum_{j=1}^I \alpha_{ij} V_j.

In implementation, for layers lLlocall \leq L_{\text{local}} (typically Llocal=3L_{\text{local}} = 3), the Gaussian bias matrix GRI×IG \in \mathbb{R}^{I \times I} is added to the attention logits. For higher layers, the standard self-attention is used without localness bias.

4. Layer-wise Application and Empirical Justification

Ablation results indicate that restricting the localness modeling to lower layers (layers 1–3 in a six-layer stack) provides the largest improvements in BLEU score. Applying the bias to all layers instead leads to minor performance degradation; limiting it to upper layers is less effective. Visualization of predicted centers and window sizes shows that lower layers predominantly focus on short-range context, while higher layers naturally expand the scope, capturing long-range dependencies.

This stratified use of Gaussian bias aligns with the notion that short-range context is crucial at initial processing stages, whereas global context emerges within deeper layers (Yang et al., 2018).

5. Algorithmic Workflow

Pseudocode for training or inference includes the following steps:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
for each batch:
    compute Q, K, V    # [B, I, d]
    for each layer l=1..L:
        if l <= L_local:  # e.g. L_local=3
            for i in 1..I:
                p_i = U_p^T tanh(W_p Q[i])
                z_i = U_d^T tanh(W_p Q[i])
                [P_i, D_i] = I_seq * sigmoid([p_i, z_i])
                sigma_i = D_i / 2
            for i, j in 1..I:
                G[i, j] = - (j - P_i)^2 / (2 * sigma_i^2)
            logits = (Q @ K.T) / sqrt(d) + G
        else:
            logits = (Q @ K.T) / sqrt(d)
        alpha = softmax(logits, dim=2)
        H^l = alpha @ V
        # feed-forward, residual, layer-norm, etc.
At inference time, the same network structure is used, relying on the predicted PiP_i and σi\sigma_i from QiQ_i.

6. Implementation Details and Practical Configuration

Experiments were conducted using the standard Transformer architecture (Base: d=512d=512, 8 heads; Big: d=1024d=1024, 16 heads), with 6 layers in both encoder and decoder, feed-forward size 2048, dropout 0.1, and label smoothing 0.1. Optimization used Adam (β1=0.9\beta_1=0.9, β2=0.98\beta_2=0.98, ϵ=1e9\epsilon=1e-9) with the original warm-up schedule and peak learning rate 1.0\approx 1.0. Batches consisted of \sim4096 tokens per GPU across 8 GPUs. Vocabulary was generated by BPE with 32,000 merge operations; max sentence length was 50. Newly introduced parameters (WpW_p, UpU_p, WdW_d, UdU_d) followed standard Xavier/Glorot initialization. No additional regularization was applied beyond Transformer dropout.

7. Experimental Results and Qualitative Analysis

The centered self-attention mechanism was evaluated on Chinese→English (WMT17, 20.62M sentence pairs) and English→German (WMT14, 4.56M sentence pairs) translation. BLEU scores were calculated using case-sensitive multi-BLEU.pl. Key findings include:

Model Zh→En BLEU En→De BLEU
Transformer-Base 24.13 27.64
+ Gaussian localness 24.77 28.11
+ both (local + rel) 24.96 28.54
Transformer-Big 24.56 28.58
+ localness 25.03 28.89
+ both 25.28 29.18

Qualitative visualization indicates that head-specialized window sizes distribute attention from narrow to wide contexts. Distribution plots of (Pi,Di)(P_i, D_i) across layers show that upper layers predict larger window sizes, whereas lower layers maintain focus on local contexts. N-gram BLEU analysis demonstrates greater relative performance gains on longer n-grams (i.e., phrases), confirming the efficacy of the approach in sharpening phrasal attention.

8. Significance, Interpretation, and Implications

The centered self-attention mechanism provides a principled bias toward short-range context in early stages of deep attention networks while retaining long-range capacity. This suggests improved modeling of phrasal structure particularly relevant for machine translation. The query-specific prediction of attention center and window size enables fine-grained context adaptation, while limiting application to lower layers avoids over-constraining the receptive field at deeper levels. A plausible implication is that such targeted localness bias may be more broadly beneficial in other sequence modeling tasks where local context is paramount.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Centered Self-Attention Layers.