Centered Self-Attention Layers

Updated 30 January 2026

Centered self-attention layers are a variant of the Transformer mechanism that incorporate a learnable Gaussian bias to dynamically predict attention centers and local windows.
They adjust query-specific attention by integrating a bias that favors local neighborhoods in lower layers, leading to significant improvements in BLEU scores for machine translation.
This approach effectively balances short-range context in early layers with global dependency capture in deeper layers, enhancing phrasal and overall translation quality.

Centered self-attention layers, as introduced by Yang et al. in "Modeling Localness for Self-Attention Networks" (Yang et al., 2018), modify the standard self-attention mechanism by biasing the attention distribution toward a local neighborhood centered at a dynamically predicted position. This is achieved by integrating a learnable Gaussian bias into the attention computation, enabling the model to focus on local context in lower layers while preserving the ability to capture global dependencies in higher layers. The approach yields substantial improvements in BLEU scores for neural machine translation tasks, demonstrating both quantitative and qualitative gains in modeling phrasal and short-range dependencies.

1. Mathematical Formulation of Gaussian Localness Bias

In standard Transformer self-attention, the input hidden states $H^{l-1} = [h_1, \ldots, h_I] \in \mathbb{R}^{I \times d}$ are projected to queries $Q$ , keys $K$ , and values $V$ of shape $\mathbb{R}^{I \times d}$ . The logit for query $i$ and key $j$ is given by

$e_{ij} = \frac{Q_i \cdot K_j}{\sqrt{d}},$

and attention weights are computed via

$\alpha_{ij} = \mathrm{softmax}_j(e_{ij}).$

Centered self-attention augments this mechanism by incorporating a learnable Gaussian bias $b_{ij}$ :

$b_{ij} = -\frac{(j - c_i)^2}{2 \sigma_i^2},$

where $c_i$ is the center position and $\sigma_i$ controls the scope. The modified logit becomes

$e'_{ij} = e_{ij} + b_{ij},$

and the resulting attention weights are

$\alpha_{ij} = \mathrm{softmax}_j(e_{ij} + b_{ij}).$

In the authors' notation, $c_i$ is called $P_i$ (predicted center), and $\sigma_i$ is tied to the window size $D_i$ via $\sigma_i = D_i/2$ .

2. Parameterization and Learning of Local Window

The center $P_i$ and scope $\sigma_i$ are learned in a query-specific manner. For each query $Q_i$ , two scalar scores are computed:

$p_i = U_p^T \tanh(W_p Q_i), \quad z_i = U_d^T \tanh(W_p Q_i),$

where $W_p \in \mathbb{R}^{d \times d}$ , $U_p, U_d \in \mathbb{R}^d$ are learned parameters.

These are projected into valid ranges:

$[P_i, D_i] = I \cdot \mathrm{sigmoid}([p_i; z_i]),$

ensuring $P_i, D_i \in (0, I)$ , with $\sigma_i = D_i / 2$ . This formulation allows each token's query to define its own attention center and window width.

Alternative parameterizations include:

Layer-specific: A single window size $D$ per layer, predicted from the mean of all keys via $z = U_d^T \tanh(W_d\, \mathrm{mean}(K))$ .
Fixed: Assign a constant window size $D$ and set $P_i = j$ (local window around each position).

3. Integration into Transformer Architecture

The Gaussian bias is incorporated into the scaled dot-product attention as:

$G_{ij} = -\frac{(j - P_i)^2}{2 \sigma_i^2}, \quad \alpha_{ij} = \mathrm{softmax}_j \left( \frac{Q_i \cdot K_j}{\sqrt{d}} + G_{ij} \right),$

with output

$H^l_i = \sum_{j=1}^I \alpha_{ij} V_j.$

In implementation, for layers $l \leq L_{\text{local}}$ (typically $L_{\text{local}} = 3$ ), the Gaussian bias matrix $G \in \mathbb{R}^{I \times I}$ is added to the attention logits. For higher layers, the standard self-attention is used without localness bias.

4. Layer-wise Application and Empirical Justification

Ablation results indicate that restricting the localness modeling to lower layers (layers 1–3 in a six-layer stack) provides the largest improvements in BLEU score. Applying the bias to all layers instead leads to minor performance degradation; limiting it to upper layers is less effective. Visualization of predicted centers and window sizes shows that lower layers predominantly focus on short-range context, while higher layers naturally expand the scope, capturing long-range dependencies.

This stratified use of Gaussian bias aligns with the notion that short-range context is crucial at initial processing stages, whereas global context emerges within deeper layers (Yang et al., 2018).

5. Algorithmic Workflow

Pseudocode for training or inference includes the following steps:

for each batch:
    compute Q, K, V    # [B, I, d]
    for each layer l=1..L:
        if l <= L_local:  # e.g. L_local=3
            for i in 1..I:
                p_i = U_p^T tanh(W_p Q[i])
                z_i = U_d^T tanh(W_p Q[i])
                [P_i, D_i] = I_seq * sigmoid([p_i, z_i])
                sigma_i = D_i / 2
            for i, j in 1..I:
                G[i, j] = - (j - P_i)^2 / (2 * sigma_i^2)
            logits = (Q @ K.T) / sqrt(d) + G
        else:
            logits = (Q @ K.T) / sqrt(d)
        alpha = softmax(logits, dim=2)
        H^l = alpha @ V
        # feed-forward, residual, layer-norm, etc.

At inference time, the same network structure is used, relying on the predicted

P_i

and

\sigma_i

from

Q_i

6. Implementation Details and Practical Configuration

Experiments were conducted using the standard Transformer architecture (Base: $d=512$ , 8 heads; Big: $d=1024$ , 16 heads), with 6 layers in both encoder and decoder, feed-forward size 2048, dropout 0.1, and label smoothing 0.1. Optimization used Adam ( $\beta_1=0.9$ , $\beta_2=0.98$ , $\epsilon=1e-9$ ) with the original warm-up schedule and peak learning rate $\approx 1.0$ . Batches consisted of $\sim$ 4096 tokens per GPU across 8 GPUs. Vocabulary was generated by BPE with 32,000 merge operations; max sentence length was 50. Newly introduced parameters ( $W_p$ , $U_p$ , $W_d$ , $U_d$ ) followed standard Xavier/Glorot initialization. No additional regularization was applied beyond Transformer dropout.

7. Experimental Results and Qualitative Analysis

The centered self-attention mechanism was evaluated on Chinese→English (WMT17, 20.62M sentence pairs) and English→German (WMT14, 4.56M sentence pairs) translation. BLEU scores were calculated using case-sensitive multi-BLEU.pl. Key findings include:

Model	Zh→En BLEU	En→De BLEU
Transformer-Base	24.13	27.64
+ Gaussian localness	24.77	28.11
+ both (local + rel)	24.96	28.54
Transformer-Big	24.56	28.58
+ localness	25.03	28.89
+ both	25.28	29.18

Qualitative visualization indicates that head-specialized window sizes distribute attention from narrow to wide contexts. Distribution plots of $(P_i, D_i)$ across layers show that upper layers predict larger window sizes, whereas lower layers maintain focus on local contexts. N-gram BLEU analysis demonstrates greater relative performance gains on longer n-grams (i.e., phrases), confirming the efficacy of the approach in sharpening phrasal attention.

8. Significance, Interpretation, and Implications

The centered self-attention mechanism provides a principled bias toward short-range context in early stages of deep attention networks while retaining long-range capacity. This suggests improved modeling of phrasal structure particularly relevant for machine translation. The query-specific prediction of attention center and window size enables fine-grained context adaptation, while limiting application to lower layers avoids over-constraining the receptive field at deeper levels. A plausible implication is that such targeted localness bias may be more broadly beneficial in other sequence modeling tasks where local context is paramount.

Markdown Report Issue Upgrade to Chat

References (1)

Modeling Localness for Self-Attention Networks (2018)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Centered Self-Attention Layers.