Centered Self-Attention Layers
- Centered self-attention layers are a variant of the Transformer mechanism that incorporate a learnable Gaussian bias to dynamically predict attention centers and local windows.
- They adjust query-specific attention by integrating a bias that favors local neighborhoods in lower layers, leading to significant improvements in BLEU scores for machine translation.
- This approach effectively balances short-range context in early layers with global dependency capture in deeper layers, enhancing phrasal and overall translation quality.
Centered self-attention layers, as introduced by Yang et al. in "Modeling Localness for Self-Attention Networks" (Yang et al., 2018), modify the standard self-attention mechanism by biasing the attention distribution toward a local neighborhood centered at a dynamically predicted position. This is achieved by integrating a learnable Gaussian bias into the attention computation, enabling the model to focus on local context in lower layers while preserving the ability to capture global dependencies in higher layers. The approach yields substantial improvements in BLEU scores for neural machine translation tasks, demonstrating both quantitative and qualitative gains in modeling phrasal and short-range dependencies.
1. Mathematical Formulation of Gaussian Localness Bias
In standard Transformer self-attention, the input hidden states are projected to queries , keys , and values of shape . The logit for query and key is given by
and attention weights are computed via
Centered self-attention augments this mechanism by incorporating a learnable Gaussian bias :
where is the center position and controls the scope. The modified logit becomes
and the resulting attention weights are
In the authors' notation, is called (predicted center), and is tied to the window size via .
2. Parameterization and Learning of Local Window
The center and scope are learned in a query-specific manner. For each query , two scalar scores are computed:
where , are learned parameters.
These are projected into valid ranges:
ensuring , with . This formulation allows each token's query to define its own attention center and window width.
Alternative parameterizations include:
- Layer-specific: A single window size per layer, predicted from the mean of all keys via .
- Fixed: Assign a constant window size and set (local window around each position).
3. Integration into Transformer Architecture
The Gaussian bias is incorporated into the scaled dot-product attention as:
with output
In implementation, for layers (typically ), the Gaussian bias matrix is added to the attention logits. For higher layers, the standard self-attention is used without localness bias.
4. Layer-wise Application and Empirical Justification
Ablation results indicate that restricting the localness modeling to lower layers (layers 1–3 in a six-layer stack) provides the largest improvements in BLEU score. Applying the bias to all layers instead leads to minor performance degradation; limiting it to upper layers is less effective. Visualization of predicted centers and window sizes shows that lower layers predominantly focus on short-range context, while higher layers naturally expand the scope, capturing long-range dependencies.
This stratified use of Gaussian bias aligns with the notion that short-range context is crucial at initial processing stages, whereas global context emerges within deeper layers (Yang et al., 2018).
5. Algorithmic Workflow
Pseudocode for training or inference includes the following steps:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
for each batch: compute Q, K, V # [B, I, d] for each layer l=1..L: if l <= L_local: # e.g. L_local=3 for i in 1..I: p_i = U_p^T tanh(W_p Q[i]) z_i = U_d^T tanh(W_p Q[i]) [P_i, D_i] = I_seq * sigmoid([p_i, z_i]) sigma_i = D_i / 2 for i, j in 1..I: G[i, j] = - (j - P_i)^2 / (2 * sigma_i^2) logits = (Q @ K.T) / sqrt(d) + G else: logits = (Q @ K.T) / sqrt(d) alpha = softmax(logits, dim=2) H^l = alpha @ V # feed-forward, residual, layer-norm, etc. |
6. Implementation Details and Practical Configuration
Experiments were conducted using the standard Transformer architecture (Base: , 8 heads; Big: , 16 heads), with 6 layers in both encoder and decoder, feed-forward size 2048, dropout 0.1, and label smoothing 0.1. Optimization used Adam (, , ) with the original warm-up schedule and peak learning rate . Batches consisted of 4096 tokens per GPU across 8 GPUs. Vocabulary was generated by BPE with 32,000 merge operations; max sentence length was 50. Newly introduced parameters (, , , ) followed standard Xavier/Glorot initialization. No additional regularization was applied beyond Transformer dropout.
7. Experimental Results and Qualitative Analysis
The centered self-attention mechanism was evaluated on Chinese→English (WMT17, 20.62M sentence pairs) and English→German (WMT14, 4.56M sentence pairs) translation. BLEU scores were calculated using case-sensitive multi-BLEU.pl. Key findings include:
| Model | Zh→En BLEU | En→De BLEU |
|---|---|---|
| Transformer-Base | 24.13 | 27.64 |
| + Gaussian localness | 24.77 | 28.11 |
| + both (local + rel) | 24.96 | 28.54 |
| Transformer-Big | 24.56 | 28.58 |
| + localness | 25.03 | 28.89 |
| + both | 25.28 | 29.18 |
Qualitative visualization indicates that head-specialized window sizes distribute attention from narrow to wide contexts. Distribution plots of across layers show that upper layers predict larger window sizes, whereas lower layers maintain focus on local contexts. N-gram BLEU analysis demonstrates greater relative performance gains on longer n-grams (i.e., phrases), confirming the efficacy of the approach in sharpening phrasal attention.
8. Significance, Interpretation, and Implications
The centered self-attention mechanism provides a principled bias toward short-range context in early stages of deep attention networks while retaining long-range capacity. This suggests improved modeling of phrasal structure particularly relevant for machine translation. The query-specific prediction of attention center and window size enables fine-grained context adaptation, while limiting application to lower layers avoids over-constraining the receptive field at deeper levels. A plausible implication is that such targeted localness bias may be more broadly beneficial in other sequence modeling tasks where local context is paramount.