HySAN: Branch-masked Self-Attention
- The paper introduces HySAN, which uses multiple masked self-attention branches to enforce local context and directionality, leading to measurable BLEU improvements in NMT.
- It employs a gated fusion mechanism that efficiently combines outputs from global, local, forward, and backward masks with minimal additional parameters.
- Ablation studies reveal that HySAN recovers strong translation performance even without positional encodings, underscoring its robustness in capturing sequence order.
Branch-masked Self-Attention, formalized as the Hybrid Self-Attention Network (HySAN), is an augmentation of standard self-attention for neural sequence modeling, specifically developed to address the lack of explicit local context sensitivity and content-dependent relative positional awareness in Transformer architectures. HySAN introduces a set of masked self-attention branches—each enforcing a structural prior such as locality or directionality—followed by a parameter-efficient fusion mechanism, yielding statistically significant gains in neural machine translation (NMT) across multiple benchmarks (Song et al., 2018).
1. Motivation and Problem Formulation
The Transformer model utilizes scaled dot-product self-attention, computing for each token a global weighting over all other tokens via
with for sequence length . This architecture captures global dependencies but exhibits two notable limitations:
- Relative Position Insensitivity: Even when supplemented with absolute positional encodings, the dot-product attention mechanism treats all pairwise token relationships as symmetric, precluding explicit modeling of “left” versus “right” context.
- Absence of Explicit Local Focus: Self-attention does not naturally bias toward local structures, unlike convolution or recurrence, making it less effective in capturing immediate semantic dependencies essential for linguistic phenomena in NMT.
HySAN remedies these via branch-masking, where each self-attention branch is masked to a specific contextual region (global, local, forward, backward), facilitating content- and order-sensitive representations.
2. Architecture: Branch-Masked Self-Attention
All branches operate on the same projected , diverging solely via an additive mask . For branch ,
Branches and Corresponding Masks:
| Branch | Mask Description (Encoder) | Effect |
|---|---|---|
| Global (G) | for all | Standard global SAN |
| Local () | 0 if 1, 2 otherwise | Attend within window 3 |
| Forward (FW) | 4 if 5, 6 otherwise | Attend to tokens at positions 7 |
| Backward (BW) | 8 if 9, 0 otherwise | Attend to tokens at positions 1 |
- In the decoder, local branches are masked causally: 2 if 3 and 4 (ensuring no information leakage).
This architecture enables extraction of global, local, and directionally-aware contextual representations in parallel via different branches.
3. Squeeze Gate Fusion: Gated Sum
Each branch yields an output 5. Fusion across 6 branches is performed via a lightweight “gated sum”:
7
where 8 is a per-token gate, defined by:
- Squeeze via 9 (0), ReLU, then expand via 1 (2)
- 3 for each position 4
- 5 is the sigmoid function.
Alternative fusion schemes—simple sum or concatenation plus projection—were empirically inferior in BLEU improvement and parameter efficiency. The squeeze gate adds two feed-forward layers per branch with minimal overhead.
4. Integration into Transformer Layers
HySAN replaces the self-attention module in each Transformer encoder layer (and in the encoder-side self-attention sublayer in the decoder) with a multi-head hybrid SAN. Each head 6 computes:
- Per-head projections: 7.
- Shared score matrix 8.
- For each branch 9:
- 0
- 1
- 2
- Fuse across 3 by gated-sum to produce 4.
- Concatenate all 5 heads and apply output linear transformation 6.
Standard residual connections, layer normalization, and position-wise feed-forward layers follow as in the canonical Transformer. In the decoder, only global, causal local, and forward branches are utilized to maintain autoregressive masking.
5. Hyperparameterization and Implementation
Key hyperparameters for HySAN are as follows:
| Model | 7 | # Layers | # Heads | Encoder Branches | Decoder Branches |
|---|---|---|---|---|---|
| Small (IWSLT14) | 256 | 2 | 4 | G, FW, BW, L1, L2, L5 | G, FW, L1 |
| Base (WMT) | 512 | 6 | 8 | G, FW, BW, L1, L2, L5 | G, FW, L1 |
| Big (WMT17) | 1024 | 6 | 16 | G, FW, BW, L1, L2, L5 | G, FW, L1 |
- Local window radii 8 evaluated at 1, 2, 5.
- Squeeze-gate reduction ratio 9 or 8.
- Regularization utilizes standard attention and feed-forward dropout.
6. Experimental Results and Ablation Studies
HySAN was evaluated on IWSLT14 German→English, WMT14 English→German, and WMT17 Chinese→English datasets. Optimization followed standard Adam with 0; the learning rate was scaled as 1, and decoding used beam=4 and length penalty=0.6.
Notable findings:
- Ablation (IWSLT14, small):
- Global (baseline): BLEU=31.27
- +FW: 31.50 (+0.23)
- +BW: 31.83 (+0.56)
- +Local 2: 31.55 (+0.28)
- All branches with gated sum: 32.28 (+1.01)
- Without positional embeddings, baseline BLEU 3 15.6, HySAN 4 30.7 (+15), indicating that local and directional masks encode order partially independent of explicit embeddings.
- WMT14 En→De:
- Base: Transformer=27.3, HySAN=27.9 (+0.6)
- Big: Transformer=28.4, HySAN=28.8 (+0.4)
- WMT17 Zh→En (big):
- Transformer=24.2, HySAN=25.27 (+1.07)
The consistent improvements across tasks and model sizes underscore the effectiveness of branch-masked mechanisms and gated fusion for NMT (Song et al., 2018).
7. Significance and Empirical Observations
By introducing structured prior constraints through masking at each self-attention sublayer and selectively fusing diverse contextual views, HySAN systematically addresses two core deficiencies of canonical self-attention: insufficient local context aggregation and indifference to relative order. The ability of HySAN to recover strong performance without any positional encoding, as evidenced by the ablation with “NoPos” input, demonstrates the potency of directional and local attestation in capturing ordering cues. The approach achieves these gains with minimal architectural modification and parameter increase, making it an efficient and generalizable extension to Transformer-based models in sequence transduction tasks (Song et al., 2018).