Papers
Topics
Authors
Recent
Search
2000 character limit reached

Context-Aware Cross-Attention for NAT

Updated 17 April 2026
  • The paper introduces CCAN, which blends global and local cross-attention signals to enhance token-level predictions in NAT models.
  • It employs a symmetric windowing mechanism to restrict attention to nearby source tokens, thereby sharpening alignment precision.
  • Empirical results demonstrate that CCAN reduces attention diffuseness and improves BLEU scores across multiple translation benchmarks.

Context-aware cross-attention for non-autoregressive translation (NAT) refers to a Transformer-based architectural modification designed to address the challenge of effectively modeling source context in NAT models. In NAT, the decoder predicts the full target sequence in parallel, which accelerates inference but results in a loss of autoregressive target dependencies. Consequently, the cross-attention mechanism from decoder to encoder must capture both broad and fine-grained alignment between source and target, but in practice standard global cross-attention in NAT yields diffuse attention and limited exploitation of local context. To remedy this, the context-aware cross-attention (CCAN) mechanism supplements conventional cross-attention with a localness-restricted signal and adaptively fuses global and local cues to enhance token-level predictions and overall translation fidelity (Ding et al., 2020).

1. The Localness Perception Problem in NAT

In the standard NAT paradigm, given a source sequence x=[f1,,fn]\mathbf{x} = [f_1,\ldots,f_n] and target y=[e1,,em]\mathbf{y} = [e_1,\ldots,e_m], each target-side decoder query QiQ_i (at position ii) attends to all encoder keys KjK_j via the conventional scaled dot-product:

ψi,j=QiKj\psi_{i,j} = Q_{i}K_j^{\top}

The attention distribution is then αi,j=softmax(ψi)j\alpha_{i,j} = \mathrm{softmax}(\boldsymbol{\psi}_i)_{j}, and the attended value is:

Att(Qi,K,V)=j=1nαi,jVj\mathrm{Att}(Q_i, K, V) = \sum_{j=1}^{n}\alpha_{i,j} V_j

NAT outputs each target token independently and lacks autoregressive signal, leading to attention weights αi,j\alpha_{i,j} that are “diffuse” (higher locality entropy), i.e., attention is spread across many source positions, which impairs precise token-to-source alignment (Ding et al., 2020). Empirically, on WMT14 En→De, NAT yielded locality entropy (LE) of 1.66 compared to 1.46 for AT models, correlating with lower BLEU.

2. Context-Aware Cross-Attention Architecture

The CCAN mechanism extends each multi-head cross-attention module in the decoder by explicitly blending local and global source cues for each target token. For a target position ii:

  1. Compute standard global attention scores y=[e1,,em]\mathbf{y} = [e_1,\ldots,e_m]0.
  2. Identify strongest aligned source position y=[e1,,em]\mathbf{y} = [e_1,\ldots,e_m]1.
  3. Define a symmetric window (radius y=[e1,,em]\mathbf{y} = [e_1,\ldots,e_m]2, e.g., 9) around y=[e1,,em]\mathbf{y} = [e_1,\ldots,e_m]3.
  4. Set localness-constrained scores y=[e1,,em]\mathbf{y} = [e_1,\ldots,e_m]4:

y=[e1,,em]\mathbf{y} = [e_1,\ldots,e_m]5

and obtain local softmax y=[e1,,em]\mathbf{y} = [e_1,\ldots,e_m]6.

  1. Compute global (y=[e1,,em]\mathbf{y} = [e_1,\ldots,e_m]7) and local (y=[e1,,em]\mathbf{y} = [e_1,\ldots,e_m]8) attended values and interpolate:

y=[e1,,em]\mathbf{y} = [e_1,\ldots,e_m]9

QiQ_i0

where QiQ_i1 is a learned gate parameter, shared across heads.

The final attended representation per target token QiQ_i2 is thus an adaptive blend of global and local context, with the gate QiQ_i3 learned end-to-end as a function of the query (Ding et al., 2020).

3. Training Objective and Implementation

The training procedure remains identical to underlying Conditional Masked LLM (CMLM) NAT objectives, specifically:

  • A cross-entropy term for predicting target sequence length QiQ_i4 given QiQ_i5.
  • Masked tokens in each target sequence QiQ_i6, with independent prediction of masked positions.
  • The loss is

QiQ_i7

CCAN layers replace every standard cross-attention in the 6-layer decoder of the CMLM NAT Transformer. The window size (QiQ_i8) and full layerwise application were empirically validated as optimal.

4. Analysis of Local and Global Information Fusion

By construction, the original global weights QiQ_i9 allow exploitation of long-range dependencies, while local weights ii0 enable precise modeling of localized, linguistically meaningful source patches. The learned per-token gate ii1 balances these components, and ablation studies indicate:

  • Higher contribution of local attention in lower decoder layers, suggesting early-stage benefit from local phrase alignment.
  • Increased use of global information in upper layers for resolving broader syntactic/semantic context.
  • Improvements in ii2-gram translation accuracy (ii3 through ii4) over baselines, indicative of stronger phrasal modeling.
  • Reduction in locality entropy for NAT cross-attention (e.g., 1.66 → 1.62 on En→De), narrowing the gap to autoregressive models (Ding et al., 2020).

5. Empirical Results and Performance Characteristics

Experimental evaluation of CCAN was conducted on WMT16 Ro→En (0.6M pairs), WMT14 En→De (4.5M), WMT17 Zh→En (20M), and WAT17 Ja→En (2M), with BPE tokenization (32K), sequence-level knowledge distillation from a Transformer-Big teacher, and standard CMLM NAT hyperparameters (encoder/decoder 6 layers, model dim 512, 8 heads, FFN 2048).

Key findings:

Task CMLM-NAT +CCAN Δ
Ro→En 33.3 33.7 +0.4
En→De 27.0 27.5 +0.5
Zh→En 24.0 24.6 +0.6
Ja→En 28.9 29.4 +0.5

All increments are statistically significant (ii5 compared to baseline). These gains are robust to window size and to application of CCAN across all decoder layers; optimal BLEU on En→De achieved at ii6, layers 1–6.

6. Linguistic and Structural Analysis

Fine-grained probing revealed that:

  • Sentence representations from encoder+CCAN NAT preserve higher-level linguistic properties, including syntactic (e.g., sentence tree depth) and semantic (e.g., bigram shift) features.
  • CCAN reduces the “diffuseness” of attention in NAT, producing sharper and more interpretable alignments.
  • Gated fusion mechanism aligns with linguistic intuition: lower layers focus on local phrases, upper layers capture syntactic structures and meaning integration.

These analyses suggest that explicitly incorporating a localness-aware channel in cross-attention benefits both alignment quality and global phrase consistency, thus improving the adequacy and fluency of NAT outputs (Ding et al., 2020).

7. Broader Implications and Future Directions

The context-aware cross-attention mechanism exemplified by CCAN provides a framework for balancing local and global sequence alignment in parallel decoding architectures. While developed in the NAT context, the principles of local-global interpolation and adaptive gating may inform future sequence-to-sequence models, including non-monotonic generation, document-level translation, and cross-modal retrieval. The empirical reduction in attention entropy and improvements to BLEU validate that source-local context remains critical even under highly parallel regimes, and a plausible implication is that further advances in NAT may require more nuanced modeling of alignment uncertainty and context sensitivity.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Context-Aware Cross-Attention for NAT.