Papers
Topics
Authors
Recent
Search
2000 character limit reached

Distance Attention Network (DAN) Overview

Updated 2 March 2026
  • The paper introduces DAN, an attention-based neural architecture for NLI that leverages directional and distance masks to capture both local and global dependencies.
  • It uses a masked multi-head self-attention mechanism and a fusion gate to generate robust fixed-length sentence representations from premise and hypothesis pairs.
  • Empirical results show that the inclusion of a distance mask improves SNLI accuracy by up to 0.3 pp overall and by over 2 pp for longer sentences.

The Distance-based Self-Attention Network (DAN) is a fully attention-based neural architecture for sentence encoding, specifically developed for Natural Language Inference (NLI). It augments the Transformer-style attention mechanism with both directional and distance-based masking, enabling it to model both global and local dependencies within sentences without losing the inherent parallelism of attention models. DAN was introduced to address the absence of local context bias in prior attention-based models, and demonstrably achieves superior performance, particularly on longer sentences and documents (Im et al., 2017).

1. Architectural Overview

DAN implements an encoder-centric approach for sentence pair modelling in NLI. Given a premise and hypothesis, each sentence passes through an identical encoder, ultimately yielding fixed-length representations u\mathbf{u} and v\mathbf{v}. These representations are combined using feature-wise concatenation, absolute difference, and element-wise product: [u;v;uv;uv][\mathbf{u};\mathbf{v};|\mathbf{u}-\mathbf{v}|;\mathbf{u}\odot\mathbf{v}]. The resulting vector is processed by a single-layer 300-dimensional ReLU network, followed by a softmax, to predict the NLI labels: entailment, neutral, or contradiction.

Within the encoder, the data flow comprises:

  • Word Embedding: GloVe 300D, fixed.
  • Masked Multi-Head Self-Attention: h=5h=5 heads, de=300d_e=300.
  • Fusion Gate: Highway network merges embedding and attention outputs.
  • Feed-Forward Network: Position-wise FFN with inner dimension dff=1200d_{ff}=1200 (4de4d_e).
  • Pooling: Concatenated multi-dimensional source-to-token attention and max-pooling.
  • Normalization/Residuals: Layer normalization and residual connections wrap attention, fusion, and FFN sublayers.

2. Masked Multi-Head Self-Attention with Directional and Distance Masks

DAN’s attention core adopts the scaled dot-product paradigm. For a sequence XRn×deX\in\mathbb{R}^{n\times d_e}, attention is computed with Q=K=V=XQ=K=V=X. Each attention head is

headi=MaskedAttention(QWiQ,KWiK,VWiV)\text{head}_i = \mathrm{MaskedAttention}(QW_i^Q,\, KW_i^K,\, VW_i^V)

and

MaskedMultiHead(Q,K,V)=Concat(head1,,headh)WO.\mathrm{MaskedMultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h)W^O.

The masked attention introduces two additive n×nn\times n masks:

Attention(Q,K,V)=Softmax(QKTdk+Mdir+αMdis)V.\mathrm{Attention}(Q,K,V) = \mathrm{Softmax}\left(\frac{QK^T}{\sqrt{d_k}} + M_{\text{dir}} + \alpha M_{\text{dis}}\right)V.

  • Directional Mask (MdirM_{\text{dir}}): Restricts each position to attend only forward or backward, blocking irrelevant directions and self-attention (diagonal masked). Implements “forward” and “backward” passes as in DiSAN.
  • Distance Mask (MdisM_{\text{dis}}): Imposes an additive bias based on word position distance: Mdis[i,j]=ijM_{\text{dis}}[i,j] = -|i-j|. The scalar hyperparameter α=1.5\alpha=1.5 modulates its influence.

This introduces a soft penalty for attending to distant tokens, with no hard zeroing—long-range connections remain feasible but are comparatively downweighted.

3. Modelling Local and Global Dependencies

The summing of the negative linear distance penalty ij-|i-j| to pre-softmax logits encourages higher probability mass for attention weights closer to the diagonal (local context). Unlike convolutional architectures that enforce hard spatial locality through windowing, this approach maintains dense connectivity: even distant tokens may attain significant attention if warranted by context. Empirical analyses reveal that attention matrices in DAN become sharply focused on local token neighborhoods, yet still permit “long-range jumps” to salient words, facilitating both fine-grained local context integration and global reasoning.

4. Hyperparameters and Implementation Details

Key hyperparameters and implementation specifics include:

  • Embedding Size (ded_e): 300 (GloVe, frozen)
  • Attention Heads (hh): 5 (per-head dimensions dk=dv=60d_k=d_v=60)
  • Distance Bias Multiplier (α\alpha): 1.5
  • Dropout: 0.1 (post-attention and pre-gate activations)
  • Feedforward Inner Size (dffd_{ff}): 1200
  • Layer Normalization: Applied to all linear projections
  • Batch Size: 64
  • Optimizer: Adam, learning rate 0.001
  • No Absolute Position Embeddings: All positional information is controlled by the design of MdirM_{\text{dir}} and MdisM_{\text{dis}}.

5. Empirical Performance and Analysis

DAN delivers new sentence-encoder-only state-of-the-art performance on SNLI:

  • SNLI Test Accuracy:

| Model | #Params | Time/ep (s) | Accuracy (%) | |------------------------ |---------|-------------|--------------| | 600D DiSAN | 2.4 M | 587 | 85.6 | | DAN w/o distance mask | 4.7 M | 687 | 86.0 | | DAN (with distance mask)| 4.7 M | 693 | 86.3 |

The overall performance gain on SNLI (+0.3 pp) is more pronounced for long sentence pairs: for average lengths >25>25 tokens, accuracy remains ≈85% with distance mask, while it drops by over 2 pp without.

On MultiNLI,

  • Matched: DAN = 74.1% (vs. DiSAN 71.0%)
  • Mismatched: DAN = 72.9% (vs. DiSAN 71.4%)

These results indicate a significant advantage for DAN on longer and varied-length inputs, with only marginal increases in parameter count and computation time relative to baseline attention architectures (Im et al., 2017).

6. Ablation Studies and Interpretations

Ablations reveal that removing the distance mask (α=0\alpha=0) causes a uniform decrease of 0.3 pp in SNLI accuracy but a relative drop exceeding 2 pp for long sentences. Attention heatmaps for long examples show that, without the distance mask, distributions become diffuse and unselective; distant tokens are attended to indiscriminately, degrading the mechanism’s ability to prioritize local context.

This suggests that the additive ij-|i-j| mask is essential for preserving locality bias, especially in longer sequences, enabling DAN to maintain precision in modeling local syntactic and semantic dependencies while remaining globally expressive.

7. Significance, Context, and Comparative Perspective

DAN’s hybrid approach, leveraging both local bias and global flexibility, addresses limitations of previous attention architectures such as DiSAN—which employed only directional masking and failed to incorporate relative token distances. The absence of absolute position embeddings places the entire burden of positional information representation on MdirM_{\text{dir}} and MdisM_{\text{dis}}, highlighting their efficacy. The architectural design and empirical results position DAN as an effective, parallelizable alternative to both RNN-based and traditional Transformer encoders for NLI, with demonstrated robustness for long and complex sentences (Im et al., 2017).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Distance Attention Network (DAN).