Distance Attention Network (DAN) Overview

Updated 2 March 2026

The paper introduces DAN, an attention-based neural architecture for NLI that leverages directional and distance masks to capture both local and global dependencies.
It uses a masked multi-head self-attention mechanism and a fusion gate to generate robust fixed-length sentence representations from premise and hypothesis pairs.
Empirical results show that the inclusion of a distance mask improves SNLI accuracy by up to 0.3 pp overall and by over 2 pp for longer sentences.

The Distance-based Self-Attention Network (DAN) is a fully attention-based neural architecture for sentence encoding, specifically developed for Natural Language Inference (NLI). It augments the Transformer-style attention mechanism with both directional and distance-based masking, enabling it to model both global and local dependencies within sentences without losing the inherent parallelism of attention models. DAN was introduced to address the absence of local context bias in prior attention-based models, and demonstrably achieves superior performance, particularly on longer sentences and documents (Im et al., 2017).

1. Architectural Overview

DAN implements an encoder-centric approach for sentence pair modelling in NLI. Given a premise and hypothesis, each sentence passes through an identical encoder, ultimately yielding fixed-length representations $\mathbf{u}$ and $\mathbf{v}$ . These representations are combined using feature-wise concatenation, absolute difference, and element-wise product: $[\mathbf{u};\mathbf{v};|\mathbf{u}-\mathbf{v}|;\mathbf{u}\odot\mathbf{v}]$ . The resulting vector is processed by a single-layer 300-dimensional ReLU network, followed by a softmax, to predict the NLI labels: entailment, neutral, or contradiction.

Within the encoder, the data flow comprises:

Word Embedding: GloVe 300D, fixed.
Masked Multi-Head Self-Attention: $h=5$ heads, $d_e=300$ .
Fusion Gate: Highway network merges embedding and attention outputs.
Feed-Forward Network: Position-wise FFN with inner dimension $d_{ff}=1200$ ( $4d_e$ ).
Pooling: Concatenated multi-dimensional source-to-token attention and max-pooling.
Normalization/Residuals: Layer normalization and residual connections wrap attention, fusion, and FFN sublayers.

2. Masked Multi-Head Self-Attention with Directional and Distance Masks

DAN’s attention core adopts the scaled dot-product paradigm. For a sequence $X\in\mathbb{R}^{n\times d_e}$ , attention is computed with $Q=K=V=X$ . Each attention head is

$\text{head}_i = \mathrm{MaskedAttention}(QW_i^Q,\, KW_i^K,\, VW_i^V)$

and

$\mathrm{MaskedMultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h)W^O.$

The masked attention introduces two additive $n\times n$ masks:

$\mathrm{Attention}(Q,K,V) = \mathrm{Softmax}\left(\frac{QK^T}{\sqrt{d_k}} + M_{\text{dir}} + \alpha M_{\text{dis}}\right)V.$

Directional Mask ( $M_{\text{dir}}$ ): Restricts each position to attend only forward or backward, blocking irrelevant directions and self-attention (diagonal masked). Implements “forward” and “backward” passes as in DiSAN.
Distance Mask ( $M_{\text{dis}}$ ): Imposes an additive bias based on word position distance: $M_{\text{dis}}[i,j] = -|i-j|$ . The scalar hyperparameter $\alpha=1.5$ modulates its influence.

This introduces a soft penalty for attending to distant tokens, with no hard zeroing—long-range connections remain feasible but are comparatively downweighted.

3. Modelling Local and Global Dependencies

The summing of the negative linear distance penalty $-|i-j|$ to pre-softmax logits encourages higher probability mass for attention weights closer to the diagonal (local context). Unlike convolutional architectures that enforce hard spatial locality through windowing, this approach maintains dense connectivity: even distant tokens may attain significant attention if warranted by context. Empirical analyses reveal that attention matrices in DAN become sharply focused on local token neighborhoods, yet still permit “long-range jumps” to salient words, facilitating both fine-grained local context integration and global reasoning.

4. Hyperparameters and Implementation Details

Key hyperparameters and implementation specifics include:

Embedding Size ( $d_e$ ): 300 (GloVe, frozen)
Attention Heads ( $h$ ): 5 (per-head dimensions $d_k=d_v=60$ )
Distance Bias Multiplier ( $\alpha$ ): 1.5
Dropout: 0.1 (post-attention and pre-gate activations)
Feedforward Inner Size ( $d_{ff}$ ): 1200
Layer Normalization: Applied to all linear projections
Batch Size: 64
Optimizer: Adam, learning rate 0.001
No Absolute Position Embeddings: All positional information is controlled by the design of $M_{\text{dir}}$ and $M_{\text{dis}}$ .

5. Empirical Performance and Analysis

DAN delivers new sentence-encoder-only state-of-the-art performance on SNLI:

SNLI Test Accuracy:

| Model | #Params | Time/ep (s) | Accuracy (%) | |------------------------ |---------|-------------|--------------| | 600D DiSAN | 2.4 M | 587 | 85.6 | | DAN w/o distance mask | 4.7 M | 687 | 86.0 | | DAN (with distance mask)| 4.7 M | 693 | 86.3 |

The overall performance gain on SNLI (+0.3 pp) is more pronounced for long sentence pairs: for average lengths $>25$ tokens, accuracy remains ≈85% with distance mask, while it drops by over 2 pp without.

On MultiNLI,

Matched: DAN = 74.1% (vs. DiSAN 71.0%)
Mismatched: DAN = 72.9% (vs. DiSAN 71.4%)

These results indicate a significant advantage for DAN on longer and varied-length inputs, with only marginal increases in parameter count and computation time relative to baseline attention architectures (Im et al., 2017).

6. Ablation Studies and Interpretations

Ablations reveal that removing the distance mask ( $\alpha=0$ ) causes a uniform decrease of 0.3 pp in SNLI accuracy but a relative drop exceeding 2 pp for long sentences. Attention heatmaps for long examples show that, without the distance mask, distributions become diffuse and unselective; distant tokens are attended to indiscriminately, degrading the mechanism’s ability to prioritize local context.

This suggests that the additive $-|i-j|$ mask is essential for preserving locality bias, especially in longer sequences, enabling DAN to maintain precision in modeling local syntactic and semantic dependencies while remaining globally expressive.

7. Significance, Context, and Comparative Perspective

DAN’s hybrid approach, leveraging both local bias and global flexibility, addresses limitations of previous attention architectures such as DiSAN—which employed only directional masking and failed to incorporate relative token distances. The absence of absolute position embeddings places the entire burden of positional information representation on $M_{\text{dir}}$ and $M_{\text{dis}}$ , highlighting their efficacy. The architectural design and empirical results position DAN as an effective, parallelizable alternative to both RNN-based and traditional Transformer encoders for NLI, with demonstrated robustness for long and complex sentences (Im et al., 2017).

Markdown Report Issue Upgrade to Chat

References (1)

Distance-based Self-Attention Network for Natural Language Inference (2017)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Distance Attention Network (DAN).

Distance Attention Network (DAN) Overview

1. Architectural Overview

2. Masked Multi-Head Self-Attention with Directional and Distance Masks

3. Modelling Local and Global Dependencies

4. Hyperparameters and Implementation Details

5. Empirical Performance and Analysis

6. Ablation Studies and Interpretations

7. Significance, Context, and Comparative Perspective

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Distance Attention Network (DAN) Overview

1. Architectural Overview

2. Masked Multi-Head Self-Attention with Directional and Distance Masks

3. Modelling Local and Global Dependencies

4. Hyperparameters and Implementation Details

5. Empirical Performance and Analysis

6. Ablation Studies and Interpretations

7. Significance, Context, and Comparative Perspective

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research