Self-Attention for Non-Local Dependency

Updated 13 December 2025

Self-Attention for Non-Local Dependency is a mechanism that uses self-attention to capture and propagate long-distance dependencies across tokens, patches, or graph nodes.
Architectural variants such as contextualized projections, structural masks, and graph-augmented modules enhance the model’s efficiency and ability to incorporate global contextual signals.
Empirical findings demonstrate improved performance in tasks like machine translation and visual recognition, while innovations address the challenges of quadratic computational complexity.

Self-attention for non-local dependency refers to the use of self-attention mechanisms that explicitly or implicitly enable a model to represent and propagate context over arbitrarily long spatial, sequential, or graph distances. While the canonical self-attention of the Transformer offers the flexibility to relate each input token or patch to any other, a large body of work explores architectural, algorithmic, and inductive modifications to more effectively or efficiently model non-local dependencies across linguistic, visual, or other data. This article reviews the mathematical foundations, key mechanisms, design variants, modeling strategies, and empirical findings on self-attention for non-local dependency.

1. Mathematical and Conceptual Foundation

In the standard Transformer self-attention, every token representation $h_i$ is projected to a query $q_i$ and attends over projected keys $k_j$ for all $j$ , generating context vectors as weighted sums of values $v_j$ . The raw attention weight between tokens $i$ and $j$ is the scaled dot product: $\alpha_{ij} = \frac{\exp(q_i^\top k_j / \sqrt{d_k})}{\sum_{j'} \exp(q_i^\top k_{j'} / \sqrt{d_k})}$ This leads to a fully connected attention graph, where each output is a non-local (potentially global) aggregation of inputs. However, each pairwise affinity depends only on the $q_i$ and $k_j$ representations, which may lack sufficient signal from global or hierarchical structure (e.g., sentence meaning or syntactic depth) (Yang et al., 2019). Additionally, the quadratic computational complexity ( $O(n^2)$ ) in sequence/image length motivates more tractable approximations for scalability (Qin et al., 2023).

Non-local dependency modeling in self-attention thus encompasses:

Altering the affinity or masking structure to steer or constrain dependencies
Injecting contextual signals (global, hierarchical, syntactic, or visual) into the attention computation
Modifying normalization or aggregation to prevent collapse of attention to a sparse subset
Hybridizing with other relational operators or inductive biases to balance locality and globality

2. Mechanisms for Enriching Non-local Dependency Modeling

2.1 Contextualized Query and Key Projections

Context-aware self-attention explicitly parameterizes the query and key projections as context functions. In “Context-Aware Self-Attention Networks,” global context $c^\ell$ (layer average), deep context (lower-layer concatenation), and deep-global context (concatenated averages) are constructed as vectors $C_i$ . Each $q_i$ / $k_j$ projection is then contextually gated: $\begin{aligned} \lambda_{Q_i} &= \sigma(q_i^\top v_Q^H + C_i^\top U_Q V_Q^C) \ q'_i &= (1 - \lambda_{Q_i}) q_i + \lambda_{Q_i} U_Q C_i \end{aligned}$ The same applies to keys $k_j$ , and the attention weight becomes $\alpha'_{ij} = \operatorname{softmax}_j(q'_i{}^\top k'_j / \sqrt{d_k})$ . This direct injection of multi-layer or global signals remedies the isolation of vanilla self-attention pairings, producing BLEU gains up to +1.0 on WMT translation (Yang et al., 2019).

2.2 Structural Masks and Dependency Constraints

Self-attention can be structurally modulated at the logits level via masking or element-wise scaling:

Masked/Directional Attention: Heads with forward ( $j \leq i$ ), backward ( $i \leq j$ ), or local window (e.g., $|i-j| \leq k$ ) masks impose causal or localized order, as in HySAN (Song et al., 2018), Bi-BloSAN (Shen et al., 2018), and DB-SAN (Im et al., 2017).
Dependency-Scaled Attention: External linguistic structures, such as syntactic parse trees, are quantified into path-distance matrices (e.g., $d_{ij}$ ) and Gaussian-scaled as $M_{ij} = \mathrm{GaussDist}(d_{ij})$ . Element-wise multiplication with the attention logits ( $S_{ij}$ ) reweights by syntactic closeness before softmax, focusing attention appropriately (Peng et al., 2021, Bugliarello et al., 2019).
Supervised/Parameter-free Dependency Heads: Attention heads are directly supervised to imitate dependency adjacency matrices (child and parent distributions), driving designated heads to encode long-range syntactic arcs (Wang et al., 2019).

2.3 Relation-aware and Graph-blended Attention

Relation-attention: In the Dependency-Transformer (Ma et al., 2022), each head computes both standard semantic self-attention and relation-attention using learned edge-type embeddings $r_{ij}$ from the dependency tree. A gating network $g_{ij}^{(h)}$ convexly combines these streams at the logit level, preserving parallelizability while encoding non-local structure.
Graph-augmented Self-Attention: CN³ (Liu et al., 2018) interleaves dense non-local attention with GNN-style message passing, dynamically constructing task-specific weighted graphs using both learned representations and injected edge attributes (e.g., POS, syntactic distance), allowing flexible non-local propagation and robust local smoothing.

2.4 Global Normalization Schemes

Standard self-attention can underuse some tokens ("explaining away"). Doubly-normalized attention (DNAS) applies one column-wise and one row-wise normalization to the unnormalized affinity scores, guaranteeing every input token contributes at least $1/S$ to the total attention. This preserves non-local signal, especially for weak but globally relevant features (Ding et al., 2020).

Transformers may display "attention localization," where each head focuses on a sparse subset and fails to propagate multi-hop dependencies, especially in compact models. Self-Attention One-step Belief Propagation (SAOBP) injects multi-hop signals via a one-step belief propagation update with repulsive Potts priors, increasing the indirect (multi-hop) propagation and entropy across layers (Lee et al., 9 Sep 2025).

3. Architectural Variants and Efficiency Scalability

3.1 Block/Hierarchical and Factorized Schemes

Block-based schemes: Bi-BloSAN partitions sequences into blocks, applying self-attention locally (intra-block, $O(m r^2)$ ), then aggregates globally via inter-block self-attention ( $O(m^2)$ ). This structure achieves subquadratic memory scaling while maintaining bidirectional non-locality (Shen et al., 2018).
Factorized Attention for Vision: Factorization Self-Attention (FaSA) decomposes the full attention matrix into $G$ sub-attention matrices over groups of channels, each using dilated and cross-window key sampling. This reconstructs global context at the cost of local-window self-attention ( $O(n)$ ), enabling long-range dependencies in high-resolution images (Qin et al., 2023).
Hybrid/Hierarchical Networks: UniFormer combines local convolutional attention in shallow layers with global multi-head self-attention at deeper stages, effectively covering both short-range and non-local dependencies in images and videos (Li et al., 2022).

3.2 Reversed Attention and Hierarchical Dependency Trees

Reversed attention, as in DependencyViT, transposes the standard attention adjacency to implement "send" (child-to-parent) rather than "gather" (parent-to-child) semantics. Stacking these layers induces explicit dependency trees from leaves to root in a fully unsupervised way, supporting non-local compositionality for part-object hierarchies and scene parsing. Layers of reversed attention are computationally comparable to standard self-attention but can be pruned in deeper layers for efficiency (Ding et al., 2023).

3.3 Patchwise and Visual Extensions

Self-attention mechanisms generalized to vision scale non-local dependency modeling in the spatial domain. Patch-wise or image-grid self-attention, with or without explicit positional encoding, enables features in distant spatial zones (e.g., intersection corners) to interact directly, supporting robust visual scene topology extraction (Nakata et al., 2022).

4. Empirical Evidence and Evaluation of Non-local Dependency Modeling

Robust non-local modeling is consistently validated by analysis on:

Translation quality for long sentences: Context-aware, dependency-scaled, or supervised-attention models yield +0.6–1.0 BLEU over baselines, with much larger gains (+2–3 BLEU) for sentences >50 tokens (Yang et al., 2019, Bugliarello et al., 2019, Peng et al., 2021).
Sentence representation and NLI: On SICK-R, SICK-E, SNLI, and MultiNLI, dependency-encoded and distance-masked self-attention models outperform tree-based and vanilla sentence encoders, with improved stability for long or complex sentences (Im et al., 2017, Ma et al., 2022).
Visual recognition and robustness: Self-attention–augmented or factorized models (e.g., FaViT) surpass CNN or local-window ViT baselines on both vanilla classification tasks (ImageNet) and OOD/corrupted domains, highlighting the value of long-range aggregation (Qin et al., 2023, Li et al., 2022, Nakata et al., 2022).
Small model regime: SAOBP specifically increases accuracy and perplexity performance in compact models (BERT-Mini, Small), which otherwise exhibit entropy collapse and poor global context maintenance (Lee et al., 9 Sep 2025).

Specialized analyses:

Gate and mask analysis: Function and non-content words show higher context gating values, confirming that syntactic connectors require more non-local information (Yang et al., 2019).
Length bin ablations: Gains concentrate in long or complex structural settings, such as cross-sentence dependencies or long-range vision tasks (Qin et al., 2023).

5. Limitations, Trade-offs, and Open Challenges

Computational Complexity: Quadratic cost in length or area remains prohibitive for full-attention models; block/factorized/patchwise and pruning approaches are essential for scalability (Shen et al., 2018, Qin et al., 2023, Ding et al., 2023).
Parser/rule noise: Approaches that rely on external syntactic parses (dependency-scaling, explicit adjacency) are vulnerable to parser errors and overfitting; knowledge “sparsing” and regularization mitigate, but do not eliminate, this risk (Peng et al., 2021, Bugliarello et al., 2019).
Localization and entropy collapse: Attention heads can become narrowly focused in deep or small models, requiring normalization, structured propagation, or explicit penalty schemes to preserve non-local propagation (Ding et al., 2020, Lee et al., 9 Sep 2025).
Hybrid vs. pure self-attention: Blending self-attention with learned or fixed graphs (GNN-hybrids) yields best performance empirically, but increases architectural complexity and parameter count; dynamic graph construction remains an active area (Liu et al., 2018).

6. Connections and Comparative Perspectives

Self-attention’s non-local capability distinguishes it fundamentally from CNN and RNN models, and extensive hybridization with structural, syntactic, and spatial priors has led to broadly superior performance on tasks requiring global context. Nevertheless, the empirical dominance of these mechanisms is realized only when architectural, regularization, and sparsity strategies are carefully aligned to the modeling domain, inductive structure, and scale constraints.

A summary comparison of representative strategies appears below:

Approach	Mechanism	Key Benefit
Context-Aware Self-Attention (Yang et al., 2019)	Contextualized Q/K projections, gating	Layer-/sentence-global information in affinities
Dependency/Relation-Scaled (Peng et al., 2021 Bugliarello et al., 2019 Ma et al., 2022)	Tree/path-based scaling or relation-aware embedding	Encodes non-local linguistic topology
Factored/Hierarchical (Qin et al., 2023 Shen et al., 2018 Li et al., 2022)	Block, factorization, hybrid convolution/attention	Linear scaling, global context at tractable cost
Supervised Attention (Wang et al., 2019)	Head-level syntactic supervision	Accurate long-range dependency encoding
Multi-hop/Refinements (Lee et al., 9 Sep 2025 Ding et al., 2020)	Multi-hop BP, doubly-normalized attention	Prevents attention collapse, ensures distributed context

Collectively, these developments reinforce the view that self-attention, augmented with explicit non-local dependency modeling and appropriate inductive bias, offers an expressive and adaptable paradigm for sequence and grid-structured data across modalities.