Papers
Topics
Authors
Recent
2000 character limit reached

Self-Attention for Non-Local Dependency

Updated 13 December 2025
  • Self-Attention for Non-Local Dependency is a mechanism that uses self-attention to capture and propagate long-distance dependencies across tokens, patches, or graph nodes.
  • Architectural variants such as contextualized projections, structural masks, and graph-augmented modules enhance the model’s efficiency and ability to incorporate global contextual signals.
  • Empirical findings demonstrate improved performance in tasks like machine translation and visual recognition, while innovations address the challenges of quadratic computational complexity.

Self-attention for non-local dependency refers to the use of self-attention mechanisms that explicitly or implicitly enable a model to represent and propagate context over arbitrarily long spatial, sequential, or graph distances. While the canonical self-attention of the Transformer offers the flexibility to relate each input token or patch to any other, a large body of work explores architectural, algorithmic, and inductive modifications to more effectively or efficiently model non-local dependencies across linguistic, visual, or other data. This article reviews the mathematical foundations, key mechanisms, design variants, modeling strategies, and empirical findings on self-attention for non-local dependency.

1. Mathematical and Conceptual Foundation

In the standard Transformer self-attention, every token representation hih_i is projected to a query qiq_i and attends over projected keys kjk_j for all jj, generating context vectors as weighted sums of values vjv_j. The raw attention weight between tokens ii and jj is the scaled dot product: αij=exp(qikj/dk)jexp(qikj/dk)\alpha_{ij} = \frac{\exp(q_i^\top k_j / \sqrt{d_k})}{\sum_{j'} \exp(q_i^\top k_{j'} / \sqrt{d_k})} This leads to a fully connected attention graph, where each output is a non-local (potentially global) aggregation of inputs. However, each pairwise affinity depends only on the qiq_i and kjk_j representations, which may lack sufficient signal from global or hierarchical structure (e.g., sentence meaning or syntactic depth) (Yang et al., 2019). Additionally, the quadratic computational complexity (O(n2)O(n^2)) in sequence/image length motivates more tractable approximations for scalability (Qin et al., 2023).

Non-local dependency modeling in self-attention thus encompasses:

  • Altering the affinity or masking structure to steer or constrain dependencies
  • Injecting contextual signals (global, hierarchical, syntactic, or visual) into the attention computation
  • Modifying normalization or aggregation to prevent collapse of attention to a sparse subset
  • Hybridizing with other relational operators or inductive biases to balance locality and globality

2. Mechanisms for Enriching Non-local Dependency Modeling

2.1 Contextualized Query and Key Projections

Context-aware self-attention explicitly parameterizes the query and key projections as context functions. In “Context-Aware Self-Attention Networks,” global context cc^\ell (layer average), deep context (lower-layer concatenation), and deep-global context (concatenated averages) are constructed as vectors CiC_i. Each qiq_i/kjk_j projection is then contextually gated: λQi=σ(qivQH+CiUQVQC) qi=(1λQi)qi+λQiUQCi\begin{aligned} \lambda_{Q_i} &= \sigma(q_i^\top v_Q^H + C_i^\top U_Q V_Q^C) \ q'_i &= (1 - \lambda_{Q_i}) q_i + \lambda_{Q_i} U_Q C_i \end{aligned} The same applies to keys kjk_j, and the attention weight becomes αij=softmaxj(qikj/dk)\alpha'_{ij} = \operatorname{softmax}_j(q'_i{}^\top k'_j / \sqrt{d_k}). This direct injection of multi-layer or global signals remedies the isolation of vanilla self-attention pairings, producing BLEU gains up to +1.0 on WMT translation (Yang et al., 2019).

2.2 Structural Masks and Dependency Constraints

Self-attention can be structurally modulated at the logits level via masking or element-wise scaling:

  • Masked/Directional Attention: Heads with forward (jij \leq i), backward (iji \leq j), or local window (e.g., ijk|i-j| \leq k) masks impose causal or localized order, as in HySAN (Song et al., 2018), Bi-BloSAN (Shen et al., 2018), and DB-SAN (Im et al., 2017).
  • Dependency-Scaled Attention: External linguistic structures, such as syntactic parse trees, are quantified into path-distance matrices (e.g., dijd_{ij}) and Gaussian-scaled as Mij=GaussDist(dij)M_{ij} = \mathrm{GaussDist}(d_{ij}). Element-wise multiplication with the attention logits (SijS_{ij}) reweights by syntactic closeness before softmax, focusing attention appropriately (Peng et al., 2021, Bugliarello et al., 2019).
  • Supervised/Parameter-free Dependency Heads: Attention heads are directly supervised to imitate dependency adjacency matrices (child and parent distributions), driving designated heads to encode long-range syntactic arcs (Wang et al., 2019).

2.3 Relation-aware and Graph-blended Attention

  • Relation-attention: In the Dependency-Transformer (Ma et al., 2022), each head computes both standard semantic self-attention and relation-attention using learned edge-type embeddings rijr_{ij} from the dependency tree. A gating network gij(h)g_{ij}^{(h)} convexly combines these streams at the logit level, preserving parallelizability while encoding non-local structure.
  • Graph-augmented Self-Attention: CN³ (Liu et al., 2018) interleaves dense non-local attention with GNN-style message passing, dynamically constructing task-specific weighted graphs using both learned representations and injected edge attributes (e.g., POS, syntactic distance), allowing flexible non-local propagation and robust local smoothing.

2.4 Global Normalization Schemes

Standard self-attention can underuse some tokens ("explaining away"). Doubly-normalized attention (DNAS) applies one column-wise and one row-wise normalization to the unnormalized affinity scores, guaranteeing every input token contributes at least $1/S$ to the total attention. This preserves non-local signal, especially for weak but globally relevant features (Ding et al., 2020).

2.5 Multi-hop/Belief Propagation Refinements

Transformers may display "attention localization," where each head focuses on a sparse subset and fails to propagate multi-hop dependencies, especially in compact models. Self-Attention One-step Belief Propagation (SAOBP) injects multi-hop signals via a one-step belief propagation update with repulsive Potts priors, increasing the indirect (multi-hop) propagation and entropy across layers (Lee et al., 9 Sep 2025).

3. Architectural Variants and Efficiency Scalability

3.1 Block/Hierarchical and Factorized Schemes

  • Block-based schemes: Bi-BloSAN partitions sequences into blocks, applying self-attention locally (intra-block, O(mr2)O(m r^2)), then aggregates globally via inter-block self-attention (O(m2)O(m^2)). This structure achieves subquadratic memory scaling while maintaining bidirectional non-locality (Shen et al., 2018).
  • Factorized Attention for Vision: Factorization Self-Attention (FaSA) decomposes the full attention matrix into GG sub-attention matrices over groups of channels, each using dilated and cross-window key sampling. This reconstructs global context at the cost of local-window self-attention (O(n)O(n)), enabling long-range dependencies in high-resolution images (Qin et al., 2023).
  • Hybrid/Hierarchical Networks: UniFormer combines local convolutional attention in shallow layers with global multi-head self-attention at deeper stages, effectively covering both short-range and non-local dependencies in images and videos (Li et al., 2022).

3.2 Reversed Attention and Hierarchical Dependency Trees

Reversed attention, as in DependencyViT, transposes the standard attention adjacency to implement "send" (child-to-parent) rather than "gather" (parent-to-child) semantics. Stacking these layers induces explicit dependency trees from leaves to root in a fully unsupervised way, supporting non-local compositionality for part-object hierarchies and scene parsing. Layers of reversed attention are computationally comparable to standard self-attention but can be pruned in deeper layers for efficiency (Ding et al., 2023).

3.3 Patchwise and Visual Extensions

Self-attention mechanisms generalized to vision scale non-local dependency modeling in the spatial domain. Patch-wise or image-grid self-attention, with or without explicit positional encoding, enables features in distant spatial zones (e.g., intersection corners) to interact directly, supporting robust visual scene topology extraction (Nakata et al., 2022).

4. Empirical Evidence and Evaluation of Non-local Dependency Modeling

Robust non-local modeling is consistently validated by analysis on:

  • Translation quality for long sentences: Context-aware, dependency-scaled, or supervised-attention models yield +0.6–1.0 BLEU over baselines, with much larger gains (+2–3 BLEU) for sentences >50 tokens (Yang et al., 2019, Bugliarello et al., 2019, Peng et al., 2021).
  • Sentence representation and NLI: On SICK-R, SICK-E, SNLI, and MultiNLI, dependency-encoded and distance-masked self-attention models outperform tree-based and vanilla sentence encoders, with improved stability for long or complex sentences (Im et al., 2017, Ma et al., 2022).
  • Visual recognition and robustness: Self-attention–augmented or factorized models (e.g., FaViT) surpass CNN or local-window ViT baselines on both vanilla classification tasks (ImageNet) and OOD/corrupted domains, highlighting the value of long-range aggregation (Qin et al., 2023, Li et al., 2022, Nakata et al., 2022).
  • Small model regime: SAOBP specifically increases accuracy and perplexity performance in compact models (BERT-Mini, Small), which otherwise exhibit entropy collapse and poor global context maintenance (Lee et al., 9 Sep 2025).

Specialized analyses:

  • Gate and mask analysis: Function and non-content words show higher context gating values, confirming that syntactic connectors require more non-local information (Yang et al., 2019).
  • Length bin ablations: Gains concentrate in long or complex structural settings, such as cross-sentence dependencies or long-range vision tasks (Qin et al., 2023).

5. Limitations, Trade-offs, and Open Challenges

  • Computational Complexity: Quadratic cost in length or area remains prohibitive for full-attention models; block/factorized/patchwise and pruning approaches are essential for scalability (Shen et al., 2018, Qin et al., 2023, Ding et al., 2023).
  • Parser/rule noise: Approaches that rely on external syntactic parses (dependency-scaling, explicit adjacency) are vulnerable to parser errors and overfitting; knowledge “sparsing” and regularization mitigate, but do not eliminate, this risk (Peng et al., 2021, Bugliarello et al., 2019).
  • Localization and entropy collapse: Attention heads can become narrowly focused in deep or small models, requiring normalization, structured propagation, or explicit penalty schemes to preserve non-local propagation (Ding et al., 2020, Lee et al., 9 Sep 2025).
  • Hybrid vs. pure self-attention: Blending self-attention with learned or fixed graphs (GNN-hybrids) yields best performance empirically, but increases architectural complexity and parameter count; dynamic graph construction remains an active area (Liu et al., 2018).

6. Connections and Comparative Perspectives

Self-attention’s non-local capability distinguishes it fundamentally from CNN and RNN models, and extensive hybridization with structural, syntactic, and spatial priors has led to broadly superior performance on tasks requiring global context. Nevertheless, the empirical dominance of these mechanisms is realized only when architectural, regularization, and sparsity strategies are carefully aligned to the modeling domain, inductive structure, and scale constraints.

A summary comparison of representative strategies appears below:

Approach Mechanism Key Benefit
Context-Aware Self-Attention (Yang et al., 2019) Contextualized Q/K projections, gating Layer-/sentence-global information in affinities
Dependency/Relation-Scaled (Peng et al., 2021Bugliarello et al., 2019Ma et al., 2022) Tree/path-based scaling or relation-aware embedding Encodes non-local linguistic topology
Factored/Hierarchical (Qin et al., 2023Shen et al., 2018Li et al., 2022) Block, factorization, hybrid convolution/attention Linear scaling, global context at tractable cost
Supervised Attention (Wang et al., 2019) Head-level syntactic supervision Accurate long-range dependency encoding
Multi-hop/Refinements (Lee et al., 9 Sep 2025Ding et al., 2020) Multi-hop BP, doubly-normalized attention Prevents attention collapse, ensures distributed context

Collectively, these developments reinforce the view that self-attention, augmented with explicit non-local dependency modeling and appropriate inductive bias, offers an expressive and adaptable paradigm for sequence and grid-structured data across modalities.

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Self-Attention for Non-local Dependency.