Papers
Topics
Authors
Recent
Search
2000 character limit reached

Non-Local Block in Neural Networks

Updated 22 April 2026
  • Non-Local Block is a neural network module that captures long-range dependencies by relating features across all spatial positions.
  • Variants such as SANL, Pyramid NL, and DNL optimize performance and efficiency by integrating multi-scale processing and noise reduction strategies.
  • These blocks are crucial in tasks like video classification, semantic segmentation, and compressed sensing, delivering state-of-the-art results with reduced computational cost.

A non-local block is a neural network module designed to capture long-range dependencies by directly relating features at all pairs of positions in a feature map, extending beyond the local receptive fields of standard convolutions or recurrences. This class of operations has become foundational in computer vision architectures, being widely adopted in tasks such as video classification, object recognition, semantic segmentation, compressed sensing, and entropy modeling. Variants of the non-local block address numerous limitations of vanilla design, improve efficiency, and incorporate domain knowledge or task-specific priors.

1. General Formulation and Mathematical Definition

Let X∈RN×CX \in \mathbb{R}^{N \times C} denote an input feature map with NN spatial (or spatio-temporal) positions and CC channels. A non-local block computes for each output position ii: Yi=1C(X)∑j=1Nf(Xi,Xj)⋅g(Xj)Y_i = \frac{1}{C(X)} \sum_{j=1}^N f(X_i, X_j) \cdot g(X_j) where:

  • ff is a pairwise affinity function producing a scalar similarity between query ii and key jj.
  • gg is a value embedding (typically a linear transformation: g(Xj)=WgXjg(X_j) = W_g X_j).
  • NN0 is a normalization factor, e.g., NN1 (softmax normalization).

Popular affinity choices include:

  • Gaussian: NN2
  • Embedded Gaussian: NN3, with NN4 NN5 convolutions
  • Dot-product: NN6; NN7

The output is often wrapped in a residual structure: NN8 with NN9 a CC0 convolution, ensuring stable training and compatibility with pretrained backbones (Wang et al., 2017).

2. Design Variants and Efficiency Enhancements

a. Query-Independent and Global Context Simplifications

It has been empirically observed that attention maps in non-local networks often collapse to near query-independence, wherein all queries share almost identical weights. This finding motivates simplified blocks such as SNL and the Global Context (GC) block, which replace per-query attention with a shared, globally-pooled attention and strong channel-wise fusion, drastically reducing computational and memory complexity while maintaining accuracy (Cao et al., 2019).

b. Pyramid and Multi-Scale Non-Local Blocks

Several architectures extend the vanilla block to operate efficiently at multiple spatial resolutions:

  • Pyramid Non-Local Module (PNL): Splits feature channels into CC1 groups, applies pooling at scale CC2 to each group, computes non-local responses at every scale with shared projections, and fuses via a learned scale-attention, realizing CC3 reductions in compute/memory versus standard NL (Xu et al., 2020).
  • Asymmetric Pyramid Non-Local Block (APNB): Down-samples key/value embeddings via pooled pyramids, such as average pooling at multiple grid sizes, reducing the number of reference positions CC4 and compressing the attention map from CC5 to CC6 (Zhu et al., 2019).
  • Pyramid Non-Local Block (PNB): Computes asymmetric cross-scale affinities: high-resolution queries attend to multi-scale references, and outputs are concatenated across scales before a linear projection. This design achieves CC7 memory reduction vs. standard NL for low-level image tasks (Zhu et al., 2020).

c. Disentangled and Denoised Non-Local Blocks

  • Disentangled NL (DNL): Separates the affinity into a whitened pairwise term (capturing intra-region consistency) and a unary saliency term (capturing boundary structure). Each is normalized and fused additively, restoring term-specific specialization and yielding large consistent accuracy gains in segmentation, detection, and action recognition (Yin et al., 2020).
  • Denoised Non-Local Block: Addresses noisy attention through two auxiliary operations: global rectifying (GR), which suppresses inter-class affinity based on coarse per-pixel class predictions, and local retention (LR), which sharpens intra-class regions via local sliding-window reweighting of the affinity. This approach further elevates segmentation accuracy beyond prior NL variants (Song et al., 2021).

3. Integrating Spatial Awareness and Task Priors

The spatial-aware non-local (SANL) block integrates an external spatial prior, CC8, into the pairwise similarity computation: CC9 then evaluates non-local attention on these spatially-gated embeddings. ii0 is derived from task-specific class activation maps (e.g., Grad-CAM) and supplied at each stride of a feature pyramid, focusing the global context on semantically relevant regions and simultaneously preserving low-level boundary and high-level semantic cues (Li et al., 2019).

4. Application Domains and Benchmark Results

Non-local blocks and their derivatives have demonstrated consistent state-of-the-art improvements across diverse modalities:

Application Backbone/Framework Gain from NL Gain from Variant Block Source
Video Classification ResNet-50, Kinetics +2.0% top-1 +2–3% for PNL, SNL, DNL (Wang et al., 2017, Xu et al., 2020, Yin et al., 2020)
Fashion Landmark ResNet-101+FPN, DeepFashion-C -0.0012 NE (NL) -0.0033 NE (SANL) (Li et al., 2019)
COCO Detection Mask R-CNN, ResNet-50 +1.0 AP +1.1 AP (GC), +0.7–1.3 AP (DNL) (Cao et al., 2019, Yin et al., 2020)
Segmentation ResNet-101, Cityscapes +2.7% mIoU +4.7% (DNL), +5.6% (Denoised NL) (Yin et al., 2020, Song et al., 2021)
Image Restoration ResNet+, Smoothing/Denoising +1 dB PSNR +0.09–0.32 dB (PNB) (Zhu et al., 2020)

In compressed sensing and entropy modeling, non-local blocks enable networks to exploit non-local self-similarity for higher-fidelity reconstructions, with explicit domain-specific affinities and auxiliary loss terms to enforce affinity symmetry (Cui et al., 2021, Li et al., 2020).

5. Computational Considerations and Implementation Strategies

A vanilla non-local block with ii1 positions and ii2 channels incurs ii3 compute and memory due to dense ii4 affinity matrices. Critical strategies for tractability include:

  • Bottlenecking channel dimensions via 1x1 convolutions (ii5)
  • Subsampling spatially (downsampled ii6 or pyramidal reference sets)
  • Query-independent (global pooling) simplifications
  • Carefully placing blocks at middle or top layers of deep networks for maximal gain per cost

Pyramid and asymmetric designs allow insertion into high-resolution, low-level tasks that would otherwise be intractable. Simplified blocks (GC, SNL) are suitable for deep or resource-constrained scenarios, as they match or exceed NL performance with ii7–ii8 less computational overhead (Cao et al., 2019, Xu et al., 2020, Zhu et al., 2020, Zhu et al., 2019).

6. Theoretical Perspectives and Broader Implications

Non-local blocks admit unification under the framework of fully connected graph filters, being interpretable as spectral graph operators with learned affinity kernels. Through Chebyshev polynomial approximation, one can generalize NL to higher-order spectral blocks (SNL), enabling flexible and stable graph-based filtering (Zhu et al., 2021). The addition of external priors, local/global separation, and scale-adaptive fusions extends the block's reach beyond classical self-attention paradigms.

Integrating data-driven or task-specific spatial priors into the similarity computation focuses network capacity on semantically or spatially important regions, which has garnered robust empirical gains and suggests a new design pattern for attention-based modules across vision models (Li et al., 2019).

7. Summary Table: Notable Non-Local Block Variants

Variant Key Architectural Feature(s) Principal Gains Reference
Vanilla NL Bidirectional, dense pairwise affinity Long-range dependency, SOTA on video/CV (Wang et al., 2017)
SANL Spatial map prior (Grad-CAM) injects bias Sharpened, task-driven attention (Li et al., 2019)
Pyramid NL Multi-scale, pooled references Efficient regional/global correlation (Xu et al., 2020, Zhu et al., 2020)
APNB/AFNB Asymmetric spatial pyramidal or fusion ii9 resource savings (Zhu et al., 2019)
DNL Disentangles pairwise and unary streams Improved semantic specialization (Yin et al., 2020)
Denoised NL Denoises via class gating + local fusion SOTA semantic segmentation accuracy (Song et al., 2021)
GC/SNL Global pooling, query independence, graph Maintains gains with Yi=1C(X)∑j=1Nf(Xi,Xj)⋅g(Xj)Y_i = \frac{1}{C(X)} \sum_{j=1}^N f(X_i, X_j) \cdot g(X_j)0 lower cost (Cao et al., 2019, Zhu et al., 2021)

The continued evolution of non-local blocks evidences their significance as universal operators for context modeling in deep networks, enabling long-range interaction, adaptability, and efficiency across a spectrum of vision tasks.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Non-Local Block.