Non-Local Block in Neural Networks

Updated 22 April 2026

Non-Local Block is a neural network module that captures long-range dependencies by relating features across all spatial positions.
Variants such as SANL, Pyramid NL, and DNL optimize performance and efficiency by integrating multi-scale processing and noise reduction strategies.
These blocks are crucial in tasks like video classification, semantic segmentation, and compressed sensing, delivering state-of-the-art results with reduced computational cost.

A non-local block is a neural network module designed to capture long-range dependencies by directly relating features at all pairs of positions in a feature map, extending beyond the local receptive fields of standard convolutions or recurrences. This class of operations has become foundational in computer vision architectures, being widely adopted in tasks such as video classification, object recognition, semantic segmentation, compressed sensing, and entropy modeling. Variants of the non-local block address numerous limitations of vanilla design, improve efficiency, and incorporate domain knowledge or task-specific priors.

1. General Formulation and Mathematical Definition

Let $X \in \mathbb{R}^{N \times C}$ denote an input feature map with $N$ spatial (or spatio-temporal) positions and $C$ channels. A non-local block computes for each output position $i$ : $Y_i = \frac{1}{C(X)} \sum_{j=1}^N f(X_i, X_j) \cdot g(X_j)$ where:

$f$ is a pairwise affinity function producing a scalar similarity between query $i$ and key $j$ .
$g$ is a value embedding (typically a linear transformation: $g(X_j) = W_g X_j$ ).
$N$ 0 is a normalization factor, e.g., $N$ 1 (softmax normalization).

Popular affinity choices include:

Gaussian: $N$ 2
Embedded Gaussian: $N$ 3, with $N$ 4 $N$ 5 convolutions
Dot-product: $N$ 6; $N$ 7

The output is often wrapped in a residual structure: $N$ 8 with $N$ 9 a $C$ 0 convolution, ensuring stable training and compatibility with pretrained backbones (Wang et al., 2017).

2. Design Variants and Efficiency Enhancements

a. Query-Independent and Global Context Simplifications

It has been empirically observed that attention maps in non-local networks often collapse to near query-independence, wherein all queries share almost identical weights. This finding motivates simplified blocks such as SNL and the Global Context (GC) block, which replace per-query attention with a shared, globally-pooled attention and strong channel-wise fusion, drastically reducing computational and memory complexity while maintaining accuracy (Cao et al., 2019).

b. Pyramid and Multi-Scale Non-Local Blocks

Several architectures extend the vanilla block to operate efficiently at multiple spatial resolutions:

Pyramid Non-Local Module (PNL): Splits feature channels into $C$ 1 groups, applies pooling at scale $C$ 2 to each group, computes non-local responses at every scale with shared projections, and fuses via a learned scale-attention, realizing $C$ 3 reductions in compute/memory versus standard NL (Xu et al., 2020).
Asymmetric Pyramid Non-Local Block (APNB): Down-samples key/value embeddings via pooled pyramids, such as average pooling at multiple grid sizes, reducing the number of reference positions $C$ 4 and compressing the attention map from $C$ 5 to $C$ 6 (Zhu et al., 2019).
Pyramid Non-Local Block (PNB): Computes asymmetric cross-scale affinities: high-resolution queries attend to multi-scale references, and outputs are concatenated across scales before a linear projection. This design achieves $C$ 7 memory reduction vs. standard NL for low-level image tasks (Zhu et al., 2020).

c. Disentangled and Denoised Non-Local Blocks

Disentangled NL (DNL): Separates the affinity into a whitened pairwise term (capturing intra-region consistency) and a unary saliency term (capturing boundary structure). Each is normalized and fused additively, restoring term-specific specialization and yielding large consistent accuracy gains in segmentation, detection, and action recognition (Yin et al., 2020).
Denoised Non-Local Block: Addresses noisy attention through two auxiliary operations: global rectifying (GR), which suppresses inter-class affinity based on coarse per-pixel class predictions, and local retention (LR), which sharpens intra-class regions via local sliding-window reweighting of the affinity. This approach further elevates segmentation accuracy beyond prior NL variants (Song et al., 2021).

3. Integrating Spatial Awareness and Task Priors

The spatial-aware non-local (SANL) block integrates an external spatial prior, $C$ 8, into the pairwise similarity computation: $C$ 9 then evaluates non-local attention on these spatially-gated embeddings. $i$ 0 is derived from task-specific class activation maps (e.g., Grad-CAM) and supplied at each stride of a feature pyramid, focusing the global context on semantically relevant regions and simultaneously preserving low-level boundary and high-level semantic cues (Li et al., 2019).

4. Application Domains and Benchmark Results

Non-local blocks and their derivatives have demonstrated consistent state-of-the-art improvements across diverse modalities:

Application	Backbone/Framework	Gain from NL	Gain from Variant Block	Source
Video Classification	ResNet-50, Kinetics	+2.0% top-1	+2–3% for PNL, SNL, DNL	(Wang et al., 2017, Xu et al., 2020, Yin et al., 2020)
Fashion Landmark	ResNet-101+FPN, DeepFashion-C	-0.0012 NE (NL)	-0.0033 NE (SANL)	(Li et al., 2019)
COCO Detection	Mask R-CNN, ResNet-50	+1.0 AP	+1.1 AP (GC), +0.7–1.3 AP (DNL)	(Cao et al., 2019, Yin et al., 2020)
Segmentation	ResNet-101, Cityscapes	+2.7% mIoU	+4.7% (DNL), +5.6% (Denoised NL)	(Yin et al., 2020, Song et al., 2021)
Image Restoration	ResNet+, Smoothing/Denoising	+1 dB PSNR	+0.09–0.32 dB (PNB)	(Zhu et al., 2020)

In compressed sensing and entropy modeling, non-local blocks enable networks to exploit non-local self-similarity for higher-fidelity reconstructions, with explicit domain-specific affinities and auxiliary loss terms to enforce affinity symmetry (Cui et al., 2021, Li et al., 2020).

5. Computational Considerations and Implementation Strategies

A vanilla non-local block with $i$ 1 positions and $i$ 2 channels incurs $i$ 3 compute and memory due to dense $i$ 4 affinity matrices. Critical strategies for tractability include:

Bottlenecking channel dimensions via 1x1 convolutions ( $i$ 5)
Subsampling spatially (downsampled $i$ 6 or pyramidal reference sets)
Query-independent (global pooling) simplifications
Carefully placing blocks at middle or top layers of deep networks for maximal gain per cost

Pyramid and asymmetric designs allow insertion into high-resolution, low-level tasks that would otherwise be intractable. Simplified blocks (GC, SNL) are suitable for deep or resource-constrained scenarios, as they match or exceed NL performance with $i$ 7– $i$ 8 less computational overhead (Cao et al., 2019, Xu et al., 2020, Zhu et al., 2020, Zhu et al., 2019).

6. Theoretical Perspectives and Broader Implications

Non-local blocks admit unification under the framework of fully connected graph filters, being interpretable as spectral graph operators with learned affinity kernels. Through Chebyshev polynomial approximation, one can generalize NL to higher-order spectral blocks (SNL), enabling flexible and stable graph-based filtering (Zhu et al., 2021). The addition of external priors, local/global separation, and scale-adaptive fusions extends the block's reach beyond classical self-attention paradigms.

Integrating data-driven or task-specific spatial priors into the similarity computation focuses network capacity on semantically or spatially important regions, which has garnered robust empirical gains and suggests a new design pattern for attention-based modules across vision models (Li et al., 2019).

7. Summary Table: Notable Non-Local Block Variants

Variant	Key Architectural Feature(s)	Principal Gains	Reference
Vanilla NL	Bidirectional, dense pairwise affinity	Long-range dependency, SOTA on video/CV	(Wang et al., 2017)
SANL	Spatial map prior (Grad-CAM) injects bias	Sharpened, task-driven attention	(Li et al., 2019)
Pyramid NL	Multi-scale, pooled references	Efficient regional/global correlation	(Xu et al., 2020, Zhu et al., 2020)
APNB/AFNB	Asymmetric spatial pyramidal or fusion	$i$ 9 resource savings	(Zhu et al., 2019)
DNL	Disentangles pairwise and unary streams	Improved semantic specialization	(Yin et al., 2020)
Denoised NL	Denoises via class gating + local fusion	SOTA semantic segmentation accuracy	(Song et al., 2021)
GC/SNL	Global pooling, query independence, graph	Maintains gains with $Y_i = \frac{1}{C(X)} \sum_{j=1}^N f(X_i, X_j) \cdot g(X_j)$ 0 lower cost	(Cao et al., 2019, Zhu et al., 2021)

The continued evolution of non-local blocks evidences their significance as universal operators for context modeling in deep networks, enabling long-range interaction, adaptability, and efficiency across a spectrum of vision tasks.