Non-Local Block in Neural Networks
- Non-Local Block is a neural network module that captures long-range dependencies by relating features across all spatial positions.
- Variants such as SANL, Pyramid NL, and DNL optimize performance and efficiency by integrating multi-scale processing and noise reduction strategies.
- These blocks are crucial in tasks like video classification, semantic segmentation, and compressed sensing, delivering state-of-the-art results with reduced computational cost.
A non-local block is a neural network module designed to capture long-range dependencies by directly relating features at all pairs of positions in a feature map, extending beyond the local receptive fields of standard convolutions or recurrences. This class of operations has become foundational in computer vision architectures, being widely adopted in tasks such as video classification, object recognition, semantic segmentation, compressed sensing, and entropy modeling. Variants of the non-local block address numerous limitations of vanilla design, improve efficiency, and incorporate domain knowledge or task-specific priors.
1. General Formulation and Mathematical Definition
Let denote an input feature map with spatial (or spatio-temporal) positions and channels. A non-local block computes for each output position : where:
- is a pairwise affinity function producing a scalar similarity between query and key .
- is a value embedding (typically a linear transformation: ).
- 0 is a normalization factor, e.g., 1 (softmax normalization).
Popular affinity choices include:
- Gaussian: 2
- Embedded Gaussian: 3, with 4 5 convolutions
- Dot-product: 6; 7
The output is often wrapped in a residual structure: 8 with 9 a 0 convolution, ensuring stable training and compatibility with pretrained backbones (Wang et al., 2017).
2. Design Variants and Efficiency Enhancements
a. Query-Independent and Global Context Simplifications
It has been empirically observed that attention maps in non-local networks often collapse to near query-independence, wherein all queries share almost identical weights. This finding motivates simplified blocks such as SNL and the Global Context (GC) block, which replace per-query attention with a shared, globally-pooled attention and strong channel-wise fusion, drastically reducing computational and memory complexity while maintaining accuracy (Cao et al., 2019).
b. Pyramid and Multi-Scale Non-Local Blocks
Several architectures extend the vanilla block to operate efficiently at multiple spatial resolutions:
- Pyramid Non-Local Module (PNL): Splits feature channels into 1 groups, applies pooling at scale 2 to each group, computes non-local responses at every scale with shared projections, and fuses via a learned scale-attention, realizing 3 reductions in compute/memory versus standard NL (Xu et al., 2020).
- Asymmetric Pyramid Non-Local Block (APNB): Down-samples key/value embeddings via pooled pyramids, such as average pooling at multiple grid sizes, reducing the number of reference positions 4 and compressing the attention map from 5 to 6 (Zhu et al., 2019).
- Pyramid Non-Local Block (PNB): Computes asymmetric cross-scale affinities: high-resolution queries attend to multi-scale references, and outputs are concatenated across scales before a linear projection. This design achieves 7 memory reduction vs. standard NL for low-level image tasks (Zhu et al., 2020).
c. Disentangled and Denoised Non-Local Blocks
- Disentangled NL (DNL): Separates the affinity into a whitened pairwise term (capturing intra-region consistency) and a unary saliency term (capturing boundary structure). Each is normalized and fused additively, restoring term-specific specialization and yielding large consistent accuracy gains in segmentation, detection, and action recognition (Yin et al., 2020).
- Denoised Non-Local Block: Addresses noisy attention through two auxiliary operations: global rectifying (GR), which suppresses inter-class affinity based on coarse per-pixel class predictions, and local retention (LR), which sharpens intra-class regions via local sliding-window reweighting of the affinity. This approach further elevates segmentation accuracy beyond prior NL variants (Song et al., 2021).
3. Integrating Spatial Awareness and Task Priors
The spatial-aware non-local (SANL) block integrates an external spatial prior, 8, into the pairwise similarity computation: 9 then evaluates non-local attention on these spatially-gated embeddings. 0 is derived from task-specific class activation maps (e.g., Grad-CAM) and supplied at each stride of a feature pyramid, focusing the global context on semantically relevant regions and simultaneously preserving low-level boundary and high-level semantic cues (Li et al., 2019).
4. Application Domains and Benchmark Results
Non-local blocks and their derivatives have demonstrated consistent state-of-the-art improvements across diverse modalities:
| Application | Backbone/Framework | Gain from NL | Gain from Variant Block | Source |
|---|---|---|---|---|
| Video Classification | ResNet-50, Kinetics | +2.0% top-1 | +2–3% for PNL, SNL, DNL | (Wang et al., 2017, Xu et al., 2020, Yin et al., 2020) |
| Fashion Landmark | ResNet-101+FPN, DeepFashion-C | -0.0012 NE (NL) | -0.0033 NE (SANL) | (Li et al., 2019) |
| COCO Detection | Mask R-CNN, ResNet-50 | +1.0 AP | +1.1 AP (GC), +0.7–1.3 AP (DNL) | (Cao et al., 2019, Yin et al., 2020) |
| Segmentation | ResNet-101, Cityscapes | +2.7% mIoU | +4.7% (DNL), +5.6% (Denoised NL) | (Yin et al., 2020, Song et al., 2021) |
| Image Restoration | ResNet+, Smoothing/Denoising | +1 dB PSNR | +0.09–0.32 dB (PNB) | (Zhu et al., 2020) |
In compressed sensing and entropy modeling, non-local blocks enable networks to exploit non-local self-similarity for higher-fidelity reconstructions, with explicit domain-specific affinities and auxiliary loss terms to enforce affinity symmetry (Cui et al., 2021, Li et al., 2020).
5. Computational Considerations and Implementation Strategies
A vanilla non-local block with 1 positions and 2 channels incurs 3 compute and memory due to dense 4 affinity matrices. Critical strategies for tractability include:
- Bottlenecking channel dimensions via 1x1 convolutions (5)
- Subsampling spatially (downsampled 6 or pyramidal reference sets)
- Query-independent (global pooling) simplifications
- Carefully placing blocks at middle or top layers of deep networks for maximal gain per cost
Pyramid and asymmetric designs allow insertion into high-resolution, low-level tasks that would otherwise be intractable. Simplified blocks (GC, SNL) are suitable for deep or resource-constrained scenarios, as they match or exceed NL performance with 7–8 less computational overhead (Cao et al., 2019, Xu et al., 2020, Zhu et al., 2020, Zhu et al., 2019).
6. Theoretical Perspectives and Broader Implications
Non-local blocks admit unification under the framework of fully connected graph filters, being interpretable as spectral graph operators with learned affinity kernels. Through Chebyshev polynomial approximation, one can generalize NL to higher-order spectral blocks (SNL), enabling flexible and stable graph-based filtering (Zhu et al., 2021). The addition of external priors, local/global separation, and scale-adaptive fusions extends the block's reach beyond classical self-attention paradigms.
Integrating data-driven or task-specific spatial priors into the similarity computation focuses network capacity on semantically or spatially important regions, which has garnered robust empirical gains and suggests a new design pattern for attention-based modules across vision models (Li et al., 2019).
7. Summary Table: Notable Non-Local Block Variants
| Variant | Key Architectural Feature(s) | Principal Gains | Reference |
|---|---|---|---|
| Vanilla NL | Bidirectional, dense pairwise affinity | Long-range dependency, SOTA on video/CV | (Wang et al., 2017) |
| SANL | Spatial map prior (Grad-CAM) injects bias | Sharpened, task-driven attention | (Li et al., 2019) |
| Pyramid NL | Multi-scale, pooled references | Efficient regional/global correlation | (Xu et al., 2020, Zhu et al., 2020) |
| APNB/AFNB | Asymmetric spatial pyramidal or fusion | 9 resource savings | (Zhu et al., 2019) |
| DNL | Disentangles pairwise and unary streams | Improved semantic specialization | (Yin et al., 2020) |
| Denoised NL | Denoises via class gating + local fusion | SOTA semantic segmentation accuracy | (Song et al., 2021) |
| GC/SNL | Global pooling, query independence, graph | Maintains gains with 0 lower cost | (Cao et al., 2019, Zhu et al., 2021) |
The continued evolution of non-local blocks evidences their significance as universal operators for context modeling in deep networks, enabling long-range interaction, adaptability, and efficiency across a spectrum of vision tasks.