Cross-Scale Neighborhood Attention

Updated 25 November 2025

Cross-scale neighborhood attention is a framework that aggregates representations over diverse neighborhoods defined by multiple spatial, structural, or hierarchical scales.
It leverages per-scale Q/K/V projections and adaptive fusion techniques, such as soft-attention, to balance computational efficiency with enriched feature expressivity.
The approach has demonstrated performance gains in applications like graph classification, object detection, LiDAR compression, and video localization.

Cross-Scale Neighborhood Attention is a unifying paradigm for local and global context aggregation in neural architectures, wherein attention mechanisms are applied not over fixed-span neighborhoods but across multiple spatial, structural, or hierarchical scales. This formulation enables models to adaptively capture fine- and coarse-grained dependencies, integrate multi-stage information, and balance computational efficiency against representational expressivity. Cross-scale neighborhood attention is implemented across diverse domains—including graph representation learning, vision transformers, LiDAR geometry compression, crowd video understanding, and foundation model feature upsampling—by leveraging explicit multi-scale Q/K/V projections, alternating or compositional local/global attention, and adaptive scale-mixing via parameterized fusion or attention gating.

1. Mathematical Foundations and General Mechanisms

At the core of cross-scale neighborhood attention is the aggregation of representations over neighborhoods defined at multiple scales—where scale may refer to spatial, topological, temporal, or hierarchical extents, depending on the signal domain. Across all implementations, three core elements recur:

Neighborhood Definition: For graphs, this is the $h$ -hop neighborhood $\mathcal{N}_i^{(h)} = \{v_j : \text{dist}(v_i, v_j) \leq h\}$ , covering increasing structural radius (Li et al., 2022). In images, neighborhoods range from local windows (radius $r$ ) to strided/dilated variants (dilation $\delta$ ), or Gaussian-sampled patches (variance $\gamma$ ) (Wang et al., 2021, Hassani et al., 2022, Li et al., 2021).
Per-Scale Q/K/V Projections: For each scale, distinct $Q_h$ , $K_h$ , $V_h$ are computed by projecting feature representations—often after an explicit multi-scale embedding or pooling operation (Li et al., 2022, Wang et al., 2023, Chambon et al., 23 Nov 2025, Shang et al., 2023).
Attention Computation and Aggregation: Attention within scale $h$ is computed as standard scaled dot-product, but over the defined neighborhood, not globally. The outputs across all scales are fused, typically using learned weights, soft-attention, or summation (Li et al., 2022, Wang et al., 2021).

The general formula, seen for graphs, vision, and upsampling, is: $\text{Attn}^{(h)}_i = \sum_{j \in \mathcal{N}_i^{(h)}} \alpha_{ij}^{(h)} V_j^{(h)};\qquad \alpha_{ij}^{(h)} = \mathrm{softmax}_{j \in \mathcal{N}_i^{(h)}} \left(\frac{Q_i^{(h)} \cdot K_j^{(h)}}{\sqrt{d'}}\right)$

For cross-scale fusion, outputs $\{z_i^{(h)}\}_{h=0}^H$ are combined via adaptive soft-attention: $H_i = \sum_{h=0}^H \beta_i^{(h)} z_i^{(h)}$ where $\boldsymbol\beta_i$ are normalized scale-importance weights, e.g., via softmax or learned slot attention (Li et al., 2022).

2. Domain-Specific Implementations

Graphs: Multi-Neighborhood Attention for Adaptive Structural Encoding

In graph transformers, Multi-Neighborhood Attention (MNA-GT) (Li et al., 2022) employs explicit $h$ -hop adjacency powers to generate input matrices $X^{(h)} = \bar{A}^h X$ , with per-scale Q/K/V projections. Multiple parallel attention heads per scale yield $z_i^{(h)}$ per node, which are fused through a slot-attention-derived vector $\boldsymbol{\beta}_i$ . Soft-attention-based fusion outperforms naive sum/mean/concat by 0.8–2.1% accuracy, and models with 2–3 scales (hops) achieve optimal structural representational capacity.

Vision Transformers: Local–Dilated and Cross-Scale Windows

Vision models leverage spatial locality and global context through several cross-scale attention variants:

Dilated Neighborhood Attention (DiNA) alternates dense ( $\delta=1$ ) local windows with dilated (sparse, $\delta>1$ ) neighborhoods, granting exponentially growing receptive field with linear complexity (Hassani et al., 2022). Alternation strictly improves accuracy over purely local or purely dilated stacking.
CrossFormer/CrossFormer++ (Wang et al., 2021, Wang et al., 2023) introduces a Cross-Scale Embedding Layer (CEL) to explicitly concatenate multi-scale convolutions (kernel sizes $\{4,8,16,32\}$ ) per stage, followed by alternating Short-Distance Attention (local $G \times G$ windows) and Long-Distance Attention (strided global non-local groups). Group sizes are managed by a Progressive Group Size (PGS) heuristic.
Multi-Stage Cross-Scale Attention (MSCSA) (Shang et al., 2023) pools and concatenates features across backbone stages, then applies attention over multiple downsampled versions (e.g., $1\times$ , $2\times$ , $3\times$ ) as K/V banks—enabling both intra-scale and inter-scale message passing.

Point Clouds: Adaptive KNN Attention in Dyadic Octree Hierarchies

In LiDAR geometry compression, cross-scale neighborhood attention is realized through stacked Neighborhood Point Attention (NPA) modules (Xue et al., 2022). At each octree scale, $k$ -nearest neighbors per point are aggregated via multi-head attention, and upsampled features condition occupancy predictions at finer scales. Cross-scale information flows via auxiliary sparse transposed convolutions, with NPA accounting for both same-scale and cross-scale dependencies. Ablation reveals k=16 and H=4 heads as optimal; NPA reduces BD-rate by 17% and achieves a 640x decoding speedup over full attention octree codecs.

Video and Spatiotemporal: Multi-Focus Gaussian Neighborhood Attention

In video crowd localization, Gaussian Neighborhood Attention (GNA) (Li et al., 2021) utilizes spatially local, randomly sampled neighborhoods with multi-focus Gaussian kernels (scales $\gamma_1,\ldots,\gamma_F$ ), averaging the outputs to handle scale variation due to perspective. GNA is applied in scene modeling and context cross-attention, combining spatial–temporal neighborhoods for framewise and sequence-level reasoning.

Feature Upsampling: Zero-Shot Cross-Scale Neighborhood Filtering

Neighborhood Attention Filtering (NAF) (Chambon et al., 23 Nov 2025) addresses the generalization bottleneck in VFM upsampling by attending per-pixel (high-res) queries over a cross-scale local window in low-res feature space. RoPE is applied to both queries (from high-res guidance) and keys (low-res grid), with aggregation restricted to a $(2R+1)^2$ window around the parent mapping $\lfloor p/s\rfloor$ . This architectural choice decouples NAF from VFM internals, enabling zero-shot upsampling transferable across foundation models.

3. Computational Properties and Complexity Analysis

Cross-scale neighborhood attention is explicitly designed to trade off the quadratic cost of full self-attention with the restrictive bias of single-scale convolutions or sliding windows. Local windowing, strided/dilated selection, and sampled Gaussian neighborhoods all reduce the effective receptive field cardinality per token from $O(N)$ (full) to $O(k)$ , while still mixing information across scales (Hassani et al., 2022, Li et al., 2021). For example, DiNA's window with radius $r$ and dilation factor $\delta$ yields $k = (2r+1)^2$ neighbors, but the spatial span grows as $(2r+1)\delta$ ; similarly, NAF’s per-pixel cost is $O(N_H k^2 (C+d))$ against $O(N_H N_L C)$ in full attention (Chambon et al., 23 Nov 2025).

Adaptive fusion further reuses learned soft-attention weights across scales, incurring only a minor overhead (few parameters) relative to the main attention block (Li et al., 2022).

4. Scale Fusion and Adaptive Weighting

A central motif is the adaptive fusion of multiple scale-specific representations. Techniques include:

Slot Attention/Self-Attention Over Scales: Producing scalar importance weights $\beta_i^{(h)}$ per node (or per spatial location), learned via feedforward projections or shallow attention (e.g., softmax over $e_i^{(h)} = \sigma(z_i^{(h)} W) w^T$ ) (Li et al., 2022).
Multi-Focus Averaging: In GNA, scale outputs $A_f(x)$ are simply averaged; this suggests that, for video aggregation tasks with strong domain priors, uniform weighting across focus levels is sufficient (Li et al., 2021).
Implicit Fusion via Window Alternation: Some models (e.g., DiNAT) alternate local and dilated layers, relying on the residual pathway of hierarchical transformers for implicit multi-scale aggregation (Hassani et al., 2022, Wang et al., 2021).

Ablations consistently show that learned fusion (soft-attention) outperforms mean/sum/concat by 0.8–2.1% on standard benchmarks (Li et al., 2022).

5. Applications and Empirical Outcomes

Cross-scale neighborhood attention enables models to:

Capture interactions at variable distances (e.g., across graph hops, image patch sizes, point cloud levels, or temporal/spatial video extents).
Generalize to both local fine-grained and long-range semantic cues without incurring the full computational overhead of global attention.
Support plug-and-play architectural enhancements, as seen with MSCSA and NAF, which consistently yield 0.5–2.5% improvement in key downstream metrics (e.g., mIoU, AP) with a modest (<13%) increase in FLOPs (Shang et al., 2023, Chambon et al., 23 Nov 2025).
Achieve new state-of-the-art results in graph classification, object detection, segmentation, LiDAR geometry compression, crowd localization, and feature upsampling, outperforming both convolutional and previous transformer baselines (Li et al., 2022, Hassani et al., 2022, Chambon et al., 23 Nov 2025, Xue et al., 2022, Li et al., 2021).

Architecture	Domain	Scales/Fusion	Key Gain
MNA-GT (Li et al., 2022)	Graphs	h-hop (0–3)	+1–3% accuracy
DiNAT (Hassani et al., 2022)	Vision	Local/dilated	+1.6 box AP
CrossFormer++ (Wang et al., 2023)	Vision	Embedding+/PGS	+0.7–1.0% mIoU
NPAFormer (Xue et al., 2022)	Point cloud	kNN cross-scale	–17% BD-rate, 640x speed
GNA (Li et al., 2021)	Video	Multi-focus γ	+3.3% mAP (localization)
NAF (Chambon et al., 23 Nov 2025)	Upsampling	Local window, RoPE	Zero-shot generalization

6. Domain-Specific Nuances, Limitations, and Practicalities

For graphs, fusion of more than 2–3 hop-scales introduces redundancy and hurts accuracy; the learned fusion is critical (Li et al., 2022).
For vision, CrossFormer’s CEL provides explicit multi-scale context, but the associated large convs must be efficiently dimensioned to avoid excess FLOPs (Wang et al., 2023).
For LiDAR and point clouds, kNN-based NPA scales efficiently but requires support for irregular sparse tensors and adaptive neighborhood search (Xue et al., 2022).
For feature upsampling, using high-res guidance with cross-scale local attention via RoPE achieves sub-pixel accuracy and is decoupled from the underlying feature distribution, enabling plug-and-play upsampling (Chambon et al., 23 Nov 2025).
For temporal tasks, the multi-focus mechanism addresses scale variation due to perspective or motion, with random sampling controlling memory overhead (Li et al., 2021).

This suggests that deployment details—such as choice of neighborhood size, dilation, number of heads, and fusion strategy—must be tuned to data modality and application for optimal performance.

7. Theoretical and Practical Impact

Cross-scale neighborhood attention mechanisms unify a spectrum of local-to-global context modeling strategies while retaining computational tractability. They have substantiated empirical benefits across vision, graphs, point clouds, and multimodal upsampling, supporting both end-to-end training and modular integration. The framework naturally accommodates advances in positional encoding (e.g., RoPE for relative displacement sensitivity (Chambon et al., 23 Nov 2025)), and flexible architectural depth (alternating or stackable layers), and is robust to common artifacts such as amplitude explosion and scale redundancy via explicit normalization and adaptive fusion (Wang et al., 2023). As new domains demand richer multi-scale interactions at-scale, cross-scale neighborhood attention remains a central mechanism in modern deep learning architectures.