Papers
Topics
Authors
Recent
Search
2000 character limit reached

Interleaved Local-Global Attention

Updated 9 April 2026
  • Interleaved local-global attention is a hybrid mechanism that combines local neighborhood focus with global context aggregation to model fine-grained and long-range dependencies.
  • It employs sequential, parallel, and hierarchical interleaving strategies with fusion techniques such as softmax-gated weighted sums to balance computational cost and modeling power.
  • Empirical evaluations demonstrate improved accuracy and efficiency across varied applications including vision transformers, graph networks, and audio models.

Interleaved Local-Global Attention

Interleaved local-global attention comprises a class of attention mechanisms and network designs that explicitly model both fine-grained local dependencies and broad global relationships within and across layers, blocks, or attention heads. This paradigm has been instantiated across vision, language, graph, audio, and multimodal domains to overcome the inherent limitations of purely local or purely global attention—balancing inductive bias, spatial/temporal resolution, modeling power, and computational tractability.

1. Core Principles and Architectural Strategies

A fundamental principle of interleaved local-global attention is the decomposition and recombination of context: Local attention modules restrict interactions to neighborhood subsets or spatially coherent windows, exploiting structure and efficient computation; global attention modules allow tokens or features to aggregate information from non-local or all positions, supporting long-range dependency modeling. Interleaving refers to their sequential, parallel, or hybrid integration at network, block, or head level.

Several interleaving strategies appear in the literature:

Fusion of local and global representations typically involves linear interpolation, softmax-gated weighted sums, or cross-attention layers; adaptive mechanisms based on data quality, feature norm, or learned scaling often modulate the relative contributions.

2. Mathematical Formulations

The attention computation in interleaved schemes generally builds on the standard multi-head self-attention mechanism: Attention(Q,K,V)=softmax(QKTdk+B)V\text{Attention}(Q, K, V) = \mathrm{softmax} \left( \frac{QK^T}{\sqrt{d_k}} + B \right) V where Q,K,VQ, K, V represent queries, keys, and values, dkd_k is the per-head dimension, and BB is a position bias.

Local attention restricts QQ and K,VK, V to neighborhoods (windows of size sp×sps_p\times s_p in vision, or region graphs in point clouds), using binary masks or windowed partitioning: Qi=fq(xWi1),Ki,Vi=fk(xN(Wi)1),fv(xN(Wi)1)Q_i = f_q(x^1_{W_i}), \qquad K_i, V_i = f_k(x^1_{\mathcal{N}(W_i)}), f_v(x^1_{\mathcal{N}(W_i)}) with N(Wi)\mathcal{N}(W_i) denoting the neighborhood of window WiW_i (Yang et al., 2021, Zhou et al., 2021, Zhang et al., 2022).

Global attention can take various forms:

Fusion is often implemented as: Q,K,VQ, K, V0 where Q,K,VQ, K, V1. More advanced forms use adaptive Q,K,VQ, K, V2 derived from feature norms or scale statistics (Yu et al., 2024, Shao, 2024).

Parallel or per-head interleaving computes for each head Q,K,VQ, K, V3: Q,K,VQ, K, V4 with Q,K,VQ, K, V5 setting the attention window per head (Yadav et al., 2023, Zhang et al., 2022).

3. Domain-Specific Implementations

3.1 Vision Transformers

  • Focal Transformer: Each layer creates multi-scale representations, attending locally at fine granularity and globally at coarser pooled scales, with all context sets concatenated per query and jointly normalized by softmax (Yang et al., 2021).
  • Axially Expanded Windows (AEWin): Heads are partitioned to compute local window self-attention, horizontal stripe (row-wise), and vertical stripe (column-wise) attention in parallel, fusing outputs at each block (Zhang et al., 2022).
  • Global-Local Fusion via Cross-Attention: Models such as DALG use separate local (windowed) and global (whole-image) branches whose outputs are fused via a stack of cross-attention modules (Song et al., 2022).
  • Early Local-Global Integration: Some models, such as Locally Shifted Attention (LSA) (Sheynin et al., 2021), first aggregate small, locally shifted patches per location via local self-attention, then apply global self-attention on the resulting virtual patch tokens.

3.2 Graph Networks

  • Global-to-Local Attention in Graph Transformers: G2LFormer applies global linear attention first (across all nodes), then local neighborhood GNN layers per node, with cross-layer fusion (NOSAF) mixing global and local representations hierarchically for each node (Wang et al., 18 Sep 2025).

3.3 Multimodal and LLMs

  • Vision-LLMs: AGLA (Assembly of Global and Local Attention) computes global image features and prompt-guided local features, combining them by fusing logit distributions at each decoding step to mitigate hallucinations (An et al., 2024).
  • Relation Classification: GLA-BiGRU merges global attention (full context) and local attention (dependency-path tokens or learned localization) via weighted interpolation, with soft/hard localization to extract salient components (Sun, 2024).

3.4 Audio Transformers

  • Multi-Window Masked Autoencoder (MW-MAE): In each transformer block, heads are assigned distinct attention windows; some attend locally, others globally, decoupled at the head level, yielding parallel multi-context fusion during decoding (Yadav et al., 2023).

3.5 Object Detection and Small Object Recognition

  • Adaptive Blockwise Interleaving: ILGA (Interleaved Local-Global Attention) block combines multi-scale convolution, local and global self-attention, positional encoding, and learns to fuse local/global outputs per feature map via trainable gating (Shao, 2024).
  • Unified Interaction Modeling in ViTs: EI-ViT interleaves aggressive convolutional pooling (local-global mixing) and conceptual attention transformation (global semantic token exchange) before standard self-attention in each transformer block (Nguyen et al., 2024).

4. Computational Complexity and Efficiency

Interleaved local-global attention mitigates the prohibitive Q,K,VQ, K, V6 cost of global attention on high-dimensional data:

  • Windowed/Local Only: Q,K,VQ, K, V7, where Q,K,VQ, K, V8.
  • Global Only: Q,K,VQ, K, V9.
  • Interleaved / Focal / Multi-window: dkd_k0 per block or dkd_k1 (Yang et al., 2021). MW-MAE with dkd_k2 heads of different windows is linear in dkd_k3 per head, total cost dkd_k4.
  • Hybrid Global-Local (FIT): Optimal block size dkd_k5 yields overall dkd_k6 complexity, enabling gigapixel scale training (Chen et al., 2023).
  • Object Detection (ILGA, AEWin): Parallel or fused block designs keep additional FLOPs minimal (<0.2 GFLOPs per attention block) relative to backbone compute (Shao, 2024, Zhang et al., 2022).

5. Empirical Results and Advantages

Systematic empirical evaluation across domains establishes the advantage of interleaved local-global attention:

Domain Model SOTA/Metric Gains Key Effect
ImageNet Classification Focal, AEWin +1–2% Top-1 accuracy vs. window-only Recover long-range semantics
Object Detection Focal, AEWin, ILGA +1–5 mAP, +0.7 mAP@50-95 Robust to multi-scale, small objects
Image Retrieval DALG, GLAM, ILGA +1–10% mAP over local/global only Joint spatial/semantic descriptors
Graphs G2LFormer Outperforms local+global and parallel baselines Preserves local structure while propagating
Audio Modeling MW-MAE +1–2 points across 10 tasks Diverse receptive field per head/block
Vision-Language AGLA -8% CHAIR_S hallucination, best on POPE/MME Integrates prompt-relevant visual evidence
Medical 3D Segmentation nnFormer -4mm HD95, +15% DSC vs. prior ViT Spatially consistent volumetric predictions

Empirical ablations consistently demonstrate that pure local or pure global attention is suboptimal. Fusing both modes—either using parallel branches (Song et al., 2022, Yu et al., 2024), per-block gating (Shao, 2024), or per-head mixing (Yadav et al., 2023)—recovers complementary information, yields better convergence, improved robustness to missing/deformed data, and regularizes model behavior (e.g., less hallucination, higher recall of rare or small-category objects).

6. Theoretical Justifications and Limitations

Theoretical justifications for interleaved local-global attention are rooted in:

  • The need to preserve both locality-sensitive inductive bias and global context modeling.
  • Overcoming limitations such as over-smoothing (from excessive local pooling) and loss of fine detail (from full global integration).
  • Mitigating the cost/computation trade-off for scalable architectures.

Limitations remain:

  • Overly deep pooling stages can degrade spatial localization (Nguyen et al., 2024).
  • Increasing fusion complexity can add parameter/memory overhead, especially for high dkd_k7 or many local branches (local crop count, head count).
  • Adaptive weighting or gating, when not aligned with data quality, may not yield uniform gains; certain tasks rely more on one context scale (Yu et al., 2024, Shao, 2024).

Ongoing research investigates dynamically learnable interleaving schedules, adaptive window selection, and decomposition into content-aware rather than fixed local/global groups.

7. Representative Applications and Future Directions

  • High-Resolution and Multi-Scale Tasks: Robustness in detection of small/rare objects and AI-generated imagery forensics is achieved by stratified sampling and parallel local-global aggregation (Han, 1 Jan 2026).
  • Medical Imaging: Interleaved volumetric self-attention delivers superior spatial consistency and structure preservation in 3D segmentation (Zhou et al., 2021).
  • Vision-Language Grounding: Adaptive fusion of context-dependent local features and general global features mitigates LVLM object hallucination (An et al., 2024).
  • Audio and Multimodal Representation: Multiple per-head window sizes enable fine-to-coarse spectro-temporal attribute modeling in a single transformation (Yadav et al., 2023).

Future work includes dynamic interleaving and feature-based window adaptivity, efficient cross-modal (e.g. prompt-to-image) interleaving, low-rank or sketch-based global context integration, and extensions to spatio-temporal and graph domains.


References:

(Yang et al., 2021, Zhang et al., 2022, Song et al., 2022, Shao, 2024, Nguyen et al., 2024, Chen et al., 2023, Wang et al., 18 Sep 2025, Yadav et al., 2023, Yu et al., 2024, Zhou et al., 2021, Song et al., 2021, Sheynin et al., 2021, Han, 1 Jan 2026, Li et al., 2024, An et al., 2024, Sun, 2024).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Interleaved Local-Global Attention.