Interleaved Local-Global Attention

Updated 9 April 2026

Interleaved local-global attention is a hybrid mechanism that combines local neighborhood focus with global context aggregation to model fine-grained and long-range dependencies.
It employs sequential, parallel, and hierarchical interleaving strategies with fusion techniques such as softmax-gated weighted sums to balance computational cost and modeling power.
Empirical evaluations demonstrate improved accuracy and efficiency across varied applications including vision transformers, graph networks, and audio models.

Interleaved local-global attention comprises a class of attention mechanisms and network designs that explicitly model both fine-grained local dependencies and broad global relationships within and across layers, blocks, or attention heads. This paradigm has been instantiated across vision, language, graph, audio, and multimodal domains to overcome the inherent limitations of purely local or purely global attention—balancing inductive bias, spatial/temporal resolution, modeling power, and computational tractability.

1. Core Principles and Architectural Strategies

A fundamental principle of interleaved local-global attention is the decomposition and recombination of context: Local attention modules restrict interactions to neighborhood subsets or spatially coherent windows, exploiting structure and efficient computation; global attention modules allow tokens or features to aggregate information from non-local or all positions, supporting long-range dependency modeling. Interleaving refers to their sequential, parallel, or hybrid integration at network, block, or head level.

Several interleaving strategies appear in the literature:

Sequential: Alternating local and global attention across layers (e.g., local → global → local; or global at shallow, local at deep layers) (Wang et al., 18 Sep 2025, Zhou et al., 2021, Chen et al., 2023).
Parallel within block/head: Computing local and global attention in parallel, often per module or per head, then fusing outputs via learned gating or cross-attention (Yang et al., 2021, Zhang et al., 2022, Song et al., 2022, Shao, 2024, Nguyen et al., 2024, Yadav et al., 2023).
Hierarchical: Multi-level pooling and aggregation combining spatial/temporal extents at different granularities (Yang et al., 2021, Zhou et al., 2021, Nguyen et al., 2024).

Fusion of local and global representations typically involves linear interpolation, softmax-gated weighted sums, or cross-attention layers; adaptive mechanisms based on data quality, feature norm, or learned scaling often modulate the relative contributions.

2. Mathematical Formulations

The attention computation in interleaved schemes generally builds on the standard multi-head self-attention mechanism: $\text{Attention}(Q, K, V) = \mathrm{softmax} \left( \frac{QK^T}{\sqrt{d_k}} + B \right) V$ where $Q, K, V$ represent queries, keys, and values, $d_k$ is the per-head dimension, and $B$ is a position bias.

Local attention restricts $Q$ and $K, V$ to neighborhoods (windows of size $s_p\times s_p$ in vision, or region graphs in point clouds), using binary masks or windowed partitioning: $Q_i = f_q(x^1_{W_i}), \qquad K_i, V_i = f_k(x^1_{\mathcal{N}(W_i)}), f_v(x^1_{\mathcal{N}(W_i)})$ with $\mathcal{N}(W_i)$ denoting the neighborhood of window $W_i$ (Yang et al., 2021, Zhou et al., 2021, Zhang et al., 2022).

Global attention can take various forms:

Full attention among all tokens (Yang et al., 2021, Zhang et al., 2022)
Pooled/aggregated tokens at coarse granularity (Yang et al., 2021, Yadav et al., 2023)
Latent tokens or semantic concepts with cross-attention links to all features (Chen et al., 2023, Nguyen et al., 2024)

Fusion is often implemented as: $Q, K, V$ 0 where $Q, K, V$ 1. More advanced forms use adaptive $Q, K, V$ 2 derived from feature norms or scale statistics (Yu et al., 2024, Shao, 2024).

Parallel or per-head interleaving computes for each head $Q, K, V$ 3: $Q, K, V$ 4 with $Q, K, V$ 5 setting the attention window per head (Yadav et al., 2023, Zhang et al., 2022).

3. Domain-Specific Implementations

3.1 Vision Transformers

Focal Transformer: Each layer creates multi-scale representations, attending locally at fine granularity and globally at coarser pooled scales, with all context sets concatenated per query and jointly normalized by softmax (Yang et al., 2021).
Axially Expanded Windows (AEWin): Heads are partitioned to compute local window self-attention, horizontal stripe (row-wise), and vertical stripe (column-wise) attention in parallel, fusing outputs at each block (Zhang et al., 2022).
Global-Local Fusion via Cross-Attention: Models such as DALG use separate local (windowed) and global (whole-image) branches whose outputs are fused via a stack of cross-attention modules (Song et al., 2022).
Early Local-Global Integration: Some models, such as Locally Shifted Attention (LSA) (Sheynin et al., 2021), first aggregate small, locally shifted patches per location via local self-attention, then apply global self-attention on the resulting virtual patch tokens.

3.2 Graph Networks

Global-to-Local Attention in Graph Transformers: G2LFormer applies global linear attention first (across all nodes), then local neighborhood GNN layers per node, with cross-layer fusion (NOSAF) mixing global and local representations hierarchically for each node (Wang et al., 18 Sep 2025).

3.3 Multimodal and LLMs

Vision-LLMs: AGLA (Assembly of Global and Local Attention) computes global image features and prompt-guided local features, combining them by fusing logit distributions at each decoding step to mitigate hallucinations (An et al., 2024).
Relation Classification: GLA-BiGRU merges global attention (full context) and local attention (dependency-path tokens or learned localization) via weighted interpolation, with soft/hard localization to extract salient components (Sun, 2024).

3.4 Audio Transformers

Multi-Window Masked Autoencoder (MW-MAE): In each transformer block, heads are assigned distinct attention windows; some attend locally, others globally, decoupled at the head level, yielding parallel multi-context fusion during decoding (Yadav et al., 2023).

3.5 Object Detection and Small Object Recognition

Adaptive Blockwise Interleaving: ILGA (Interleaved Local-Global Attention) block combines multi-scale convolution, local and global self-attention, positional encoding, and learns to fuse local/global outputs per feature map via trainable gating (Shao, 2024).
Unified Interaction Modeling in ViTs: EI-ViT interleaves aggressive convolutional pooling (local-global mixing) and conceptual attention transformation (global semantic token exchange) before standard self-attention in each transformer block (Nguyen et al., 2024).

4. Computational Complexity and Efficiency

Interleaved local-global attention mitigates the prohibitive $Q, K, V$ 6 cost of global attention on high-dimensional data:

Windowed/Local Only: $Q, K, V$ 7, where $Q, K, V$ 8.
Global Only: $Q, K, V$ 9.
Interleaved / Focal / Multi-window: $d_k$ 0 per block or $d_k$ 1 (Yang et al., 2021). MW-MAE with $d_k$ 2 heads of different windows is linear in $d_k$ 3 per head, total cost $d_k$ 4.
Hybrid Global-Local (FIT): Optimal block size $d_k$ 5 yields overall $d_k$ 6 complexity, enabling gigapixel scale training (Chen et al., 2023).
Object Detection (ILGA, AEWin): Parallel or fused block designs keep additional FLOPs minimal (<0.2 GFLOPs per attention block) relative to backbone compute (Shao, 2024, Zhang et al., 2022).

5. Empirical Results and Advantages

Systematic empirical evaluation across domains establishes the advantage of interleaved local-global attention:

Domain	Model	SOTA/Metric Gains	Key Effect
ImageNet Classification	Focal, AEWin	+1–2% Top-1 accuracy vs. window-only	Recover long-range semantics
Object Detection	Focal, AEWin, ILGA	+1–5 mAP, +0.7 mAP@50-95	Robust to multi-scale, small objects
Image Retrieval	DALG, GLAM, ILGA	+1–10% mAP over local/global only	Joint spatial/semantic descriptors
Graphs	G2LFormer	Outperforms local+global and parallel baselines	Preserves local structure while propagating
Audio Modeling	MW-MAE	+1–2 points across 10 tasks	Diverse receptive field per head/block
Vision-Language	AGLA	-8% CHAIR_S hallucination, best on POPE/MME	Integrates prompt-relevant visual evidence
Medical 3D Segmentation	nnFormer	-4mm HD95, +15% DSC vs. prior ViT	Spatially consistent volumetric predictions

Empirical ablations consistently demonstrate that pure local or pure global attention is suboptimal. Fusing both modes—either using parallel branches (Song et al., 2022, Yu et al., 2024), per-block gating (Shao, 2024), or per-head mixing (Yadav et al., 2023)—recovers complementary information, yields better convergence, improved robustness to missing/deformed data, and regularizes model behavior (e.g., less hallucination, higher recall of rare or small-category objects).

6. Theoretical Justifications and Limitations

Theoretical justifications for interleaved local-global attention are rooted in:

The need to preserve both locality-sensitive inductive bias and global context modeling.
Overcoming limitations such as over-smoothing (from excessive local pooling) and loss of fine detail (from full global integration).
Mitigating the cost/computation trade-off for scalable architectures.

Limitations remain:

Overly deep pooling stages can degrade spatial localization (Nguyen et al., 2024).
Increasing fusion complexity can add parameter/memory overhead, especially for high $d_k$ 7 or many local branches (local crop count, head count).
Adaptive weighting or gating, when not aligned with data quality, may not yield uniform gains; certain tasks rely more on one context scale (Yu et al., 2024, Shao, 2024).

Ongoing research investigates dynamically learnable interleaving schedules, adaptive window selection, and decomposition into content-aware rather than fixed local/global groups.

7. Representative Applications and Future Directions

High-Resolution and Multi-Scale Tasks: Robustness in detection of small/rare objects and AI-generated imagery forensics is achieved by stratified sampling and parallel local-global aggregation (Han, 1 Jan 2026).
Medical Imaging: Interleaved volumetric self-attention delivers superior spatial consistency and structure preservation in 3D segmentation (Zhou et al., 2021).
Vision-Language Grounding: Adaptive fusion of context-dependent local features and general global features mitigates LVLM object hallucination (An et al., 2024).
Audio and Multimodal Representation: Multiple per-head window sizes enable fine-to-coarse spectro-temporal attribute modeling in a single transformation (Yadav et al., 2023).

Future work includes dynamic interleaving and feature-based window adaptivity, efficient cross-modal (e.g. prompt-to-image) interleaving, low-rank or sketch-based global context integration, and extensions to spatio-temporal and graph domains.

References:

(Yang et al., 2021, Zhang et al., 2022, Song et al., 2022, Shao, 2024, Nguyen et al., 2024, Chen et al., 2023, Wang et al., 18 Sep 2025, Yadav et al., 2023, Yu et al., 2024, Zhou et al., 2021, Song et al., 2021, Sheynin et al., 2021, Han, 1 Jan 2026, Li et al., 2024, An et al., 2024, Sun, 2024).