Papers
Topics
Authors
Recent
Search
2000 character limit reached

Alternating Sparse Attention (ASA)

Updated 28 February 2026
  • ASA is a transformer attention scheme that alternates between local sliding-window and global compressed sparse attention to efficiently handle long-range dependencies.
  • It incorporates latent attention enhancements like Multi-head Latent Attention (MLA) and Group-head Latent Attention (GLA) to boost performance and parameter efficiency.
  • Empirical results show that ASA outperforms previous methods in long-context reasoning and retrieval tasks while halving KV-cache memory requirements.

Alternating Sparse Attention (ASA) is a transformer attention scheme designed to address the computational and memory challenges inherent in modeling extremely long sequences. By alternating local and global sparse attention regimes across layers and leveraging latent attention enhancements, ASA efficiently propagates information over long contexts while reducing memory overhead, outperforming both full attention and prior sparse attention baselines in empirical evaluations (Hu et al., 2 Nov 2025).

1. Formal Definition and Motivation

Conventional transformer attention operates with O(T2)O(T^2) complexity per layer for sequence length TT, which is prohibitive for contexts in the range 10410^410510^5. Native Sparse Attention (NSA) decomposes attention into three concurrent branches—sliding-window (local), compressed (coarse global), and selective (sparse global)—within each layer. However, concurrent application creates interference between local and global signals, and uniform branch usage across all layers degrades long-range information propagation.

ASA restructures these dynamics by alternating between local-only and global-only layers:

  • Local layers: Implement sliding-window attention, capturing short-range dependencies and supporting common-sense reasoning.
  • Global layers: Combine compressed and selective attention, enabling propagation of long-range signals.

This architectural alternation forces attention heads to specialize, reduces destructive branch interference, and accelerates global signal propagation across layers. Additionally, ASA halves the key-value (KV) cache memory footprint since only global layers store full KV states (Hu et al., 2 Nov 2025).

2. Layer-Wise Architectural Design

ASA’s transformer stack consists of LL layers, partitioned as follows:

  • Odd-indexed layers (2+12\ell + 1): Local context is modeled using sliding-window attention, enhanced with Multi-head Latent Attention (MLA), wherein each query head attends to a small pool of learned latent states cRT×dcc \in \mathbb{R}^{T \times d_c}.
  • Even-indexed layers (2+22\ell + 2): Global context is managed with compressed attention (block-wise means) plus sparse selective attention (top-KK informative blocks), both augmented with Group-head Latent Attention (GLA), where groups of heads share key and value projections to minimize parameters and facilitate efficient sparse computation.

In summary, half the layers are dedicated to high-bandwidth local modeling, and the other half to long-range, memory-efficient global retrieval.

3. Mathematical Formulation

Let xRT×dx \in \mathbb{R}^{T \times d} denote the input representation.

3.1 Local (Sliding-Window) Attention with Latent Augmentation

Define window radius ss. For position tt:

  • Ksw(t)=[kts,...,kt]K_{\text{sw}}(t) = [k_{t-s}, ..., k_t], Vsw(t)=[vts,...,vt]V_{\text{sw}}(t) = [v_{t-s}, ..., v_t].
  • Standard sliding-window attention:

oh,tsw=Softmax(qthKsw(t)dk)Vsw(t)o^{\text{sw}}_{h, t} = \mathrm{Softmax}\left(\frac{q^h_t K_{\text{sw}}(t)^\top}{\sqrt{d_k}}\right) V_{\text{sw}}(t)

otml=h=1HSoftmax(qh,t(ctWk))(ctWv)Woo^{\mathrm{ml}}_t = \sum_{h=1}^H \mathrm{Softmax}\left(q_{h, t} (c_{\leq t} W_k)^\top\right)(c_{\leq t} W_v) W_o

or equivalently,

otml=Softmax((xtWqWk)ct)ct(WvWo)o^{\mathrm{ml}}_t = \mathrm{Softmax}\left((x_t W_q W_k^\top)c_{\leq t}\right)c_{\leq t} (W_v W_o)

3.2 Global (Compressed + Selective) Attention with Group-Head Latent Attention

Let BB be block size, m=t/B1m = \lfloor t/B \rfloor - 1. Compressed keys and values:

  • K^t=[cmp(k1:B),...,cmp(kmBB+1:mB)]\hat K_t = [\operatorname{cmp}(k_{1:B}), ..., \operatorname{cmp}(k_{mB-B+1:mB})]
  • V^t\hat V_t analogously

Scoring:

  • gtcmp=Softmax(qtK^t)g^{\text{cmp}}_t = \mathrm{Softmax}(q_t \hat K_t^\top) (compressed)
  • gtslc=Softmax(qtK^t)g^{\text{slc}}_t = \mathrm{Softmax}(q_t \hat K_t^\top), select top-KK blocks ItI_t for selective attention

Outputs:

otcmp=Attn(qt,K^t,V^t)o^{\text{cmp}}_t = \mathrm{Attn}(q_t, \hat K_t, \hat V_t)

otslc=Attn(qt,KIt,VIt)o^{\text{slc}}_t = \mathrm{Attn}(q_t, K_{I_t}, V_{I_t})

ot=gtcmpotcmp+gtslcotslco_t = g^{\text{cmp}}_t o^{\text{cmp}}_t + g^{\text{slc}}_t o^{\text{slc}}_t

Group-head Latent Attention partitions heads into groups, each sharing Wk,WvW_k, W_v, with individual WoW_o:

oiG+j,tgla=Softmax(qiG+j,t(ctWj,k))(ctWj,v)WiG+j,oo^{\text{gla}}_{iG+j, t} = \mathrm{Softmax}(q_{iG+j, t} (c_{\leq t} W_{j, k})^\top)(c_{\leq t} W_{j,v}) W_{iG+j,o}

3.3 Complexity and Memory

Let LL be the number of layers, TT the sequence length.

Variant Compute per Layer KV-Cache Memory
Full Attention O(T2dk)O(T^2 d_k) O(LTdv)O(L T d_v)
NSA O(T(s+T/B+KB)dk)O(T (s + T/B + K B) d_k) O(LTdv)O(L T d_v)
ASA O(LT(s+T/B+KB)dk2)O\left(\frac{L T (s + T/B + K B) d_k}{2}\right) O(LTdv2)O\left(\frac{L T d_v}{2}\right)

ASA reduces both computational complexity and memory footprint by half compared to NSA, primarily because only half the layers maintain global KV caches, and compressed layers are less frequent.

4. Implementation Structure

The following PyTorch-style pseudocode summarizes ASA's layer alternation (Hu et al., 2 Nov 2025):

1
2
3
4
5
6
7
8
9
10
forin 1..L:
    ifis odd:
        # Local layer
        o = ASA_sliding_window_attention(x, ...)
    else:
        # Global layer (compressed + selective)
        o = ASA_compressed_selected_attention(x, ...)
    x = x + Dropout(o)  # residual
    x = x + MLP(LayerNorm(x))
return x
Inside ASA_sliding_window_attention, queries and latents are constructed, and the sliding-window kernel is invoked. Global layers perform block compression, sparse selection, and call hardware-optimized sparse kernels.

5. Empirical Performance and Evaluation

Empirical benchmarks were conducted with Llama-style models at 340M and 1.3B parameters on the SlimPajama corpus (15B and 100B tokens). ASA was compared to full attention (GQA) and NSA across three key long-context evaluation categories (Hu et al., 2 Nov 2025).

Task 340M ASA 340M NSA 340M GQA 1.3B ASA 1.3B NSA 1.3B GQA
Common-Sense Reasoning 44.06 43.80 43.24 53.10 52.96 51.45
In-Context Retrieval (8K) 52.6% 11.6% 33.0% 62.0% 65.0% 64.4%
Long-Context Understanding 12.67 11.02 10.75 18.25 16.78 16.49

ASA either matches or outperforms GQA and NSA on reasoning and retrieval tasks, with particularly pronounced improvements in long-context retrieval (8K context), where 340M ASA achieves 52.6% versus NSA's 11.6%.

A further advantage is a 50% reduction in KV-cache memory, increasing feasibility for deployment on long-sequence inference scenarios.

6. Context Within Sparse Attention Research

ASA builds directly on the NSA framework, addressing its main bottlenecks:

  • Specialization via alternation: Local/global layer alternation prevents destructive feature mixing and enables more efficient, rapid global information flow.
  • Latent Attention augmentation: The replacement of GQA with MLA and GLA branches increases context modeling effectiveness and parameter efficiency.
  • Hardware and kernel alignment: The ASA design supports direct use of optimized GPU kernels for both sliding-window and block-sparse computation.

A plausible implication is that layerwise modular sparse patterns, rather than uniform or static masking, offer substantial benefits for long-sequence transformer variants.

7. Practical Impact and Scalability

ASA is a practically scalable and memory-efficient solution for long-context transformer models. In large sequence length settings, typical of document-level or code understanding and multi-modal applications, ASA reliably delivers superior retrieval, reasoning performance, and drastically reduced memory footprint compared to both full and prior sparse attention implementations (Hu et al., 2 Nov 2025).

By imposing structural alternation, enhancing attention with latent mechanisms, and optimizing for hardware, ASA provides a robust blueprint for future transformer architectures operating in the long-context regime.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Alternating Sparse Attention (ASA).