Papers
Topics
Authors
Recent
Search
2000 character limit reached

Zigzag Attention in Transformers

Updated 28 January 2026
  • Zigzag Attention is a technique that restructures Transformer attention to enable efficient long-context inference while minimizing degradation in model quality.
  • It employs layer-exclusive retrieval and streaming modes to reduce computational complexity, achieving significant speedups and memory savings compared to full attention.
  • Blockwise Zigzag in LoZA uses structured sparse patterns for near-linear scaling, delivering improved latency and decoding efficiency in long-context scenarios.

Zigzag Attention encompasses two families of techniques that restructure attention mechanisms in Transformer architectures to enable efficient long-context inference while minimizing degradation in model quality. Both families—exemplified by ZigzagAttention with exclusive retrieval and streaming heads, and by blockwise ZigZag attention as instantiated in LongCat ZigZag Attention (LoZA)—directly address computational and memory bottlenecks observed in LLMs with context lengths extending beyond tens of thousands of tokens. These techniques achieve significant reductions in inference latency and memory footprint by replacing traditional full attention either via exclusive per-layer sparsity or through structured, blockwise sparse attention patterns that preserve essential connectivity for information retrieval and reasoning at scale (Liu et al., 17 Aug 2025, Zhang et al., 30 Dec 2025).

1. Motivation and Problem Statement

Standard Transformers exhibit O(n2)\mathcal{O}(n^2) attention complexity, with nn denoting input sequence length. For long-context LLMs, especially in autoregressive decoding, this manifests as rapidly increasing GPU memory use for the key–value (KV) cache, with size O(nd)\mathcal{O}(n \cdot d), where dd is the hidden dimension. As n1,000n \gg 1,000, the operational cost becomes prohibitive. Prior sparse attention patterns—such as sliding window, block-sparse, or strided variants—either inadequately preserve long-range dependency or still incur O(nw)\mathcal{O}(n w) cost with window size ww scaling with nn for quality retention. Zigzag Attention methods are explicitly designed to resolve these trade-offs by either assigning entire layers to exclusive “retrieval” or “streaming” attention modes (ZigzagAttention), or by introducing blockwise alternating global-local connectivity at the kernel level (LoZA), thereby achieving near-linear scaling and robust long-context performance (Liu et al., 17 Aug 2025, Zhang et al., 30 Dec 2025).

2. ZigzagAttention: Layer-Exclusive Retrieval and Streaming Heads

ZigzagAttention addresses the cost of attention under long-context settings by leveraging the observation that not all attention heads contribute equally to global retrieval. Each head (i,j)(i, j) (with i{1,,L}i\in\{1,\dots,L\} layers and j{1,,H}j\in\{1,\dots,H\} heads per layer) is assigned a gating score αi,j[0,1]\alpha_{i,j}\in[0,1] learned by distillation on synthetic long-context data. Mixed per-head attention is realized as

attentioni,j=αi,jfull_attention+(1αi,j)streaming_attention.\mathrm{attention}_{i,j} = \alpha_{i,j} \cdot \text{full\_attention} + (1 - \alpha_{i,j}) \cdot \text{streaming\_attention}.

Heads are sorted by αi,j\alpha_{i,j} and a user-defined sparsity quantile ss designates the top $1-s$ fraction as “retrieval” heads and the bottom ss fraction as “streaming” heads. Crucially, ZigzagAttention enforces an exclusivity constraint per layer:

i{1,,L},j=1Hzi,j{0,H}\forall i\in\{1,\dots,L\},\quad \sum_{j=1}^H z_{i,j}\in\{0,H\}

where zi,j=1z_{i,j}=1 for retrieval and zi,j=0z_{i,j}=0 for streaming. Thus, every layer comprises either all retrieval or all streaming heads. Assignment is determined via a transport-based enumeration that minimizes perturbation from the original gating scores, weighted by a hyperparameter ω\omega controlling the penalty for promoting streaming layers back to retrieval mode. Only one attention pass (either full or streaming) is performed per layer, eliminating the two-pass overhead characteristic of prior methods (e.g., DuoAttention) (Liu et al., 17 Aug 2025).

3. Blockwise ZigZag Attention in LoZA

LongCat ZigZag Attention (LoZA) implements a blockwise sparse attention pattern optimized for long-context scaling. The input sequence of length nn is divided into B=n/bB=n/b contiguous blocks of size bb. For every “sparse” (streaming) layer dd, each block tt attends to:

  • A local radius of blocks (tlt - l to t+lt + l)
  • A global “sink” block whose index αd=dmodB\alpha_d=d \bmod B zigzags across layers

Mathematically, for block tt, the keys and values attended are

Kd,t=concat(Ktl:t+l,Kαd),Vd,t=concat(Vtl:t+l,Vαd)K'_{d, t} = \mathrm{concat}(K_{t-l:t+l}, K_{\alpha_d}), \qquad V'_{d, t} = \mathrm{concat}(V_{t-l:t+l}, V_{\alpha_d})

and the result is

$O_{d, t} = \operatorname{softmax}\left(Q_t K'_{d, t}^\top / \sqrt{d_k}\right)V'_{d, t}$

Across successive layers, by varying αd\alpha_d, the pattern guarantees any block can reach all others in O(# layers)O(\text{\# layers}) hops, whereas per-token computational cost remains constant. In LoZA, typically 50% of attention layers are converted to this streaming sparse pattern, calibrated via a phase attaching trainable αi[0,1]\alpha_i\in[0,1] to each layer output, ranking by calibration, and then statically converting the bottom half to sparse after mid-training (Zhang et al., 30 Dec 2025).

4. Quantitative Results and Empirical Analysis

Decoding and Prefill Latency

  • ZigzagAttention achieves per-token decoding speedups of approximately 37% over LLaMA-3-8B and 10–15% over DuoAttention for output lengths from 1k to 32k tokens, without degradation in prefilling speed (Liu et al., 17 Aug 2025).
  • LoZA's blockwise kernel executes with less than 10% of the cost of a full-attention kernel at n=128n=128k tokens, yielding over 50% speedup in prefill and approximately 30% decode cost saving at n=256n=256k tokens (Zhang et al., 30 Dec 2025).

Model Quality

  • On LongBench (50% sparsity), ZigzagAttention's average score (38.44) is within 1.3 points of the baseline LM-3 (39.78) and marginally underperforms DuoAttention (39.45). Needle-in-a-Haystack evaluations (40k–280k context) show no loss in retrieval performance (Liu et al., 17 Aug 2025).
  • LoZA matches or surpasses the quality of LongCat-Flash-Base/Chat on general benchmarks (MMLU, GSM8K, code) and improves upon long-context tasks (LongEval 95.7→99.3, MRCR at up to one million context) (Zhang et al., 30 Dec 2025).

Ablation and Sensitivity

  • The penalty parameter ω\omega in ZigzagAttention reveals a trade-off: ω=0.1\omega=0.1 gives the best LongBench average, with higher ω\omega eroding long-context performance.
  • Naïve interleaving of sparse layers (every-other) in LoZA erases long-context quality (LongEval 95.7→54.1), while calibrated assignment preserves performance (89.6), with full recovery through continued sparse-training (Zhang et al., 30 Dec 2025).

5. Algorithmic Details and Complexity

ZigzagAttention Assignment Algorithm (Layer-exclusive Mode)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
Input: αℝ^{L×H}, sparsity s, ω
p  round(s·L)
q  Lp
best_cost  +
best_assignment  None

For each subset S  {1L} of size p:   # S = streaming layer indices
  cost  0
  For i=1L:
    For j=1H:
      if iS: cost += α_{i,j}
      else:   cost += ω·α_{i,j}
  If cost < best_cost:
    best_cost  cost
    best_assignment  S

Return best_assignment
(Liu et al., 17 Aug 2025)

Blockwise ZigZag Sparse Layer in LoZA

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
function ZigZagAttentionLayer(X, W_q, W_k, W_v, b, l, α):
    Q = X @ W_q     # [n, d_q]
    K = X @ W_k     # [n, d_k]
    V = X @ W_v     # [n, d_v]
    B = n // b
    Qb = reshape(Q, [B, b, d_q])
    Kb = reshape(K, [B, b, d_k])
    Vb = reshape(V, [B, b, d_v])
    out_blocks = []
    for t in 0 .. B-1:
        local_idxs = [clip(t-i,0,B-1) for i in range(-l, l+1)]
        sink_idx   = [α]
        idxs = unique(local_idxs + sink_idx)
        K_sel = concat([Kb[j] for j in idxs], axis=0)
        V_sel = concat([Vb[j] for j in idxs], axis=0)
        scores = softmax((Qb[t] @ K_sel.T) / sqrt(d_k), axis=-1)
        out_blocks.append(scores @ V_sel)
    O = reshape(concat(out_blocks), [n, d_v])
    return O
(Zhang et al., 30 Dec 2025)

Computational Complexity

Scheme Cost per layer Scaling with nn
Full attention O(n2dk)\mathcal{O}\left(n^2 d_k\right) Quadratic
Zigzag sparse O(nC)\mathcal{O}\left(n C\right) (C=(2l+1+s)bC = (2l+1+s) b) Linear

Interleaving full and Zigzag layers in LoZA yields overall cost Lfn2+LsnCL_f n^2 + L_s n C, with nLsC/Lfn \gg \sqrt{L_s C / L_f} ensuring the sparse layers’ cost is negligible. Practical speedups of up to 2×2\times in the attention-dominated regime are reported; kernel-level Zigzag achieves 90%\sim90\% cost reduction for sparse layers (Zhang et al., 30 Dec 2025).

6. Integration, Practical Considerations, and Limitations

For ZigzagAttention, only a one-time “transport” assignment (7 minutes wall time) is needed to fix layer roles; no inference or architectural changes are required beyond the adjusted layer modes. KV cache memory scales down almost linearly with the streaming head fraction (e.g., 50%50\% streaming heads yields 50%50\% memory reduction) (Liu et al., 17 Aug 2025). LoZA is calibrated during mid-training with learnable αi\alpha_i, followed by sparsification and curriculum-based long-context retraining. Post-training, task-specific supervised, DPO, or reinforcement finetuning can be applied (Zhang et al., 30 Dec 2025).

Optimal performance requires careful choice of block size bb, local radius ll, sink count ss, and sparsity ratio. For LoZA, b=128b=128, l[4,16]l\in[4,16], s=1s=1, and 50%50\% sparsity have been found effective. For short contexts (n4n\leq4k), ZigZag overhead may outweigh benefits, suggesting defaulting to full attention in these regimes. Layer-level sparsity ensures uniform GPU allocation, while per-head sparsity (as in prior work) may introduce computational imbalances (Zhang et al., 30 Dec 2025).

A plausible implication is that the design principles of Zigzag Attention can be generalized to other settings where trade-offs between memory, compute, and retrieval accuracy need to be balanced without major architectural overhaul or kernel reengineering.

7. Applications and Impact

ZigzagAttention and LoZA primarily target LLM deployment scenarios requiring efficient inference over very long contexts, such as retrieval-augmented generation, document-level question answering, and tool-based agentic reasoning, where context lengths often exceed 100k tokens. These methods enable scalable model serving by curbing both per-token latency and KV cache growth without sacrificing core LLM capabilities on established benchmarks. Their “drop-in” nature makes them relevant for production environments where model retraining costs or architecture changes are undesirable, and their impact is most pronounced in prefill-intensive and decode-intensive workloads with large input/output history (Liu et al., 17 Aug 2025, Zhang et al., 30 Dec 2025).

Key limitations involve minor accuracy loss on certain long-context benchmarks at very high sparsity ratios, and inefficiency or computational overhead at short context lengths due to metadata handling. These trade-offs can be tuned, and further research is ongoing regarding the optimal scheduling of sparse/full layers and dynamic adaptation mechanisms.

In summary, Zigzag Attention constitutes a principled, empirically validated family of sparse attention designs capable of preserving LLM utility at million-token scale while dramatically reducing the operational burdens associated with quadratic attention.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Zigzag Attention.