Zigzag Attention in Transformers
- Zigzag Attention is a technique that restructures Transformer attention to enable efficient long-context inference while minimizing degradation in model quality.
- It employs layer-exclusive retrieval and streaming modes to reduce computational complexity, achieving significant speedups and memory savings compared to full attention.
- Blockwise Zigzag in LoZA uses structured sparse patterns for near-linear scaling, delivering improved latency and decoding efficiency in long-context scenarios.
Zigzag Attention encompasses two families of techniques that restructure attention mechanisms in Transformer architectures to enable efficient long-context inference while minimizing degradation in model quality. Both families—exemplified by ZigzagAttention with exclusive retrieval and streaming heads, and by blockwise ZigZag attention as instantiated in LongCat ZigZag Attention (LoZA)—directly address computational and memory bottlenecks observed in LLMs with context lengths extending beyond tens of thousands of tokens. These techniques achieve significant reductions in inference latency and memory footprint by replacing traditional full attention either via exclusive per-layer sparsity or through structured, blockwise sparse attention patterns that preserve essential connectivity for information retrieval and reasoning at scale (Liu et al., 17 Aug 2025, Zhang et al., 30 Dec 2025).
1. Motivation and Problem Statement
Standard Transformers exhibit attention complexity, with denoting input sequence length. For long-context LLMs, especially in autoregressive decoding, this manifests as rapidly increasing GPU memory use for the key–value (KV) cache, with size , where is the hidden dimension. As , the operational cost becomes prohibitive. Prior sparse attention patterns—such as sliding window, block-sparse, or strided variants—either inadequately preserve long-range dependency or still incur cost with window size scaling with for quality retention. Zigzag Attention methods are explicitly designed to resolve these trade-offs by either assigning entire layers to exclusive “retrieval” or “streaming” attention modes (ZigzagAttention), or by introducing blockwise alternating global-local connectivity at the kernel level (LoZA), thereby achieving near-linear scaling and robust long-context performance (Liu et al., 17 Aug 2025, Zhang et al., 30 Dec 2025).
2. ZigzagAttention: Layer-Exclusive Retrieval and Streaming Heads
ZigzagAttention addresses the cost of attention under long-context settings by leveraging the observation that not all attention heads contribute equally to global retrieval. Each head (with layers and heads per layer) is assigned a gating score learned by distillation on synthetic long-context data. Mixed per-head attention is realized as
Heads are sorted by and a user-defined sparsity quantile designates the top $1-s$ fraction as “retrieval” heads and the bottom fraction as “streaming” heads. Crucially, ZigzagAttention enforces an exclusivity constraint per layer:
where for retrieval and for streaming. Thus, every layer comprises either all retrieval or all streaming heads. Assignment is determined via a transport-based enumeration that minimizes perturbation from the original gating scores, weighted by a hyperparameter controlling the penalty for promoting streaming layers back to retrieval mode. Only one attention pass (either full or streaming) is performed per layer, eliminating the two-pass overhead characteristic of prior methods (e.g., DuoAttention) (Liu et al., 17 Aug 2025).
3. Blockwise ZigZag Attention in LoZA
LongCat ZigZag Attention (LoZA) implements a blockwise sparse attention pattern optimized for long-context scaling. The input sequence of length is divided into contiguous blocks of size . For every “sparse” (streaming) layer , each block attends to:
- A local radius of blocks ( to )
- A global “sink” block whose index zigzags across layers
Mathematically, for block , the keys and values attended are
and the result is
$O_{d, t} = \operatorname{softmax}\left(Q_t K'_{d, t}^\top / \sqrt{d_k}\right)V'_{d, t}$
Across successive layers, by varying , the pattern guarantees any block can reach all others in hops, whereas per-token computational cost remains constant. In LoZA, typically 50% of attention layers are converted to this streaming sparse pattern, calibrated via a phase attaching trainable to each layer output, ranking by calibration, and then statically converting the bottom half to sparse after mid-training (Zhang et al., 30 Dec 2025).
4. Quantitative Results and Empirical Analysis
Decoding and Prefill Latency
- ZigzagAttention achieves per-token decoding speedups of approximately 37% over LLaMA-3-8B and 10–15% over DuoAttention for output lengths from 1k to 32k tokens, without degradation in prefilling speed (Liu et al., 17 Aug 2025).
- LoZA's blockwise kernel executes with less than 10% of the cost of a full-attention kernel at k tokens, yielding over 50% speedup in prefill and approximately 30% decode cost saving at k tokens (Zhang et al., 30 Dec 2025).
Model Quality
- On LongBench (50% sparsity), ZigzagAttention's average score (38.44) is within 1.3 points of the baseline LM-3 (39.78) and marginally underperforms DuoAttention (39.45). Needle-in-a-Haystack evaluations (40k–280k context) show no loss in retrieval performance (Liu et al., 17 Aug 2025).
- LoZA matches or surpasses the quality of LongCat-Flash-Base/Chat on general benchmarks (MMLU, GSM8K, code) and improves upon long-context tasks (LongEval 95.7→99.3, MRCR at up to one million context) (Zhang et al., 30 Dec 2025).
Ablation and Sensitivity
- The penalty parameter in ZigzagAttention reveals a trade-off: gives the best LongBench average, with higher eroding long-context performance.
- Naïve interleaving of sparse layers (every-other) in LoZA erases long-context quality (LongEval 95.7→54.1), while calibrated assignment preserves performance (89.6), with full recovery through continued sparse-training (Zhang et al., 30 Dec 2025).
5. Algorithmic Details and Complexity
ZigzagAttention Assignment Algorithm (Layer-exclusive Mode)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
Input: α∈ℝ^{L×H}, sparsity s, ω p ← round(s·L) q ← L–p best_cost ← +∞ best_assignment ← None For each subset S ⊆ {1…L} of size p: # S = streaming layer indices cost ← 0 For i=1…L: For j=1…H: if i∈S: cost += α_{i,j} else: cost += –ω·α_{i,j} If cost < best_cost: best_cost ← cost best_assignment ← S Return best_assignment |
Blockwise ZigZag Sparse Layer in LoZA
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
function ZigZagAttentionLayer(X, W_q, W_k, W_v, b, l, α):
Q = X @ W_q # [n, d_q]
K = X @ W_k # [n, d_k]
V = X @ W_v # [n, d_v]
B = n // b
Qb = reshape(Q, [B, b, d_q])
Kb = reshape(K, [B, b, d_k])
Vb = reshape(V, [B, b, d_v])
out_blocks = []
for t in 0 .. B-1:
local_idxs = [clip(t-i,0,B-1) for i in range(-l, l+1)]
sink_idx = [α]
idxs = unique(local_idxs + sink_idx)
K_sel = concat([Kb[j] for j in idxs], axis=0)
V_sel = concat([Vb[j] for j in idxs], axis=0)
scores = softmax((Qb[t] @ K_sel.T) / sqrt(d_k), axis=-1)
out_blocks.append(scores @ V_sel)
O = reshape(concat(out_blocks), [n, d_v])
return O |
Computational Complexity
| Scheme | Cost per layer | Scaling with |
|---|---|---|
| Full attention | Quadratic | |
| Zigzag sparse | () | Linear |
Interleaving full and Zigzag layers in LoZA yields overall cost , with ensuring the sparse layers’ cost is negligible. Practical speedups of up to in the attention-dominated regime are reported; kernel-level Zigzag achieves cost reduction for sparse layers (Zhang et al., 30 Dec 2025).
6. Integration, Practical Considerations, and Limitations
For ZigzagAttention, only a one-time “transport” assignment (7 minutes wall time) is needed to fix layer roles; no inference or architectural changes are required beyond the adjusted layer modes. KV cache memory scales down almost linearly with the streaming head fraction (e.g., streaming heads yields memory reduction) (Liu et al., 17 Aug 2025). LoZA is calibrated during mid-training with learnable , followed by sparsification and curriculum-based long-context retraining. Post-training, task-specific supervised, DPO, or reinforcement finetuning can be applied (Zhang et al., 30 Dec 2025).
Optimal performance requires careful choice of block size , local radius , sink count , and sparsity ratio. For LoZA, , , , and sparsity have been found effective. For short contexts (k), ZigZag overhead may outweigh benefits, suggesting defaulting to full attention in these regimes. Layer-level sparsity ensures uniform GPU allocation, while per-head sparsity (as in prior work) may introduce computational imbalances (Zhang et al., 30 Dec 2025).
A plausible implication is that the design principles of Zigzag Attention can be generalized to other settings where trade-offs between memory, compute, and retrieval accuracy need to be balanced without major architectural overhaul or kernel reengineering.
7. Applications and Impact
ZigzagAttention and LoZA primarily target LLM deployment scenarios requiring efficient inference over very long contexts, such as retrieval-augmented generation, document-level question answering, and tool-based agentic reasoning, where context lengths often exceed 100k tokens. These methods enable scalable model serving by curbing both per-token latency and KV cache growth without sacrificing core LLM capabilities on established benchmarks. Their “drop-in” nature makes them relevant for production environments where model retraining costs or architecture changes are undesirable, and their impact is most pronounced in prefill-intensive and decode-intensive workloads with large input/output history (Liu et al., 17 Aug 2025, Zhang et al., 30 Dec 2025).
Key limitations involve minor accuracy loss on certain long-context benchmarks at very high sparsity ratios, and inefficiency or computational overhead at short context lengths due to metadata handling. These trade-offs can be tuned, and further research is ongoing regarding the optimal scheduling of sparse/full layers and dynamic adaptation mechanisms.
In summary, Zigzag Attention constitutes a principled, empirically validated family of sparse attention designs capable of preserving LLM utility at million-token scale while dramatically reducing the operational burdens associated with quadratic attention.