Striped Attention: Efficient Transformer Mechanism

Updated 13 May 2026

Striped Attention is a method that refines transformer attention using fine-grained stripe partitions to selectively compute high-utility regions in the attention matrix.
It significantly increases sparsity, achieving up to 76.6% at around 90% recall, and reduces computational cost compared to traditional block-sparse approaches.
By interleaving tokens across devices, it balances computational loads and offers theoretical speedups up to 2× in distributed transformer models.

Striped Attention defines a class of architectural and algorithmic innovations for efficient transformer attention computation, characterized by the use of "stripes": either as fine-grained regions of the attention matrix for selective, sparse computation or as partitioning strategies for balanced, distributed computation across multiple devices. Striped approaches aim to minimize unnecessary computation and maximize hardware utilization, particularly in LLMs operating on long sequences. Two primary lines of research are prominent: (1) stripe-granular sparse attention for efficient inference, as exemplified by AnchorAttention, and (2) Striped Attention as a distributed parallelization/partitioning strategy improving causal attention scaling in multi-device settings.

1. Stripe Granularity in Sparse Attention

Traditionally, block-sparse attention divides the $N \times N$ attention matrix into large contiguous blocks of size $b_r \times b_c$ . Attention is computed only for a selectable subset of these blocks, leading to coarse sparsity. Stripe granularity refines this by collapsing the block's row dimension, yielding stripes of size $(1, b_k)$ ; e.g., with $(b_q, b_k) = (128, 128)$ , each "stripe" corresponds to $(1, 128)$ rather than $(128,128)$ blocks.

Empirical analysis of attention heatmaps in ultra-long contexts shows that most queries derive high-value attention from sparse, shared "columns" (key positions), rather than dense local neighborhoods. As a result, block-sparse approaches waste compute on low-impact rows within active blocks. Stripe granularity circumvents this inefficiency by restricting computation to high-utility columns, dramatically increasing overall sparsity for a fixed recall level. For example, at ∼90% recall, stripe-sparse attention achieves 76.6% sparsity, as opposed to 56.3% for block-sparse (Table 1 in (Zhang et al., 29 May 2025)).

2. AnchorAttention: Stripe-Sparse Dynamic Sparse Attention

AnchorAttention (Zhang et al., 29 May 2025) implements stripe-sparse attention as a three-stage algorithm:

Pattern-based Anchor Computation: For each query block, an array of "anchor" logits is rapidly estimated by considering the maximum dot-product over a small initial window and localized keys. Given $Q \in \mathbb{R}^{N \times d}$ and $K \in \mathbb{R}^{N \times d}$ , anchors are computed as

$x_a = \max_{j \in \{\text{init}, \text{w}\}} \frac{Q [K_{\text{init}} \,\|\, K_{\text{w}}]^T}{\sqrt{d}}$

for a small window size $b_\mathrm{w}$ , typically $b_r \times b_c$ 0, incurring negligible computational cost.

Difference-aware Stripe Sparsity Identification: Each query block (represented by its mean) is compared to the anchor logit $b_r \times b_c$ 1; differences are thresholded to yield a binary mask over possible query-block/key-stripe pairs:

$b_r \times b_c$ 2

The mask defines the active, high-utility stripes for which full attention should be computed.

Fine-grained Sparse Computation: Attention is computed only for those stripe positions flagged by the sparsity mask, loading discrete key and value stripes. Since discontiguous stripes can be loaded in parallel, GPU throughput is preserved. The final output uses the standard softmax across the selected stripes.

The net complexity becomes $b_r \times b_c$ 3, with $b_r \times b_c$ 4 typically, ensuring total cost is orders-of-magnitude below $b_r \times b_c$ 5. AnchorAttention excels where attention is globally structured but highly sparse, such as in long documents. Its design targets the transformer prefill phase, with decode-time behavior an open area for further investigation.

3. Striped Attention as Parallel Causal Attention Partitioning

A distinct but related methodology, Striped Attention (Brandon et al., 2023), addresses workload imbalance in distributed causal self-attention (i.e., GPT-style models) with Ring Attention. In Ring Attention, the sequence is split into $b_r \times b_c$ 6 contiguous blocks for $b_r \times b_c$ 7 devices. Due to the triangular mask, at each ring-exchange round, only a subset of devices computes significant work, leading to suboptimal hardware utilization.

Striped Attention replaces contiguous partitioning with an interleaved "stripe" permutation: device $b_r \times b_c$ 8 is assigned all tokens $b_r \times b_c$ 9 such that $(1, b_k)$ 0. This results in each device's local block being distributed uniformly throughout the sequence.

For each block, the causal mask renders all blocks upper-triangular with respect to original sequence order, meaning every device always processes roughly half of its $(1, b_k)$ 1 attention interactions, avoiding fully masked or fully dense blocks. As a result, all devices have an evenly balanced computational load, and the overall FLOPs for attention are reduced by nearly $(1, b_k)$ 2 in the large- $(1, b_k)$ 3 regime.

4. Algorithmic Details and Complexity

For AnchorAttention (Zhang et al., 29 May 2025), the three-stage procedure is formalized by key equations and can be implemented with straightforward, kernelized code (see Algorithm 1–3 in the source). The fine-grained computation pipeline ensures that memory access patterns are hardware-efficient despite stripe irregularity. A comparison of computational costs is summarized as follows:

Attention Type	Complexity	Sparsity (Rec. ∼90%)	Speedup (128K)
Full	$(1, b_k)$ 4	0%	1×
Block-Sparse	$(1, b_k)$ 5	56.3%	—
Stripe-Sparse	$(1, b_k)$ 6	76.6%	1.44×–4.6× over SOTA

For Striped Attention (Brandon et al., 2023), algorithmic changes are confined to a one-time permutation of tokens and positional IDs, followed by mask adjustment. Pseudocode outlines the high-level procedure, with the primary variation being the uniform application of upper-triangular masks across all device blocks. The per-device compute per round drops from $(1, b_k)$ 7 FLOPs (Ring) to $(1, b_k)$ 8 FLOPs (Striped), with total communication and memory cost unchanged:

At $(1, b_k)$ 9k tokens with 8 A100 GPUs, Striped Attention achieves up to $(b_q, b_k) = (128, 128)$ 0 speedup over Ring Attention.
On 16 TPUv4 chips at $(b_q, b_k) = (128, 128)$ 1k, observed speedup increases to $(b_q, b_k) = (128, 128)$ 2.
The theoretical maximum speedup approaches $(b_q, b_k) = (128, 128)$ 3 as block sizes increase, limited in practice by nonzero tile granularity overhead and partial overlap of computation and communication.

5. Experimental Results

Extensive benchmarks confirm the superiority of stripe-based approaches in both paradigms.

AnchorAttention (Zhang et al., 29 May 2025) achieves 91.2% recall and 76.6% overall sparsity on LLaMA-3.1-8B at 128k sequence length. At this context, it runs $(b_q, b_k) = (128, 128)$ 4 faster than FlexPrefill (block-sparse SOTA) and $(b_q, b_k) = (128, 128)$ 5 faster than FlashAttention (full-attention) at equivalent or better recall.
Striped Attention (Brandon et al., 2023) reaches up to $(b_q, b_k) = (128, 128)$ 6 end-to-end throughput gains in distributed GPT-style transformer training on long sequences, with performance benefits scaling with sequence length and parallelism degree.

Both methods leave memory usage unchanged relative to their underlying block or ring baselines.

6. Architectural Implications and Trade-offs

Stripe-based granularity exploits empirically observed attention sparsity: in long-context language modeling, only a moderate number of key positions are significant for each query. Striped computation eliminates wasted FLOPs in block regions with low attention scores, achieving higher sparsity at equal recall. Hardware utilization is optimized, as fine-grained parallel loading enables high GPU or TPU throughput.

Trade-offs include the overhead of real-time pattern computation (anchor selection, stripe mask identification) and, in distributed Striped Attention, the need for up-front sequence permutation and consistent adjustment of positional encodings. The current state of AnchorAttention focuses exclusively on the prefill phase; its adaptability to the autoregressive decode phase remains to be evaluated. Further work is suggested for scaling to larger architectures and differing task regimes.

7. Connections to Broader Attention Research

Striped Attention and stripe-sparse approaches form part of a broader taxonomy of efficient transformer methods addressing the quadratic complexity inherent to global self-attention. Compared to coarse block strategies, stripe-based methods realize a finer balance between recall and computational/memory efficiency. Striped partitioning, as a parallel hardware solution, addresses bottlenecks unmitigated by algorithmic sparsity alone.

A plausible implication is that further integration of stripe granularity with fused attention kernels (e.g., FlashAttention-style) and hardware-aware tiling may support even greater speedups and more widespread adoption in ultra-long-context LLM deployment.

For full mathematical derivation, pseudocode, and practical implementation considerations, see the original sources (Zhang et al., 29 May 2025, Brandon et al., 2023).

Markdown Report Issue Upgrade to Chat

References (2)

AnchorAttention: Difference-Aware Sparse Attention with Stripe Granularity (2025)

Striped Attention: Faster Ring Attention for Causal Transformers (2023)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Striped Attention.

Striped Attention: Efficient Transformer Mechanism

1. Stripe Granularity in Sparse Attention

2. AnchorAttention: Stripe-Sparse Dynamic Sparse Attention

3. Striped Attention as Parallel Causal Attention Partitioning

4. Algorithmic Details and Complexity

5. Experimental Results

6. Architectural Implications and Trade-offs

7. Connections to Broader Attention Research

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Striped Attention: Efficient Transformer Mechanism

1. Stripe Granularity in Sparse Attention

2. AnchorAttention: Stripe-Sparse Dynamic Sparse Attention

3. Striped Attention as Parallel Causal Attention Partitioning

4. Algorithmic Details and Complexity

5. Experimental Results

6. Architectural Implications and Trade-offs

7. Connections to Broader Attention Research

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research