Compressed Sparse Attention (CSA)
- CSA is a multi-stage attention mechanism that compresses the full context into a lightweight proxy and then selectively applies exact attention on a reduced set of tokens or blocks.
- It includes various implementations—such as token-level, chunk-level, block-level, and representative-based methods—that balance computational cost, memory efficiency, and model performance.
- Innovative techniques like Hadamard transforms and learned gist tokens enable CSA to achieve significant speedups and scalability on long-document tasks while preserving key information.
Searching arXiv for the cited CSA-related papers to ground the article in current preprints. arxiv_search: (Yan et al., 21 Oct 2025) arxiv_search({"id_list":["(Yan et al., 21 Oct 2025)","(Mao et al., 22 Apr 2026)","(Liu et al., 16 Dec 2025)","(Jaber et al., 4 May 2026)","(Song et al., 30 Mar 2026)","(Wang et al., 29 Jan 2026)","(Wen et al., 21 Sep 2025)","(Yuan et al., 16 Feb 2025)","(Wang et al., 8 Jun 2026)"]}) Compressed Sparse Attention (CSA) denotes a class of attention mechanisms for long-context Transformers in which the model first compresses the historical context into a cheaper proxy, then uses that proxy to identify a small, query-relevant subset of tokens, chunks, or blocks for higher-fidelity attention. In the canonical dense formulation, self-attention computes
which for sequence length forms an score matrix per head and incurs compute together with softmax or memory traffic; in autoregressive decoding, each new token scans keys and values, so per-token latency and KV bandwidth dominate at long context (Yan et al., 21 Oct 2025). CSA addresses this by replacing full-resolution scoring with compressed surrogates or summaries that are cheap to evaluate, selecting only a small set of candidates, and then executing exact or hybrid attention only where it matters. Recent work presents CSA both as an inference-time drop-in acceleration strategy and as a trainable long-context modeling paradigm, with token-level, chunk-level, block-level, and representative-based variants (Yan et al., 21 Oct 2025, Mao et al., 22 Apr 2026, Liu et al., 16 Dec 2025).
1. Definition, scope, and terminology
CSA is best understood as a two-stage or three-stage decomposition of attention. First, the context is compressed into a representation that is much cheaper to score than the full KV cache. Second, a sparse pattern is determined from that compressed representation, typically by top- or Top- selection. Third, the model performs exact attention on the retained subset, or combines exact sparse attention with a compressed residual for the discarded tail. This general recipe appears explicitly in Adamas, which compresses queries and keys into 2-bit surrogates for token-level top- selection (Yan et al., 21 Oct 2025); in SSA, which trains gist tokens to act as routing summaries for chunk unfolding (Mao et al., 22 Apr 2026); and in UniSparse, which pools tokens into composite tokens and then derives a block-sparse mask from compressed-space attention (Liu et al., 16 Dec 2025).
The term is not fully uniform across papers. Some works use CSA as a general category, while others introduce named mechanisms that fit the same pattern without foregrounding the acronym. One paper also introduces CSAttention, short for Centroid-Scoring Attention, and explicitly notes that in that paper the acronym CSA does not mean Compressed Sparse Attention (Song et al., 30 Mar 2026). This ambiguity is terminological rather than conceptual: CSAttention still compresses the retrieval structure into fixed-size offline tables and enforces sparsity online by selecting only a small subset of keys per step, so it falls naturally within the broader CSA design space (Song et al., 30 Mar 2026).
A useful distinction within the literature is between compression for scoring and compression for output formation. In Adamas, UniSparse, StreamIndex, and Lookahead Sparse Attention, compressed representations are primarily used to decide where exact attention should be spent; once a subset is chosen, the retained attention is exact on the selected keys or blocks (Yan et al., 21 Oct 2025, Liu et al., 16 Dec 2025, Jaber et al., 4 May 2026, Wang et al., 8 Jun 2026). By contrast, SPLA explicitly compresses the unselected long tail into a residual linear-attention state so that discarded blocks still contribute to the output (Wang et al., 29 Jan 2026). CBSA goes further and replaces dense attention with a representative-based factorization in which token-to-representative and representative-to-representative interactions are the only dense operations (Wen et al., 21 Sep 2025).
2. Core computational pattern
The common CSA pipeline can be written schematically as follows. A compression operator 0 maps the original context into a lower-cost representation, producing compressed keys, compressed blocks, gist tokens, centroids, or representatives. A selection score is then computed in that compressed space, and a sparse set of indices is chosen:
1
The final attention then runs only on the selected original tokens or blocks. Adamas instantiates this pattern at token granularity by transforming 2 and 3 with a Hadamard matrix, bucketizing into four levels, compressing to 2-bit codes, estimating relevance by negative Manhattan distance, and then computing
4
only on the selected subset (Yan et al., 21 Oct 2025). SSA applies the same logic at chunk granularity: each chunk is paired with a learnable gist token during continued pretraining, and decoding first scores only the gist keys,
5
before unfolding the top-6 chunks and attending over the selected gists plus their raw tokens (Mao et al., 22 Apr 2026).
The principal algorithmic difficulty is ranking fidelity. Compression is only useful if it preserves the ordering of the keys or blocks that matter for the current query. Different methods solve this differently. Adamas relies on orthogonality of the Hadamard transform to preserve dot products before quantization, then uses bucketized geometry in the Hadamard domain to approximate relevance (Yan et al., 21 Oct 2025). UniSparse compresses both query and key sequences by average pooling, optionally compresses heads, computes compressed-space attention
7
aggregates it to block scores, and selects blocks until a Top-8 mass criterion is satisfied (Liu et al., 16 Dec 2025). CSAttention partitions head space into subspaces, clusters prefill queries rather than keys, and precomputes centroid-to-key scores so that decode-time selection becomes a fixed-size lookup plus reduce-by-key accumulation (Song et al., 30 Mar 2026). SPLA uses a second-order Taylor proxy of block attention mass based on per-block means and covariances,
9
and then complements the selected blocks with residual linear attention over the unselected context (Wang et al., 29 Jan 2026).
These mechanisms differ in granularity. Token-level CSA maximizes selection precision but requires extremely cheap estimators and compressed codes to keep the full scan practical (Yan et al., 21 Oct 2025). Chunk-level and block-level CSA improve hardware efficiency by preserving contiguous memory access and reducing index overhead (Mao et al., 22 Apr 2026, Liu et al., 16 Dec 2025, Yuan et al., 16 Feb 2025). Representative-based methods such as CBSA compress all tokens into 0 representatives and implement attention as a structured low-rank factorization,
1
which no longer materializes dense token-token interactions (Wen et al., 21 Sep 2025).
3. Principal design families
The current CSA literature spans several distinct but related constructions.
| Method | Compressed object | Sparse or hybrid execution |
|---|---|---|
| Adamas | Hadamard-domain 2-bit token codes | Token-level top-2, then exact attention (Yan et al., 21 Oct 2025) |
| SSA / H-SSA | Learned gist tokens and meta-gists | Query scores gists, unfolds top-3 chunks (Mao et al., 22 Apr 2026) |
| UniSparse | Composite tokens from sequence/head pooling | Top-4 block mask, block-sparse FlashAttention-style execution (Liu et al., 16 Dec 2025) |
| NSA | Learned compressed blocks plus local window | Compression branch, selected fine-grained blocks, gated fusion (Yuan et al., 16 Feb 2025) |
| CSAttention | Query-centric centroid tables | Table lookup, reduce-by-key, sparse exact attention (Song et al., 30 Mar 2026) |
| SPLA | Selected exact blocks plus recurrent residual state | Exact sparse attention plus residual linear attention (Wang et al., 29 Jan 2026) |
| CBSA | Representatives contracted from tokens | Representative attention and broadcast factorization (Wen et al., 21 Sep 2025) |
Token-level compressed surrogates prioritize fine ranking under extremely small budgets. Adamas is the clearest example: Hadamard smoothing redistributes variance, four-level bucketization with 2-bit packing yields ultra-compact codes, and an 5 estimator on packed integers supports token-level top-6 selection with small cache overhead. The method is explicitly training-free and model-agnostic (Yan et al., 21 Oct 2025).
Learned routing summaries train the model to write salient information into dedicated summary tokens. SSA inserts one gist token after each chunk and constrains later tokens to access prior chunks only through those gists; H-SSA recursively adds meta-gists after groups of gist-chunk pairs, enabling coarse-to-fine routing with logarithmic decoding complexity (Mao et al., 22 Apr 2026). This family shifts part of CSA from inference-time approximation to representation learning.
Block-sparse compressed evaluation emphasizes global selection quality together with GPU alignment. UniSparse performs compressed-space global evaluation and then broadcasts the selected mask back to full-resolution attention, while NSA combines compressed global context, sparse fine-grained selection of original blocks, and a sliding local window, with branch outputs fused by learned gates (Liu et al., 16 Dec 2025, Yuan et al., 16 Feb 2025). Both are hardware-conscious, but NSA is explicitly designed to be natively trainable end-to-end (Yuan et al., 16 Feb 2025).
Fixed-size lookup structures front-load computation into reusable indexes. CSAttention clusters queries offline in subspaces where 7 lives, precomputes centroid-to-key partial inner products, and at decode time replaces a full-context scan with nearest-centroid lookup and sparse score accumulation (Song et al., 30 Mar 2026). Lookahead Sparse Attention for FlashMemory-DeepSeek-V4 extends this indexing logic to GPU residency control: a neural memory indexer predicts which CSA chunks will be needed in the next 8 steps, keeps only those chunks in HBM, and lets the native Lightning Indexer perform finer-grained scoring inside the recalled set (Wang et al., 8 Jun 2026).
Exact-plus-compressed hybrids address a common weakness of pure sparsification: the long tail is often discarded. SPLA preserves the selected peaks with exact block-sparse attention while retaining the complement through residual linear attention computed as
9
so that unselected blocks need not be explicitly reloaded (Wang et al., 29 Jan 2026). This suggests a broader view of CSA in which compression is not merely a routing device but also a complementary memory of what was not selected.
4. Systems, kernels, and memory behavior
A central theme in CSA is that asymptotic sparsity is insufficient without bandwidth-aware implementation. Several papers explicitly target the dominant systems bottleneck rather than only the algorithmic attention formula. Adamas emphasizes that the selection pass scans all prior keys, but it does so on 2-bit packed data with vectorized integer operations; the authors implement fused CUDA kernels for Hadamard transform, bucketization, compression, and the 0 estimator so that selection overhead is negligible relative to the dense floating-point path it replaces (Yan et al., 21 Oct 2025). UniSparse similarly packs its method into a drop-in library in which compression, softmax, aggregation, and Top-1 selection are fused, and the resulting mask is consumed by block-sparse FlashAttention-style kernels (Liu et al., 16 Dec 2025).
SSA and H-SSA shift the systems emphasis toward KV-cache residency. Only the compact gist cache must remain resident; raw K/V for selected chunks are fetched on demand. Prefill uses a key-column permutation and a block-sparse FlexAttention operator, while decoding uses a three-kernel sparse flash-decode path that materializes only the selected indices, never the full mask or full KV (Mao et al., 22 Apr 2026). NSA makes a related argument from arithmetic intensity: training and prefilling are compute-bound, whereas decoding is memory-bound, so blockwise contiguous access and group-shared selection under GQA or MQA are essential. Its Triton kernel loads all query heads in a GQA group, fetches shared sparse KV blocks sequentially, and performs attention accumulation on SRAM-resident tiles (Yuan et al., 16 Feb 2025).
StreamIndex isolates a specific systems failure mode in DeepSeek-style CSA: the indexer itself can exceed HBM before sparse attention ever runs. In the reference path, a lightning indexer materializes a score tensor of shape 2, which at 3, 4, 5, 6, and FP32 occupies approximately 7 GiB, exceeding a single NVIDIA H200’s 8 GB HBM (Jaber et al., 4 May 2026). StreamIndex exploits the separability of the indexer score to perform chunked partition-merge top-9 without materializing the full intermediate, running the same indexer to 0 with 1 GB peak HBM (Jaber et al., 4 May 2026). The contribution is explicitly limited to the indexer step; it makes no claim of a faster attention kernel or real-checkpoint end-to-end behavior (Jaber et al., 4 May 2026).
Lookahead Sparse Attention extends the systems viewpoint from scoring cost to physical KV residency. FlashMemory-DeepSeek-V4 keeps HCA layers and the local 8K window permanently resident, but dynamically pages global CSA chunks between CPU and GPU according to a dual-encoder memory indexer that runs every 2 steps with threshold 3 (Wang et al., 8 Jun 2026). The result is not merely sparse computation but sparse memory presence, a distinction that becomes decisive at 4K to 5K contexts.
5. Empirical performance across tasks and modalities
CSA methods are typically evaluated on long-document summarization, single-document and multi-document QA, perplexity, retrieval stress tests, synthetic long-context benchmarks, and increasingly multimodal workloads. On LongBench, PG19, and passkey retrieval, Adamas reports that with token budgets as small as 6 it matches the accuracy of full attention across evaluated tasks, that at 7 it is near-lossless, that it supports up to 8 higher sparsity than prior state-of-the-art sparse methods, and that it delivers up to 9 self-attention and 0 end-to-end speedups on 32K sequences; on PG19 with LongChat-7b-v1.5-32k at 32K length, its perplexity closely tracks and can be lower than full attention as 1 increases (Yan et al., 21 Oct 2025).
SSA reports a different empirical pattern because it changes the representation during continued pretraining. Under the same compression ratio on LongBench with Qwen2-7B-Instruct, SSA scores 2 at 3 compression versus ActivationBeacon 4 and UniGist 5, near Full-PT 6; in retrieval-augmented generation with Llama-3.2-1B, SSA at 7 compression after continued pretraining reaches 8 average, over 9 points better than KVLink 0 and UniGist 1, and even higher than the vanilla model 2 and Full-PT 3. The paper also reports end-to-end decoding speedups versus Flash-Decoding up to 4 at 5K tokens for SSA and 6 for H-SSA, with compression ratios explored up to 7 (Mao et al., 22 Apr 2026).
UniSparse emphasizes near-full-accuracy block sparsification and multimodal breadth. On HELMET with Llama-3.1-8B-Instruct, UniSparse-0.95 achieves 8 versus 9 for FlashAttention, approximately 0, and on RULER 1 versus 2, approximately 3, at approximately 4 sparsity. On Video-MME with Qwen2.5-VL-7B-Instruct and subtitles, UniSparse-0.9 achieves 5 overall with 6 sparsity and even surpasses FlashAttention in some settings, while end-to-end attention is reported as up to 7 faster than FlashAttention at 8K (Liu et al., 16 Dec 2025).
Training-based CSA variants often report benefits beyond inference acceleration. NSA, pretrained with a 27B-parameter backbone, reports a LongBench average of 9 versus Full 0, perfect 64k needle-in-a-haystack retrieval, and Triton-kernel speedups up to 1 forward and 2 backward at 3k, with expected decoding speedup up to 4 at 5k (Yuan et al., 16 Feb 2025). SPLA reports that it closes the performance gap in continual pretraining and surpasses dense attention models on RULER, with scores of 6 versus dense 7 at 8k and 9 versus dense 0 at 1k; the ablation SPA, which removes residual linear attention, collapses beyond 2k, indicating that compressing the long tail rather than discarding it is crucial (Wang et al., 29 Jan 2026).
CSAttention and FlashMemory-DeepSeek-V4 show that CSA can also improve serving performance under reusable or ultra-long contexts. CSAttention reports near-identical accuracy to full attention under 3 sparsity and up to 4 inference speedup over the most accurate baseline at a context length of 5K (Song et al., 30 Mar 2026). FlashMemory-DeepSeek-V4 reports that across LongBench-v2, LongMemEval, and RULER the average physical KV cache footprint is reduced to 6 of the DS-V4-Flash baseline while average accuracy increases from 7 to 8; on LongBench-v2-L at 9K, memory drops from 00 GB to 01 GB and accuracy increases from 02 to 03 (Wang et al., 8 Jun 2026). StreamIndex complements these results at the kernel level by showing bit-exact set parity with the materialized DeepSeek indexer where both fit in memory and a 04 regime extension in sequence length (Jaber et al., 4 May 2026).
6. Limitations, trade-offs, and unresolved problems
CSA does not eliminate all long-context costs. Several methods still scan the entire compressed history at selection time. Adamas explicitly notes that per-step selection still scans all prior keys, though on 2-bit data with coalesced loads, and that the main savings come from avoiding dense floating-point 05 and 06 over all keys (Yan et al., 21 Oct 2025). UniSparse reports that it works best for prefill and keeps dense decode, maintaining the full KV cache during decoding (Liu et al., 16 Dec 2025). StreamIndex shows that even when the final sparse attention is efficient, the upstream indexer can be the dominant memory bottleneck if naively materialized (Jaber et al., 4 May 2026).
Another trade-off is where the method pays its cost. Training-free inference plugins such as Adamas, UniSparse, CSAttention, and FlashMemory are attractive because they do not require retraining or fine-tuning, but they rely on the fidelity of compressed scoring proxies or offline-built tables (Yan et al., 21 Oct 2025, Liu et al., 16 Dec 2025, Song et al., 30 Mar 2026, Wang et al., 8 Jun 2026). SSA, H-SSA, and NSA instead learn or co-train the compressed representation, which can improve routing quality but introduces continued pretraining or architecture-specific training pipelines (Mao et al., 22 Apr 2026, Yuan et al., 16 Feb 2025). CBSA frames efficient attention as an optimization-driven representative contraction, but its approximation quality depends on representative choice and on the softmax replacement of the exact inverse in the contraction step (Wen et al., 21 Sep 2025).
Failure modes are also method-specific. Adamas notes that very short sequences may not benefit because dense attention overhead is already small (Yan et al., 21 Oct 2025). UniSparse warns that very fine-grained long-range dependencies or rare token interactions can be smoothed out by compression, especially at large compression factors or with head compression (Liu et al., 16 Dec 2025). SSA is sensitive to chunking and compression budgets, and H-SSA can be modestly slower during prefill at short contexts (Mao et al., 22 Apr 2026). FlashMemory reports that MRCR remains difficult: accuracy drops from 07 for the baseline to 08, and even oracle selection with the top 09 of golden chunks is insufficient, indicating a dense global memory dependency beyond the present indexer capacity (Wang et al., 8 Jun 2026).
A broader conceptual divide remains between selection fidelity and tail preservation. Pure sparse selection can miss critical context, while pure compression can wash out sharp peaks. SPLA’s results suggest that exact sparse attention plus a compressed residual is one principled way to reconcile these objectives (Wang et al., 29 Jan 2026). A plausible implication is that future CSA systems will combine multiple mechanisms already present in the literature: low-bit or pooled surrogates for cheap global scoring, hardware-aligned block or token selection, and a complementary state that preserves the contribution of discarded context.