StreamIndex: Memory-Bounded Compressed Sparse Attention via Streaming Top-k
Abstract: DeepSeek-V3.2 and V4 introduce Compressed Sparse Attention (CSA): a lightning indexer (a learned scoring projection over compressed keys) scores them, the top-k are selected per query, and a sparse attention kernel reads only those. Public CSA implementations materialize a [B, S, H_I, T] FP32 score tensor before the top-k reduction. With H_I=64 indexer heads and the V4-Flash compression ratio m=4, that intermediate is 256 GB at sequence length S=65,536, exceeding any single-GPU high-bandwidth-memory (HBM) budget. We present StreamIndex, a Triton implementation of the CSA pipeline whose central component is a chunked partition-merge top-k driver that never materializes the full intermediate. On synthetic-but-realistic V4-shaped inputs at the indexer-step (layer) level on a single NVIDIA H200, the materialize path runs out of memory (OOMs) at S=65,536 with V4-Flash dimensions; StreamIndex runs the same indexer to S=1,048,576 with 6.21 GB peak HBM, a 32x regime extension. Set-overlap recall against the materialize ground truth is bit-exact at small S where both fit; across three 5-point design-space sweeps (chunk size, key-tile size, top-k), mean recall rounds to 1.0000 with min recall at least 0.9980 in every cell. The chunked driver composes with TileLang's pipelined attention kernel: at S=262,144 with V4-Flash dimensions, the materialize indexer paired with TileLang attention OOMs while the chunked indexer paired with the same attention runs in 1.97 s at 18.56 GB peak. Our contribution targets the indexer step; we make no claim of a faster attention kernel or of real-checkpoint end-to-end behavior. Code: https://github.com/RightNow-AI/StreamIndex.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Explain it Like I'm 14
What this paper is about (big picture)
Imagine you’re trying to read a book with a million pages and answer a question about page 1. A regular (dense) attention system tries to compare page 1 with every other page, which is far too much work and memory. Compressed Sparse Attention (CSA) speeds this up by first asking a small “indexer” which pages are most likely to matter, then only reading those pages closely.
The problem: most public CSA code builds a huge “score table” (how relevant each page is) before picking the top results. That table gets gigantic and doesn’t fit on a single GPU for very long texts.
This paper introduces StreamIndex, a way to compute those scores in small pieces (chunks) and keep only the best results as you go, so you never create the huge table. It lets the indexer step work at million‑token lengths on one GPU without running out of memory.
The main questions the paper asks
- Can we avoid building the massive score table in CSA’s indexer step?
- Can we pick the top‑k best matches “on the fly” (streaming) without changing the final answer?
- Does this make long contexts (like 1,000,000 tokens) practical on a single high‑end GPU?
- Will this plug into existing attention kernels without breaking anything?
- How accurate is it compared to the standard (materialized) method?
How their method works (plain language with an analogy)
Think of a talent show with millions of contestants and a judge who only needs the top 512. You don’t write every score on a giant billboard. Instead, you:
- Watch contestants in small groups (chunks).
- Keep a small “top‑512” scoreboard that you update whenever someone new beats the lowest score on the board.
- Move to the next group and repeat.
- At the end, you have the same top 512 you would have gotten if you had written down every score—but you never had to store them all at once.
That’s exactly what StreamIndex does for CSA’s indexer:
- It breaks the work into tiles of queries (the “questions” tokens ask) and compressed keys (the “candidate pages”).
- For each tile, it computes scores on the GPU, applies the “no peeking into the future” rule (causal mask), takes the local top‑k, and merges them into a running top‑k buffer.
- Because you only ever hold a small tile of scores in memory, you avoid the giant intermediate that causes out‑of‑memory errors.
Key ideas explained simply:
- Top‑k: “Keep only the best k items so far.”
- Streaming top‑k: “Process in batches, update the best‑so‑far list, never store everything.”
- Causal masking: “Only look at past pages, not future ones.” They build this mask per tile, not for the entire sequence, saving lots of memory.
- Triton: A language for writing fast GPU kernels (think “hand‑tuned GPU code” without having to write raw CUDA).
What they found and why it matters
- No more huge score table: The usual CSA indexer builds an intermediate of size [batch × sequence × heads × compressed‑sequence], which can be hundreds of gigabytes. StreamIndex never builds it, so memory stays small.
- Big jump in maximum length: On an NVIDIA H200 GPU:
- The standard “materialize everything” path runs out of memory at about 65,536 tokens.
- StreamIndex runs up to 1,048,576 tokens (1 million) with only about 6.21 GB peak memory for the indexer step.
- That’s a 32× extension in the usable sequence length for the indexer.
- Accuracy matches the standard method: Where both methods fit in memory (shorter sequences), StreamIndex’s chosen top‑k set is the same as the materialized version (bit‑exact set match in tests).
- Plays nicely with existing attention: When combined with a production‑quality sparse attention kernel (TileLang), the standard path still runs out of memory at 262,144 tokens, but StreamIndex runs in about 1.97 seconds with ~18.56 GB peak memory.
- Tuning tips:
- Larger key tiles (processing more candidates per chunk) are faster when memory allows.
- There’s a “sweet spot” for the query tile size around 2,048.
- Using FP32 (full precision) for score accumulation is needed for exact‑match results; FP16 is faster and uses less memory but can drop a few correct entries (a trade‑off some might accept in practice).
Why this is important: It shifts the bottleneck. Lots of work has optimized the attention step, but this shows the indexer step was the real memory gate for CSA at long context. Fixing the indexer unlocks much longer inputs without changing the model itself or the downstream attention kernel.
How they tested it (in simple terms)
- They used realistic synthetic data (numbers shaped and scaled like the real model’s outputs) because the full DeepSeek V4 checkpoints are too big to fit on a single GPU.
- They checked:
- Do the streaming and materialized versions pick the same top‑k keys at sizes where both can run?
- How far can each method go in length before running out of memory?
- How fast and memory‑heavy is StreamIndex across different chunk sizes and top‑k values?
- Does StreamIndex work with a widely used attention kernel (TileLang)?
Takeaways and impact
- StreamIndex removes the memory roadblock in CSA’s indexer step, enabling million‑token contexts on one GPU where the standard approach fails.
- It doesn’t change the model’s math—just how the top‑k is computed—so it can be dropped into existing pipelines.
- It complements, rather than replaces, other attention optimizations (like FlashAttention or paged KV caches).
- This makes long‑document tasks (big codebases, long reports, extensive chats) more practical.
What’s next:
- End‑to‑end tests with actual model weights and benchmarks (QA and retrieval over long documents).
- Multi‑GPU setups and possibly fusing the indexer and attention steps for even better performance.
- Validation on different hardware (H100, A100, MI300X).
In short: The paper shows a simple but powerful idea—streaming top‑k—that fixes the indexer memory spike in CSA. That unlocks very long contexts with high accuracy, using hardware people already have.
Knowledge Gaps
Knowledge gaps, limitations, and open questions
The paper leaves several concrete questions and gaps that future work could address:
- End-to-end validation with real checkpoints
- No experiments with actual DeepSeek-V4/V3.2 weights; it remains unknown whether the chunked indexer preserves end-to-end task quality (perplexity, RULER/LongBench retrieval, SWE-bench-Verified, etc.) at long context compared to materialize.
- Unclear whether set-level equivalence of selected indices is sufficient in practice for downstream attention across diverse tasks and prompts.
- Input distribution realism
- All parity and scaling tests use synthetic-but-“spec-matched” distributions; selection stability and numerical behavior under real hidden-state distributions from trained checkpoints are unverified.
- How robust is the method to model- and layer-specific score distributions (e.g., more ties, heavier tails, clustered scores)?
- Deterministic tie-breaking and reproducibility
- The pipeline relies on torch.topk with unspecified tie-breaking; no deterministic comparator is implemented. This leaves open deterministic reproducibility across hardware/backends and strict equivalence to a defined order (beyond set parity).
- How often do ties occur under real quantized or low-precision regimes (e.g., FP8/FP4 paths), and do they meaningfully impact selection consistency?
- Precision trade-offs
- FP16 accumulation meaningfully degrades perfect-recall rows (to ~92%); BF16 or mixed-precision alternatives were not evaluated. What precision configurations minimize recall loss while preserving most of the speed/memory gains?
- No analysis of sensitivity to quantization noise in q/K/weights (e.g., FP4/FP8), especially at 1M context, where small numerical errors could accumulate.
- Generality of masking and attention variants
- The per-chunk causal mask is implemented; it is unclear how the approach generalizes to non-causal/bidirectional attention, cross-attention, or mixed patterns (e.g., prefix/sink tokens, sliding windows).
- Integration with the “Window(t)” component (TopK ∪ Window) is not detailed; memory and performance impact of adding fixed/dynamic windows to the streaming top‑k path is unreported.
- Throughput in prefill vs. decoding and incremental updates
- The method targets the indexer in a batch prefill regime; how to maintain and update per-query top‑k incrementally during token-by-token decoding without recomputing from scratch remains unexplored.
- What is the latency impact in online decoding settings, and can the running buffers be reused efficiently across steps?
- Batch size and shape scalability
- All results appear to use B=1; scaling behavior with larger batches, heterogeneous sequence lengths within a batch, and padding/ragged batching is not reported.
- Sensitivity to head count H_I, head dim d_h, and wider k ranges (e.g., k≫1024) is limited to V4-Flash/Pro‑like settings.
- Hardware portability and multi-GPU
- Validation is limited to a single H200; performance/peak-memory behavior on H100, A100, MI300X, and consumer GPUs is untested.
- Multi-GPU composition (TP/PP/SP) and interaction with sequence/tensor parallel ring/pipeline schemes are not evaluated; how to partition chunks across devices to maximize overlap and bandwidth remains open.
- Integration with serving stacks
- No integration/benchmarks with production inference stacks (e.g., vLLM, TensorRT-LLM) that manage KV paging, continuous batching, and heterogeneous workloads; the scheduling and memory implications of chunked indexer kernels under such systems are unknown.
- Auto-detection threshold (1 GB) and chunk-size heuristics are simplistic; an online autotuner that adapts c_S/c_T/k to device memory and workload dynamics is not provided.
- Driver and kernel engineering
- The Python driver introduces launch overhead for small c_S/c_T; a C++/CUDA/Triton-fused driver or single-kernel design to remove per-tile Python overhead is not attempted.
- Theoretical and empirical analysis of the merge step’s complexity (e.g., O(c_T log k) per tile) and its optimization (heapless selection, radix/K‑select, warp-level top‑k) is not provided.
- Memory traffic and IO analysis
- The paper asserts similar total HBM IO as materialize but does not directly measure bandwidth, cache hit rates, or DRAM traffic. A precise IO profile (including L2/L1 utilization and overlap with compute) is missing.
- Potential benefits of fusing indexer and attention to reuse K_C loads are acknowledged but unimplemented; the achievable end-to-end bandwidth savings and kernel scheduling strategies remain open.
- Training-time applicability
- Only inference is considered; viability for training (forward+backward) with streaming top‑k (e.g., gradients through ReLU+weights and index selection) is unexplored.
- Stability and efficiency of backward pass with chunked tiles and top‑k selection are untested, as are memory checkpoints/recomputation strategies.
- Robustness across compression and sparsity regimes
- The approach is evaluated for m=4 and k in {512, 1024}; behavior for other compression ratios, larger/smaller k, adaptive k per query, or alternative compressed-key schemes is not characterized.
- Failure modes when c_T<k are partially ablated (skip is harmful), but broader guidance for extreme ratios (e.g., very small or very large c_T vs. k) and nonuniform k across queries is limited.
- Full-pipeline scaling to 1M with production attention
- Composition with TileLang attention is shown up to S=262k; a full pipeline run at S≈1M with a production-quality attention kernel is not demonstrated. Peak memory and latency at 1M in the full pipeline remain unknown.
- Determinism and cross-version stability
- The reliance on PyTorch/Triton versions and their top‑k implementations may affect results; determinism across software versions, kernels, and compiler flags is not established or tested.
- Error analysis and selection drift
- Beyond recall statistics, there is no analysis of which keys get displaced under FP16 or different chunkings, and how such errors impact downstream attention weights (e.g., missed “heavy hitters” vs. marginal keys).
- No sensitivity study on the impact of chunk ordering permutations, despite the theoretical invariance (practical differences could arise from floating-point non-associativity and tie behavior).
- Support for heterogeneous masks and constraints
- Real-world deployments often combine causal masks with document boundaries, prefix tokens, or blockwise legal ranges; general APIs to program such constraints in the per-tile mask and their performance impact are not described.
- Benchmarking breadth and fairness
- No comparisons to alternative indexer elimination/replacement approaches (e.g., NSA) in terms of memory, speed, and quality trade-offs for long contexts; a broader landscape evaluation is absent.
- Release maturity and usability
- Guidance for choosing c_S/c_T under diverse memory budgets, batch sizes, and k values is heuristic; a principled tuner or ruleset is not provided.
- The auto-detection logic for switching between materialize and chunked paths could be improved with cost models; current policy may be suboptimal in borderline regimes.
Practical Applications
Immediate Applications
The following applications can be deployed now using the open-source StreamIndex implementation, primarily in settings that already use Compressed Sparse Attention (CSA) or are prototyping DeepSeek-V3.2/V4-style indexers.
- Memory-safe long-context CSA indexer for single-GPU inference (Software/AI infrastructure)
- What it enables: Run CSA indexer to 1M tokens on a single H100/H200/A100/MI300X-class GPU without OOM by avoiding the [B,S,H_I,T] FP32 intermediate; bit-exact set parity with materialized top-k at small S.
- Tools/workflows: Drop-in indexer replacement in CSA pipelines; compose with TileLang’s pipelined attention; auto-detect path (materialize if fits, chunked otherwise). Deployment recipe: set c_T as large as memory allows (ideally single T-tile), c_S≈2048 knee, FP32 score accumulation for exactness; optionally FP16 for throughput with small recall drift.
- Assumptions/dependencies: Your model uses V4-style CSA and lightning indexer; end-to-end weights may still require TP/offloading; tested parity uses synthetic-but-realistic inputs; tie-breaking is set-level, not order-level.
- Longer-context RAG, summarization, and QA without model changes (Enterprise software, Media, Legal, Finance)
- What it enables: Practical 256K–1M token contexts for document-heavy use cases (e.g., 10-K/annual reports, legal contracts, earnings calls, compliance logs, meeting transcripts, long-form articles) on a single high-memory GPU by fixing the indexer bottleneck.
- Tools/workflows: Replace indexer in CSA-based deployments; integrate with existing vLLM/TGI-like serving stacks via a custom operator; enable “whole-document/whole-session” summarization and QA without chunk-fragmentation heuristics.
- Assumptions/dependencies: Your LLM/checkpoint supports CSA (or a compatible indexer); attention kernel accepts per-query top-k indices; end-to-end throughput depends on attention and KV-cache kernels.
- Whole-repository/codebase assistants with million-token windows (Software engineering)
- What it enables: Index and attend across entire monorepos in a single pass for navigation, refactoring, or cross-file reasoning by removing indexer OOMs at 64K+ context.
- Tools/workflows: CSA-based code assistants with StreamIndex-backed indexer; batch- or session-level long-context operations (e.g., test failures + logs + code) without manual paging.
- Assumptions/dependencies: CSA model weights available; GPU VRAM still must host weights or use offloading/TP; I/O patterns may dominate in practice.
- Healthcare and life sciences document analysis at scale (Healthcare)
- What it enables: Long EHR narratives, multi-visit chart reviews, clinical trial protocols, or literature packs in a single context window with CSA.
- Tools/workflows: Deploy StreamIndex in on-prem inference for privacy-sensitive data; use exact FP32 mode to avoid selection drift.
- Assumptions/dependencies: Regulatory constraints require on-prem hardware and vetted models; CSA-equipped medical LLMs or adapters are needed.
- Customer support and contact-center analytics over entire conversation histories (Customer experience)
- What it enables: End-to-end conversation+knowledge base windows at 100K–1M tokens for escalation prediction, QA, and resolution summarization.
- Tools/workflows: StreamIndex indexer + existing sparse attention; simplifies workflow by eliminating manual windowing and sink-token hacks for long histories.
- Assumptions/dependencies: SLA-sensitive deployments must validate latency; may need FP16 mode trade-off for throughput with small recall imperfections.
- Academic benchmarking and ablation at long context (Academia/Research)
- What it enables: Reproduce/extend long-context experiments (needle-in-a-haystack, RULER, LongBench) without OOM at the indexer step; study k, c_S, c_T, and precision trade-offs.
- Tools/workflows: Use open-source code for design sweeps, ablations, and kernel-level comparisons; integrate with TileLang attention to isolate indexer effects.
- Assumptions/dependencies: Results in the paper use synthetic distributions; real-checkpoint validation is recommended.
- Cost and capacity optimization for GPU fleets (Cloud/DevOps)
- What it enables: Fit longer contexts on fewer/more affordable GPUs by reducing peak HBM from O(S·H_I·T) to O(c_S·c_T); delay multi-GPU scale-out for long-context trials.
- Tools/workflows: Capacity planning with StreamIndex-enabled CSA; auto-tune c_T up to the memory budget and pick c_S around the performance knee (~2048).
- Assumptions/dependencies: Total I/O remains similar; small-S jobs may be marginally slower than materialized indexer but auto-detection mitigates this.
- Framework/plugin integration (Ecosystem enablement)
- What it enables: A CSA-indexer operator for frameworks (e.g., vLLM, TensorRT-LLM, Hugging Face Transformers, OpenAI Triton/TorchInductor backends).
- Tools/workflows: Package StreamIndex as a pip installable kernel extension; expose runtime knobs (k, c_S, c_T, precision) and auto-tuner.
- Assumptions/dependencies: Kernel portability to ROCm/MI300X may require minor adaptations; CI for multiple GPUs/driver stacks.
Long-Term Applications
These opportunities need further research, engineering, or ecosystem alignment (e.g., model weights, training support, or fused kernels).
- Production-grade 1M-token CSA inference with multi-GPU weight sharding (Software/AI infrastructure)
- What it enables: True end-to-end long-context serving with StreamIndex indexer + optimized sparse attention + TP/offloading for large checkpoints.
- Potential products/workflows: Enterprise LLM endpoints offering 512K–1M context SLAs; “long-context mode” in serving platforms with automatic chunk scheduling.
- Assumptions/dependencies: Distributed weight/activation management; latency budgets that accommodate chunked merges; full-pipeline validation on real tasks.
- Fused indexer-attention kernels for shared K/V loads (Compiler/Systems)
- What it enables: Reduce memory traffic by jointly computing indexer scores and downstream sparse attention over selected keys within a single CUDA pipeline.
- Potential products/workflows: FA3-style warp-specialized fused kernels; Triton/CUDA library primitives for streaming top-k + attention; improved end-to-end throughput.
- Assumptions/dependencies: Kernel/compiler advances (warp specialization patterns), careful scheduling, and accuracy validation.
- CSA training with streaming indexer (Training systems/Academia)
- What it enables: Train or fine-tune CSA-based models with longer contexts without prohibitive activation peaks in the indexer step.
- Potential products/workflows: Curriculum strategies that grow S during training; mixed-precision accumulation policies; checkpointing+recompute around chunked indexer.
- Assumptions/dependencies: Backprop-safe streaming top-k implementations; optimizer/gradient memory planning; empirical stability vs materialized baselines.
- Hardware-agnostic support and standardized APIs (Ecosystem/Policy & Standards)
- What it enables: Cross-vendor kernels (CUDA/ROCm/oneAPI) and a standard API for memory-bounded CSA indexers, improving portability and procurement flexibility.
- Potential products/workflows: ONNX/TensorRT-LLM/TVM ops for streaming top-k indexers; MLPerf-style long-context benchmarks that include indexer memory metrics.
- Assumptions/dependencies: Vendor collaboration; reproducible tie-breaking semantics if order-level determinism is required for certification.
- Edge/PC long-context assistants (Consumer/Edge AI)
- What it enables: Long-context LLM features on high-end desktops/workstations or edge servers with limited HBM by keeping indexer peaks low.
- Potential products/workflows: Offline assistants that digest large local corpora (knowledge bases, project archives) with reduced memory requirements.
- Assumptions/dependencies: Smaller CSA-capable models or quantized checkpoints; storage/CPU-GPU bandwidth constraints may dominate.
- Domain-specific long-context verticals (Healthcare, Legal, Finance, Scientific R&D)
- What it enables:
- Healthcare: Cohort-level longitudinal reasoning across multi-visit EHRs.
- Legal: Full deal room analysis, contract families, and negotiation histories.
- Finance: Multi-year filings, call transcripts, and regulatory dossiers.
- Science/Engineering: Cross-paper literature synthesis with methods, code, and data supplements.
- Potential products/workflows: “Whole-context” vertical copilots that avoid chunk heuristics; compliance/audit trails based on exact key selections.
- Assumptions/dependencies: Data governance and privacy controls; domain-tuned CSA models; throughput optimizations for production loads.
- Energy- and cost-aware inference policy (Policy/Operations)
- What it enables: Operational policies that prefer memory-bounded indexers to reduce peak HBM requirements, enabling denser GPU sharing and lower energy per request.
- Potential products/workflows: Scheduler knobs that trade FP32 vs FP16 indexer precision under SLOs; admission control that selects chunk sizes based on load and memory.
- Assumptions/dependencies: Fleet telemetry and SLO-aware auto-tuners; acceptance of minor recall deviations in FP16 modes for certain workloads.
- Generalized streaming top-k operators beyond CSA (Search/Retrieval, Vision, Multimodal)
- What it enables: Apply partition-merge streaming top-k to other separable scoring stages (e.g., page/segment ranking before attention, video/text chunk ranking).
- Potential products/workflows: Retrieval front-ends that maintain top-k candidates online over large corpora; multimodal long-sequence attention with streamed index selection.
- Assumptions/dependencies: Score separability across keys/chunks; careful masking semantics; validation of recall/quality vs monolithic baselines.
- Co-design with learned sparsity (NSA/MInference/Quest hybrids) (Research)
- What it enables: Combine streaming top-k with learned/dynamic sparsity to further lower memory or improve accuracy-speed trade-offs.
- Potential products/workflows: Controllers that modulate k or chunk sizes on-the-fly; hybrid indexers that fall back to learned patterns when T is tiny.
- Assumptions/dependencies: New training objectives; calibration to avoid recall loss; added complexity in serving stacks.
Cross-cutting assumptions and dependencies
- Model compatibility: Benefits accrue to CSA-style indexers (DeepSeek V3.2/V4 lightning indexer). Other sparsity methods may need adaptation.
- End-to-end validation: Paper demonstrates layer-level parity and memory scaling; production claims require task-level evaluation with real checkpoints.
- Hardware/software stack: Triton 3.7 + CUDA tested; ROCm/other GPUs may need kernel tuning. Integration with attention kernels (e.g., TileLang, FlashAttention variants) is required.
- Determinism: Set-level top-k parity is validated; order-level determinism requires an explicit comparator if downstream expects ordered indices.
- Throughput trade-offs: Chunked path re-reads inputs; at small S, materialized indexer can be faster—use auto-detection to select the path dynamically.
Glossary
- argtop: The set of the top-k elements under a specified order, used to denote the result of top-k selection mathematically. "We write this set as ."
- autotuned Triton kernel: A GPU kernel written in Triton whose parameters are automatically tuned for performance. "A single autotuned Triton kernel computes"
- BF16: Brain floating-point 16-bit format, a reduced-precision numerical type used in deep learning. "All runs are on a single NVIDIA H200 SXM with 140~GB HBM3e, BF16 precision, CUDA 13, Triton 3.7, PyTorch 2.13 nightly."
- causal masking: A constraint that prevents attention to future positions, limiting legal indices based on sequence order. "Causal masking restricts the legal index range."
- chunked partition-merge: A streaming approach that processes data in chunks, taking per-chunk top-k and merging them to reduce memory usage. "a chunked partition-merge top- driver"
- Compressed Sparse Attention (CSA): An attention mechanism that compresses keys and selects a sparse subset via an indexer to reduce compute and memory. "DeepSeek-V3.2 and V4 introduce Compressed Sparse Attention (CSA)"
- compression ratio: The factor by which tokens are compressed into fewer key/value blocks. "the V4-Flash compression ratio "
- CUDA: NVIDIA’s parallel computing platform and programming model for GPUs. "CUDA 13"
- effective top-k: The actual number of top-k results after applying legality constraints like causality. "Effective top-k"
- einsum: A generalized tensor contraction notation and operation (as in PyTorch) used to implement scoring efficiently. "index_score = torch.einsum(\"bshd,btd->bsht\", q, kv_cache_slice)"
- FlashAttention: An IO-aware attention algorithm that fuses operations to avoid materializing large score matrices. "FlashAttention~\cite{flashattn1,flashattn2,flashattn3} fuses softmax with the matmul"
- FP16: IEEE half-precision floating-point format used for faster computation and lower memory, with potential numerical tradeoffs. "A3 FP16 score accumulation"
- FP32: IEEE single-precision floating-point format used for higher numerical accuracy. "Public CSA implementations materialize a FP32 score tensor"
- FP4: 4-bit floating point quantization for higher compression in model inference. "quantized to FP4~\cite{microscaling}"
- FP8: 8-bit floating point format used to reduce memory footprint for large models. "the 270~GB FP8~\cite{fp8} V4-Flash weights"
- Hadamard-rotated: Applying a Hadamard transform-based rotation to vectors (e.g., queries) to improve quantization. "Hadamard-rotated~\cite{quarot}"
- HBM (High-Bandwidth Memory): High-throughput GPU memory critical for large-scale attention workloads. "exceeding any single-GPU high-bandwidth-memory (HBM) budget."
- HBM3e: A specific generation of high-bandwidth memory with enhanced bandwidth, as used on NVIDIA H200 GPUs. "All runs are on a single NVIDIA H200 SXM with 140~GB HBM3e"
- heavy-hitter eviction: A KV-cache management strategy that retains or removes entries based on heavy-hitter statistics. "H2O~\cite{h2o} retains heavy hitters."
- indexer heads: Multiple parallel learned projections used by the indexer to score compressed keys. "with indexer heads"
- indexer-step intermediate: The large intermediate score tensor produced before reduction in the indexer step, which can cause OOM. "The CSA indexer-step intermediate () is unaddressed"
- IO-aware attention: Attention implementations optimized for input/output and memory movement, not just FLOPs. "IO-aware attention."
- key-tile size: The tile size along the compressed key axis used in chunked processing to manage memory and performance. "three 5-point design-space sweeps (chunk size, key-tile size, top-)"
- KV cache: Cached key/value tensors used to accelerate attention across long contexts. "PagedAttention (vLLM)~\cite{vllm} chunks the KV cache."
- lexicographic comparator: A total order comparator that breaks score ties by index, used for deterministic top-k. "the lexicographic comparator inlined"
- lightning indexer: A learned scoring projection that ranks compressed keys per query to enable sparse attention. "A lightning indexer scores compressed keys"
- linear attention: An approximation to softmax attention with linear complexity in sequence length. "and linear attention~\cite{linearattn} approximate softmax attention with low-rank, locality-sensitive-hashing (LSH), or kernel structure."
- locality-sensitive-hashing (LSH): A technique for approximate nearest neighbors used in attention approximations. "locality-sensitive-hashing (LSH)"
- materialize indexer: The reference path that fully computes and stores the score tensor before top-k selection. "The materialize indexer OOMs at "
- Mixture-of-experts: An architecture that routes inputs to specialized expert sub-networks for efficiency and capacity. "Mixture-of-experts work~\cite{mixtral} influences the broader V4 architecture"
- Multi-Latent Attention (MLA): A KV factorization approach used in DeepSeek models to reduce memory/compute. "introduced Multi-Latent Attention (MLA) with a low-rank KV factorization."
- multi-query attention: An attention variant sharing keys/values across heads to reduce KV-cache size. "Multi-query attention~\cite{mqa}"
- Native Sparse Attention (NSA): A method that learns a top-k sparsity pattern natively, replacing the indexer. "Native Sparse Attention (NSA)~\cite{nsa}"
- out-of-memory (OOM): A failure state where GPU memory is exceeded during execution. "runs out of memory (OOMs)"
- PagedAttention: A technique that pages KV cache blocks to manage memory during attention. "PagedAttention (vLLM)~\cite{vllm}"
- partition-merge invariance: The property that top-k results are unchanged by partitioning and merging partial top-k results for separable scores. "Theorem (partition-merge invariance, idealized deterministic top-)."
- pipelined attention kernel: An attention implementation that pipelines operations for throughput, as in TileLang. "TileLang's pipelined sparse attention kernel"
- RoPE (Rotary Position Embedding): A positional encoding technique that rotates query/key vectors in attention. "RoPE~\cite{rope}"
- running top-k buffer: A per-query buffer that maintains the current top-k scores/indices across tiles during streaming. "It maintains a running top- buffer per query"
- sentinel padding: Padding top-k outputs with a special value when fewer legal entries than k exist. "Sentinel padding (used when $T_{\mathrm{legal} < k$) is post-processing and is excluded from set comparisons."
- separable scoring function: A scoring function where each key’s score is independent of others, enabling streaming top-k. "Partition-merge invariance for top- over a separable scoring function is folklore;"
- sequence-parallelizes: Distributing computation across devices by splitting the sequence dimension. "Ring attention~\cite{ringattn} sequence-parallelizes."
- set-overlap recall: A metric comparing sets of selected indices (e.g., top-k) between two methods. "Set-overlap recall against the materialize ground truth"
- state-space models: Sequence models (e.g., Mamba, Hyena) that replace attention with learned state transitions. "State-space models such as Mamba~\cite{mamba} and Hyena~\cite{hyena} avoid attention entirely."
- strict total order: An ordering where any two elements are comparable and the order is transitive and antisymmetric. "This is a strict total order."
- streaming top-k: Performing top-k selection incrementally over streamed partitions without materializing full scores. "the streaming top- invariance"
- tensor parallelism (TP): Model-parallel execution that splits tensors (e.g., along hidden dimensions) across multiple GPUs. "multi-GPU tensor parallelism (TP)"
- TileLang: A tiled programming model and library providing reference kernels for CSA attention. "TileLang~\cite{tilelang} provides reference CSA kernels"
- top-k: Selecting the k largest elements (by score), often per query in attention. "the top- are selected per query"
- Triton: A programming language and compiler for writing high-performance GPU kernels. "Triton~\cite{triton} is the kernel language we use."
- warp-specialization patterns: GPU execution strategies that assign different warps specialized roles to improve performance. "the FlashAttention-3 (FA3)-style~\cite{flashattn3} warp-specialization patterns"
Collections
Sign up for free to add this paper to one or more collections.