Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 75 tok/s
Gemini 2.5 Pro 49 tok/s Pro
GPT-5 Medium 34 tok/s
GPT-5 High 32 tok/s Pro
GPT-4o 101 tok/s
GPT OSS 120B 471 tok/s Pro
Kimi K2 200 tok/s Pro
2000 character limit reached

Flash Sparse Attention: An Alternative Efficient Implementation of Native Sparse Attention Kernel (2508.18224v1)

Published 25 Aug 2025 in cs.DC and cs.LG

Abstract: Recent progress in sparse attention mechanisms has demonstrated strong potential for reducing the computational cost of long-context training and inference in LLMs. Native Sparse Attention (NSA), a state-of-the-art approach, introduces natively trainable, hardware-aligned sparse attention that delivers substantial system-level performance gains while maintaining accuracy comparable to full attention. However, the kernel implementation of NSA relies on a query-grouping strategy that is efficient only with large Grouped Query Attention (GQA) sizes, whereas modern LLMs typically adopt much smaller GQA groups, which limits the applicability of this sparse algorithmic advance. In this work, we propose Flash Sparse Attention (FSA), which includes an alternative kernel design that enables efficient NSA computation across a wide range of popular LLMs with varied smaller GQA group sizes on modern GPUs. Compared to vanilla NSA kernel implementation, our empirical evaluation demonstrates that FSA achieves (i) up to 3.5$\times$ and on average 1.6$\times$ kernel-level latency reduction, (ii) up to 1.25$\times$ and 1.09$\times$ on average end-to-end training speedup on state-of-the-art LLMs, and (iii) up to 1.36$\times$ and 1.11$\times$ on average end-to-end prefill speedup on state-of-the-art LLMs. The source code is open-sourced and publicly available at https://github.com/Relaxed-System-Lab/Flash-Sparse-Attention.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

  • The paper demonstrates a novel GPU kernel optimization by reordering computation loops, significantly reducing latency and computational cost.
  • It achieves up to 3.5× kernel-level speedup and reduces memory access and FLOPs compared to conventional NSA.
  • The approach maintains model accuracy while enabling faster training and inference for long-context LLMs.

Flash Sparse Attention: Efficient Native Sparse Attention Kernel for Modern LLMs

Introduction

The paper introduces Flash Sparse Attention (FSA), a hardware-efficient kernel implementation for Native Sparse Attention (NSA) tailored to modern LLMs with small Grouped Query Attention (GQA) group sizes. NSA, while algorithmically effective for long-context LLMs, suffers from significant inefficiencies on GPUs when GQA group sizes are small—a common configuration in contemporary LLMs. FSA addresses this by reordering the kernel computation loops and introducing system-level optimizations, resulting in substantial reductions in kernel latency, end-to-end training, and inference times, without compromising model accuracy.

Background and Motivation

Full attention mechanisms in LLMs incur quadratic time and memory complexity with respect to sequence length, making them prohibitive for long-context scenarios. Sparse attention methods, such as NSA, reduce this cost by allowing each query to attend to a subset of keys. NSA achieves this via three parallel modules: compressed, selected, and sliding attention. However, NSA’s kernel implementation is only efficient when GQA group sizes are large, as it relies on batching query heads that share the same key/value heads. For small GQA group sizes, NSA must pad queries to meet hardware requirements, leading to wasted computation and memory bandwidth.

FSA Kernel Design and Implementation

FSA’s core innovation is the inversion of the two-level loop in the NSA selected attention kernel. Instead of looping over queries in the outer loop and key-value (KV) blocks in the inner loop (as in NSA), FSA loops over KV blocks in the outer loop and queries in the inner loop. This design enables batching of non-contiguous queries that attend to the same KV block, eliminating the need for padding and reducing unnecessary memory access and FLOPs. Figure 1

Figure 1: FSA kernel design inverts the NSA loop order, batching queries per KV block and storing partial results for later reduction.

This reordering introduces two main challenges: (1) non-contiguous memory access for queries, which can degrade GPU cache efficiency, and (2) the need for correct online softmax and accumulation across multiple KV blocks. FSA addresses these by:

  • Using index tensors to batch non-contiguous queries and enable early return when all relevant queries are processed.
  • Decoupling online softmax statistics computation and result accumulation into separate pre-computation and reduction kernels, respectively.
  • Implementing all kernels in Triton, with fine-grained control over warp assignment and buffer management to minimize overhead.

Analytical and Empirical Performance

Theoretical Analysis

FSA achieves lower memory access volume and FLOPs than NSA, especially for small GQA group sizes. For example, with GQA=4, block size BK=64B_K=64, and top-k T=16T=16, FSA reduces memory access to 21.3% and FLOPs to 56.2% of NSA’s requirements. This is achieved by eliminating padding and ensuring all loaded KV data is used in computation. Figure 2

Figure 2: FSA achieves lower normalized memory access and FLOPs than NSA across GQA group sizes.

Kernel Profiling

Empirical profiling on NVIDIA H20 and H200 GPUs demonstrates that FSA consistently outperforms NSA and full attention kernels across a range of GQA group sizes and sequence lengths. FSA achieves up to 3.5× kernel-level speedup over NSA and up to 6.4× over full attention, with the largest gains observed for small GQA group sizes and long sequences. Figure 3

Figure 3: FSA kernel execution latency is significantly lower than NSA across GPU types and GQA group sizes.

Figure 4

Figure 4: FSA outperforms NSA and full attention in kernel latency across various block and top-k configurations.

End-to-End System Performance

Training and Inference Latency

FSA delivers up to 1.25× (average 1.09×) end-to-end training speedup and up to 1.36× (average 1.11×) prefill speedup over NSA, with even larger gains over full attention. These improvements are robust across Llama3-8B, Qwen3-14B, and Qwen2.5-32B models, and are more pronounced for longer context lengths and on higher-end hardware. Figure 5

Figure 5: FSA reduces end-to-end training latency compared to NSA and full attention.

Figure 6

Figure 6: FSA achieves lower prefill latency in inference compared to NSA and full attention.

Detailed Breakdown and Ablation

Forward/Backward and Attention Module Analysis

FSA’s performance gains are most significant in the selected attention phase, which dominates total attention computation time. In forward and backward passes, FSA achieves up to 2.36× and 4.32× speedup over NSA, respectively. Ablation studies confirm that each FSA optimization—inner loop reordering and early return—contributes substantially to performance. Figure 7

Figure 7: FSA outperforms NSA and full attention in both forward and backward attention computation.

Figure 8

Figure 8: Selected attention dominates attention overhead; FSA achieves the largest speedup in this phase.

Figure 9

Figure 9: Disabling FSA optimizations (inner loop, early return) degrades kernel performance.

Correctness and Training Dynamics

Loss curves for Llama3-8B fine-tuning show that FSA matches NSA and full attention in convergence behavior, validating the correctness of the FSA kernel. Figure 10

Figure 10: FSA, NSA, and full attention achieve similar loss convergence in Llama3-8B training.

End-to-End Latency Breakdown

FSA’s end-to-end speedup is attributable to attention computation, with up to 1.4× lower latency than NSA and up to 3.87× over full attention in this component. Figure 11

Figure 11: FSA’s end-to-end training speedup is driven by attention computation improvements.

Implementation Considerations

  • Hardware Requirements: FSA is designed for modern NVIDIA GPUs (H20, H200), leveraging Triton for kernel implementation. The additional buffer overhead is manageable given current GPU memory capacities.
  • Scalability: FSA’s design is robust to varying sequence lengths, GQA group sizes, and model scales, making it suitable for deployment in production LLM systems.
  • Limitations: FSA’s non-contiguous memory access, while mitigated by early return and index batching, may still underutilize cache compared to fully contiguous access patterns. However, empirical results show that the reduction in redundant computation outweighs this cost.

Implications and Future Directions

FSA demonstrates that algorithm–system co-design is essential for realizing the practical benefits of sparse attention in LLMs. By aligning kernel design with hardware constraints and LLM architectural trends (small GQA group sizes), FSA enables efficient long-context modeling without accuracy loss. This work opens avenues for further research in:

  • Extending FSA-like optimizations to other forms of structured or learned sparsity.
  • Adapting FSA for emerging hardware architectures with different memory and compute characteristics.
  • Exploring dynamic kernel scheduling and further buffer management optimizations for even larger models and longer contexts.

Conclusion

Flash Sparse Attention (FSA) provides an efficient, hardware-aligned kernel for NSA, enabling practical deployment of sparse attention in modern LLMs with small GQA group sizes. Through loop reordering, memory access optimizations, and decoupled softmax/reduction, FSA achieves substantial speedups in both kernel and end-to-end performance, validated across multiple models and hardware platforms. The open-sourced implementation facilitates further research and adoption, and the approach exemplifies the importance of system-aware algorithm design in scaling LLMs to ever longer contexts.

Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Paper Prompts

Sign up for free to create and run paper prompts using GPT-5.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube