Papers
Topics
Authors
Recent
Search
2000 character limit reached

AsyncTLS: Efficient Generative LLM Inference with Asynchronous Two-level Sparse Attention

Published 9 Apr 2026 in cs.CL | (2604.07815v1)

Abstract: Long-context inference in LLMs faces the dual challenges of quadratic attention complexity and prohibitive KV cache memory. While token-level sparse attention offers superior accuracy, its indexing overhead is costly; block-level methods improve efficiency but sacrifice precision. We propose AsyncTLS, a hierarchical sparse attention system that combines coarse-grained block filtering with fine-grained token selection to balance accuracy and efficiency, coupled with an asynchronous offloading engine that overlaps KV cache transfers with computation via temporal locality exploitation. Evaluated on Qwen3 and GLM-4.7-Flash across GQA, and MLA architectures, AsyncTLS achieves accuracy comparable to full attention while delivering 1.2x - 10.0x operator speedups and 1.3x - 4.7x end-to-end throughput improvements on 48k - 96k contexts.

Summary

  • The paper introduces a hierarchical two-level sparse attention that combines fast block-level filtering with precise token-level selection for efficient long-context inference.
  • It achieves parity with full attention accuracy while delivering up to 4.0x speedups and significant throughput gains over conventional sparse methods.
  • The asynchronous KV cache offloading engine overlaps computation and memory transfer, reducing PCIe bandwidth demands and minimizing memory overhead.

AsyncTLS: Asynchronous Two-Level Sparse Attention for Efficient Long-Context LLM Inference

Introduction and Motivation

The quadratic complexity of self-attention and the linear growth of Key-Value (KV) cache memory fundamentally limit the scalability of LLMs in long-context inference. With sequence lengths expanding to 100k+ tokens, the memory footprint frequently exceeds GPU capacity, necessitating expensive transfers to CPU memory. Existing sparse attention methods typically operate at the token or block level—token-level sparsity offers higher fidelity but comes at substantial indexing and computation costs, while block-level sparsity enables efficient hardware utilization but sacrifices precision due to the inclusion of irrelevant tokens.

AsyncTLS introduces a hierarchical approach, combining fast block-level filtering with precise token-level selection and a specialized asynchronous KV cache offloading engine leveraging temporal locality, to address both computational and memory bottlenecks in long-context generative inference.

Hierarchical Two-Level Sparse Attention

AsyncTLS leverages hierarchical indices for KV blocks at two granularities:

  • Block-Level Indexing: Coarse filtering via block importance scoring rapidly eliminates irrelevant sequence segments.
  • Token-Level Indexing: Within retained blocks, fine-grained token selection provides precise contextual retrieval, preserving the accuracy benefits of high-resolution sparse attention.

This staged selection drastically limits the search space for fine-grained token selection, minimizing indexing overhead while retaining the critical tokens needed for model fidelity. Figure 1

Figure 1: Illustration of hierarchical indices in AsyncTLS; block-level filtering precedes token-level selection for sparse attention computation.

Asynchronous KV Cache Offloading Engine

Beyond attention costs, long-context inference requires substantial KV cache storage. AsyncTLS integrates an asynchronous offloading pipeline:

  • Prefilling: Construction of hierarchical indices to segment the input, retention of critical KV pairs in GPU-resident cache, offloading the remainder to CPU memory.
  • Decoding: Sparse attention is computed via token-level selection from the GPU cache; concurrently, block-level indices trigger asynchronous prefetches of additional KV blocks from host memory, utilizing prior timestep predictions to anticipate future access needs.
  • Incremental Block Transmission: Only the difference between consecutive selection steps is transferred, exploiting temporal similarity and minimizing PCIe bandwidth requirements. Figure 2

    Figure 2: AsyncTLS workflow; attention and KV transfer operations are overlapped, maximizing temporal locality and reducing redundant transmission.

Experimental Results and Numerical Findings

AsyncTLS is evaluated on Qwen3 (8B/14B) and GLM-4.7-Flash models across GQA and MLA architectures. The key findings:

  • Accuracy: AsyncTLS achieves results indistinguishable from Full Attention (FA) under equivalent token budgets, outperforming block-level methods (Quest) and matching token-level approaches (DS/Double-Sparsity).
  • Operator Speedups: Delivers 1.2×1.2\times--10.0×10.0\times operator speedups versus FA, with up to 4.0×4.0\times improvements over token-level sparse attention.
  • End-to-End Throughput: Achieves 1.3×1.3\times--4.7×4.7\times throughput gains on 48k–96k contexts compared to FA, supporting larger batch sizes without memory exhaustion.

Performance under varying token budgets demonstrates robust retention of accuracy; AsyncTLS consistently outperforms block-level sparse attention and aligns with token-level baselines across retrieval, summarization, and code completion tasks. Figure 3

Figure 3

Figure 3

Figure 3

Figure 3

Figure 3

Figure 3

Figure 3

Figure 3: AsyncTLS maintains high task scores under token budget constraints for Qwen3-14B.

Figure 4

Figure 4

Figure 4

Figure 4

Figure 4

Figure 4

Figure 4

Figure 4

Figure 4: Latency comparison across attention mechanisms; AsyncTLS provides substantial gains at all batch sizes and sequence lengths.

Figure 5

Figure 5

Figure 5: End-to-end latency; AsyncTLS significantly reduces inference times relative to FA, DS, and Quest.

Figure 6

Figure 6

Figure 6: Throughput benchmarks; AsyncTLS enables high batch processing for long contexts.

Architectural and Practical Implications

AsyncTLS demonstrates hardware-agnostic compatibility (GQA, MLA), with no dependency on retraining or fine-tuning. Its hierarchical attention design and cache management pipeline make it suitable for production-scale LLM deployments where ultra-long context and efficient hardware utilization are essential. The asynchronous offloading engine directly addresses PCIe bottlenecks, supporting scalable inference with minimized memory overhead.

From a theoretical standpoint, the staged selection pipeline suggests a general strategy for balancing granularity and computational efficiency, applicable beyond Transformer-based attention. The exploitation of temporal locality in KV selection paves the way for further research into predictive cache management and pipeline optimization.

Future Directions

Potential future extensions include:

  • Integration with learned adaptive selection strategies for block/token granularity rather than fixed heuristics.
  • Hardware-specific kernel optimization to further maximize performance on emerging accelerator architectures.
  • Combining hierarchical attention with other KV compression paradigms (e.g., latent or semantic clustering) for additional memory savings.
  • Application to retrieval-augmented generation and streaming inference scenarios.

Conclusion

AsyncTLS introduces a training-free, hardware-efficient sparse attention architecture that reconciles the accuracy of fine-grained token-level selection with the efficiency of block-level processing. Its asynchronous KV cache offloading engine overlaps computation and memory transfer, exploiting temporal locality to minimize bandwidth requirements. Comprehensive experiments validate that AsyncTLS achieves parity with full attention accuracy while offering substantial operator speedups and throughput improvements for contexts up to 96k tokens. This work establishes a scalable foundation for practical long-context LLM deployment, and its principles are extensible to future developments in memory-efficient generative AI.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.