Papers
Topics
Authors
Recent
Search
2000 character limit reached

Prefix RadixAttention: Accelerating LLM Inference

Updated 5 April 2026
  • Prefix RadixAttention is a method that uses radix trees to efficiently exploit shared prefixes in LLM inference, thereby reducing redundant computation.
  • It integrates algorithmic refactorings and specialized GPU kernel strategies, achieving up to 22.7× speedups and reducing attention latency by over 67%.
  • The approach incorporates scheduling innovations like k-LPM to optimize query order, effectively balancing computational reuse with fairness in high-throughput deployments.

Prefix RadixAttention refers to a class of methods for LLM inference that accelerates autoregressive decoding by exploiting shared input prefixes among requests. This involves (1) data structures, notably the radix tree, for efficient prefix matching and key/value (KV) caching; (2) algorithmic refactorings of attention computation that allow partial reuse of previously cached transformer key/value tensors; and (3) specialized batched scheduling or GPU kernel strategies that minimize redundant computation and bandwidth utilization. As LLM deployments scale, practical workloads display significant hierarchical prompt overlap (e.g., system prompts, templates, retrieved materials), and Prefix RadixAttention substantially improves time-to-first-token (TTFT), time-per-output-token (TPOT), and memory efficiency through both architectural and systems-level innovations (Yi et al., 27 Nov 2025, Dexter et al., 7 Feb 2025).

1. RadixAttention: Data Structures and Transformer Integration

RadixAttention centers on the construction and dynamic management of a radix tree (also known as a compressed prefix tree) built over the LLM token vocabulary Σ\Sigma. Each prompt x=x1x2…xnx = x_1 x_2 \dots x_n is mapped onto this tree by greedily tracing the longest prefix shared with previously seen input strings. At every node vv at depth dd, the corresponding cached key K[v]∈Rd×dkK[v] \in \mathbb{R}^{d \times d_k} and value V[v]∈Rd×dvV[v] \in \mathbb{R}^{d \times d_v} tensors capture the cumulative transformer activations for the prefix s(v)s(v).

Upon the arrival of a new prompt xx, the tree is traversed to identify the deepest node matching the prefix p=x1⋯xpp = x_1 \cdots x_p of xx. The novel suffix x=x1x2…xnx = x_1 x_2 \dots x_n0 is then processed by the transformer, yielding new x=x1x2…xnx = x_1 x_2 \dots x_n1 and x=x1x2…xnx = x_1 x_2 \dots x_n2. The full self-attention computation for the new tokens x=x1x2…xnx = x_1 x_2 \dots x_n3 is efficiently factored as follows: x=x1x2…xnx = x_1 x_2 \dots x_n4 where the x=x1x2…xnx = x_1 x_2 \dots x_n5 denotes row-concatenation. This partitioning reduces the attention computation from x=x1x2…xnx = x_1 x_2 \dots x_n6 to x=x1x2…xnx = x_1 x_2 \dots x_n7 operations, achieving an approximate x=x1x2…xnx = x_1 x_2 \dots x_n8 speedup per query, with even greater gains for workloads with deeper prefix sharing (Dexter et al., 7 Feb 2025).

2. Computational Complexity and Storage Overhead

The time cost of processing an input sequence x=x1x2…xnx = x_1 x_2 \dots x_n9 of length vv0 in RadixAttention consists of vv1 for prefix matching, vv2 for insertion of novel nodes (vv3 is suffix length), and vv4 for attention (with vv5 the prefix length). Thus, the overall per-prompt cost is vv6, eliminating vv7 of the typical vv8 transformer attention complexity for large prefix overlaps.

Space requirements are dictated by storing per-prefix key and value matrices at each node, resulting in a worst-case memory use of vv9—where dd0 is the prompt count and dd1 the maximum prompt length. In implementation, the prefix tree is compressed by variable-length edge labels and LRU-eviction to control cache size (Dexter et al., 7 Feb 2025).

3. Query Scheduling with Prefix Reuse: Theory and Algorithms

The integration of RadixAttention into production LLM serving workloads raises sophisticated scheduling challenges. Each query dd2 is defined by its prompt dd3 and arrival time dd4, and the objective is to minimize TTFT and TPOT metrics—while leveraging prefix reuse for computational savings.

RadixAttention modifies scheduling theory because processing orders impact attainable KV reuse. The key result establishes that, under prefix reuse and TTFT deadlines, finding a feasible schedule is strongly NP-hard (a reduction from 3-PARTITION rigorously demonstrates this in (Dexter et al., 7 Feb 2025)).

To address this, the dd5-LPM (Longest-Prefix-Match with window size dd6) algorithm is proposed: after processing the oldest query, it greedily processes up to dd7 pending queries with maximal prefix overlap with the last served. For dd8 it recovers FCFS; as dd9 it becomes LPM. Theoretical analysis proves that for common user-doc workloads (user prefix repeated K[v]∈Rd×dkK[v] \in \mathbb{R}^{d \times d_k}0 times plus unique doc suffix), K[v]∈Rd×dkK[v] \in \mathbb{R}^{d \times d_k}1-LPM provides deterministic TTFT bounds: K[v]∈Rd×dkK[v] \in \mathbb{R}^{d \times d_k}2 for request rate K[v]∈Rd×dkK[v] \in \mathbb{R}^{d \times d_k}3, prefix length K[v]∈Rd×dkK[v] \in \mathbb{R}^{d \times d_k}4, and doc suffix K[v]∈Rd×dkK[v] \in \mathbb{R}^{d \times d_k}5. Empirically, K[v]∈Rd×dkK[v] \in \mathbb{R}^{d \times d_k}6-LPM with small K[v]∈Rd×dkK[v] \in \mathbb{R}^{d \times d_k}7 (e.g., K[v]∈Rd×dkK[v] \in \mathbb{R}^{d \times d_k}8) robustly outperforms both FCFS and LPM, especially in moderate-to-high throughput regimes (Dexter et al., 7 Feb 2025).

4. Prefix-Aware Attention Kernel Implementation: PAT

The "Prefix-Aware Attention Kernel" (PAT) (Yi et al., 27 Nov 2025) advances these ideas to the CUDA kernel level, tightly coupling prefix reuse with GPU resource scheduling. PAT operates in a pack–forward–merge fashion:

  1. Pack: Incoming queries are grouped by maximal shared prefix, forming a forest structure where each group (CTA in CUDA terminology) minimizes redundant KV cache reads. Optimal packing schemes are selected by maximizing a profit ratio K[v]∈Rd×dkK[v] \in \mathbb{R}^{d \times d_k}9 for each prefix node.
  2. Forward: Each CTA is processed by a multi-tile, resource-adaptive kernel. Tile sizes V[v]∈Rd×dvV[v] \in \mathbb{R}^{d \times d_v}0 are chosen (subject to shared memory/register and bandwidth constraints on A100 hardware) to fit the group’s query/KV shape exactly. Multiple CUDA streams and KV splitting are used to balance work across SMs and avoid stragglers.
  3. Merge: Partial attention results (maximum score, log-sum-exp accumulator, value sum) are merged via a lightweight online-softmax reduction. The merge overhead is V[v]∈Rd×dvV[v] \in \mathbb{R}^{d \times d_v}1 of total kernel time.

PAT integrates as a plugin into vLLM and demonstrates up to V[v]∈Rd×dvV[v] \in \mathbb{R}^{d \times d_v}2–V[v]∈Rd×dvV[v] \in \mathbb{R}^{d \times d_v}3 kernel speedup over FlashAttention and consistent reduction in TPOT and TTFT across representative real-world and synthetic workloads, particularly where large-scale prefix overlap is common (Yi et al., 27 Nov 2025).

5. Comparative Benchmarks and System Impact

Empirical studies have benchmarked Prefix RadixAttention approaches against standard baselines such as FlashAttention, FlashInfer, FastTree, and RelayAttention++. The PAT kernel reduces attention latency by 67.4% on average and TPOT by 13.6%–83.4% across diverse batch shapes and LLMs (LLaMA-3-8B, Qwen-3-8B, 8–32K-tokens) (Yi et al., 27 Nov 2025). End-to-end TTFT improvements range from 7.9% to 99.6% in streaming settings. In query scheduling experiments with 2,100 queries and personalized user prefixes, V[v]∈Rd×dvV[v] \in \mathbb{R}^{d \times d_v}4-LPM achieves 10–30% lower P99 TTFT under high throughput, in accordance with theoretical bounds (Dexter et al., 7 Feb 2025).

The table summarizes representative performance results:

Method P99 TTFT at 200 req/s Kernel Latency Speedup TPOT Reduction
FCFS 300 ms — —
LPM 220 ms — —
V[v]∈Rd×dvV[v] \in \mathbb{R}^{d \times d_v}5-LPM (V[v]∈Rd×dvV[v] \in \mathbb{R}^{d \times d_v}6) 160 ms — —
PAT — V[v]∈Rd×dvV[v] \in \mathbb{R}^{d \times d_v}7–V[v]∈Rd×dvV[v] \in \mathbb{R}^{d \times d_v}8 vs FlashAttention V[v]∈Rd×dvV[v] \in \mathbb{R}^{d \times d_v}9–s(v)s(v)0

TTFT and TPOT are as defined above; kernel speedups refer to attention step only.

6. Architectural and Practical Considerations

Prefix RadixAttention demonstrates that the combination of hierarchical data structures (radix trees for prefix factorization) and bespoke GPU kernel design (prefix packing, multi-stream parallelism) enables LLM servers to approach hardware memory bandwidth limits even under highly non-uniform batch and prefix distributions (Yi et al., 27 Nov 2025). The avoidance of redundant KV reads, adaptive tiling per CTA, and cross-CTA concurrency are all critical for saturation of available GPU resources in the decode path. Memory overhead remains manageable via cache compression and LRU eviction.

Challenges remain in managing worst-case prefix tree growth and schedule fairness under adversarial traffic, but both theoretical and empirical findings indicate robust performance in practical, heavily prefix-redundant scenarios.

7. Relationship to Broader Prefix Reuse Paradigms

Prefix RadixAttention generalizes earlier approaches to prefix reuse in transformer inference by providing a formalized, data-structure-centric approach and integrating system-level scheduling optimization. It supports efficient batched inference in high-throughput, low-latency settings, and connects scheduling policy (through s(v)s(v)1-LPM) with algorithmic and hardware-aware kernel implementations. A plausible implication is that, as LLM workloads grow more complex and personalized, Prefix RadixAttention methodologies will become foundational for state-of-the-art serving systems, especially in environments where TTFT and TPOT directly affect user experience and infrastructure cost (Dexter et al., 7 Feb 2025, Yi et al., 27 Nov 2025).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Prefix RadixAttention.