Prefix RadixAttention: Accelerating LLM Inference

Updated 5 April 2026

Prefix RadixAttention is a method that uses radix trees to efficiently exploit shared prefixes in LLM inference, thereby reducing redundant computation.
It integrates algorithmic refactorings and specialized GPU kernel strategies, achieving up to 22.7× speedups and reducing attention latency by over 67%.
The approach incorporates scheduling innovations like k-LPM to optimize query order, effectively balancing computational reuse with fairness in high-throughput deployments.

Prefix RadixAttention refers to a class of methods for LLM inference that accelerates autoregressive decoding by exploiting shared input prefixes among requests. This involves (1) data structures, notably the radix tree, for efficient prefix matching and key/value (KV) caching; (2) algorithmic refactorings of attention computation that allow partial reuse of previously cached transformer key/value tensors; and (3) specialized batched scheduling or GPU kernel strategies that minimize redundant computation and bandwidth utilization. As LLM deployments scale, practical workloads display significant hierarchical prompt overlap (e.g., system prompts, templates, retrieved materials), and Prefix RadixAttention substantially improves time-to-first-token (TTFT), time-per-output-token (TPOT), and memory efficiency through both architectural and systems-level innovations (Yi et al., 27 Nov 2025, Dexter et al., 7 Feb 2025).

1. RadixAttention: Data Structures and Transformer Integration

RadixAttention centers on the construction and dynamic management of a radix tree (also known as a compressed prefix tree) built over the LLM token vocabulary $\Sigma$ . Each prompt $x = x_1 x_2 \dots x_n$ is mapped onto this tree by greedily tracing the longest prefix shared with previously seen input strings. At every node $v$ at depth $d$ , the corresponding cached key $K[v] \in \mathbb{R}^{d \times d_k}$ and value $V[v] \in \mathbb{R}^{d \times d_v}$ tensors capture the cumulative transformer activations for the prefix $s(v)$ .

Upon the arrival of a new prompt $x$ , the tree is traversed to identify the deepest node matching the prefix $p = x_1 \cdots x_p$ of $x$ . The novel suffix $x = x_1 x_2 \dots x_n$ 0 is then processed by the transformer, yielding new $x = x_1 x_2 \dots x_n$ 1 and $x = x_1 x_2 \dots x_n$ 2. The full self-attention computation for the new tokens $x = x_1 x_2 \dots x_n$ 3 is efficiently factored as follows: $x = x_1 x_2 \dots x_n$ 4 where the $x = x_1 x_2 \dots x_n$ 5 denotes row-concatenation. This partitioning reduces the attention computation from $x = x_1 x_2 \dots x_n$ 6 to $x = x_1 x_2 \dots x_n$ 7 operations, achieving an approximate $x = x_1 x_2 \dots x_n$ 8 speedup per query, with even greater gains for workloads with deeper prefix sharing (Dexter et al., 7 Feb 2025).

2. Computational Complexity and Storage Overhead

The time cost of processing an input sequence $x = x_1 x_2 \dots x_n$ 9 of length $v$ 0 in RadixAttention consists of $v$ 1 for prefix matching, $v$ 2 for insertion of novel nodes ( $v$ 3 is suffix length), and $v$ 4 for attention (with $v$ 5 the prefix length). Thus, the overall per-prompt cost is $v$ 6, eliminating $v$ 7 of the typical $v$ 8 transformer attention complexity for large prefix overlaps.

Space requirements are dictated by storing per-prefix key and value matrices at each node, resulting in a worst-case memory use of $v$ 9—where $d$ 0 is the prompt count and $d$ 1 the maximum prompt length. In implementation, the prefix tree is compressed by variable-length edge labels and LRU-eviction to control cache size (Dexter et al., 7 Feb 2025).

3. Query Scheduling with Prefix Reuse: Theory and Algorithms

The integration of RadixAttention into production LLM serving workloads raises sophisticated scheduling challenges. Each query $d$ 2 is defined by its prompt $d$ 3 and arrival time $d$ 4, and the objective is to minimize TTFT and TPOT metrics—while leveraging prefix reuse for computational savings.

RadixAttention modifies scheduling theory because processing orders impact attainable KV reuse. The key result establishes that, under prefix reuse and TTFT deadlines, finding a feasible schedule is strongly NP-hard (a reduction from 3-PARTITION rigorously demonstrates this in (Dexter et al., 7 Feb 2025)).

To address this, the $d$ 5-LPM (Longest-Prefix-Match with window size $d$ 6) algorithm is proposed: after processing the oldest query, it greedily processes up to $d$ 7 pending queries with maximal prefix overlap with the last served. For $d$ 8 it recovers FCFS; as $d$ 9 it becomes LPM. Theoretical analysis proves that for common user-doc workloads (user prefix repeated $K[v] \in \mathbb{R}^{d \times d_k}$ 0 times plus unique doc suffix), $K[v] \in \mathbb{R}^{d \times d_k}$ 1-LPM provides deterministic TTFT bounds: $K[v] \in \mathbb{R}^{d \times d_k}$ 2 for request rate $K[v] \in \mathbb{R}^{d \times d_k}$ 3, prefix length $K[v] \in \mathbb{R}^{d \times d_k}$ 4, and doc suffix $K[v] \in \mathbb{R}^{d \times d_k}$ 5. Empirically, $K[v] \in \mathbb{R}^{d \times d_k}$ 6-LPM with small $K[v] \in \mathbb{R}^{d \times d_k}$ 7 (e.g., $K[v] \in \mathbb{R}^{d \times d_k}$ 8) robustly outperforms both FCFS and LPM, especially in moderate-to-high throughput regimes (Dexter et al., 7 Feb 2025).

4. Prefix-Aware Attention Kernel Implementation: PAT

The "Prefix-Aware Attention Kernel" (PAT) (Yi et al., 27 Nov 2025) advances these ideas to the CUDA kernel level, tightly coupling prefix reuse with GPU resource scheduling. PAT operates in a pack–forward–merge fashion:

Pack: Incoming queries are grouped by maximal shared prefix, forming a forest structure where each group (CTA in CUDA terminology) minimizes redundant KV cache reads. Optimal packing schemes are selected by maximizing a profit ratio $K[v] \in \mathbb{R}^{d \times d_k}$ 9 for each prefix node.
Forward: Each CTA is processed by a multi-tile, resource-adaptive kernel. Tile sizes $V[v] \in \mathbb{R}^{d \times d_v}$ 0 are chosen (subject to shared memory/register and bandwidth constraints on A100 hardware) to fit the group’s query/KV shape exactly. Multiple CUDA streams and KV splitting are used to balance work across SMs and avoid stragglers.
Merge: Partial attention results (maximum score, log-sum-exp accumulator, value sum) are merged via a lightweight online-softmax reduction. The merge overhead is $V[v] \in \mathbb{R}^{d \times d_v}$ 1 of total kernel time.

PAT integrates as a plugin into vLLM and demonstrates up to $V[v] \in \mathbb{R}^{d \times d_v}$ 2– $V[v] \in \mathbb{R}^{d \times d_v}$ 3 kernel speedup over FlashAttention and consistent reduction in TPOT and TTFT across representative real-world and synthetic workloads, particularly where large-scale prefix overlap is common (Yi et al., 27 Nov 2025).

5. Comparative Benchmarks and System Impact

Empirical studies have benchmarked Prefix RadixAttention approaches against standard baselines such as FlashAttention, FlashInfer, FastTree, and RelayAttention++. The PAT kernel reduces attention latency by 67.4% on average and TPOT by 13.6%–83.4% across diverse batch shapes and LLMs (LLaMA-3-8B, Qwen-3-8B, 8–32K-tokens) (Yi et al., 27 Nov 2025). End-to-end TTFT improvements range from 7.9% to 99.6% in streaming settings. In query scheduling experiments with 2,100 queries and personalized user prefixes, $V[v] \in \mathbb{R}^{d \times d_v}$ 4-LPM achieves 10–30% lower P99 TTFT under high throughput, in accordance with theoretical bounds (Dexter et al., 7 Feb 2025).

The table summarizes representative performance results:

Method	P99 TTFT at 200 req/s	Kernel Latency Speedup	TPOT Reduction
FCFS	300 ms	—	—
LPM	220 ms	—	—
$V[v] \in \mathbb{R}^{d \times d_v}$ 5-LPM ( $V[v] \in \mathbb{R}^{d \times d_v}$ 6)	160 ms	—	—
PAT	—	$V[v] \in \mathbb{R}^{d \times d_v}$ 7– $V[v] \in \mathbb{R}^{d \times d_v}$ 8 vs FlashAttention	$V[v] \in \mathbb{R}^{d \times d_v}$ 9– $s(v)$ 0

TTFT and TPOT are as defined above; kernel speedups refer to attention step only.

6. Architectural and Practical Considerations

Prefix RadixAttention demonstrates that the combination of hierarchical data structures (radix trees for prefix factorization) and bespoke GPU kernel design (prefix packing, multi-stream parallelism) enables LLM servers to approach hardware memory bandwidth limits even under highly non-uniform batch and prefix distributions (Yi et al., 27 Nov 2025). The avoidance of redundant KV reads, adaptive tiling per CTA, and cross-CTA concurrency are all critical for saturation of available GPU resources in the decode path. Memory overhead remains manageable via cache compression and LRU eviction.

Challenges remain in managing worst-case prefix tree growth and schedule fairness under adversarial traffic, but both theoretical and empirical findings indicate robust performance in practical, heavily prefix-redundant scenarios.

7. Relationship to Broader Prefix Reuse Paradigms

Prefix RadixAttention generalizes earlier approaches to prefix reuse in transformer inference by providing a formalized, data-structure-centric approach and integrating system-level scheduling optimization. It supports efficient batched inference in high-throughput, low-latency settings, and connects scheduling policy (through $s(v)$ 1-LPM) with algorithmic and hardware-aware kernel implementations. A plausible implication is that, as LLM workloads grow more complex and personalized, Prefix RadixAttention methodologies will become foundational for state-of-the-art serving systems, especially in environments where TTFT and TPOT directly affect user experience and infrastructure cost (Dexter et al., 7 Feb 2025, Yi et al., 27 Nov 2025).

Markdown Report Issue Upgrade to Chat

References (2)

PAT: Accelerating LLM Decoding via Prefix-Aware Attention with Resource Efficient Multi-Tile Kernel (2025)

LLM Query Scheduling with Prefix Reuse and Latency Constraints (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Prefix RadixAttention.