Targeted Pruning for Prefill-Decode Disaggregation

Updated 25 September 2025

The paper demonstrates that stage-aware pruning achieves over 20% inference speedup by conservatively treating prefill and aggressively optimizing decode stages.
It introduces block- and circuit-level pruning using cosine similarity and simulated annealing to refine removal sets with minimal impact on validation objectives.
KV cache pruning with token- and layer-awareness yields nearly 5× bandwidth reduction, effectively addressing communication bottlenecks in distributed LLM serving.

Targeted pruning for prefill-decode disaggregation is an inference- and system-level optimization paradigm in LLM serving that aims to maximize efficiency by selectively removing redundant computational pathways, parameters, model blocks, or KV cache entries, with techniques tailored to the distinct properties—and sensitivities—of the prefill and decode stages. The core motivation is to reduce computational cost, memory usage, and inter-node bandwidth, particularly in high-throughput or distributed inference where prefill (context processing) and decode (autoregressive token generation) are scheduled or executed separately, often on dedicated hardware or disjoint resources.

1. Disaggregated Prefill and Decode: Motivation and System Challenges

Prefill-decode (PD) disaggregation splits inference into two logical phases: the prefill phase (processing the input prompt, building the initial KV cache, and emitting the first token) and the decode phase (generating each output token autoregressively by consuming and extending the KV cache). In distributed and high-performance inference systems, this separation is key to workload balance—prefill is compute-bound and batch-friendly, while decode is memory-bandwidth limited and inherently sequential (Agrawal et al., 2023, Zhong et al., 18 Jan 2024). However, naively splitting these phases can create two major issues:

Sensitivity Asymmetry: The prefill phase is more sensitive to accuracy loss—errors or artifacts propagate through the entire KV cache and affect all subsequent generations. Decode, by contrast, is more robust to localized approximation or pruning.
Bandwidth Cost: Full disaggregation requires transferring large KV caches between nodes, which can dominate end-to-end latency and limit scalability (Zhang et al., 29 Aug 2025). Optimally, one wants to prune away as much unnecessary computation and state as possible to minimize both latency and communication overhead without degrading output quality.

2. Block and Circuit-Level Pruning Strategies

Targeted block and circuit pruning address computational redundancy within the transformer architecture by identifying entire layers, blocks, or parameter subgraphs that can be removed with minimal impact. The leading approach constructs pruning and distillation sets based on redundancy metrics—for instance, the cosine similarity between each block’s input and output states ( $r_i = \cos(h_{i-1}, h_i)$ ) or grouped cosines for consecutive block pairs (Zhang et al., 29 Aug 2025). The overall procedure:

Compute redundancy scores for all blocks.
Form a pruning set of the top redundant blocks.
Define a distillation set of block pairs suitable for merging based on aggregate cosine similarity.
Use simulated-annealing–like iterative optimization to refine the removal set, evaluating candidate sets by their impact on a validation set objective.

Unlike global uniform pruning, this approach can be stage-aware: it is feasible to prune more aggressively in the decode pathway than in prefill, as errors in decode do not snowball across many tokens.

Parallel advances in attribution-guided methods use Layer-wise Relevance Propagation (LRP) to rank neurons, heads, or parameters with respect to importance for specific tasks or stages, enabling circuit extraction: targeted removal of task-irrelevant subgraphs or even correction/removal of circuits underlying undesirable model behavior (Hatefi et al., 16 Jun 2025).

3. KV Cache Pruning with Token- and Layer-Awareness

KV cache pruning addresses transmission and memory bottlenecks imposed by PD disaggregation. Rather than transferring or storing the full KV cache for every sequence, targeted mechanisms leverage empirical attention patterns:

In the decode stage, KV cache entries for the first and last portions of the context (e.g., first and last $p \cdot N$ tokens) often receive the highest attention weight.
By computing aggregate attention scores per layer and per head, one can select only those KV cache entries and/or layers (using a combined mean-variance metric, e.g., $LayerScore_l = \mu_l \cdot (1 - (\sigma_l/(\mu_l+\varepsilon)))$ ) for transmission or retention (Zhang et al., 29 Aug 2025).

This token-aware, layer-selective approach results in up to 4.95× bandwidth reduction without notable accuracy loss, as shown by experimental results in LLM PD inference, while retaining all KV cache information during prefill.

4. Iterative Optimization and Pruning-Stage Decoupling

Stage-specific iterative pruning refines the removal set for each inference stage independently, often using simulated annealing to swap candidate blocks or pairs in and out of the removal set, guided by objective function improvements and controlled by a cooling schedule ( $T \rightarrow \alpha T$ per iteration).

A summary of this process:

For each candidate removal or swap, calculate the difference in validation-set objective ( $f_{new}$ vs. $f_{current}$ ).
Accept the new removal set with probability $P_{accept} = \exp(-\Delta f / T)$ .
The process runs independently for prefill and decode, allowing more conservative block retention in prefill and more aggressive pruning in decode, based on validation/performance studies that quantify error amplification across multiple decode steps (Zhang et al., 29 Aug 2025).

5. Impact on Inference Speed, Bandwidth, and System Design

Targeted pruning for PD disaggregation brings significant performance improvements:

Inference acceleration: Experimental results demonstrate 20.56% end-to-end inference speedups due to a reduced number of executed blocks and less KV cache overhead in both prefill and, especially, decode (Zhang et al., 29 Aug 2025).
Communication reduction: Token-aware, layer-selective cache pruning reduces cross-node data transfer bandwidth by nearly 5× under standard PD disaggregation benchmarks.
Adaptation to hardware topology: Coupling these techniques with optimized workload partitioning (e.g., overlapping execution and proportional assignment of compute across heterogeneous GPU clusters (Liu et al., 22 Sep 2025)) enables throughput and latency gains even in non-uniform hardware environments.

The combination of block- and cache-level pruning synergizes with existing inference system frameworks, including microserving and programmable routers (Jin et al., 17 Dec 2024), allowing fine-grained dynamic orchestration routines that adapt to varying bottlenecks, resource capacities, and workload profiles.

6. Relations to Other Targeted Pruning Paradigms

Targeted pruning in PD disaggregation is compatible with a variety of model compression strategies:

Row-wise metric-driven pruning as in TRIM (Dong et al., 19 May 2025) assigns per-output-dimension sparsity, focusing preservation on dimensions that most impact decoding performance.
Activation sparsity and structured N:M sparsity (e.g., Amber Pruner (An et al., 4 Aug 2025)) apply structured pruning, particularly in the compute-intensive prefill phase, leaving the sensitive decode pathway less affected.
Attribution-guided and circuit-level pruning (Hatefi et al., 16 Jun 2025) further enable correction and refinement of model behavior, including for safety-critical use cases (e.g., filtering undesired outputs during decoding).

7. Limitations and Future Research Directions

While targeted pruning enables sizable gains, several open areas remain:

Dynamic memory management for constructing pruned subgraphs on the fly without incurring excessive allocation overhead.
MoE and beyond-dense architectures require specialized pruning mechanisms that account for gating and expert selection in PD-disaggregated settings.
Adaptive and workload-aware pruning: Future systems may adjust pruning levels at runtime, responding to changes in traffic, model architecture, or system constraints.
Integration with quantization and other compression methods, finding joint optima across multiple axes of efficiency.

Comprehensive system integration of these strategies will be required to fully realize performance gains across diverse model types, hardware architectures, and serving workloads.

In summary, targeted pruning for prefill–decode disaggregation unifies sensitivity-aware block removal, token/layer-aware KV cache reduction, and iterative optimization. When deployed in modern LLM inference systems, these techniques not only accelerate serving by over 20% but also drastically reduce data transfer bottlenecks, especially under distributed or heterogeneous-resource deployment. Empirical evidence indicates that leveraging stage-level characteristics is critical: conservative pruning in prefill, aggressive and token-aware strategies in decode, and dynamic orchestration driven by system workload all contribute to an efficient, scalable, and adaptable LLM serving stack (Zhang et al., 29 Aug 2025, Singh et al., 25 Dec 2024, Dong et al., 19 May 2025, Liu et al., 22 Sep 2025).