Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
94 tokens/sec
Gemini 2.5 Pro Premium
55 tokens/sec
GPT-5 Medium
18 tokens/sec
GPT-5 High Premium
24 tokens/sec
GPT-4o
103 tokens/sec
DeepSeek R1 via Azure Premium
93 tokens/sec
GPT OSS 120B via Groq Premium
462 tokens/sec
Kimi K2 via Groq Premium
254 tokens/sec
2000 character limit reached

LazyLLM: Dynamic Token Pruning

Updated 12 August 2025
  • The paper introduces a dynamic, layerwise token pruning approach that adaptively reduces computation while preserving model accuracy.
  • The methodology leverages attention scores at each layer to compute token importance, enabling reversible pruning through an auxiliary cache.
  • Empirical evaluations demonstrate up to 3× speedup in TTFT on long-context tasks with less than 1% accuracy drop, showcasing significant efficiency gains.

LazyLLM: Dynamic Token Pruning refers to a set of inference techniques that adaptively prune less important tokens during the execution of large transformer models, particularly in tasks involving long-context LLMs. The core principle is to minimize unnecessary computation and memory usage by identifying, at each model layer and decoding step, which tokens are most relevant for the next prediction, allowing for more efficient use of resources without significant degradation in output quality. LazyLLM is characterized by dynamic, stepwise, and layer-wise decisions, making token selection reversible and contextually adaptive—a departure from static, one-shot pruning approaches. This paradigm is central both to efficient LLM inference and scalable deployment in resource-constrained environments.

1. Motivation and Problem Context

The exponential increase in transformer model size and context window length has substantially raised the computational and memory demands of LLM inference. In conventional transformer-based LLMs, two phases are involved in processing a long prompt: (1) the prefilling stage, which requires computing the key-value (KV) cache for the entire prompt, and (2) the decoding stage, which generates tokens autoregressively. For long-context scenarios, the time-to-first-token (TTFT)—dominated by the cost of prefilling the full prompt—often becomes the primary latency bottleneck, as all tokens must be processed, even though many are not critical to the immediate output.

Static pruning approaches, which remove fixed subsets of tokens or compress the prompt prior to inference, are limited in their ability to adapt to token-level context relevance and may cause notable accuracy degradation or eliminate information required downstream. These methods cannot revive pruned tokens if they become important in later steps, and they do not address the evolving importance of tokens at different depths of the model.

LazyLLM addresses these limitations by introducing dynamic token pruning strategies, where token selection is performed at each step and layer based on current attentional relevance, enabling reversible pruning and adaptive resource allocation throughout the inference process (Fu et al., 19 Jul 2024).

2. Dynamic Token Pruning Methodology

LazyLLM’s central mechanism is a dynamic, attention-guided, fractionally progressive pruning procedure that “lazily” constructs the KV cache. In each transformer layer and for each generation step (prefilling and decoding), attention maps are leveraged to compute quantitative importance scores for all context tokens relative to the next-token prediction.

Let Ah,i,NlA^{l}_{h,i,N} denote the multi-head attention probability for token tit_i (from head hh) in layer ll directed toward the current prediction position NN. The importance score for token tit_i at layer ll is computed as:

sil=1Hh=1HAh,i,Nls^{l}_i = \frac{1}{H} \sum_{h=1}^H A^{l}_{h,i,N}

where HH is the number of attention heads.

Tokens are then selected based on their percentile ranking in sils^{l}_i. A top-klk^l percentile threshold is applied: tokens below this threshold are pruned from further computation at the next layer (l+1)(l+1), while higher-scoring tokens are retained. This percentile threshold klk^l can be set to keep almost all tokens early (e.g., k1100%k^1 \approx 100\%) and dropped more aggressively in later layers, leveraging empirical observations that deeper layers are less sensitive to token removal (Fu et al., 19 Jul 2024).

A distinguishing innovation is the auxiliary cache: pruned tokens' hidden states are stored so that, if they become relevant at a subsequent layer (e.g., due to a change in attentional focus), their computation need not be re-executed. Each token is thus computed at most once per layer; worst-case runtime is never worse than full inference without pruning. This makes the procedure strictly efficient and non-destructive.

Pseudocode Sketch

1
2
3
4
5
6
7
8
9
10
11
12
13
14
H = number_of_heads
for l in range(L):
    s = np.zeros(N)
    for h in range(H):
        # Compute attention probabilities for current head/layer
        A = attention_map[l][h]  # shape: N x N
        s += A[:, N-1]  # attention to next token
    s /= H
    keep_mask = (s >= np.percentile(s, topk_percentile[l]))
    # Pass only the tokens in keep_mask to layer l+1
    # Store pruned tokens in aux cache
    x_next = compute_next_layer(x_current[keep_mask])
    aux_cache[l+1] = x_current[~keep_mask]
    x_current = restore_tokens(x_next, aux_cache[l+1])

Here, restore_tokens brings back a token if required at that context step in subsequent layers.

3. Comparison to Static and Prior Dynamic Pruning Approaches

Conventional static pruning—such as prompt compression, static segmentation, or "token dropping"—removes a fixed subset of tokens or prunes at a single layer based on global criteria (e.g., average attention scores across the prompt), and this mask is used throughout inference. If a token is dropped and later found to be important, it cannot be recovered without reprocessing, leading to either degraded accuracy or increased complexity (e.g., “pre-profiling” the entire prompt with several passes, which negates efficiency gains) (Fu et al., 19 Jul 2024).

LazyLLM differs fundamentally:

  • Dynamic selection: Token retention/pruning decisions are made per-layer and per-step.
  • Revivability: Pruned tokens can be recalled at later layers if their attentional importance grows.
  • Progressive depthwise pruning: Deeper layers use more aggressive pruning.
  • No fine-tuning required: Integration is direct and parameter-free, operating entirely at the inference level.

Empirical results show that static methods tend either to cause significant accuracy loss or fail to improve TTFT when “pre-profiling” is required (Fu et al., 19 Jul 2024). LazyLLM’s dynamic approach achieves better accuracy-efficiency trade-offs by adapting to evolving token importance throughout inference and across input instances.

4. Experimental Evaluation and Efficiency Gains

Comprehensive benchmarking of LazyLLM has been conducted on 16 datasets spanning multi-document QA, summarization, code completion, and various few-shot learning settings for long prompts (Fu et al., 19 Jul 2024). Key findings include:

  • In multi-document QA with Llama 2 7B, LazyLLM achieves a 2.34× speedup in the prefilling stage (TTFT) while maintaining baseline accuracy.
  • Across other tasks, LazyLLM yields TTFT speedups in the range of 1.3× to over 3×, with typical accuracy drops below 1%.
  • For overall generation, a significant share of prompt tokens (only 63%–88%) is ultimately processed, resulting in overall computational savings.
  • Per-task ablation studies demonstrate that later-layer, more aggressive pruning is effective, and auxiliary cache revivals are essential to avoid performance degradation.

A summary of results is given in the following table:

Task Model TTFT Speedup Accuracy Drop
Multi-doc QA Llama2-7B 2.34× <0.1%
Summarization Llama2-7B 1.5× – 2.2× <0.7%
Code completion XGen-7B 1.7× <1%

This table demonstrates that adaptive pruning yields substantial latency reduction without violating performance constraints.

5. Practical Integration and Implementation

LazyLLM is designed for compatibility with existing transformer architectures and deployment pipelines:

  • Training-free: No fine-tuning, re-training, or additional model parameters are needed. The technique applies as a modification to the KV cache construction logic used during inference, requiring only access to per-layer attention scores.
  • Auxiliary cache management: Pruning and possible revival of tokens is handled via bookkeeping in the auxiliary cache, ensuring tokens are only computed once per layer.
  • No architectural constraints: Unlike some prior methods, LazyLLM is applicable to any pre-trained transformer model that exposes intermediate attention maps, including widely adopted LLMs such as Llama2 and XGen.
  • Application domains: Immediate benefits are seen in multi-document QA and any task with long prompt contexts, but the method is generalizable to summarization, code generation, and few-shot learning with extensive prompt chaining.

The direct inference-level implementation makes LazyLLM appropriate for real-world LLM-serving scenarios, especially where prompt length and first-token latency are dominant bottlenecks.

6. Trade-offs, Limitations, and Future Directions

Adaptive and dynamic token selection set new Pareto frontiers in the efficiency-accuracy trade-off. Nevertheless, there are important considerations:

  • Aggressiveness of pruning: Overly aggressive pruning in earlier layers can cause irrecoverable loss of context; optimal percentile thresholds often require careful tuning per model and task.
  • Auxiliary cache storage: While each token is only computed once per layer, the auxiliary cache for pruned tokens can increase transient memory usage; the net effect is still positive given the reduction in total KV cache and processed prompt tokens.
  • Layerwise adaptation: Static percentile thresholds per layer may not be optimal for all input distributions or LLM architectures; more sophisticated adaptive scheduling or even reinforcement-learning-guided threshold determination remain open research areas.
  • Extensions: Integration with memory compression, cache-aware masking, or synergy with structured channel pruning could further enhance gains, as indicated by related research (Federici et al., 2 Dec 2024, Joo et al., 28 May 2025, Tao et al., 6 Apr 2025).

Future work may investigate fully differentiable, jointly optimized dynamic pruning schedules, lightweight attention proxy mechanisms, and integration with parallel or asynchronous decoding strategies to maximize throughput and quality.

7. Broader Impact and Relationship to Other Dynamic Pruning Paradigms

LazyLLM’s dynamic pruning is aligned with a broader movement toward adaptive inference in modern deep learning—where computation is allocated not statically per model or input, but adaptively at runtime. Similar principles have been explored in vision transformers (e.g., dynamic token selection, token idling, and “soft pruning” with token aggregation (Kong et al., 2021, Xu et al., 2023)), as well as in multimodal models where cross-modal importance guides token retention (Cao et al., 5 Mar 2024, Yu et al., 2 Sep 2024).

Unlike some vision-centric approaches, which may rely on lightweight MLP selectors or auxiliary heads to decide token utility, LazyLLM integrates pruning at the level of transformer attention, which is naturally present in LLM design and captures instance-specific relevance. The dynamic, layerwise, reversible pruning strategy embodied in LazyLLM and its auxiliary cache management evidences a principled progression from static to contextual and resource-aware inference in transformer-based LLMs.

In summary, LazyLLM: Dynamic Token Pruning provides a robust, empirically validated, and implementation-friendly approach to adaptive token selection in LLM inference. It achieves substantial latency and compute reductions for long-context tasks, integrates seamlessly with existing models and serving pipelines without retraining, and serves as a foundation for future advances in efficient, context-adaptive transformer inference (Fu et al., 19 Jul 2024).