Window-Based Token Pruning & Caching Techniques
- The paper presents window-based token pruning and caching strategies that exploit temporal and spatial locality to reduce compute and memory costs in transformers.
- It details methodologies such as DiffScore-based token selection, sliding-window caching, dynamic window sizing, and quantized token management to minimize redundant computations.
- Empirical evaluations in diffusion and language models demonstrate significant speedups and memory reductions with minimal impact on quality when hyperparameters are finely tuned.
Window-based token pruning and caching refer to a family of inference-time optimizations for large transformer-based models—especially diffusion models and LLMs—that exploit temporal or structural locality to reduce compute- and memory-costs by (1) pruning tokens, attention heads, or cache entries outside a dynamic window of influence, and (2) aggressively reusing previously computed intermediate representations. These methods leverage the empirical observation that, within each decoding or denoising phase, only a small subset of tokens exhibit significant dynamism or relevance, while other tokens can be pruned away, compressed, or replaced with cached values without degrading predictive performance. Key instantiations span text-to-image diffusion (e.g., Stable Diffusion), video diffusion transformers, and diffusion LLMs, as well as autoregressive LLMs with extended context.
1. Foundations: Structural Locality and Redundant Computation
Token pruning and feature caching build on the insight that transformer inference is dominated by redundant computations across both tokens (spatial/sequence positions) and steps (temporal iterations). For instance, in Stable Diffusion, the denoising U-Net applies expensive self-attention on high-resolution feature maps where many tokens change slowly across steps. Similarly, in DLMs such as LLaDA and Dream, iterative denoising updates only a handful of undecoded tokens at any step, but traditional full-sequence attention needlessly recomputes representations for almost all positions (Zhang et al., 2024, Zuo et al., 28 Jan 2026).
Empirical analyses reveal that:
- Most computation is spent on self-attention in shallow up/down blocks for vision diffusion (Zhang et al., 2024), or on recomputing full-sequence representations at every iteration for language diffusion (Zuo et al., 28 Jan 2026).
- Token-level dynamics are highly localized: a small spatial, temporal, or prefix window contains almost all of the computationally salient activity.
- Outside a dynamic window, tokens' representations are quasi-static and can be efficiently refreshed from cached or quantized values.
2. Methodologies: Window-Based Pruning and Caching Algorithms
Numerous algorithmic paradigms have been developed to operationalize window-based pruning and caching, tailored to specific domains.
2.1 Dynamics-Aware Token Pruning (DaTo, Stable Diffusion)
DaTo selects informative tokens at each step by quantifying the per-token "DiffScore," computed as channel-mean absolute difference of feature maps across timesteps. In each patch, only the most dynamic token is selected as the base; other tokens are pruned based on maximal cosine similarity to base tokens. Self-attention is run only on the survivors, and features for pruned tokens are copied from their most similar base (Zhang et al., 2024).
2.2 Windowed Sliding-Window Caching (Feature/Phase/KV-Cache)
Both vision and LLMs implement computation windows that slide over tokens or timesteps:
- In DaTo, feature caching follows a windowed schema: after every "fresh" computation, intermediate features are cached and reused for a fixed number of subsequent timesteps before cache-refresh (Zhang et al., 2024).
- In DLMs, windowed caching divides the sequence into phases/windows: a local computation window is maintained, buffer tokens are cached and periodically refreshed, and far-field (pruned) tokens are omitted entirely from each stage (Zuo et al., 28 Jan 2026).
2.3 Dynamic Window Sizing and Output-Aware Pruning (UniCP, OBCache)
- UniCP (video diffusion) implements Error-Aware Dynamic Cache Windows (EDCW): the cache window at each block and timestep is dynamically set by measuring the distance between current and cached attention outputs/maps, following a U-shaped error curve (Sun et al., 6 Feb 2025).
- OBCache (LLMs) formulates token eviction as structured pruning, quantifying the saliency of keys/values by their direct effect on recent attention outputs via second-order Taylor expansion (OBD). Recent tokens within a protected (window) tranche are never pruned; older tokens are evicted based on minimal output perturbation (Gu et al., 9 Oct 2025).
2.4 Quantized Sliding-Window KV Pruning
Quantized pruning addresses the token-precision trade-off in cache compression, maintaining more tokens at reduced precision within a sliding window. This approach surpasses pure token-pruning or pure quantization in memory-constrained scenarios, especially in retrieval-heavy tasks (Zhang et al., 2024).
2.5 Dynamic Per-Layer, Per-Step Token Selection (LazyLLM)
LazyLLM introduces layerwise, stepwise token pruning, where per-layer attention heatmaps identify the most influential tokens for the next prediction. Tokens are revived ("lazy computation") from an auxiliary cache if future steps judge them important, and all pruning is progressive down the model depth (Fu et al., 2024).
3. Core Algorithms and Implementation Patterns
A window-based system typically includes the following algorithmic elements:
| Component | Function | Notable Instantiations |
|---|---|---|
| Pruning metric | Quantifies token dynamics/relevance | DiffScore, attention weights, OBD |
| Window definition | Sets dynamic window over tokens/timesteps/blocks | Patch, prefix, cache phase |
| Caching policy | Determines what representations to reuse/refresh | Block-wide (DaTo), KV (LLMs) |
| Refresh strategy | Frequency and conditions for cache update | Fixed N steps, error-based (EDCW) |
| Recovery/fallback | Recomputes or reconstructs pruned features | Copy from base/auxiliary cache |
A prototype DaTo-Sampling pseudocode, combining window-based pruning and caching in diffusion, is as follows (Zhang et al., 2024):
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
for t in T...1: for l in block_indices: if l < d_t: # Fresh compute with DaTo-pruning compute features via convolution + DaTo-attention else: # Cache hit reuse cached features if l == first_upblock: # Compute DiffScore, select base/prune # Run attention on survivors, recover pruned if l == d_t - 1: # Refill cache here cache[l] = features |
In DLMs with windowed pruning (Zuo et al., 28 Jan 2026):
- Begin a phase by computing full representations within the external window and refreshing the cache.
- During normal steps, restrict full compute to active and buffer tokens; far-field tokens omitted.
- Periodically advance/redefine window by re-centering on the decoding frontier.
4. Empirical Results and Comparative Analysis
Quantitative evaluations demonstrate major gains in efficiency with negligible or even improved quality under careful tuning:
| Application | Speedup | Quality Impact | Domain | Paper |
|---|---|---|---|---|
| Stable Diffusion | up to 9 | FID improvement up to 0.33 (ImageNet), FID=2.17 (COCO) | Image | (Zhang et al., 2024) |
| Diffusion Language | up to 99 | 1% accuracy drop | Language | (Zuo et al., 28 Jan 2026) |
| Video DiT (UniCP) | 1.1–1.6 | 0.10 drop in LPIPS/SSIM/PSNR | Video | (Sun et al., 6 Feb 2025) |
| LazyLLM (Llama2) | 2.34 prefilling | 1% accuracy drop | Language | (Fu et al., 2024) |
| Quantized pruning | 4 tokens@4-bit outperforms 116-bit | Retrieval | (Zhang et al., 2024) | |
| OBCache (LongBench) | +2–4% score | +10–20pp retrieval accuracy | Language | (Gu et al., 9 Oct 2025) |
Across domains, window-based methods consistently:
- Reduce compute/memory by – depending on window size and application.
- Maintain or improve accuracy/FID/SSIM, provided hyperparameters (window size, number of survivors, refresh interval) are tuned to data/task-specific localities.
5. Theoretical Aspects and Trade-Offs
Theoretical analyses frame window-based pruning and caching as constrained resource allocation problems:
- For cache compression, maximize downstream task performance subject to a memory budget , optimizing the token count and precision (Zhang et al., 2024).
- OBCache derives closed-form saliency scores for joint KV pruning using a second-order Taylor expansion (Optimal Brain Damage); this approach theoretically dominates heuristics based only on accumulated attention weights (Gu et al., 9 Oct 2025).
- Window length and refresh interval hyperparameters must be matched to the empirical locality of the task: plateauing of vs.\ window size, or error-thresholding based on attention output norms (Zuo et al., 28 Jan 2026, Sun et al., 6 Feb 2025).
A practically significant implication is that trading off a modest drop in precision for more tokens in the cache yields better retrieval and task performance than reducing token count, up to a minimum threshold (typically bits for quantized caches) (Zhang et al., 2024).
6. Limitations, Hyperparameter Sensitivity, and Extensions
Known limitations include:
- Hyperparameter sensitivity: too small windows or infrequent refreshes can accumulate errors or truncate global dependencies, degrading performance (Sun et al., 6 Feb 2025, Zuo et al., 28 Jan 2026).
- For DaTo, schedule search for is computationally intensive (NSGA-II search, 15–20 GPU-hours) (Zhang et al., 2024).
- Fixed patch sizes (DaTo) or static window lengths may miss local context variations; adaptive strategies—window length by output error or confidence threshold, layer-specific budgets, or RL-based scheduling—are active directions (Zhang et al., 2024, Zuo et al., 28 Jan 2026, Sun et al., 6 Feb 2025).
- In highly nonlinear models, new tokens' post-decode transient can produce suboptimal cached states, requiring more frequent refresh (Zuo et al., 28 Jan 2026).
Potential extensions and research avenues:
- Integration with quantization, distillation, and other model compression techniques (Zhang et al., 2024, Zuo et al., 28 Jan 2026).
- Automatic refresh scheduling based on online metrics (e.g., key-value cosine stability).
- Generalization to other sequence models or dynamic memory systems (autoregressive LLM prefix caching, adaptive retrieval).
7. Impact and Broader Significance
Window-based token pruning and caching methods mark a shift toward inference-time adaptivity, exploiting temporal and structural redundancy not available to static, pre-trained model parameters. These methods combine principled output-aware pruning, error-aware dynamic caching, and aggressive quantization to dramatically reduce compute and memory requirements across domains—without requiring retraining or model structure modification. They are integral to scaling diffusion generation, efficient video synthesis, and enabling long-context LLM inference on constrained hardware, and are likely to inform the design of future sparse, locality-aware neural architectures (Zhang et al., 2024, Zuo et al., 28 Jan 2026, Gu et al., 9 Oct 2025, Fu et al., 2024, Zhang et al., 2024, Sun et al., 6 Feb 2025).