Papers
Topics
Authors
Recent
Search
2000 character limit reached

Graph-Guided Adaptive Channel Elimination for KV Cache Compression

Published 18 Apr 2026 in eess.SP | (2604.16983v1)

Abstract: LLMs have revolutionized natural language processing, achieving unprecedented success across a vast range of tasks. However, their practical application in long-context scenarios is severely hampered by the formidable memory footprint of the Key-Value cache. While channel pruning has emerged as a promising compression strategy, existing methods evaluate channel importance in isolation, fundamentally ignoring the inter-channel interactions that collectively dictate model performance. This oversight leads to suboptimal pruning decisions. To address this, we introduce \textbf{GRACE} (\textbf{GR}aph-guided \textbf{A}daptive \textbf{C}hannel \textbf{E}limination), a novel framework that reframes KV cache compression as a graph-based optimization problem. GRACE models channels as nodes and their interactions as weighted edges, enabling the identification of a near-optimal channel subset for pruning by minimizing the reconstruction error of the attention weight matrix. Furthermore, GRACE incorporates an adaptive protection mechanism that shields salient key channels from removal, ensuring a robust autoregressive decoding process. Extensive experiments show that GRACE can reduce KV cache size by 60\% with negligible performance degradation, consistently outperforming the state-of-the-art method.

Summary

  • The paper introduces GRACE, a novel graph-based method that reduces KV cache size by up to 60% with negligible performance drop.
  • It combines statistical profiling for outlier protection with a greedy Minimum Incremental Error Selection algorithm to optimize channel pruning.
  • Empirical evaluations on LLaMA-3-8B and Mistral-7B show lower attention reconstruction error and improved performance under memory constraints.

Graph-Guided Adaptive Channel Elimination for KV Cache Compression

Introduction and Motivation

Key-Value (KV) cache size remains a severe bottleneck for long-context inference in LLMs, as its memory footprint scales linearly with context length. While token-level and model-level approaches (e.g., token eviction, shared KV heads, quantization) help address this issue, channel-level pruning strategies offer an orthogonal and highly promising avenue for memory-efficient inference. However, existing channel pruning techniques inadequately model the collective effect of pruning decisions: they treat the importance of each channel independently, neglecting the strong inter-channel dependencies observed in real LLMs.

GRACE (Graph-guided Adaptive Channel Elimination) targets this critical limitation by reformulating KV cache pruning as a graph-based optimization. Each channel is explicitly modeled as a node, and channel interactions as weighted edges, enabling a principled selection of pruning candidates that minimizes the degradation of the attention mechanism. A further innovation is the explicit protection of "outlier" channels—key dimensions with disproportionately large activations—against pruning-induced instability. Figure 1

Figure 1: Magnitude visualization of key cache. A small subset of channels concentrates the majority of the energy.

The GRACE Framework

The GRACE framework applies a two-step channel elimination process:

  1. Salient Channel Protection: Statistical analysis identifies and shields outlier channels, ensuring those with the highest mean and variance (most likely to be critical for model stability) are always retained.
  2. Graph-Theoretic Pruning: For the remaining candidate channels, a complete weighted graph is constructed, where nodes represent channels and edge weights encode cross-channel interaction strength.

A greedy algorithm—Minimum Incremental Error Selection (MIES)—is then employed. In each iteration, MIES selects the channel whose pruning yields the smallest increase to the cumulative attention reconstruction error (as defined by the Frobenius norm difference between the full and pruned attention matrices). Crucially, the algorithm recomputes channel scores at each step, reflecting the updated state of both self-importance and the cumulative impact of pruned channel interactions. Figure 2

Figure 2: An overview of the proposed GRACE framework.

Formally, the pruning error for a subset SS of channels is decomposed as a quadratic set function involving both self and interaction terms:

f(S)=∑i∈S∣∣qiki⊤∣∣F2+∑i≠j∈S(qi⊤qj)(ki⊤kj)f(S) = \sum_{i \in S} ||\mathbf{q}_i \mathbf{k}_i^\top||_F^2 + \sum_{i \neq j \in S} (\mathbf{q}_i^\top \mathbf{q}_j)(\mathbf{k}_i^\top \mathbf{k}_j)

This explicit modeling of the off-diagonal interaction terms is empirically validated by the highly non-negligible values observed in the interaction heatmap. Figure 3

Figure 3: Heatmap of interaction terms between channels, where heatmapi,j=ki⊤kj×qi⊤qj\mathrm{heatmap}_{i,j} = \mathbf{k}_{i}^\top \mathbf{k}_{j} \times \mathbf{q}_{i}^\top \mathbf{q}_{j}. Off-diagonal terms are non-negligible.

Saliency-Driven Protection: Addressing Query Drift

LLM decoding involves nonstationary query distributions—channels important during cache construction may be essential for future queries. Statistical profiling of query vector activations in GRACE reveals that high-magnitude (outlier) channels also exhibit pronounced volatility. Pruning such channels, even if statistically rare, severely degrades attention fidelity.

To mitigate this, GRACE applies a dynamic, data-driven threshold (mean plus one standard deviation, clamped to a suitable range) to always retain the most volatile and highest-energy dimensions. Figure 4

Figure 4: Statistical profiling of query vector channels. Channels with higher mean amplitudes also exhibit higher variance.

This addresses key theoretical weaknesses in prior methods and empirically stabilizes downstream generation, especially under context drift.

Theoretical Analysis

The GRACE optimization objective is a combinatorial quadratic minimization over the subset of candidate channels. GRACE's greedy approximator (MIES) offers the following guarantee: under a restricted eigenvalue condition for the interaction matrix, the selected set achieves a pruning error bounded by the ratio of the maximal to minimal eigenvalues (κ\kappa) of the interaction matrix, relative to the (unknown) optimum. Salient channel protection further stabilizes κ\kappa, improving the concentration of the bound and ensuring practical near-optimality.

Empirical Evaluation

Extensive evaluation on LLaMA-3-8B-Instruct and Mistral-7B-Instruct LLMs was conducted using LongBench and the Needle In A Haystack benchmark. GRACE consistently outperformed THINK—its main competitor—by achieving lower attention reconstruction error and stronger downstream task performance at equivalent or higher pruning ratios.

GRACE reduced the KV cache size by 60% with negligible performance drop and even surpassed the performance of THINK on 10/16 LongBench subtasks for LLaMA-3-8B at 50% pruning. In Mistral-7B-Instruct, GRACE achieved higher average scores under both 256-token and 512-token KV budgets.

Notably, for the highly compressive Needle In A Haystack setting (96-token KV, 60% pruning), GRACE achieved a retrieval score of 0.828 vs. THINK's 0.804, demonstrating its effectiveness in preserving sparse, critical information under extreme memory constraints.

GRACE further demonstrates robust reduction in attention reconstruction error across decoder layers, especially in the shallowest and deepest layers, where it shows a 30–40% relative reduction compared to THINK. Figure 5

Figure 5: Reduction in reconstruction error for our method versus the THINK across decoder layers.

The computational overhead for GRACE (Time to First Token, Time per Output Token) is negligible, making it practical for real-time or large-batch inference regimes.

Practical and Theoretical Implications

GRACE's graph-theoretic pruning strategy introduces a general, training-free, and plug-and-play framework for channel-level cache compression. It can be straightforwardly combined with model-level cache sharing, quantization, or token-level caches to produce multiplicative gains in inference memory efficiency. The explicit handling of inter-channel correlations provides a template for future cache management algorithms in LLMs and other attention-based architectures.

Theoretically, GRACE's insights link KV cache redundancy to second-order statistics in channel activations, aligning with broader trends in model compression emphasizing the criticality of outlier dimensions. Its greedy solution and protection mechanism suggest avenues for more sophisticated, possibly higher-order or contextual interaction modeling.

Future Directions

Potential extensions include adaptation to other cache compression domains (e.g., value caches, multi-modal attention) and integration with adaptive quantization. Further research into the dynamics of salient channel preservation during instruction-following, multi-turn dialogue, and grounding tasks could inform task-specific cache management strategies. An additional area is making the GRACE framework differentiable and amenable to end-to-end tuning.

Conclusion

GRACE presents a robust solution for KV cache compression in LLMs by explicitly encoding inter-channel dependencies and dynamically protecting salient activations. It offers substantial memory savings with minimal accuracy tradeoff, outperforming established baselines in both typical and memory-constrained settings. Its graph-based methodology and theoretical guarantees mark a significant advance in practical long-context inference for large-scale LLMs (2604.16983).

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.