Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Inference-Time Hyper-Scaling with KV Cache Compression (2506.05345v1)

Published 5 Jun 2025 in cs.LG and cs.CL

Abstract: Inference-time scaling trades efficiency for increased reasoning accuracy by generating longer or more parallel sequences. However, in Transformer LLMs, generation cost is bottlenecked by the size of the key-value (KV) cache, rather than the number of generated tokens. Hence, we explore inference-time hyper-scaling: by compressing the KV cache, we can generate more tokens within the same compute budget and further improve the accuracy of scaled inference. The success of this approach, however, hinges on the ability of compression methods to preserve accuracy even at high compression ratios. To make hyper-scaling practical, we introduce Dynamic Memory Sparsification (DMS), a novel method for sparsifying KV caches that only requires 1K training steps to achieve 8$\times$ compression, while maintaining better accuracy than training-free sparse attention. Instead of prematurely discarding cached tokens, DMS delays token eviction, implicitly merging representations and preserving critical information. We demonstrate the effectiveness of inference-time hyper-scaling with DMS on multiple families of LLMs, showing that it boosts accuracy for comparable inference runtime and memory load. For instance, we enhance Qwen-R1 32B by an average of 9.1 points on AIME 24, 7.6 on GPQA, and 9.6 on LiveCodeBench across compute budgets.

Inference-Time Hyper-Scaling with KV Cache Compression: An Evaluation

The paper entitled "Inference-Time Hyper-Scaling with KV Cache Compression" explores advanced techniques for improving inference efficiency and accuracy in Transformer-based LLMs. The authors address a crucial limitation in the scaling of inference-time compute – the key-value (KV) cache size becomes a bottleneck for reasoning tasks rather than the number of generated tokens. Their work introduces a novel method named Dynamic Memory Sparsification (DMS), which promises efficient scaling by compressing the KV cache, thus enabling the generation of more tokens within the same compute budget without sacrificing accuracy.

Key Contributions

The main proposition of the paper is inference-time hyper-scaling, which aims to enhance reasoning accuracy by utilizing a compressed KV cache. The authors illustrate that by retaining essential information through efficient compression methods, it's possible to generate longer sequences or engage in more parallel reasoning threads compared to the original models. The innovation of this work lies in employing DMS, which implements a delayed token eviction strategy to preserve crucial information before any potential discarding of KV cache entries.

DMS requires minimal retraining overhead – only 1K training steps are needed to achieve an 8x compression while maintaining better accuracy than existing sparsity methods like training-free sparse attention variants such as TOVA and H2O. The paper also notes the superior performance of DMS by indicating significant improvements in a range of task-specific metrics across different model sizes. For instance, the authors report accuracy gains of 9.1 points on AIME 24, 7.6 on GPQA, and 9.6 on LiveCodeBench for the Qwen-R1 32B model across comparable compute budgets.

Evaluation and Results

The paper provides extensive benchmarks, evaluating the effectiveness of DMS against vanilla LLMs and other compression strategies on various well-recognized datasets such as MATH-500, AIME 2024, GPQA Diamond, and LiveCodeBench. These evaluations demonstrate that DMS enhances the accuracy of LLMs under constrained inference-time budgets, pushing the Pareto frontiers and outperforming baseline models regarding runtime and memory use efficiency.

Moreover, the authors present compelling evidence that not only theory but also practical implementations of KV cache compression foster improved reasoning abilities under constant compute budgets. Contrast in performance across different datasets elucidates that DMS particularly benefits tasks that require extended reasoning processes, confirming the gains by rendering more tokens at inference time.

Implications and Future Directions

From a theoretical perspective, DMS offers a promising avenue for scaling inference-time compute resource-efficiently, aiding more sophisticated reasoning without demanding additional hardware resources. Practically, this research highlights the potential for deploying LLMs in memory-constrained environments, such as edge devices, while maintaining high accuracy levels. Furthermore, integrating DMS with other compression techniques like quantization or tensor decomposition might further optimize memory footprint without significant accuracy trade-offs.

The proposed methodology defines promising grounds for future development in AI, including extending the application scope beyond LLMs to other Transformer-like architectures where KV caching becomes a bottleneck. Additionally, evaluating the compatibility and merging of DMS with inference-time verifier models could deepen understanding and offer more robust solutions for context-aware reasoning tasks.

In conclusion, the effort represented by the development of DMS within the context of inference-time scaling is not just a refinement; it is an advanced adaptation to existing AI architectures that aligns compute efficiency with enhanced decision-making capabilities, engendering a powerful tool in the pursuit of optimizing inference under strict constraints.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Adrian Łańcucki (12 papers)
  2. Konrad Staniszewski (6 papers)
  3. Piotr Nawrot (7 papers)
  4. Edoardo M. Ponti (24 papers)
Youtube Logo Streamline Icon: https://streamlinehq.com