Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
194 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

dKV-Cache: The Cache for Diffusion Language Models (2505.15781v1)

Published 21 May 2025 in cs.CL

Abstract: Diffusion LLMs (DLMs) have been seen as a promising competitor for autoregressive LLMs. However, diffusion LLMs have long been constrained by slow inference. A core challenge is that their non-autoregressive architecture and bidirectional attention preclude the key-value cache that accelerates decoding. We address this bottleneck by proposing a KV-cache-like mechanism, delayed KV-Cache, for the denoising process of DLMs. Our approach is motivated by the observation that different tokens have distinct representation dynamics throughout the diffusion process. Accordingly, we propose a delayed and conditioned caching strategy for key and value states. We design two complementary variants to cache key and value step-by-step: (1) dKV-Cache-Decode, which provides almost lossless acceleration, and even improves performance on long sequences, suggesting that existing DLMs may under-utilise contextual information during inference. (2) dKV-Cache-Greedy, which has aggressive caching with reduced lifespan, achieving higher speed-ups with quadratic time complexity at the cost of some performance degradation. dKV-Cache, in final, achieves from 2-10x speedup in inference, largely narrowing the gap between ARs and DLMs. We evaluate our dKV-Cache on several benchmarks, delivering acceleration across general language understanding, mathematical, and code-generation benchmarks. Experiments demonstrate that cache can also be used in DLMs, even in a training-free manner from current DLMs.

Summary

A Critical Examination of dKV-Cache: A Novel Approach for Accelerating Diffusion LLMs

This paper introduces a nuanced caching mechanism, called the delayed KV-Cache, to address the inherent inefficiencies in inference speed faced by Diffusion LLMs (DLMs). The primary motivation stems from the notable limitation of DLMs, which, unlike Autoregressive Models (ARs), cannot naturally leverage key-value caching during decoding due to their non-autoregressive, bidirectional attention framework.

The authors propose a delayed caching system that capitalizes on the unique token representation dynamics observed across the diffusion process. Traditional key-value cache approaches, successful in autoregressive settings, hinge on the linear, left-to-right generation sequence typical of autoregressive models, where once a token's key-value pair is computed, it remains unchanged and can be reused efficiently. However, due to the simultaneous bidirectional attention across all tokens in DLMs, key and value states are interdependent and evolve with each timestep, precluding straightforward cache reuse.

Methodological Innovation and Variants

The paper introduces two main variants of the dKV-Cache system:

  1. dKV-Cache-Decode: This approach aims to replicate the acceleration benefits of traditional KV-caching while optimizing for bidirectional attention. The innovative tactic here involves delaying the cache of decoded tokens by an additional timestep, which allows the model to circumvent performance drops and enhance accuracy without sacrificing speed. This variant caches key-value pairs only after tokens achieve stability in their representations post-decoding.
  2. dKV-Cache-Greedy: A more aggressive caching strategy, where the mechanism reduces computational complexity by focusing only on a subset of tokens, specifically recently and currently decoded tokens alongside a local neighborhood window. Although this method may introduce minor accuracy trade-offs, it achieves up to quadratic improvements in time complexity.

Results and Implications

The experimental results presented in the paper are compelling. Across a range of benchmarks, including tasks in language understanding, code generation, and mathematical problem-solving, the dKV-Cache system exhibits a 2-10× improvement in inference speed. What is noteworthy is the minimal performance degradation observed, particularly in the dKV-Cache-Decode variant, which shows near-original performance levels.

The implications of developing such caching mechanisms are far-reaching, particularly in scenarios requiring high-throughput text generation or real-time applications where inference speed is critical. In theoretical terms, this work extends knowledge on how diffusion processes can be adapted to incorporate caching strategies, thereby suggesting new paradigms of efficiency within machine learning model architectures traditionally impeded by computational limits.

Future Directions and Limitations

While the advancements illustrated in this paper are significant, the authors highlight several directions for future research. There remains a need to explore the integration of algorithmic optimizations with system-level strategies, like advanced parallelism and memory management, which could further improve the model's efficiency.

Moreover, the evaluation of dKV-Cache is primarily focused on existing open-source models, such as LLaDA and Dream, within controlled benchmarks. Extending research to newer or more varied DLMs across diverse infrastructures could provide insights into the generalizability of the proposed caching mechanisms.

In conclusion, the dKV-Cache represents a critical advancement in accelerating diffusion-based models, paving the way for more pragmatic applications of DLMs. This work underscores the potential for continued refinement of diffusion processes and caching strategies, thereby narrowing the performance gap between diffusion and autoregressive models, a development with profound implications for the future of natural language processing and beyond.

Youtube Logo Streamline Icon: https://streamlinehq.com