Papers
Topics
Authors
Recent
2000 character limit reached

LMCache: An Efficient KV Cache Layer for Enterprise-Scale LLM Inference (2510.09665v1)

Published 8 Oct 2025 in cs.LG

Abstract: Today's LLM inference systems treat individual engines and queries independently for simplicity, but this causes significant resource inefficiencies. While there are proposals to avoid redundant computation by reusing KV caches across queries and to increase GPU utilization by disaggregating a single query to different engines, their promises cannot be realized without efficiently offloading and communicating KV cache across LLM inference engines and queries. We present LMCache, the first and so far the most efficient open-source KV caching solution, which extracts and stores KV caches generated by modern LLM engines (vLLM and SGLang) and shares the KV caches across engines and queries. LMCache exposes KV caches in the LLM engine interface, effectively transforming LLM engines from individual token processors to a collection of engines with KV cache as the storage and communication medium. In particular, it supports both cache offloading (prefix reuse across queries) and prefill-decode disaggregation (cross-engine cache transfer). LMCache's high performance and wide adoption stem from the following contributions: highly optimized KV cache data movement with performance optimizations including batched data movement operations, compute and I/O pipelining; a modular KV cache connector component, decoupling LMCache from the rapid evolution of inference engines; a first-class control API, such as pinning, lookup, cleanup, movement, and compression, for flexible cache orchestration across GPU, CPU, storage, and network layers. Evaluation shows that combining LMCache with vLLM achieves up to 15x improvement in throughput across diverse workloads. With a growing community, LMCache has seen dramatic growth in adoption by enterprise inference systems, which provides valuable lessons for future KV caching solutions. The source code of LMCache is at: https://github.com/LMCache/LMCache.

Summary

  • The paper presents LMCache as an efficient KV cache layer that reduces redundant computations through cross-query context reuse.
  • It leverages batched operations, compute-I/O overlapping, and zero-copy transfers to minimize latency and maximize throughput.
  • The architecture integrates a data plane and control plane to manage cache movement across hardware layers, proving scalability in enterprise environments.

LMCache: An Efficient KV Cache Layer for Enterprise-Scale LLM Inference

Introduction

The paper "LMCache: An Efficient KV Cache Layer for Enterprise-Scale LLM Inference" (2510.09665) addresses significant inefficiencies in current LLM inference systems. Typically, each user query is processed independently by a single inference engine instance, leading to redundant computations and underutilized resources. LMCache introduces an efficient open-source solution for KV caching, allowing for resource sharing across queries and engines, thereby optimizing performance and reducing latency. Figure 1

Figure 1: LMCache supports both context caching (KV cache offloading and sharing across queries) and PD disaggregation (cross-engine transfer of KV caches).

Motivation

The proliferation of enterprise applications utilizing LLMs has shifted the bottleneck from training to inference costs and delays. This paper highlights that inference becomes the primary limitation due to increasing input and output lengths, driven by user interactions, multimodal inputs, and newer LLMs with extensive context windows. The redundant computation involved in recomputing the same prompt prefix across different queries adds to these costs. In interactive applications, reducing Time-to-First-Token (TTFT) is essential for maintaining a quality user experience.

KV Caching as a Solution

The paper proposes extending KV caching beyond single-query lifecycles to enable cross-query context reuse and prefill-decode (PD) disaggregation. These optimizations rely on efficiently moving KV cache segments across different memory hierarchies, including GPU memory, CPU DRAM, and networks such as Ethernet and NVLink.

LMCache makes several contributions:

  1. Optimized Performance: Incorporates batched operations, compute-I/O overlapping, and minimizes data copies to achieve high throughput and efficiency.
  2. Standardized Interface: A modular KV cache connector maintains compatibility with evolving inference engines.
  3. Management Interface: Offers APIs for KV cache orchestration, allowing efficient utilization and control across storage and network layers. Figure 2

    Figure 2: LMCache sits between LLM inference engines and heterogeneous storage/network devices.

Architecture and Implementation

LMCache's architecture integrates with LLM inference engines to manage KV cache movement, as depicted in the end-to-end system workflow (Figure 3). It features a data plane (LMCache Worker) and a control plane (LMCache Controller), providing standardized interfaces for optimized data movement and resource management. Figure 3

Figure 3: End-to-end system workflow for LMCache.

To handle I/O inefficiencies, LMCache employs batched operations with configurable chunk sizes, layer-wise pipelining to overlap computation with data transfer effectively, and zero-copy operations to reduce memory overhead (Figure 4). Figure 4

Figure 4: Illustration of dynamic offloading in LMCache.

Performance Evaluation

LMCache's evaluation demonstrates its superiority in reducing latency and improving throughput across various scenarios compared to existing systems and commercial offerings. In single-node CPU offloading, LMCache achieves significantly higher throughput and lower TTFT. Through centralized storage servers and PD disaggregation, it exhibits substantial improvements in efficiency and resource utilization.

Among the highlights:

  • Single-Node Evaluation: LMCache reduces mean latency and achieves higher throughput by leveraging CPU offloading and optimized cache management.
  • Remote Backend Evaluation: Utilizes hierarchical storage to maximize cache hit ratios and minimize time-to-load delays from remote servers.
  • PD Disaggregation: Outperforms native PD implementations by effectively managing KV cache transfers between prefiller and decoder stages.

Real-World Application

In production environments, LMCache's capabilities have been adapted for large-scale deployments, including emerging use cases like recommendation systems and chatbots. The paper notes the importance of adaptive strategies for KV cache loading, given bandwidth and context length considerations.

Conclusion

LMCache presents a robust and efficient solution for KV caching in LLM inference, significantly enhancing performance by addressing inefficiencies in current systems. Its open-source availability and rapid adoption in industry highlight its practical value and potential as a foundation for next-generation LLM infrastructures that consider AI-native data, such as KV caches, as integral components.

The work contributes a compelling argument for the broader adoption of AI-native data as a substrate for scalable LLM inference, facilitating a shift toward persistent, cache-aware computation models that ensure efficient, reliable, and cost-effective AI service delivery.

Whiteboard

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 5 tweets with 153 likes about this paper.