Prompt Cache in LLMs
- Prompt cache is a method of storing key-value representations generated during LLM inference to reduce redundant computation and lower latency.
- It employs modular, semantic, and distributed caching techniques that can cut time-to-first-token by up to 70x while minimizing costs.
- Advanced management strategies, including compression, eviction, and load-balancing, address memory constraints and privacy concerns in production environments.
Prompt cache refers to the storage, reuse, and management of key–value (KV) representations generated by LLMs during the processing of input prompts or related context segments. Prompt caching is central to the acceleration of LLM inference, optimization of memory usage, cost minimization in production environments, and the design of efficient retrieval-augmented and agentic systems. The precise design space includes classic KV cache reuse, modular or semantic segment caching, prefix and chunk-level reuse schemes, compression and eviction policies, distributed system scheduling for cache affinity, and prompt cache mechanisms for both language and diffusion models.
1. Fundamental Principles of Prompt Cache in LLMs
Prompt caching in LLMs is built around two inference phases:
- Prompt (Prefill) Phase: The entire prompt (length ) is processed through all model layers to output the first token, emitting Query (Q), Key (K), and Value (V) projections at each layer, with K and V stored in a KV-cache.
- Extension (Decoding) Phase: Each additional token leverages the cached K and V entries, requiring only the new token’s projections, and thereby greatly lowers computational cost.
Formally, the KV-cache for causal self-attention is
with each and , where is the hidden state at position (Cho et al., 2024).
This cache grows linearly with prompt length and enables efficient computation of attention during decoding as
where is the causal mask. Prompt caching reduces the time-to-first-token (TTFT) and overall inference latency by eliminating redundant computation for overlapping or repeated prompt segments.
2. Advanced Prompt Cache Schemes and Algorithms
2.1 Modular and Schema-Based Attention Reuse
The "Prompt Cache" system introduces the notion of prompt modules—explicitly marked spans of tokens (e.g., system messages, document templates)—precomputed and cached for reuse via a dedicated schema (Prompt Markup Language, PML). Modules are defined at the schema level, prefilled with fixed position IDs, and concatenated at inference alongside new prompt content, drastically reducing prefill complexity from to for context modules and fresh text . Empirical gains include up to TTFT reduction with negligible quality loss (Gim et al., 2023).
2.2 Parallel and Distributed Prompt Cache Generation
KV-Runahead parallelizes cache construction in the prompt phase. Worker processes each handle contiguous prompt segments, passing partial KV-caches via point-to-point messages and eventually assembling the full KV-cache on the final worker. This pipeline minimizes TTFT, with speedups up to observed in Llama-7B/Falcon-7B, and robust performance under network noise. Work allocation is load-balanced using an offline hierarchical grid search, achieving asymptotic efficiency improvement over tensor/sequence-parallel baselines (Cho et al., 2024).
2.3 Semantic, Prefix, and Chunk-Level Caching in Production APIs
In LLM APIs, prompt caching can operate at:
- Full-context (naïve) caching: every prompt is eligible if sufficiently long, risking inefficient cache writes.
- System-prompt-only caching: only static system instructions are cached, using explicit cache breakers (UUIDs).
- Exclude-tool-results caching: multiple cache breakers isolate tool call outputs, maximizing cache efficiency in agentic setups.
Strategic placement of dynamic content, adherence to provider cache policies, and explicit cache boundaries are essential for maximizing cost and TTFT savings (45–80% cost reduction, 13–31% TTFT) (Lumer et al., 9 Jan 2026).
3. Prompt Cache Compression and Management under Constraints
Coping with unbounded KV-cache growth and hardware limits has produced various compression and management schemes:
3.1 Token/Span-based Eviction and Compression
Compression policies include:
- Position-based (sliding window, attention sinks).
- Attention-based (H2O, TOVA: select tokens with highest attention mass).
- Embedding-based (K-norm).
- Hybrid (SnapKV: combine windows/recent-queries for scoring).
Pitfalls are prominent in multi-instruction or multi-system prompts—e.g., system prompt leakage, uneven degradation rates, and instruction order bias. Mitigation techniques include token whitelisting and per-span fair eviction, balancing retention between instruction spans and preserving safety-critical directives (Chen et al., 30 Sep 2025).
3.2 Block-wise, Episodic, and Context-Guided Compression
EpiCache applies block-wise prefill to cap peak cache usage and introduces episodic compression. Dialogue is segmented into semantically coherent episodes (via embedding clustering); per-episode patched prompts guide context-aware eviction. Layer-wise budget splitting is determined by each layer’s cache sensitivity. Under 4–6 compression, EpiCache maintains full-KV accuracy while providing 2.4–3.5 memory and latency reduction (Kim et al., 22 Sep 2025).
Finch performs prompt-guided, chunk-wise KV selection during prefill, storing only top- relevance-scored KV pairs per chunk. This supports compression ratios up to 93× while retaining semantic integrity and high F1/EM on QA and summarization tasks (Corallo et al., 2024).
4. Prompt Cache in Specialized and Retrieval-Augmented Architectures
4.1 Retrieval-Augmented Generation and Chunk Caching
In Cache-Craft, input prompt chunks in RAG are matched to previously stored chunk-caches. KV states are reused if context alignment and cache-context impact (CCI) metrics are favorable. A minimal number of KV rows are recomputed (“fix-up”) to address chunk order shifts or contextual changes. Preloading and multi-tier storage enable efficient operation on production workloads, showing 1.6×–2× latency and throughput improvements vs prefix-caching (Agarwal et al., 5 Feb 2025).
FusionRAG further bridges the inefficiency of naive chunk-level cache reuse by offline similarity-guided fusion of likely co-retrieved chunks and online sparse recomputation for critical chunk tokens. This achieves quality within a few percent of full attention at <15% recomputation, with up to 9.4× TTFT reduction (Wang et al., 19 Jan 2026).
4.2 Diffusion LLMs and Prompt-Centric Caching
Diffusion LLMs (dLLMs) require KV caching for bidirectional prompt-token attention, which grows as . MaskKV uses a mask-based, prompt-guided attention mechanism and data-driven, hybrid layer/head budgeting, compressing KV storage by over in practice while retaining >90% performance (Huang et al., 10 Oct 2025).
Hybrid grained approaches such as HGC employ both block-level and prompt-level caches in controllable generation, freezing and reusing cross-attention maps to accelerate denoising while preserving semantic fidelity (Liu et al., 14 Nov 2025).
4.3 Infilling and Interactive Tasks
In code or text infilling, EFIM reorganizes prompt format so that only the incremental, dynamic segments are appended at the end, resulting in near-full cache reuse for both prefix and suffix. Complemented by fragment tokenization during training to enable subtoken continuation, this enables 52% lower latency and nearly doubles throughput compared to standard fill-in-the-middle schemes (Guo et al., 28 May 2025).
5. Prompt Cache Scheduling, Security, and Auditing
5.1 Distributed Serving and Cache-Affinity Load Balancing
DualMap addresses the inherent conflict between cache affinity (routing requests sharing prompt prefixes to the same instance for KV reuse) and global load balancing in distributed LLM serving. By mapping each request to two candidate servers via dual hashes, DualMap leverages the power-of-two-choices phenomenon, achieving both high cache hit rates and strong load balance. SLO-aware routing and hotspot-aware rebalancing further optimize for TTFT and throughput, with empirically up to 2.25× higher effective capacity (Yuan et al., 6 Feb 2026).
5.2 Side-Channel and Privacy Implications
Prompt caching strategies can lead to measurable timing side channels, enabling attackers to infer prompt cache hits or potentially leak information about prompts across users. A taxonomy identifies three levels of cache scope (per-user, per-team, global) and two matching policies (exact and prefix-match, the latter generally safe only for decoder-only Transformers). Statistical audits (e.g., Kolmogorov–Smirnov test on TTFT distributions) can detect prompt caching and scope, and have even revealed proprietary model architectures (e.g., identifying OpenAI's embedding model as decoder-only). Per-user or per-org cache isolation and transparency of caching policy are recommended for privacy (Gu et al., 11 Feb 2025).
Semantic prompt cache (e.g., vCache) further returns cached responses to semantically similar prompts, using embedding-based kNN search and online threshold selection per entry to guarantee a user-defined error rate. This design yields up to higher hit rates at identical error compared to static-threshold policies, with sub-millisecond overhead (Schroeder et al., 6 Feb 2025).
5.3 Security Vulnerabilities in Diffusion Model Caches
Approximate prompt caches in diffusion models open new remote attack vectors—covert channels, prompt stealing (recover cached prompt embeddings/inputs), and poisoning (injecting logo or watermark overlays into outputs for future hits). Effective mitigations combine randomized cache lookup, content filtering, rate-limiting, per-tenant isolation, and noise injection (Sun et al., 28 Aug 2025).
6. Emerging Trends and Open Issues
- Throughput–Accuracy Trade-Offs: Aggressive cache compression or reuse must be balanced against potential losses in instruction following, system prompt leakage, and robustness to multi-turn or multi-instruction prompts (Chen et al., 30 Sep 2025, Kim et al., 22 Sep 2025).
- Dynamic, Semantic, and Modular Schemes: Increasing interest in modular prompt schemas, semantic/embedding cache keys, and fusion of symbolic (schema) and learned (embedding) cache management (Gim et al., 2023, Schroeder et al., 6 Feb 2025).
- Distributed and Heterogeneous Serving: Solutions span in-GPU/HBM multitenant caching, DRAM-SSD tiering, and routing algorithms robust to load skew and real-world traffic (Yuan et al., 6 Feb 2026, Agarwal et al., 5 Feb 2025).
- LLM+RAG and Specialized Tasks: Specialized prompt cache paradigms are needed for RAG, diffusion models, controllable generation, reinforcement learning, and infilling workloads (Agarwal et al., 5 Feb 2025, Liu et al., 14 Nov 2025, Chang et al., 14 Jan 2026, Guo et al., 28 May 2025).
- Security and Privacy: Side-channel prevention and privacy guarantees remain critical open problems for both language and vision pipelines, especially under cross-tenant and approximate cache sharing (Gu et al., 11 Feb 2025, Sun et al., 28 Aug 2025).
7. Practical Considerations and Best Practices
- Always segment prompt input to maximize the stable (cacheable) prefix and keep dynamic or session-specific content at the end.
- Use explicit UUIDs or cache-breakers to delimit segments when using provider-managed APIs (Lumer et al., 9 Jan 2026).
- Monitor cache usage and hit rates, employ fair eviction and whitelisting for critical instructions, and validate under representative workloads, not just synthetic benchmarks (Chen et al., 30 Sep 2025).
- For RAG, multi-document, and conversational workloads, combine episodic compression, blockwise prefill, and per-chunk or per-episode KV budgeting (Kim et al., 22 Sep 2025, Agarwal et al., 5 Feb 2025, Wang et al., 19 Jan 2026).
- In distributed serving, leverage dual-mapping for combined cache-affinity and load-balancing, and tune scheduling under system SLOs (Yuan et al., 6 Feb 2026).
- Audit and document caching policies, implement privacy boundary controls, and consider the security implications of approximate or semantic cache schemes (Gu et al., 11 Feb 2025, Sun et al., 28 Aug 2025, Schroeder et al., 6 Feb 2025).
Prompt cache, originally arising as a simple acceleration tool, has rapidly evolved into a central construct for scaling, efficiency, retrieval, security, and accuracy in large model inference, enabling modern LLM and diffusion model applications to operate at real-world scale and cost.