Cross-Request KV Caching Systems

Updated 25 February 2026

Cross-request KV caching is a technique that reuses previously computed or retrieved values across different requests to enhance efficiency in multi-tenant systems.
It employs advanced matching strategies—including exact-prefix, semantic similarity, and delta editing—to enable partial reuse and reduce redundant computations.
Adaptive cache management policies, such as workload-aware eviction and model-aware compression, significantly reduce latency and resource consumption in distributed environments.

Cross-request key/value (KV) caching comprises a class of techniques for reusing previously computed or retrieved values across distinct client or user requests to accelerate inference, reduce redundancy, and improve resource efficiency in distributed storage, LLM serving systems, networking, and tool-calling applications. This approach fundamentally extends traditional temporal locality exploitation by enabling cache hits in the presence of semantic or structural similarity across varied queries, inputs, or operations. The development of cross-request KV caching has been driven by the computational and latency bottlenecks observed in multi-tenant LLM serving and cloud-based storage, necessitating workload-aware, scalable, and adaptive cache design.

1. Architectural Principles and Taxonomy

Cross-request KV caching architectures are characterized by three key system-level decisions: (a) cache insertion and key derivation strategies, (b) cross-request matching and reuse mechanisms, and (c) cache management policies, including eviction and consistency enforcement.

Insertion and Key Derivation: Systems must determine what constitutes an eligible cacheable entry, typically through extracting structured semantic features (e.g., tool-call type, LLM prompt hash, transaction identifier) and constructing robust, collision-resistant cache keys (e.g., SHA-256 over order-invariant serializations) (Zhai et al., 20 Jan 2026).
Cross-request Matching: Exact-prefix matching yields high-fidelity reuse but is limited to identical requests. Advanced systems employ semantic similarity via embedding databases, edit-distance-aware delta structures, or flexible admission controllers to support partial and inexact reuse (Yang et al., 17 Mar 2025, Pandey, 4 Dec 2025).
Management and Eviction: Resource constraints and skewed workloads necessitate sophisticated eviction policies that blend recency/frequency (LRU, LFU), cost/benefit analysis, reuse-probability models, or randomized online algorithms, often augmented by adaptive admission control (Wang et al., 3 Jun 2025, Wu et al., 26 Jan 2026, Zhai et al., 20 Jan 2026).

Table 1 summarizes major system archetypes:

System Type	Key Matching	Eviction Policy
LLM Prefix Cache	Exact-prefix	LRU / Custom
Semantic LLM Cache	Embedding + DeltaTree	LRU / Workload-aware
Distributed Storage Cache	Key+Lease/TTL	LRU / Lease-based
In-network Data-plane	Hash+Metadata	Popularity / Control-plane
Tool-Calling Cache	Feature hash+semantic	Bandit admission + v-LRU

State-of-the-art systems targeting LLM inference, such as KVShare (Yang et al., 17 Mar 2025), generalize beyond prefix caching by employing mechanisms to enable efficient cross-user cache hits:

DELTA Tree and Semantic Indexing: Requests are embedded (e.g., through sentence transformers) and indexed. On each new request, the nearest neighbor prior prompt is retrieved, and an edit script (DELTA Tree) is computed to guide minimal KV edit operations—deletion, insertion of placeholders, and optional replacement—so that only modified prefix regions are recomputed.
PartialAttention: Transformer attention computation is restricted to "placeholder" tokens identified via DELTA Tree, leveraging the linearity of self-attention for correctness. This local recomputation ensures that cross-request reuse does not introduce semantic drift.
Token Recycling and Partial KV Use: For low-parameter LLMs, cached past_key_values are selectively loaded when an exact prefix is detected. Latency and compute savings are realized by skipping encoding and attention passes up to the longest-matching prefix (Pandey, 4 Dec 2025).

Such mechanisms are domain-agnostic and applicable in tool-calling (Zhai et al., 20 Jan 2026) and cross-transaction storage (Misra et al., 2020), where the cacheability of requests is inferred from request type, user role, or TTL.

3. Cache Management Policies and Workload Adaptivity

Emerging large-scale LLM and storage workloads demonstrate skewed and ephemeral reuse patterns, necessitating adaptive eviction and admission modeling (Wang et al., 3 Jun 2025, Zhu et al., 28 May 2025). Notable policy features include:

Workload-Aware Tuple-Priority Eviction: Each KV block’s eviction priority is represented as a tuple (negative estimated future reuse probability, token offset). This exploits per-category exponential fits of reuse-time distributions and the empirical dominance of head-prefixed blocks for optimal caching effectiveness (Wang et al., 3 Jun 2025).
Value-driven and Bandit-based Policies: In tool-calling settings, cache admission and eviction depend on a multi-factor "caching-value score" combining latency, cost, hit frequency, and staleness risk. VAAC employs hierarchical grouping and UCB-based arm selection for adaptive admission, and value-aware LRU evicts among the least recently used with the lowest composite value (Zhai et al., 20 Jan 2026).
Randomized Online Algorithms: Randomized Log-Time eviction (RLT) guarantees O(log n) competitive ratio for adversarial query streams and, when combined with learning-based query routing, yields substantial improvements in cache hit rates and throughput over LRU or simple cost models (Wu et al., 26 Jan 2026).

The use of dynamically fitted model parameters (e.g., λ for exponential reuse time) and category-specific queues enables continuous adaptation as the workload shifts across single-turn, multi-turn, API, and chat modes.

4. System Implementations: Distributed, In-Network, and LLM-Serving

Cross-request KV caching is realized across varied platforms:

LLM Serving Systems: KVShare integrates at the transformer KV attention boundary, requires vector-database lookups, and orchestrates fine-grained delta edits via the DELTA Tree before partial KV recomputation (Yang et al., 17 Mar 2025). Canonical pipeline stages: request routing → semantic similarity search → DeltaTree edit → PartialAttention compute.
Distributed Storage (Kairos): Inter-transaction caching leverages soft leases defined by analytical optimization (balancing cache hit/freshness with write arrival rates by d_ideal). PTP-synchronized clocks allow for server-state free leases with fallback optimistic concurrency control; scalability is enhanced by sharded client-side validators and watermark-based GC (Misra et al., 2020).
In-Network Caching: OrbitCache/StarCache circumvents ASIC memory constraints by representing cached items as recirculating packets in the switch data plane. Admission/eviction is orchestrated by a controller informed by both switch-side counters and server-side frequency sketches, with congested items managed by dynamic resizing and periodic popularity aggregation (Kim, 2024).
VNF-based Deployments: VNF-Cache employs an in-network function to intercept and serve KV requests (e.g., MongoDB) directly at edge routers or PoPs. Consistency is enforced via write-invalidate, and cache federation is enabled with coordination flows for geo-distributed coherence (Farias et al., 23 Dec 2025).

5. Performance and Workload Characterization

Empirical studies at cloud provider scale reveal fundamental patterns governing cross-request KV cache efficiency (Wang et al., 3 Jun 2025):

Ideal cache hit rates on production LLM workloads range from 54–62%, substantially lower than synthetic benchmarks, owing to the multiplicity of ephemeral, per-user, and single-turn interactions.
Reuse is highly skewed: 10% of blocks account for >75% of reuses, and the vast majority (>90%) of reuse is intra-user, not cross-user.
Temporal locality is concentrated: 80% of reuses arise within 10 minutes (chat workloads) or 10 seconds (API workloads).
The working set for optimal caching is moderate (cache size ≈4x HBM in GQA models, <1x in API workloads), and memory scaling for multi-head attention (MHA) models is considerable (10x+ HBM for Llama3-70B).
Workload-aware eviction consistently improves hit rate (3–9 percentage points over LRU) and reduces response time/TTFT by up to 40%.

Table 2 summarizes empirical hit ratios for key workloads and models.

Model	Policy	Hit Ratio Gain	ΔTTFT Reduction
Qwen2-7B GQA	LRU → Workload-Aware	+10.9%	–41.9%
Llama3-70B	LRU → Workload-Aware	+43.3%	–32.0%

6. Compression, Scalability, and Future Directions

KV cache growth and context length scaling have catalyzed research into model-aware compression and learning-based allocation:

Compression via Nexus Tokens (SONIC): Multi-turn context is summarized using learnable "Nexus" embeddings that replace raw KV segments. Hierarchical attention masking and loss-regularized distillation ensure critical memory is retained while compressing historical segments by 50–80%. Empirical results show up to 67.3% memory savings and 50.1% reduction in inference time at negligible quality cost on public benchmarks (Chen et al., 29 Jan 2026).
Scaling and Routing: Multi-LLM serving necessitates coordinated cache-eviction and query-routing policies that balance worker load without sacrificing locality. Joint optimization with learning-based routing and randomized eviction (LBGR+RLT) yields 6.92× hit-rate and 14.06× TTFT improvement over baseline methods, confirmed in adversarial and realistic traffic studies (Wu et al., 26 Jan 2026).
Integration with System Pipelines: Advanced frameworks either slot in at the attention layer (requiring insertion points for placeholder/edited KV) or operate orthogonally over intercepted API/tool-calls (bandit-admission, dynamic grouping). SONIC's compression layer adapts on-the-fly to variable memory budgets via dynamic budget training.

7. Practical Considerations and Best Practices

Emerging best practices for deploying cross-request KV caching include:

Pre-characterize request and block reuse patterns using real workload traces rather than synthetic benchmarks (Wang et al., 3 Jun 2025).
Exploit both single-turn and multi-turn reuse; identical system prompts in API workloads can drive substantial cross-request hit rates.
In highly skewed or ephemeral settings, frequency-based policies (LFU) are suboptimal; favor time-bounded reuse probability and tuple-priority or bandit-based admission.
Integrate cache-layer metrics (reuse time, probability, lifespan) with system-level adaptation (eviction, resource allocation, admission control).
For in-network/dp-based cache, further gains are possible by federating multiple edge caches with invalidate coordination.
In LLM serving, use semantic matching and partial recomputation judiciously; edit distance cutoffs and DELTA Tree pruning balance efficiency and recomputation overhead (Yang et al., 17 Mar 2025).
For compression-based approaches (e.g., SONIC), always preserve prompt and current turn in full, and tune attention regularization weights to ensure compressed “memory” is utilized.

In summary, cross-request key/value caching encompasses a spectrum of mechanisms that exploit temporal and semantic locality across diverse high-throughput and multi-tenant tasks. Recent research demonstrates that advanced cache matching, adaptive and probabilistic eviction, and model-aware context compression are necessary for optimal performance in LLM serving, cloud storage, networking, and API tool-chaining deployments (Yang et al., 17 Mar 2025, Zhu et al., 28 May 2025, Kim, 2024, Pandey, 4 Dec 2025, Misra et al., 2020, Wang et al., 3 Jun 2025, Farias et al., 23 Dec 2025, Wu et al., 26 Jan 2026, Chen et al., 29 Jan 2026, Zhai et al., 20 Jan 2026).