Pull-Based KV Cache Transfer Strategy

Updated 15 November 2025

Pull-based KV cache transfer is an on-demand method that retrieves only essential cache blocks during LLM inference to minimize time-to-first-token.
It employs adaptive, bidirectional scheduling with semantic triggers and parallel compute/I/O threads to optimize cache prefill and resource usage.
Empirical results show significant latency reductions and throughput gains across systems like Cake, LMCache, and Titanus, enhancing overall energy efficiency.

A pull-based KV cache transfer strategy orchestrates on-demand, request-driven loading of Key-Value cache blocks in LLM inference, typically from persistent storage, off-chip memory, or remote engines into compute devices. In contrast to push-based schemes—which proactively broadcast or stage full KV caches—a pull-based approach only transfers the necessary cache blocks, often in response to real-time scheduling or semantic boundaries. Pull-based strategies aim to minimize time-to-first-token (TTFT), reduce I/O and compute bottlenecks, and adapt flexibly to variable resource availability and storage bandwidth. Modern systems such as Cake, LMCache, FreeKV, LouisKV, Titanus, and FlowKV deploy variants of pull-based logic to optimize LLM inference across long contexts, cross-query reuse, cross-layer sharing, and disaggregated compute scenarios.

1. Formal Problem Statement and Design Objectives

The fundamental problem addressed by pull-based KV-cache strategies is minimizing TTFT during the prefill stage for long-context LLM inference. For a prompt split into $N$ fixed-size chunks, the system must generate each chunk’s KV cache either through local GPU computation or by I/O transfer (from disk or cache), selecting the optimal split point $k^*$ :

$T(k) = \max\left(\sum_{i=1}^k t_{\mathrm{compute}}(i), (N-k)t_{\mathrm{io}}\right), \quad k^* = \arg\min_{0 \leq k \leq N} T(k)$

Here, $t_{\mathrm{compute}}(i)$ reflects GPU computation time (empirically linear in $i$ , leading to a quadratic cumulative cost), and $t_{\mathrm{io}}$ is the chunk fetch time (typically constant). The strategy seeks the point where the computed prefix and fetched suffix rendezvous, minimizing TTFT. This objective generalizes to heterogeneous contexts—cross-query, cross-layer, and disaggregated nodes—where the goal is always to minimize critical-path latency and resource usage.

2. System Architectures and Pull Protocols

Pull-based KV-transfer systems are realized in software frameworks (Cake, LMCache, FreeKV, LouisKV, Titanus, FlowKV) and sometimes at the hardware level (Titanus). They feature:

Parallel Compute and I/O Threads: One thread (usually on GPU) pulls and computes chunks from the input’s front; another (CPU/storage) pulls and loads from the back. No global split parameter is needed.
Residency Tracking: A map tracks which cache chunks have been pulled and are resident, allowing termination upon rendezvous.
Modular Connectors and APIs: LMCache’s data-plane workers extract, pull, and push KV blocks between compute and storage/network tiers, controlled by APIs (lookup, move, pin, clear, compress).
Efficient Data Planes: Block-chunk batched I/O, zero-copy transfers, and connector modularity tune chunk size and schedule transfer pipelines for maximal throughput.
Disaggregated Node Coordination: FlowKV decouples prefill (P) and decode (D) nodes, initiating pulls only when D requires data. Memory is allocated in large contiguous segments to reduce kernel launches (O(1) vs. O(N)).
Speculative and Semantic-Aware Triggers: FreeKV, LouisKV, and cross-layer schemes trigger pulls either speculatively (based on query similarity) or only at semantic boundaries (via clustering and cosine metrics).

The following table summarizes selected system components and pull logic:

System	Pull Trigger	Residency/Tracking
Cake	Bidirectional pointer sweep	CPU memory residency map
LMCache	Token processor events	Connector + controller
FlowKV	Decode node requests	Segment-based allocator
LouisKV	Semantic boundary detection	Centroid cache units

3. Bidirectional, Adaptive, and Semantic Scheduling

Classic bidirectional scheduling (Cake) initializes two pointers: one for compute (front) and one for I/O (back), progressing until their indices cross. The adaptive mechanism is entirely implicit—if $t_{\mathrm{io}}$ increases, the compute thread advances further before rendezvous ( $k$ rises); if $t_{\mathrm{compute}}$ rises, the I/O thread fetches more ( $k$ falls). No tuning is required; actual durations drive the split.

Semantic-aware approaches (LouisKV) only trigger a pull at detected semantic boundaries, computed by the cosine similarity of consecutive query vectors. Fine-grained management splits cache units into clusters (input) and temporal segments (decode), searching centroids for maximal relevance and pulling only critical entries.

Cross-layer schemes redefine cache sharing via pull, letting non-KV layers fetch from KV layers at decode time—potentially reducing total cache size by a factor of $r$ and requiring iterative prefill adjustments for top/middle-pull setups.

4. Performance Optimizations and Quantitative Results

Modern pull-based KV strategies deploy several optimizations:

Chunked/Batched I/O: Chunk sizes are tuned to saturate PCIe/NVLink (e.g., LMCache recommends $C \cdot P \geq 1$ MB).
Zero-Copy Data Movement: Data pointers are never duplicated across tiers.
Layer-wise Compute/I/O Overlap: Separate CUDA streams for compute and transfer; pipelined layer computation and KV loading.
Prefetch Heuristics: Pending queries may prefetch prefixes to CPU so that pulls from slower tiers do not stall GPU.
Lossy Compression: Acceptable in chat/instruction-tuned setups, further reducing pull volume (Cake with quantization or CacheGen).
Dense-to-Sparse/Clustered Transfer: Titanus only pulls non-zero KV entries, leveraging CPQ and HQE for compression and hierarchical quantization.
Double-Buffered Recall: FreeKV uses alternating GPU buffers for fully overlapped DMA and transpose, hiding >90% of pull latency.

Empirical results across benchmarks and hardware (A100/H100, RTX 3090):

Cake: TTFT reduced by 76.9–93.5% (vs. I/O-only) and up to 31.5% (vs. compute-only) (Jin et al., 4 Oct 2024).
LMCache: Throughput up to 15× higher than native vLLM; remote pulls at 1.3–3× gains; 2× TTFT reduction in tail (Cheng et al., 8 Oct 2025).
Titanus: Off-chip access reduced by 107.1×; energy efficiency 159.9× and throughput 49.6× vs. A100 (Chen et al., 23 May 2025).
FreeKV: 13.7× decode speedup over ArkVale, 8.4× over ShadowKV; <0.6% loss in accuracy (Liu et al., 19 May 2025).
LouisKV: 1.9–4.7× speedup over Arkvale; ≤0.2% loss in retrieval accuracy (Wu et al., 13 Oct 2025).
FlowKV: KV transfer latency reduced by 96–98%; end-to-end throughput improved by 15.2–48.9% over baseline (Li et al., 3 Apr 2025).

5. Trade-Offs, Limitations, and Practical Considerations

Trade-offs are highly workload-dependent:

Duplication Window (LMCache): Tighter duplication means less RAM usage but higher risk of GPU stalls; wider windows trade higher memory for fewer stalls.
Remote Pull Latency: Remote storage pulls may lag, so overlapping with compute and effective prefetching are essential.
Compression Effects: Aggressive pruning and quantization (Titanus) risk accuracy loss when exceeding layer-sensitivity bounds; for $b_c < 2$ , quantization error is non-negligible.
Chunk Sizing: Adaptive chunking prevents both underutilization (small chunks) and overutilization (large chunks) of bandwidth.
Role Flexibility: FlowKV’s ability for nodes to swap P/D roles mitigates load imbalance, but correct metrics in a sliding window must prevent oscillation.
Implementation Overheads: Modular connectors (LMCache), pinned CPU memory (FreeKV), and kernel selection optimize integration and deployment.
Prefill Iterative Encoding: Cross-layer pulls from upper layers require special iterative training and causal masking, incurring extra prefill latency but retaining performance.

A plausible implication is that flexible, adaptive pull strategies will outperform static push-based or pure-compute/pure-I/O schemes when context lengths, hardware availability, and storage topology are highly variable. Most strategies can fallback gracefully to full prefill if pulls fail or latency is prohibitive.

6. Broader Impacts and Scenarios of Application

The adoption of pull-based KV transfer strategies directly benefits several application scenarios:

Long-Context and Multi-Turn Reasoning: Reducing TTFT and efficiently managing cache throughput are essential for chatbots and agents requiring window sizes > 16 K tokens.
Enterprise-Scale Inference: LMCache is extensively integrated—its lessons on chunking, connector APIs, and offloading inform future cache orchestration (Cheng et al., 8 Oct 2025).
Disaggregated Architectures: FlowKV nearly eliminates cache-transfer bottlenecks, enabling flexible resource assignment and maximizing overall throughput on heterogeneous infrastructures.
Efficient Cross-Layer and Cross-Query Sharing: Layer-pull and cross-query strategies cut memory by up to 2×–4×, which is vital for high-throughput, large-batch deployments.
Energy and Hardware Efficiency: Titanus leverages on-the-fly pruning/quantization and selective pulls for extreme energy savings and throughput on custom accelerators (Chen et al., 23 May 2025).

Misconceptions include assuming that all pull-based protocols inevitably increase complexity or degrade accuracy; empirical studies consistently show that, when implemented with adaptive scheduling and fine-grained selection, accuracy remains near-lossless and integration overhead is negligible.

7. Comparison to Push-Based and Legacy Methods

Push-based cache strategies, as in traditional ArkVale or Quest, rely on broadcasting all KV blocks per token interval, resulting in linear per-token transfer cost and significant latent GPU stalls. Pull-based strategies (LouisKV, FreeKV, Cake, etc.) only retrieve on semantic or scheduling boundaries, transfer critical or compressed entries, and overlap transfer with compute, dramatically reducing overall latency and bandwidth use.

Push strategies manage cache by rigid fixed pages; pull strategies cluster or segment units according to actual attention patterns, tightly matching model usage and minimizing unnecessary movement. In very short-output scenarios, their performance converges, but with increasing output lengths or sparse attention, pull-based designs consistently demonstrate direct throughput and resource advantages.

Taken together, pull-based KV cache transfer approaches define the modern frontier of efficient LLM inference deployment, enabling scalable, adaptive, and resource-optimal services in both research and industrial settings.