RAGCache: Caching for RAG Systems

Updated 12 December 2025

RAGCache is a family of caching techniques that accelerates retrieval-augmented generation by reusing neural intermediate states and document retrievals.
It leverages approximate embedding matching, prefix-aware key–value caching, and multi-level strategies to reduce computational redundancy and memory usage.
Empirical studies report up to 80% latency reduction and improved throughput, making RAGCache vital for efficient LLM performance under high workloads.

Retrieval-augmented generation (RAG) cache systems (“RAGCache”) constitute a family of algorithmic and systems techniques that accelerate, simplify, or enhance the knowledge integration process in retrieval-augmented LLM architectures. By reusing retrievals, neural intermediate states, or model-prefilled representations across RAG queries exhibiting statistical or semantic locality, RAGCache schemes reduce redundant computation and memory throughput, offering significant gains in time-to-first-token (TTFT), query throughput, and end-to-end system efficiency. Approaches span approximate embedding-keyed document caches, task-compressed key-value (KV) caches, prefix- and chunk-aware caching trees, multi-layered enterprise caches, exact and probabilistic vector or graph caches, disk/shared memory variants, and context order–robust cache fusion schemes.

1. Approximate Query and Embedding-Level Caching

Approximate RAGCaches use vector similarity between incoming and historical queries to reuse retrievals for semantically similar requests, reducing frequent and costly nearest-neighbor search over large vector databases. In Proximity, the incoming query $q\in\mathbb{R}^n$ is matched against a bounded in-memory cache $\mathcal{C}=\{(q_i, V_i)\}$ using a distance function $s$ (e.g., Euclidean or $1-\cos$ similarity) and a threshold $\tau$ . On a cache hit ( $s(q, q')\leq\tau$ for some $q'\in\mathcal{C}$ ), previously retrieved document indices $V_{q'}$ are reused; otherwise, $q$ is forwarded to the actual vector database and the resulting indices are cached, potentially evicting old entries in FIFO or other policies. Experimental evaluation on MMLU and MedRAG benchmarks shows that a properly tuned $\tau$ achieves up to 59–71% lower retrieval latency with accuracy loss under 1% for well-chosen thresholds, and the expected time per query is driven by the cache hit rate $h$ as $E[T]\approx (1-h)T_{DB}$ , the speedup thus diminishing with decreased $h$ (Bergman et al., 7 Mar 2025).

ARC generalizes this principle by dynamically crafting each agent's cache based on historical query distributions, document geometry in embedding space, and explicit cache prioritization metrics. ARC employs rank–distance weighted scoring, a geometric “hubness” measure for coverage, and a controlled insertion/eviction protocol to maintain a small, high-utility cache. ARC reportedly achieves an up-to 80% reduction in average retrieval latency and up to 79.8% has-answer rate using just 0.015% of the total corpus in cache on SQuAD, MMLU, and AdversarialQA (Lin et al., 4 Nov 2025).

2. Intermediate, Prefix, and Multilevel Caching Schemes

Modern RAG pipelines often suffer from computation and memory cost inflation due to repeated computation of long context prefixes (retrieved document concatenations) in the attention prefill phase. RAGCache systems such as the one in (Jin et al., 18 Apr 2024) directly cache key–value (KV) tensors for sub-prompts (prefixes) as internal model states rather than just document contents. The cache is organized as a prefix-sensitive knowledge tree: each node encodes the KV state for a unique path (ordered document prefix) within the RAG retrieval hierarchy. Multi-level caching over GPU HBM and host RAM allows rapid reuse of frequently accessed prefixes, with a replacement policy (Prefix-aware Greedy-Dual-Size-Frequency, PGDSF) that exploits retrieval frequency, recomputation cost, and size.

Overlapping the retrieval stage with speculative LLM prefill for early-stabilized candidate sets further trims end-to-end latency. Empirically, this design yields up to 4× reduced TTFT and 2.1× higher throughput compared to baseline vLLM+Faiss setups. The cache-aware tree structure particularly boosts efficiency under skewed workloads where a small fraction of document prefixes account for most retrievals (Jin et al., 18 Apr 2024).

Chunk-aware approaches such as Cache-Craft store and reuse per-chunk KV-caches, conditionally fixing or recomputing only a select subset of contextualized tokens when chunk order or prefix composition varies between requests. Analytical attention-weight–based metrics (adjusted prefix overlap, cache context impact, cache fix-overhead) determine when partial recomputation suffices. Deep GPU-integration and hierarchical storage over HBM, DRAM, and SSD balance overheads, with effective deployment leading to 1.6× increased throughput and 2× latency decrease on LLaMA-3-based production workloads, maintaining ≥90% output quality (Agarwal et al., 5 Feb 2025).

3. Task-Aware and Global KV Cache Compression

Task-aware RAGCache methods, as developed by (Corallo et al., 6 Mar 2025), target scenarios where downstream reasoning over broad, distributed evidence is paramount (e.g., multi-hop or “join-like” queries). Here, the entire knowledge base is pre-tokenized and key–value tensors are computed offline, then compressed via task-driven mechanisms such as attention- or gradient-based importance scoring w.r.t. the anticipated task prompt(s). The resulting compressed cache is loaded before inference, eliminating runtime retrievals entirely. Quantitatively, such systems report up to 7 pp accuracy gain over vanilla RAG and up to 30× reduction in cache size, with a 2.7× speedup on LongBench v2 (Corallo et al., 6 Mar 2025).

For “closed world” tasks (where all relevant knowledge fits in the context window), pure cache-augmented generation (CAG) bypasses retrieval altogether, preloading the document corpus and model KV states in advance and appending only the user query at decode time. CAG achieves comparable or superior accuracy and up to 40× lower end-to-end latency for SQuAD and HotpotQA in scenarios where the sum of all knowledge base tokens fits available model context ( $\sum_{i=1}^n |d_i|_{tokens} \leq W$ ) (Chan et al., 20 Dec 2024).

4. Specialized Caching: Disk, Multi-Instance, and Graph-Structured Cache Systems

RAGCache solutions also address system-level bottlenecks in production deployment. For high-throughput and multi-instance scenarios, disk-based persistent KV cache managers such as Shared RAG-DCache centralize cache state across multiple LLM inference instances, leveraging both RAM and NVMe tiers, prefetching KV for documents predicted to be needed (using queue waiting times as signals). Replacement and eviction are managed using LRU and TTL for RAM and disk layers, respectively, with concurrent access guarantees and careful resource configuration (CPU/GPU partitioning, batch sizing). This architecture reliably delivers 15–71% higher throughput and up to 65% shorter TTFT under real server loads, with empirical cache sizes/eviction patterns scaling appropriately with model and workload (Lee et al., 16 Apr 2025).

In disk-based vector search, CaGR-RAG clusters batches of queries by likely accessed clusters (based on Jaccard similarity of IVF nprobe cluster IDs), maximizing cache locality. Opportunistically, cache prefetches clusters needed for the next group. This approach halves 99th-percentile tail latency and more than doubles cache hits, with recommended parameters (batch size, cache capacity, group thresholds) and integration into any ANN backend such as Faiss (Jeong et al., 2 May 2025).

For graph-based RAG, SubGCache groups queries clustering by subgraph embedding and precomputes KV-caches for “representative” union subgraphs. These representatives are leveraged by all similar queries in the cluster, reducing per-query prefill cost from $O(pL H^2)$ to amortized $O(qL H^2)$ where $p$ ( $\gg$ $q$ ) is the typical prefix length. SubGCache achieves up to 6.68× TTFT reduction and even improved accuracy in multi-hop graph QA tasks, with scalable clustering and minimal per-batch overhead (2505.10951).

5. Robustness, Accuracy Preservation and Adaptive/Hybrid Designs

Contextual, accuracy-preserving cache fusion and reuse are critical for correctness in real-world deployments. KV-Fusion (Oh et al., 13 Jan 2025) fuses independently prefetched per-passage KV-caches using uniform local positional embeddings, feeding the decoder a position-invariant, order-agnostic context. This design eliminates “Lost in the Middle” bias: answer accuracy remains invariant as the gold passage's position is permuted, outperforming naive concatenation and robustly handling top-K settings (e.g., NQ shuffled accuracy drops from 42% to 20% for Llama3, but remains ≈42% with KV-Fusion).

RAGBoost (Jiang et al., 5 Nov 2025) and related works formalize cache reuse via context-index trees, efficient context reordering, de-duplication (across session and turn boundaries), and lightweight context hints to preserve task-specific reasoning fidelity. When integrated into common LLM engines, such systems achieve 1.5–3× speedups, 20–45% cache hit rates, and no accuracy degradation or even improvements in multi-turn and agentic settings.

Hybrid approaches blend task- or session-level caches, approximate indices, and multi-modal context features. Adaptive thresholding, advanced eviction (LRU/LFU/Reinforcement Learning), distributed and multi-level caches, and integration with speculative prefill pipelines further improve system resilience under dynamic workloads (Bergman et al., 7 Mar 2025, Lin et al., 4 Nov 2025, Syarubany et al., 18 Jun 2025).

6. Empirical Trade-Offs, Limitations, and Best Practices

The design space involves trade-offs between cache granularity, reuse robustness, memory/storage overhead, and accuracy:

Increasing approximate cache tolerance ( $\tau$ ) raises speed but risks recall loss; optimal $\tau$ is chosen w.r.t. a tolerable accuracy drop (Bergman et al., 7 Mar 2025).
Task-aware/global caches provide coverage at the expense of static scope and inefficient adaptation to newly emerging queries or facts (Corallo et al., 6 Mar 2025).
Chunk-level or subgraph-level cache fusion must balance excessive cache size (fewer clusters, more irrelevant content) against missed reuse opportunities (more clusters, less overlap) (2505.10951).
Disk/shared memory caches must be tuned (RAM:disk ratio, prefetch policy, pruning) for workload locality and resource constraints, with diminishing returns under low locality (Lee et al., 16 Apr 2025).
RAGCache effectiveness depends on overlapping retrieval distributions, stable document semantics, and model compatibility with external KV injection.

Best practices include system-specific parameter tuning (cache size, prefetch thresholds), workload profiling, hybrid pipeline construction (e.g., global static caches plus RAG fallback), and integration at the LLM inference system’s cache or prefill interface.

Key References

Approach	Main Idea	Peak Speedup/Hit Rate
Proximity (Bergman et al., 7 Mar 2025)	Embedding-similarity cache	59–71% latency reduction, ∼1% acc drop
ARC (Lin et al., 4 Nov 2025)	Geometry-/demand-aware agent caches	80% latency saved, 79.8% has-answer
RAGCache (Jin et al., 18 Apr 2024)	Prefix-tree intermediate KV caching	4× TTFT, 2.1× throughput
Cache-Craft (Agarwal et al., 5 Feb 2025)	Partial chunk-KV reuse/fixup	1.6× throughput, 2× latency
Task-aware compression (Corallo et al., 6 Mar 2025)	Task-driven global KV	30× compression, +7pp accuracy
CAG (Chan et al., 20 Dec 2024)	Full corpus preloaded in context	5–40× faster, no retrieval latency
KV Fusion (Oh et al., 13 Jan 2025)	Position-invariant passage fusion	Stable accuracy across orderings
SubGCache (2505.10951)	Subgraph clustering (graph RAG)	up to 6.68× TTFT
Shared RAG-DCache (Lee et al., 16 Apr 2025)	Disk/RAM multi-instance	65% lower latency, 71% higher throughput

RAGCache represents an active and rapidly diversifying area in retrieval-augmented generation research, providing foundational mechanisms for scaling knowledge-infusion architectures to high-throughput, low-latency, and resource-constrained environments.