EpiCache: Episodic KV Cache for LongConvQA
- EpiCache is a training-free episodic key-value cache management framework that segments long conversational history into coherent episodes for LLMs.
- It employs block-wise prefill and immediate eviction, using attention scoring and K-means clustering to retain the most relevant context efficiently.
- Experimental results show up to 40% accuracy improvement, 4–6x cache compression, 3.5x memory reduction, and 2.4x latency decrease in long dialogue tasks.
EpiCache is a training-free episodic Key-Value (KV) cache management framework developed for long conversational question answering (LongConvQA) with LLMs under fixed memory budgets. Addressing the challenge of linear cache growth and resource constraints, EpiCache introduces a block-wise prefill and eviction strategy combined with episodic clustering and adaptive layer-wise KV budget allocation, significantly improving cache efficiency and maintaining multi-turn context coherence. EpiCache demonstrates up to 40% accuracy improvement over prior compression baselines while achieving 4–6x KV cache compression, 3.5x memory reduction, and 2.4x latency decrease in long dialogue tasks (Kim et al., 22 Sep 2025).
1. Episodic Clustering of Conversation Context
EpiCache segments the conversational history into semantically coherent episodes by embedding utterance blocks into a semantic vector space. For each block (e.g., w₍embed₎ utterances grouped together), a sentence encoder maps segments onto embeddings, and K-means clustering (with k-means++ initialization) partitions these segments into episodes.
Each episode is characterized by:
- A centroid: the mean vector of embeddings within the episode.
- A medoid: the segment closest to the centroid under cosine similarity; this serves as a representative, patched prompt for compression and query matching.
This episodic partitioning enables the cache to retain topic-relevant context and supports efficient retrieval for subsequent queries.
2. Block-wise Prefill and Episodic KV Compression
EpiCache enforces a fixed memory budget %%%%1%%%% by using block-wise prefill and immediate eviction. Rather than accumulating the KV cache over the entire dialogue—which would result in unbounded peak memory—EpiCache processes input in blocks of tokens:
- After each block, tokens are evaluated for retention based on attention-guided scoring (Equation 3 in the paper).
- The patched prompt (from the episode medoid) is appended to enhance semantic coherence.
- Only the most relevant tokens (per attention scoring) are retained, while remaining entries are evicted immediately.
This approach bounds peak memory to and ensures the episodic KV cache remains compact and focused on conversation topics.
3. Query-to-Episodic Matching and Attention Scoring
Upon receiving a new query, EpiCache embeds the query in the same vector space as existing conversational episodes. The system selects the episodic KV cache whose centroid yields the highest cosine similarity to the query embedding, maximizing topical relevance.
The selected episodic cache is supplied to the LLM for decoding. This ensures that the response leverages the most pertinent historical context without the need for full-cache recomputation.
The token retention process leverages attention scores derived from the medoid-patched prompt, following:
where is a token from the block, and is the patched prompt; tokens with the highest scores are favored for retention.
4. Adaptive Layer-wise KV Budget Allocation
Transformer layers exhibit heterogeneous sensitivity to KV cache eviction. EpiCache introduces an adaptive allocation strategy:
- For each layer , compute sensitivity as:
where and are key states under full causal and block prefill masks, respectively, is the number of heads, and is the number of tokens.
- The global memory budget is distributed as:
with hyperparameter adjusting allocation sharpness. Layers with higher sensitivity receive proportionally more KV cache entries, mitigating the impact of compression especially on semantically important representations.
5. Performance Metrics and Experimental Results
Across three LongConvQA benchmarks (Realtalk, LoCoMo, LongMemEval), EpiCache achieves:
- Up to 40% higher conversational accuracy compared to recent KV compression baselines.
- Near-full accuracy under 4–6x cache compression.
- Peak GPU memory reductions of up to 3.5x.
- Latency improvements up to 2.4x, attributable to smaller cache size during decoding.
These results illustrate that EpiCache preserves multi-turn coherence and personalization in dialogue agents even as cache storage and computation are constrained.
6. Use Cases, Limitations, and Future Directions
EpiCache is suited for:
- Long conversational agents maintaining histories across hundreds of turns.
- Resource-constrained deployments on edge devices or GPUs with strict memory limits.
- Real-world applications such as customer support and multi-session personal assistants reliant on sustained context.
The current version:
- Fixes the number of episodes () per conversation, suggesting future research may explore adaptive episode determination.
- Relies on embedding-based clustering (K-means); more advanced or domain-specific identification of conversation topics is a plausible direction.
- Focuses on eviction-based compression; integrating quantization techniques may yield further memory savings.
- Further refinements of the sensitivity metric and allocation hyperparameter are anticipated to optimize heterogeneous models and tasks.
EpiCache’s combination of conversational clustering, episodic KV compression, and adaptive per-layer budgeting addresses the memory and latency bottlenecks inherent in long-context LLM dialogue systems, enabling efficient multi-turn question answering under practical hardware constraints (Kim et al., 22 Sep 2025).