K-LRU Eviction Policy Fundamentals
- K-LRU is a cache policy that restricts eviction to the K least recently used items, ensuring rapid and bounded candidate selection.
- Implementations like cachebpf demonstrate that K-LRU can enhance throughput by up to 70% and lower tail latency with minimal system overhead.
- Adaptive extensions such as Cold-RL and LazyEviction integrate multi-feature learning and recurrence tracking to overcome static recency limitations.
K-LRU Eviction Policy is a class of cache management algorithms that extends the classic Least Recently Used (LRU) framework, offering enhanced flexibility and performance across storage systems, operating systems, and generative model inference. The core mechanism involves selecting victims for eviction from the K oldest or "coldest" entries in the cache, rather than evaluating the entire cache state. This K-tail sampling approach enables fast, bounded-complexity decision making and underpins a wide range of adaptive, multi-level, and workload-informed policies.
1. Definition and General Principles
K-LRU refers to policies that restrict the candidate source for eviction to the K least-recently-used objects in the cache—often the tail of a doubly-linked LRU list or its abstract equivalent. At each eviction trigger, the policy selects a subset of candidates (typically K=8–32 objects) and executes a decision routine to determine which object(s) to evict.
The standard K-LRU algorithm is deterministic and recency-centric: always evict the oldest among these K candidates. However, K-LRU serves as a foundational template for more advanced mechanisms in both classical and contemporary cache architectures.
2. System Implementations: Operating System and Application Caches
Recent research demonstrates practical, programmable support for K-LRU and its multi-level generalizations within system-scale caches. Notably, the cachebpf framework for Linux page caching (Zussman et al., 4 Feb 2025) provides explicit API primitives for constructing eviction policies with multiple, variable-sized eviction lists, enabling the direct implementation of K-LRU and variants such as segmented LRU and ARC.
cachebpf: K-LRU Expressivity and Deployment
- Multiple eviction lists model K-LRU recency windows, with folios (pages) promoted between lists upon repeated accesses.
- Per-folio metadata tracked via eBPF maps allows the policy to record the last K access timestamps or counts needed for K-LRU logic.
- Eviction requests prioritize the lowest-level list (least recently used objects), with batch scoring and selection routines facilitating efficient candidate identification.
- Kernel integration ensures low overhead (<2% CPU, ~1% memory), rapid policy prototyping (~100–400 eBPF lines), and per-cgroup isolation for workload-specific tuning.
Policies matching K-LRU structure achieve up to 70% higher throughput and 58% lower tail latency when tailored to application access patterns compared to default global policies.
| Feature | cachebpf Support |
|---|---|
| Multiple eviction lists | Yes (API: list_create, move) |
| Metadata/tracking | Yes (eBPF maps: timestamps) |
| cgroup isolation | Yes |
3. Learned and Adaptive Extensions: Reinforcement Learning and Feature-Enriched K-LRU
K-LRU provides a deterministic baseline. Recent advances introduce adaptivity and workload-awareness by augmenting K-tail selection with statistical or learned policies. Cold-RL (Gupta et al., 17 Aug 2025) integrates K-tail sampling with offline-trained reinforcement learning in NGINX's cache, operationalizing a hybrid policy that combines K-LRU's fast candidate filtering with a dueling deep Q-network (DQN) for decision making.
Cold-RL: K-Tail RL Policy for NGINX
- At eviction, Cold-RL samples the K least-recently-used objects.
- Six features per candidate: age, size, hit count, inter-arrival time, TTL remaining, last origin RTT.
- Features are input to a DQN running as an ONNX sidecar (strict 500μs SLO).
- DQN outputs Q-values, selecting one or more objects to evict based on estimated future hits.
- Immediate LRU fallback ensures bounded latency.
Cold-RL consistently outperforms classic policies under adversarial and mixed-size workloads at small and medium cache scales. The empirical hit ratio improvements reach 146% over ARC (0.1436 to 0.3538 at 25 MB) and maintain parity under low-pressure scenarios.
| Aspect | K-LRU | Cold-RL (K-tail) |
|---|---|---|
| Candidate pool | K oldest | K oldest |
| Decision rule | Evict oldest | DQN selects (Q-value, features) |
| Feature input | Recency | Recency + five others |
| Training | None | Offline RL |
| Adaptivity | None | High |
| Latency | O(K), very fast | O(K), <500μs |
This demonstrates that learned, feature-enriched K-LRU adaptations can systematically outperform static policies under dynamic and non-stationary access patterns while retaining K-tail sampling's efficiency.
4. K-LRU Variants in Generative Model Key-Value (KV) Cache Management
Within autoregressive LLM inference, K-LRU-like policies—often termed window-based recency eviction—are prevalent due to GPU memory bottlenecks. These methods restrict eviction candidates to tokens beyond a local window (size K), assuming recency implies importance (Ren et al., 9 Feb 2024).
Limitations Identified
- Recency-based K-LRU heuristics exhibit strong bias, underestimating importance for tokens with delayed or distributed future relevance.
- Sensitive to window size (K): if too small, recent tokens are prematurely evicted; too large, eviction is inefficient.
- Poor adaptation to tasks with long-range dependencies (e.g., summarization, reasoning).
Experimental results in generative summarization and instruction-following tasks highlight the inadequacy of these heuristics, with attention-driven policies (e.g., RoCo) substantially outperforming recency policies in BLEU and ROUGE scores, especially under constrained cache budgets.
5. Temporal and Recurrence-Aware Refinements
Recent frameworks recognize the deficiency of pure recency/K-LRU policies in contexts exhibiting "importance recurrence", notably long chain-of-thought reasoning in LLMs (Zhang et al., 19 Jun 2025). LazyEviction incorporates lagged, windowed eviction and per-token recurrence interval tracking, preserving latent recurring tokens that static K-LRU or aggregate attention strategies misclassify.
LazyEviction Mechanism
- Only evicts tokens at periodic intervals (window W), not every generation step.
- Preserves the most recent window of tokens, while evaluating long-term recurrence patterns for the rest.
- Tracks maximum recurrence interval per token; computes importance scores penalizing tokens that have not been attended within their typical recurrence window.
- Demonstrates markedly higher accuracy retention at aggressive KV cache compression rates (e.g., 72.70% vs. 62.24% for H2O at 50% cache) in math reasoning benchmarks.
| Feature | LRU/K-LRU | Attention-based | LazyEviction |
|---|---|---|---|
| Recency tracking | Yes | No | Yes |
| Recurrence aware | No | No | Yes (MRI tracking) |
| Windowed eviction | No | No | Yes |
| Accuracy drop | High | Moderate | Low |
A plausible implication is that K-LRU policies may benefit from temporal extensions such as recurrence interval tracking and lagged windowed execution, particularly in workloads with highly non-uniform token importance distributions.
6. Common Applications and Design Trade-Offs
K-LRU and its derivatives are now established in:
- System page caches, via programmable eBPF frameworks for multi-tenancy, workload specialization, and scan-resistance.
- Application-level object caches (NGINX, key-value stores), where fast K-tail sampling enables strict latency SLOs.
- Generative model KV caches, where window-based recency heuristics compete with attention- and recurrence-aware algorithms for optimal memory/quality trade-off.
Effective deployment requires balancing:
- Sampling complexity ( for tail candidate selection vs. for full cache scan),
- Adaptivity (static rules vs. workload-trained/learned policies),
- Feature richness (recency only vs. multi-feature decision vectors),
- Robustness to workload, scaling, and adversarial access patterns.
7. Research Directions and Limitations
K-LRU policies—while fast and simple—are limited by their ignorance of semantic importance, long-term recurrence, and distributed access patterns. Recent reinforcement learning, attention-driven, and windowed recurrence designs operationalize K-LRU's structure, but generalize it to incorporate predictive retention, workload response, and robust hyperparameter insensitivity.
Empirical and conceptual advances suggest future K-LRU policies should:
- Integrate temporal recurrence and predictive scoring algorithms.
- Employ hybrid multi-feature candidate evaluation for unstructured/object caches.
- Offer programmable interfaces and per-workload tuning in OS/page cache environments.
In all settings, the adoption of K-LRU-derived policies must address overhead (metadata, state tracking, decision latency) and expressivity constraints (data structure support, batch processing), as highlighted in systems-scale implementations (Zussman et al., 4 Feb 2025) and learned policies (Gupta et al., 17 Aug 2025). These considerations are central to achieving both efficiency and quality preservation under real-world operating constraints.