Cold-RL (K-tail) for NGINX Cache Eviction
- Cold-RL (K-tail) is an offline reinforcement learning method that enhances traditional LRU by selecting eviction candidates from the K least recently used objects.
- It frames cache eviction as a finite Markov decision process and employs a dueling deep Q-network to make real-time decisions under microsecond deadlines.
- Evaluations on real-world CDN workloads show that Cold-RL significantly improves hit ratios and reduces origin fetches, leading to substantial cost savings.
Cold-RL (K-tail) is an offline reinforcement learning–based eviction policy designed for NGINX HTTP caches. The method supersedes classical least-recently-used (LRU) eviction by leveraging deep RL to select evictions from the coldest (least recently used) objects in the cache. Cold-RL is the first such policy integrated into production NGINX under strict microsecond service-level objectives (SLOs), demonstrating substantial improvements in hit ratio and resource efficiency across adversarial and real-world CDN workloads (Gupta et al., 17 Aug 2025).
1. Markov Decision Process Formulation and K-Tail Sampling
Cold-RL frames forced cache eviction as a finite Markov decision process (MDP), where each state consists of a batch of features corresponding to the least-recently-used cache entries (the "K-tail"). The features extracted per object are:
- age (time since last origin fetch)
- size (object size in bytes)
- hit_count (number of hits since insertion)
- inter_arrival_time (interval between the last two requests)
- ttl_remaining (HTTP cache TTL left)
- last_origin_rtt (last measured RTT to the origin)
For each eviction event, the action is a -bit mask indicating which tail objects to evict. The reward for each retained object is 1 if it is subsequently re-referenced before TTL expiry; objects evicted receive 0. State transitions are determined by simulating future requests and TTL expirations, with the next state comprising the subsequent K-tail.
K-tail sampling is performed by traversing the NGINX LRU list’s tail to gather up to victim candidates. The computational cost is pointer operations, with typically in , and is strictly bounded to fit within a 500 μs total eviction deadline (Gupta et al., 17 Aug 2025).
2. Dueling Deep Q-Network Design and Action Selection
The policy network is a compact, dueling Deep Q-Network (DQN) trained to estimate the Q-value —the expected future sum of rewards—for each candidate object. The input to the network is the feature matrix . The architecture consists of two fully-connected layers (128 units and 64 units, both ReLU), followed by value (scalar) and advantage (vector of ) heads. The combinator function is:
where is the value for retaining object . During inference, Cold-RL generates a binary mask by setting for the lowest to select the object(s) for eviction, minimizing future cache misses.
The network comprises approximately 10K parameters. Quantized to int8 and exported to ONNX, it fits within the L2 cache for low-latency execution. The action space and combinator are optimized for efficient evictions under microsecond SLOs (Gupta et al., 17 Aug 2025).
3. Offline Training Methodology
Cold-RL policies are trained offline using historical NGINX access logs replayed through a high-fidelity cache simulator that exactly models the admission, expiration, and forced-expire logic. For each eviction step, the log-driven simulator extracts the K-tail state , samples an action using -greedy exploration, transitions to , and collects reward . Experiences are stored in a buffer and used to minimize the squared DQN Bellman error:
- Bellman target:
- Loss:
where , batch size is 128, and the target network is updated every 1000 steps.
The approach enables training strong policies from logs without any online exploration or production risk. Feature normalization to is performed during training for optimal transferability (Gupta et al., 17 Aug 2025).
4. Integration in NGINX and Inference Runtime Design
The Cold-RL production implementation consists of an NGINX module and an ONNX-runtime sidecar. The module hooks into ngx_http_file_cache_forced_expire, performs K-tail sampling and feature extraction (no heap allocations), serializes features over a Unix Domain Socket, and synchronously requests a decision from the ONNX sidecar, with a strict 500 μs timeout for reply. Upon timeout or error, eviction instantly falls back to native LRU.
The ONNX sidecar—implemented in C++—loads the quantized Q-network, maintains a lock-free ring buffer for feature packets, and runs inference entirely within the L2 cache, avoiding kernel transitions and dynamic memory allocation. Inference latencies are measured as p50=127 μs, p95=342 μs, p99=487 μs, and the overall eviction (including NGINX-side processing) remains within the SLO for 95% of events, with higher percentiles defaulting to LRU (Gupta et al., 17 Aug 2025).
Safety features include:
- Circuit breaker disabling on repeated failures
- Shadow mode for A/B deployments
- Immediate, stateless kill switch and rollout guardrails
5. Empirical Evaluation and Performance Analysis
Comprehensive evaluation comprises synthetic adversarial (“trap”) and large-scale log-replay benchmarks (10M requests, NASA and commercial CDN traces) across three cache sizes:
| Policy | 25 MB (High Pressure) | 100 MB (Medium) | 400 MB (Low) | Trap |
|---|---|---|---|---|
| LRU | 0.089 | 0.623 | 0.916 | 0.056 |
| LFU | 0.112 | 0.689 | 0.909 | 0.078 |
| Size | 0.073 | 0.512 | 0.823 | 0.089 |
| ARC | 0.144 | 0.753 | 0.919 | 0.134 |
| Hybrid | 0.123 | 0.723 | 0.912 | 0.112 |
| Cold-RL | 0.354 | 0.868 | 0.918 | 0.421 |
Hit ratio improvements are +146% over the best classical baseline at 25 MB, +15% at 100 MB, and on par at 400 MB. On adversarial trap workloads, improvement is +214%. CPU overhead remains under 2% at 50,000 requests/second, and fallback rate is negligible (). Feature ablations indicate size, inter_arrival_time, and ttl_remaining are critical to performance; e.g., removing size degrades hit ratio by 31%. The choice of balances inference time and hit ratio, with optimal in most scenarios (Gupta et al., 17 Aug 2025).
Production deployment across 100M requests/day resulted in 23% reduction in origin fetches—a validated cost saving of approximately 2.2–3.3 million USD/year per 50 edge nodes. No service interruptions or latency increases were observed relative to standard LRU eviction.
6. Significance, Limitations, and Directions
The K-tail method demonstrates that by restricting the candidate set to the LRU-tail and employing a lightweight per-object feature set, a compact dueling DQN trained offline can match strict real-time system requirements. The policy is robust to fallback, introduces negligible operational risk via SLO-guarded integration, and enables safe, incremental production rollout.
A plausible implication is that this approach generalizes to other forced-expire workloads with compact state, low-latency inference constraints, and abundant historical logs. However, the effectiveness of K-tail diminishes as cache size increases and the working set fits comfortably, where all policies converge to similar hit rates. The method is specifically tailored for NGINX and the LRU-list datastructure; adaptation to fundamentally different cache architectures would require additional investigation (Gupta et al., 17 Aug 2025).