Cold-RL (K-tail) for NGINX Cache Eviction

Updated 4 March 2026

Cold-RL (K-tail) is an offline reinforcement learning method that enhances traditional LRU by selecting eviction candidates from the K least recently used objects.
It frames cache eviction as a finite Markov decision process and employs a dueling deep Q-network to make real-time decisions under microsecond deadlines.
Evaluations on real-world CDN workloads show that Cold-RL significantly improves hit ratios and reduces origin fetches, leading to substantial cost savings.

Cold-RL (K-tail) is an offline reinforcement learning–based eviction policy designed for NGINX HTTP caches. The method supersedes classical least-recently-used (LRU) eviction by leveraging deep RL to select evictions from the $K$ coldest (least recently used) objects in the cache. Cold-RL is the first such policy integrated into production NGINX under strict microsecond service-level objectives (SLOs), demonstrating substantial improvements in hit ratio and resource efficiency across adversarial and real-world CDN workloads (Gupta et al., 17 Aug 2025).

1. Markov Decision Process Formulation and K-Tail Sampling

Cold-RL frames forced cache eviction as a finite Markov decision process (MDP), where each state $s$ consists of a batch of features corresponding to the $K$ least-recently-used cache entries (the "K-tail"). The features extracted per object are:

age (time since last origin fetch)
size (object size in bytes)
hit_count (number of hits since insertion)
inter_arrival_time (interval between the last two requests)
ttl_remaining (HTTP cache TTL left)
last_origin_rtt (last measured RTT to the origin)

For each eviction event, the action $a$ is a $K$ -bit mask $M \in \{0,1\}^K$ indicating which tail objects to evict. The reward $r(s, a)$ for each retained object is 1 if it is subsequently re-referenced before TTL expiry; objects evicted receive 0. State transitions are determined by simulating future requests and TTL expirations, with the next state comprising the subsequent K-tail.

K-tail sampling is performed by traversing the NGINX LRU list’s tail to gather up to $K$ victim candidates. The computational cost is $O(K)$ pointer operations, with $K$ typically in $s$ 0, and is strictly bounded to fit within a 500 μs total eviction deadline (Gupta et al., 17 Aug 2025).

2. Dueling Deep Q-Network Design and Action Selection

The policy network is a compact, dueling Deep Q-Network (DQN) trained to estimate the Q-value $s$ 1—the expected future sum of rewards—for each candidate object. The input to the network is the $s$ 2 feature matrix $s$ 3. The architecture consists of two fully-connected layers (128 units and 64 units, both ReLU), followed by value $s$ 4 (scalar) and advantage $s$ 5 (vector of $s$ 6) heads. The combinator function is:

$s$ 7

where $s$ 8 is the value for retaining object $s$ 9. During inference, Cold-RL generates a binary mask $K$ 0 by setting $K$ 1 for the lowest $K$ 2 to select the object(s) for eviction, minimizing future cache misses.

The network comprises approximately 10K parameters. Quantized to int8 and exported to ONNX, it fits within the L2 cache for low-latency execution. The action space and combinator are optimized for efficient evictions under microsecond SLOs (Gupta et al., 17 Aug 2025).

3. Offline Training Methodology

Cold-RL policies are trained offline using historical NGINX access logs replayed through a high-fidelity cache simulator that exactly models the admission, expiration, and forced-expire logic. For each eviction step, the log-driven simulator extracts the K-tail state $K$ 3, samples an action $K$ 4 using $K$ 5-greedy exploration, transitions to $K$ 6, and collects reward $K$ 7. Experiences $K$ 8 are stored in a buffer and used to minimize the squared DQN Bellman error:

Bellman target: $K$ 9
Loss: $a$ 0

where $a$ 1, batch size is 128, and the target network $a$ 2 is updated every 1000 steps.

The approach enables training strong policies from logs without any online exploration or production risk. Feature normalization to $a$ 3 is performed during training for optimal transferability (Gupta et al., 17 Aug 2025).

4. Integration in NGINX and Inference Runtime Design

The Cold-RL production implementation consists of an NGINX module and an ONNX-runtime sidecar. The module hooks into ngx_http_file_cache_forced_expire, performs K-tail sampling and feature extraction (no heap allocations), serializes features over a Unix Domain Socket, and synchronously requests a decision from the ONNX sidecar, with a strict 500 μs timeout for reply. Upon timeout or error, eviction instantly falls back to native LRU.

The ONNX sidecar—implemented in C++—loads the quantized Q-network, maintains a lock-free ring buffer for feature packets, and runs inference entirely within the L2 cache, avoiding kernel transitions and dynamic memory allocation. Inference latencies are measured as p50=127 μs, p95=342 μs, p99=487 μs, and the overall eviction (including NGINX-side processing) remains within the SLO for 95% of events, with higher percentiles defaulting to LRU (Gupta et al., 17 Aug 2025).

Safety features include:

Circuit breaker disabling on repeated failures
Shadow mode for A/B deployments
Immediate, stateless kill switch and rollout guardrails

5. Empirical Evaluation and Performance Analysis

Comprehensive evaluation comprises synthetic adversarial (“trap”) and large-scale log-replay benchmarks (10M requests, NASA and commercial CDN traces) across three cache sizes:

Policy	25 MB (High Pressure)	100 MB (Medium)	400 MB (Low)	Trap
LRU	0.089	0.623	0.916	0.056
LFU	0.112	0.689	0.909	0.078
Size	0.073	0.512	0.823	0.089
ARC	0.144	0.753	0.919	0.134
Hybrid	0.123	0.723	0.912	0.112
Cold-RL	0.354	0.868	0.918	0.421

Hit ratio improvements are +146% over the best classical baseline at 25 MB, +15% at 100 MB, and on par at 400 MB. On adversarial trap workloads, improvement is +214%. CPU overhead remains under 2% at 50,000 requests/second, and fallback rate is negligible ( $a$ 4). Feature ablations indicate size, inter_arrival_time, and ttl_remaining are critical to performance; e.g., removing size degrades hit ratio by 31%. The choice of $a$ 5 balances inference time and hit ratio, with $a$ 6 optimal in most scenarios (Gupta et al., 17 Aug 2025).

Production deployment across 100M requests/day resulted in 23% reduction in origin fetches—a validated cost saving of approximately 2.2–3.3 million USD/year per 50 edge nodes. No service interruptions or latency increases were observed relative to standard LRU eviction.

6. Significance, Limitations, and Directions

The K-tail method demonstrates that by restricting the candidate set to the LRU-tail and employing a lightweight per-object feature set, a compact dueling DQN trained offline can match strict real-time system requirements. The policy is robust to fallback, introduces negligible operational risk via SLO-guarded integration, and enables safe, incremental production rollout.

A plausible implication is that this approach generalizes to other forced-expire workloads with compact state, low-latency inference constraints, and abundant historical logs. However, the effectiveness of K-tail diminishes as cache size increases and the working set fits comfortably, where all policies converge to similar hit rates. The method is specifically tailored for NGINX and the LRU-list datastructure; adaptation to fundamentally different cache architectures would require additional investigation (Gupta et al., 17 Aug 2025).

Markdown Report Issue Upgrade to Chat

References (1)

Cold-RL: Learning Cache Eviction with Offline Reinforcement Learning for NGINX (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Cold-RL (K-tail).