Fine-Grained Cache Management
- Fine-Grained Cache Management is a set of techniques that manage cache at sub-block levels, using selective data movement and fine granularity to optimize performance and resource use.
- It improves performance isolation and load balancing in multi-tenant and dynamic workloads by dynamically tracking cache content and adapting eviction policies through workload awareness.
- It integrates hardware-software co-design across memory systems, OS, and middleware to achieve reduced latency, energy savings, and higher throughput in advanced computing environments.
Fine-grained cache management refers to policies, data structures, and mechanisms that dynamically track, allocate, and evict cached data at sub-block, sub-object, or otherwise non-coarse levels of granularity. This approach is fundamental for modern memory systems, database engines, cloud resource managers, accelerators, and large-model serving, as it enhances resource utilization, improves performance isolation, and adapts to heterogeneous and highly dynamic access patterns beyond what traditional block- or page-level policies achieve. Techniques span hardware, firmware, OS, middleware, and application domains; foundational principles include selective data movement, metadata granularity alignment, utility-based replacement, and workload-aware cache budgeting.
1. Architectural Foundations and Substrate Mechanisms
Fine-grained cache management relies on system substrates that allow monitoring, relocation, and protection of cache contents at granularities substantially below traditional hardware or software blocks.
In-DRAM Cache Substrates: DRAM banks are subdivided into multiple subarrays, each with a local row buffer (LRB), and all subarrays in a bank are connected by a narrow global row buffer (GRB). The FIGARO substrate introduces a RELOC command that utilizes the GRB to copy data at the 64 B (cache block) granularity between subarrays, with distance-independent latency of ~1 ns, unlike previous row-sized movement mechanisms. This enables intra-bank, fine-grained selective relocation of hot data segments (Wang et al., 2020).
Software Cache Organization: Modern software caches implement limited associativity by dividing total capacity into N fixed-size sets with k ways (k typically 8): hash(key) mod N determines set placement, and per-set metadata tracks policy-specific data with O(1) or O(k) operations (Adas et al., 2021). This mapping provides independent fine-grained domains for highly concurrent cache operations.
Memory Pooling and Segmentization: Cluster-level LLM systems such as TokenLake break prefixes into memory segments (e.g. 568-token units) and use content-addressable identifiers to support deduplication and dynamic placement. Segments are the atomic unit of movement, access, and load balancing across instances (Wu et al., 24 Aug 2025).
2. Granularity-Aligned Data Management and Adaptive Relocation
Moving data at a fine granularity reduces both data movement and cache pollution, maximizing utilization from limited fast memory or storage.
Contiguous Chunking in I/O-bound Systems: ContiguousKV aligns semantic cache pruning (e.g. attention-guided selection) with the system I/O block size via ContiguousChunks (e.g. 16 tokens per chunk). This eliminates read amplification entirely when the chunk size matches cache selection granularity, i.e., all loaded data is used with no excess I/O (Zou et al., 20 Jan 2026).
In-DRAM Cache Segmentation: FIGCache divides DRAM rows into multiple K×64 B (default K=16, so 1 KiB) segments, caching only those segments with high benefit scores. Eviction and replacement also occur at the segment, rather than row, level. This allows co-locating hot segments, promoting sequential and temporal locality in memory-intensive workloads (Wang et al., 2020).
CacheX Eviction-set Probing: CacheX constructs minimal eviction sets for each cache set/slice within a VM, enabling page coloring and memory placement policies at a level finer than the guest physical page. Colors are dynamically remapped as contention shifts (Tofigh et al., 13 Nov 2025).
3. Utility-Guided Replacement, Budgeting, and Scheduling
Fine-grained policies maintain per-block or per-object utility metadata, allowing evictions and admissions to be driven by predicted benefit, cost, or effective reuse.
RowBenefit and Selective Admission: FIGCache’s RowBenefit replacement sum, , maximizes on-cache row hits. Admission is performed greedily (“insert-any-miss”), with evaluated benefit counters per segment. Only segments in rows with lowest total benefit are candidates for eviction, and policy outperforms segmentwise-LRU (Wang et al., 2020).
Segment/Chunk Importance Ranking: ContiguousKV maintains a dynamic chunk score , where is cumulative attention-based importance and is access frequency. Sorting chunks by this score for retention across memory tiers yields superior performance to LRU/LFU, especially under memory pressure (Zou et al., 20 Jan 2026).
Head-/Layer-aware Budgeting: MaskKV, targeting diffusion LLMs, computes per-head and per-layer token retention budgets with attention-driven scores (mask-query attention): , where is the normalized prompt-preference score (Huang et al., 10 Oct 2025).
Admission/Eviction via RL and Learning: RLCache partitions cache management into three agents (admit, TTL, eviction), with reward , and delayed experience methods to accurately credit outcomes. This can outperform heuristic policies (LRU, LFU) on hit rate and storage at both steady-state and after workload shift (Alabed, 2019).
ILP-driven KV Placement: ORBITFLOW formulates placement as an ILP: For requests r and layers ℓ, decide x_{r,ℓ} ∈ {0,1} (on/off GPU) to minimize decode latency , subject to memory and SLO constraints. Runtime feedback re-invokes the solver on batch/churn events (Ma et al., 5 Jan 2026).
4. System-Level Implications: Load Balance, Elasticity, and Isolation
Fine-grained management decouples cache placement from compute scheduling—critical for modern multi-tenant, multi-accelerator, and large-scale AI deployments.
Segment-Level Global Pools: TokenLake enables a cluster-wide segment-level prefix cache pool by using a declarative API and heavy-hitter-aware load balancing. Heavy segments are replicated to minimize (max instance load), while normal segments use hash placement. This mechanism, with deduplication, achieves up to 2.6× improvement in P90 goodput and 2.1× higher hit rate than cache-aware routing, and dramatically lowers load imbalance (CV ≈ 11%) (Wu et al., 24 Aug 2025).
Layer-wise Offloading and SLO Attainment: LayerKV maintains layer-granular block tables and offloads non-critical layers to host memory, preserving only x layers in GPU for each request as per SLO-aware scheduling. Dynamic monitoring and per-layer block allocation reduce Time-to-First-Token (TTFT) by up to 69× and cut SLO violations by 28.7% (Xiong et al., 2024).
Per-Tenant Partitioning: In private cloud caches, hybrid models maintain per-tenant dedicated slots (guaranteeing hard SLAs) and a shared pool (allocating by max-min gap via eviction policies). This minimizes overall slot usage without violating per-tenant guarantees (Kim et al., 2019).
Compiler-Guided Phase Partitioning: Com-CAS extracts per-phase static footprints and reuse info via compile-time analysis, leveraging Intel CAT for per-phase fractional allocation and dynamic partition rebalancing at user-defined granularity, achieving both throughput and isolation (Chatterjee et al., 2021).
5. Quantitative Benefits and Cost Models
Fine-grained cache management produces substantial quantitative improvements in utilization, latency, throughput, and energy efficiency.
| System/Policy | Workload/Domain | Key Gains (over baseline) |
|---|---|---|
| FIGCache (Wang et al., 2020) | 8-core, DDR4 | +16.3% perf, –7.8% DRAM energy, ~2–5% from ideal |
| ContiguousKV (Zou et al., 20 Jan 2026) | Qwen2.5 LLM, SSD offload | ~3.85–4.0× TTFT ↓, ~16× token I/O ↓ |
| LayerKV (Xiong et al., 2024) | Llama2/3, 7B-70B | TTFT up to 69× ↓, SLO violations –28.7% |
| TokenLake (Wu et al., 24 Aug 2025) | LLM multi-turn, prefix pool | Goodput ×2.6, hit-rate ×2.1, load CV < 15% |
| MaskKV (Huang et al., 10 Oct 2025) | dLLM, LongBench | 94% quality w/5% prompt-keep, ×31 speed, –65% mem |
| RLCache (Alabed, 2019) | DB, Zipf YCSB | Hit-rate: 0.91 (1GB cache) v. 0.82 LRU, –27% size |
| Sieve (Shakya et al., 18 Nov 2025) | DBMS, FGAC | Query latency: 6–22% ↓, 2–10× policy eval speedup |
| Choreographer (Nguyen et al., 30 Oct 2025) | HW/Sim. fine-grain tasks | Speedup: 1.08–1.88× (prefetch), >2× (quicksort) |
Energy savings derive from fewer costly ACTIVATE/PRECHARGE cycles and higher row-buffer utilization in in-DRAM caches (Wang et al., 2020). Latency benefits often come from pipelining, granularity alignment, and avoiding queue buildup due to resource contention (Xiong et al., 2024, Wu et al., 24 Aug 2025). Empirically, segment-granular systems approach the theoretical minimum required for cold start and hot reuse in real LLM serving (Zou et al., 20 Jan 2026). Reinforcement learning and supervised learning can reduce the required cache size for the same hit-rate by >20% versus LRU (Alabed, 2019, Choi et al., 2019).
6. Open Challenges, Trade-Offs, and Future Directions
Key trade-offs and ongoing research in fine-grained cache management focus on:
- Overhead vs. Reward: Fine-grained policies require more per-object metadata and potentially more frequent updates or movement; however, properly tuned, the overhead is amortized or hidden (solver time <1 ms in ORBITFLOW) (Ma et al., 5 Jan 2026).
- Granularity Tuning: Choosing the optimal block or segment size is non-trivial; too fine leads to excessive metadata or relocation, too coarse loses locality and admits interference (Wang et al., 2020, Zou et al., 20 Jan 2026).
- Adaptivity: Many systems now combine runtime feedback (SLO attainment, observed latency) with profile-guided models and machine learning, triggering policy updates only as needed (Ma et al., 5 Jan 2026, Xiong et al., 2024).
- Generality and Extensibility: Techniques such as reinforcement learning (RLCache) are domain-agnostic and can be reused with only reward function/scaling changes; attention-guided policies are specific to transformer models and not portable across architectures (Alabed, 2019, Zou et al., 20 Jan 2026).
- Hardware Integration: Emerging work targets end-to-end hardware/software co-design, enabling programmable fine-grained cache logic at the accelerator or memory controller, with modular APIs for rapid cache policy exploration (Nguyen et al., 30 Oct 2025).
- Elasticity and Isolation: The move to segment-level global pooling, with load and memory fragmentation minimized by deduplication, is essential in multi-tenant and distributed LLM workloads (Wu et al., 24 Aug 2025).
A plausible implication is that future large-memory, multi-tenant systems and large-model serving pipelines will depend on hierarchical, multi-granular cache management orchestrated by a combination of fine-grained hardware substrate support, workload-aware learning, and cross-tier orchestration to approach theoretical bounds on performance per watt, fairness, and cost.