Dynamic Cache Resizing Techniques
- Dynamic cache resizing is an adaptive cache allocation technique that adjusts capacity and structure based on workload variations to optimize resource utilization and meet QoS objectives.
- It employs methods like regression-based predictors, feedback counters, and hardware hash-based remapping to dynamically scale cache resources in real-time.
- Evaluations show significant improvements, including up to 29% higher hit ratios and energy savings, benefiting cloud multi-tenancy, hardware caches, and LLM serving.
Dynamic cache resizing refers to adaptive mechanisms that alter the capacity, allocation granularity, or structural layout of a cache subsystem in response to changes in workload characteristics, system state, or policy objectives. This capability is critical across a range of computing contexts, including cloud multi-tenancy, LLM/KV cache management, hardware-level DRAM and L1D caches, and real-time or mixed-criticality systems. Dynamic resizing can be realized in hardware, software, or hybrid forms and is evaluated against metrics such as performance, energy consumption, security, and quality of service (QoS) compliance.
1. Core Principles of Dynamic Cache Resizing
Dynamic cache resizing encompasses runtime adjustment of cache dimensioning—either by reallocating space among clients/tenants, changing the number of physical sets/ways/banks, or variably compressing/decompressing the state held in cache entries. The principal goals are:
- Resource Utilization: Allocating "just enough" cache to maximize aggregate system efficiency (e.g., minimizing total allocated memory or DRAM energy while meeting per-tenant service guarantees) (Choi et al., 2019).
- QoS/SLO Adherence: Guaranteeing workload-specific service-level objectives—such as hit rate or low tail latency—by scaling cache resources as required (Su et al., 24 May 2025, Choi et al., 2019).
- Performance Adaptation: Rapidly responding to phase or workload changes, including bursty access patterns, fluctuating working set sizes, and system-level events (mode switches, reconfigurations).
- Security Hardening: Obfuscating cache eviction behavior to thwart side-channel or timing attacks via randomized and unpredictable resizing (Wang et al., 2023).
Underlying these principles are mechanisms for usage/pressure sensing, resizing/partitioning logic, state migration or remapping, and, when needed, preservation of cache coherence or hit/miss behavior.
2. Algorithms and Mechanisms for Dynamic Cache Resizing
2.1 Regression- and Learning-Based Policies
Learning-based dynamic cache managers infer access patterns (via Kolmogorov–Smirnov tests on candidate parametric distributions; distributions: Uniform, Zipf, Exponential, Gaussian) and employ regression (SVR, GPR, FCN) to predict the hit-rate as a function of cache allocation and data profile (Choi et al., 2019). These predictions directly feed into closed-loop resizing decisions: if the estimated hit-rate falls below (or rises above) prescribed QoS margins, cache allocation is increased (or decreased), and the optimal increment is identified via binary search on the predictor.
2.2 Counter-Based Feedback Policies
DynamicAdaptiveClimb (Berend et al., 26 Nov 2025) embeds simple counters (jump and jump′) that respectively track cache misses/hits and the fraction of "top-half" cache hits. The resizing rules are threshold-based:
- If jump_t ≥ 2K_t: double the cache size;
- If jump_t ≤ −K_t/2 and jump′_t ≤ −εK_t/2: halve the cache size;
- Otherwise: do not resize.
Counters are reset after resizing to prevent oscillation.
2.3 Hardware Hash-Based Set/Banks Remapping
CRUNCH (Chang et al., 2016) introduces hardware-level resizing for large DRAM caches by consistent hashing on banks and regions. Upon powering down banks, only the sets in those banks are remapped in a load-balanced and low-latency way. The Region Remapping Table (RRT) implements a fixed mapping between logical addresses and active banks, updated during resize events. Hierarchical Dirty Bits (HIER) structures enable efficient handling of dirty data migration during up/down transitions.
2.4 Layer- and Attention-Aware Methods for LLMs
PyramidKV (Cai et al., 2024) leverages structural properties of Transformer-based LLMs—specifically, the "pyramidal" attention distribution—by allocating more KV cache in lower layers and less in higher layers. At each step, each layer's KV buffer is pruned based on recent-window attention scoring (top–k selection), with budgets set by a pyramid slope hyperparameter:
PipeLive (Bai et al., 14 Apr 2026) and MorphServe (Su et al., 24 May 2025) provide token-level or in-place cache resizing for distributed KV (Key/Value) caches in LLM serving, utilizing block-level allocators, asynchronous CUDA streams, and page table updates to maintain correctness, minimize service downtime, and enable concurrent inference and block (de-)allocation.
2.5 Mode- and Criticality-Based Partition Redistribution
In mixed-criticality scheduling, as in (Awan et al., 2017), cache partitions are reassigned at runtime system mode switches (e.g., L→H). When a high-criticality job overruns its WCET, cache dedicated to low-criticality tasks is reclaimed and redistributed to high-criticality tasks, with explicit constraints on per-task cache allocation in both modes and using Integer Linear Programming (ILP)-based heuristics to maximize schedulability.
2.6 Security-Driven Randomization Policies
For side-channel resilience, BackCache (Wang et al., 2023) makes the backup cache size a random variable in [Bₘᵢₙ, Bₘₐₓ], with periodic random resizes driven by a memory-access counter. Correctness and defense strength derive from (a) unpredictability of the backup size and (b) unlinkability of miss events to true L1D evictions, quantified by P_avg approaching 0.5 as (Bₘₐₓ–Bₘᵢₙ + 1) increases.
3. Implementation Architectures Across Domains
| System Context | Resizing Granularity | Core Mechanism(s) |
|---|---|---|
| Cloud multi-tenancy (Choi et al., 2019) | Per-tenant allocation | Learning-based hit-rate prediction, sample+predictor |
| DRAM caches (Chang et al., 2016) | Banks/sets | Hardware RRT, consistent hashing, HIER |
| LLM KV caches (Cai et al., 2024, Bai et al., 14 Apr 2026, Su et al., 24 May 2025) | Layers/blocks/pages | Attention-based pruning, block allocators, async GPU |
| Adaptive policies (Berend et al., 26 Nov 2025) | Cache capacity (entries) | Feedback counters, threshold rules |
| Mixed-criticality (Awan et al., 2017) | Partitioned pages | ILP heuristic, mode-triggered redistribution |
| Security (L1D) (Wang et al., 2023) | Backup line count | Random intervals, unpredictable sizing, RURP |
In LLM serving, dynamic cache resizing must be tightly coupled with the design of memory allocators and attention kernels. Modern approaches such as block-level allocators support non-disruptive in-place resizing (PipeLive), while pointer-based indirection in paged attention and coalesced kernel launches enable resizing concurrent with token generation (Bai et al., 14 Apr 2026, Su et al., 24 May 2025).
Hardware-based methods (e.g., CRUNCH) require customized address remapping logic, compact indirection tables (RRT), and efficient dirty-line enumeration. Security-focused designs tune both the randomness and frequency of resizing to minimize side-channel visibility, balancing overheads against threat models (Wang et al., 2023).
4. Evaluation Metrics and Quantitative Results
Dynamic cache resizing mechanisms are assessed on a portfolio of domain-relevant metrics:
- Hit/Miss Ratio and Latency: DynamicAdaptiveClimb achieves up to 29% higher hit-ratio than FIFO and 10–15% over the next-best adaptive methods, while reducing miss penalties (Berend et al., 26 Nov 2025). Learning-based resizing in clouds yielded 50%+ latency reduction (from >4 ms to ~2 ms), with hit-rate adherence to within 5% of targets, even under workload shifts (Choi et al., 2019).
- Throughput and Scalability: Both AdaptiveClimb/DynamicAdaptiveClimb scale linearly to 16-core implementations, reaching up to 35 Mops, outperforming LRU/B-LRU due to minimal lock and metadata cost (Berend et al., 26 Nov 2025).
- Memory and Energy Efficiency: CRUNCH reduced DRAM cache remapping energy up to 8.6× relative to naive MRI and matched MRI’s load balancing at sub-10 ms transition latency (Chang et al., 2016). MorphServe expands KV cache by +32.97% during runtime bursts, nearly eliminating SLO violations while keeping tail-latency within 2.2×–3.9× of the static baseline (Su et al., 24 May 2025).
- Memory Footprint Reduction for LLMs: PyramidKV provides up to 18× memory reduction (down to 0.8% of baseline) with negligible model accuracy loss, outperforming fixed-budget and sliding window baselines in long-context reasoning tasks (Cai et al., 2024).
- Security Margin: BackCache’s P_avg→0.5 under wide resizing ranges, with performance cost as low as 1–3% for context switches and ~7–8% for throughput, suggesting the randomization approach can nearly eliminate attacker distinguishability at modest overhead (Wang et al., 2023).
- Schedulability Improvements: Dynamic redistribution in mixed-criticality scheduling improved schedulability by up to ~3.6 percentage points (absolute) and up to ~30% (relative) when caches are scarce or Cᵢᴴ≫Cᵢᴸ (Awan et al., 2017).
5. Trade-Offs, Tuning, and System Considerations
Key configuration choices include:
- Granularity: Finer-grained allocation (e.g., layers or pages) increases flexibility but may induce management overhead (fragmentation, pointer tracking) (Bai et al., 14 Apr 2026). Block-level or region-based strategies are favored for low-latency, concurrent reconfigurations (Su et al., 24 May 2025).
- Thresholds and Aggressiveness: Thresholds (tailored to utilization, miss rates, latency, or feedback counters) determine sensitivity and adaptation speed. In security contexts, (Bₘₐₓ–Bₘᵢₙ) width is set to maximize unpredictability (Wang et al., 2023).
- State Preservation and Migration: Methods must minimize or mask the latency of cache evictions, migration, or shrink steps. In DRAM caches, HIER decouples migration from full bank scans; in LLMs, asynchronous state updates or incremental patching obscure resizing from the main compute path (Bai et al., 14 Apr 2026).
- Co-design with Scheduling and Attenuation: In mixed-criticality and cloud contexts, resizing must be cognizant of schedulability analysis (compositional or ILP-based) and leverage system mode transitions to reallocate cache for real-time acceptance (Awan et al., 2017, Choi et al., 2019).
- Security Implications: Predictable resizing intervals or sizes may be exploitable; cryptographically strong randomness (and incorporating the resize logic into the cache pipeline clocking) is recommended where side-channel resistance is crucial (Wang et al., 2023).
6. Limitations, Open Questions, and Extensions
Many approaches intentionally forgo closed-form optimality for online simplicity, relying on lightweight heuristics, counters, or learning predictors. Theoretical convergence/stability guarantees are generally limited; empirical evidence suggests oscillation is bounded by feedback thresholds (Berend et al., 26 Nov 2025). Open areas include:
- Predictive Scaling: Integration with future-demand predictors to preemptively resize.
- Cross-Layer Coordination: Simultaneous adaptation of multiple resources (e.g., memory, cache, compute) for holistic optimization (Su et al., 24 May 2025).
- Extensibility: Handling of heterogeneous or asymmetric partitions, and integration with additional hardware-level power-saving mechanisms (e.g., line decay) (Chang et al., 2016).
- Differentiability and Fine-Tuning: Dynamic adjustment of compression slope or budget via trainable signals in LLM cache management is suggested but not implemented in PyramidKV (Cai et al., 2024).
- Formal Security Proofs: While P_avg analysis supports "guessing" margin, formal proofs under strong attacker models remain a subject for further work in hardware-safer cache resizing (Wang et al., 2023).
7. Cross-Domain Applications and Impact
Dynamic cache resizing is now a ubiquitous paradigm, found in:
- Multi-tenant database and cloud systems for real-time, SLAs-driven resource multiplexing.
- Serving stacks for large-scale transformer models, accommodating bursty, multi-tenant query streams and supporting both latency-sensitive and memory-constrained deployment (Su et al., 24 May 2025, Cai et al., 2024, Bai et al., 14 Apr 2026).
- Advanced hardware cache and memory management, reducing DRAM refresh and leakage energy, and facilitating rapid adaptation to phase shifts in workload (Chang et al., 2016).
- Security-critical contexts, where cache behavior must be obfuscated without prohibitive overhead.
- Mixed-criticality cyber-physical systems, by runtime cache partition redistribution to guarantee critical task schedulability without gross overprovisioning (Awan et al., 2017).
Dynamic resizing, when tightly integrated with system monitoring, learning, and scheduling logic, delivers robust performance, enhanced efficiency, and guarantees or hardens system-level properties under dynamic and adversarial conditions.