Memory Caching: Techniques and Trade-offs

Updated 3 March 2026

Memory caching is a technique that uses fast-access storage to temporarily hold data, optimizing retrieval and lowering latency in diverse applications.
It employs coded delivery alongside local and global caching gains to achieve efficient rate-memory trade-offs and reduced network delivery rates.
Advanced caching integrates adaptive policies and predictive algorithms to dynamically optimize resource allocation and improve overall system throughput.

Memory Caching (MC) encompasses a wide spectrum of techniques and theoretical frameworks wherein intermediate storage—or “caches”—serve to accelerate access to data, minimize latency, and reduce bandwidth consumption across hardware, networking, distributed, and machine learning contexts. MC leverages locality (both temporal and spatial), redundancy elimination (coded schemes), and predictive algorithms to optimize for hit rate, energy, or delivery rate, subject to physical or economic constraints. This article surveys the major models, algorithmic innovations, rate–memory trade-offs, and system design principles that define the state of the art across key research fronts.

1. Core Principles and Models of Memory Caching

Memory Caching refers to strategies where a fast-access memory stage (the cache) temporarily retains data predicted to be useful for future requests. The classical MC system model encompasses:

A data repository (e.g., library of $N$ files or memory objects).
$K$ clients or users, each with local cache of size $M$ (files or blocks).
A two-step workflow: placement phase (populate caches, typically off-peak) and delivery phase (serve future requests efficiently).

The fundamental trade-off is succinctly expressed as: $R^*(M) := \min\{R: \text{scheme with cache size } M \text{ achieves delivery rate } R\}.$ This MC model has been formalized in both information-theoretic settings for content distribution (Maddah-Ali et al., 2012, Namboodiri et al., 2022) and in hardware memory hierarchies (Ruan et al., 2016, Bakhshalipour et al., 2018).

A key insight is the distinction between local caching gain, where users serve their requests from their own cache, and global caching gain, where content placement and coded delivery are co-designed to exploit overlaps in user demands, enabling multicast opportunities and rate reductions (Maddah-Ali et al., 2012).

2. Algorithmic and Theoretical Rate–Memory Trade-offs

Early MC schemes used uncoded, popularity-based replications of popular files, yielding the following rate under the uniform-demand model: $R_{\mathrm{uncoded}}(M) = K\left(1 - \frac{M}{N}\right).$ However, coded caching—pioneered by Maddah-Ali and Niesen—demonstrated that splitting files into subfiles and employing XOR-multicast transmissions can dramatically reduce network delivery rate: $R(M) = \frac{K(1 - M/N)}{1 + KM/N}.$ This scheme achieves multiplicative rate reduction, especially when the aggregate cache capacity ( $KM$ ) exceeds library size ( $N$ ) (Maddah-Ali et al., 2012).

Subsequent work characterized MC for nonuniform file popularity and cache capacities by optimizing modified coded caching schemes (MCCS), yielding structured and often optimal placement/delivery for two-user and distinct-demand regimes (Deng et al., 2021). Strategies for heterogeneous (video layer) needs with loss-tolerant users yield MC formulations minimizing average distortion subject to shared-link or per-user constraints (Hassanzadeh et al., 2019).

In combinatorial and hierarchical multi-access caching, MC techniques further generalize to multi-cache and multi-layer network architectures. Coded placement with layered, MDS-coded data enables optimal or near-optimal rate–memory trade-offs, both in single-layer (Namboodiri et al., 2022) and hierarchical/multi-tiered (Pandey et al., 20 Jan 2025) designs.

Model/Setting	Achievable $R(M)$ Formulation	Optimality Regime
Uncoded caching (uniform)	$K(1 - M/N)$	Baseline
Centralized coded caching	$\frac{K(1-M/N)}{1+KM/N}$	Within factor 12 of optimal (Maddah-Ali et al., 2012)
Multi-access coded placement (Scheme 1)	$\frac{\binom{C}{r}}{\binom{t+r}{r}} \left(1 - \frac{rM}{N}\right)$	Optimal for $M/N > (\binom{C}{r}-1)/r\binom{C}{r}$ (Namboodiri et al., 2022)
Hierarchical (composite rate)	$\overline{R} = R_1 + K_1 R_2$ (formulas in (Pandey et al., 20 Jan 2025))	Improves over prior for small $K_2$

3. Advanced Replacement and Policy Mechanisms in Hardware MC

In memory systems, MC is instantiated as cache replacement algorithms, bridging recency, frequency, and dirty-data heuristics for bandwidth, energy, and endurance optimization. The systematically multilevel replacement policy (MAC) integrates freshness levels ( $N_1$ for LRU-like recency) and dirty degree levels ( $N_2$ for write-back urgency), partitioning blocks into $N_1 \times N_2$ protection-level tiers (Ruan et al., 2016).

Insertion, promotion, and demotion policies maintain spatial separation between dirty and clean lines, ensuring that dirty data is not prematurely or frequently evicted, reducing write-back traffic by 25% on average over standard LRU in simulation, with <1% hardware overhead.
Replacement policies can be further tuned by integrating set-dueling for dynamic sensitivity or sub-block dirty-byte counting (as in LDF) for finer granularity (Ruan et al., 2016).

4. Memory Caching in Modern Hardware Architectures and Disaggregated Systems

MC principles are foundational in modern hardware stacks leveraging multi-level memory technologies (DRAM, PCM, NVM, SSD, etc.). Notable architectures and schemes include:

Die-Stacked DRAM Hybrid MemCache: Partitioning on-chip memory into a memory slice for hot-page allocation (tag-free, no associative lookup) and a cache slice for dynamic behavior, with partitioning guided by static profiling for maximal “hits per frame.” This design achieves 21–28% higher performance than full-cache approaches at reduced tag overhead (Bakhshalipour et al., 2018).
Disaggregated Memory Elastic Caching (Ditto): RDMA-enabled client-centric MC adapts not only the caching policy (multi-armed bandit among LRU, LFU, etc.) but also resource allocation (CPU, memory), achieving up to 9× throughput improvements and p99 tail latencies of 14–21 $\mu$ s under dynamic reconfiguration (Shen et al., 2023).
FPGA-Accelerated MC via GMM (ICGMM): Utilizing hardware-efficient 2D Gaussian mixture models (page index, time window) to score and select SSD pages for device-side DRAM caching. Achieves up to 39.14% lower access latency and over 10,000× faster on-chip policy inference than LSTM-driven alternatives (Chen et al., 2024).

5. Multi-Tenant, Distributed, and Graph-Centric MC

MC also encompasses shared and distributed models:

Multi-Tenant Proxied MC with Shared Caches: Multiple proxies maintain independent LRU lists in a shared cache; object length is split (inflated/deflated) proportionally among referring proxies. A “working-set” approximation yields hit probability estimates, enabling admission control and soft/hard cap batching strategies to minimize eviction overhead and ripple effects (Kesidis et al., 2019).
RMA Caching for Distributed Graph Algorithms: Transparent software caching (CLaMPI) with application-aware scoring (degree-bias for high-degree vertices in CSR) in asynchronous, one-sided MPI graph analytics yields 62–73% communication time reductions and head-to-head speedups up to 100× over synchronous, two-sided baselines (Strausz et al., 2022).

6. Emerging Directions: MC for Learning Systems, Hierarchical Design, and Joint Optimization

Memory caching in machine learning models has recently emerged as a core enabler of long-context sequence modeling:

Memory Caching for RNNs: The Memory Caching (MC) framework splits input into segments, caches hidden states, and supports both direct sum (residual), data-dependent gating, and “memory soup” retrieval strategies. MC interpolates between RNN ( $\mathcal O(L)$ ) and Transformer ( $\mathcal O(L^2)$ ) memory, closing much of the recall gap at subquadratic cost and supporting context up to 32K tokens efficiently (Behrouz et al., 27 Feb 2026).
Flexible Subpacketization and Explicit $R$ – $F$ Trade-offs: Construction of placement-delivery arrays (PDAs) allows direct control of caching rate $R$ and subpacketization $F$ , enabling practical MC deployments with polynomial or subexponential $F$ at the cost of marginal $R$ increases—a key for scalable coded caching in real-world environments (Cheng et al., 2017).
Multiple-Choice Knapsack-Based Memory Hierarchy Design: MC is formally cast as an MCKP, where each data item's placement is chosen to maximize service time reduction under budget constraints. Greedy LP relaxations efficiently configure multi-stash hierarchies (e.g., DRAM+NVM+SSD), with tiered allocation and selective replication optimizing performance, reliability, and cost (Ghandeharizadeh et al., 5 Jun 2025).

7. Implications, Best Practices, and Future Directions

Research across MC domains converges on several key guidelines:

Coded Placement and Delivery: Engineering memory (or cache) content to create coded-multicast opportunities multiplies global gains, especially when aggregate capacity is large compared to the data universe.
Adaptive and Application-Aware Policies: Integrating learning (GMMs, bandits), application-specific scores, or system/OS-level heuristics (hot-page identification, migration hysteresis) are crucial for dynamically shifting workloads and heterogeneous user demands.
Hierarchical and Cooperative Caching: Joint optimization (admission control, resource partition, cross-layer MC) consistently outperforms localized schemes, justifying the added complexity in large-scale or multi-tier settings.

Several open research avenues persist, such as online adaptation in hardware MC, MC for dynamic and multi-modal ML workloads, efficient MC across emerging CXL-enabled and disaggregated architectures, and the design of MC for privacy/security-oriented environments (Namboodiri et al., 2022, Shen et al., 2023, Chen et al., 2024).