Block-Wise Caching (BWCache)

Updated 24 September 2025

Block-Wise Caching (BWCache) is a caching strategy that manages and evicts data in blocks to exploit spatial locality and reduce overhead.
BWCache is implemented in coded caching frameworks, adaptive cloud systems, and deep neural inference to enhance throughput and minimize memory cost.
BWCache employs specialized cost models and eviction policies that leverage block redundancy, improving latency and scalability in heterogeneous environments.

Block-Wise Caching (BWCache) refers to a family of algorithms and architectural principles in which data is cached, managed, and evicted at the level of blocks (or subfiles, segments, or chunks) rather than at the object or fine-grained item level. BWCache spans multiple research domains: coded caching in networked systems, block-aware replacement and fetching algorithms, granularity change in memory caching, and efficient block-wise KV-pruning in large-scale inference engines and generative models. Conceptually, BWCache exploits the cost structure and spatial locality introduced by block granularity, enabling reductions in computational, communication, and storage overheads compared to itemwise or tokenwise methods.

1. Block-Wise Caching Principles and Variants

At its core, BWCache leverages the following core properties:

Block-based granularity: The unit of caching, eviction, and sometimes data transfer is a block (which could be a fixed-sized chunk in storage, a group of tokens in LLM attention cache, or a stage output in a DNN).
Cost structure adaptation: Fetching or evicting a set of items from a given block incurs the same cost as operating on any single item in that block (Coester et al., 2022). This property drives the optimization in both storage and network environments.
Redundancy identification: BWCache often exploits inherent redundancy across blocks (e.g., temporal redundancy in diffusion models, or spatial redundancy in adjacent storage sectors).
Structure-aligned eviction: Many block-wise schemes tailor eviction to block, page, or group boundaries to reduce fragmentation and maintain efficient memory utilization (Chitty-Venkata et al., 4 Sep 2025).

There are multiple BWCache instantiations:

Coded caching frameworks where files are split into subfiles or blocks, and coded delivery creates multicast opportunities enhanced by block structure (Hachem et al., 2014); blockwise design can reduce subpacketization (Tang et al., 2017).
Block-aware paging and eviction policies that explicitly model and optimize block-level costs, in contrast to classical paging (Coester et al., 2022).
Granularity-change caching where block-based loading and spatial locality profoundly affect computational complexity and competitive ratios (Beckmann et al., 2022).
Latent/block feature caching in deep neural inference and generative models, e.g., blockwise feature reuse in diffusion models (Wimbauer et al., 2023, Ji et al., 16 Jun 2025, Cui et al., 17 Sep 2025), blockwise eviction in LLM KV memory (Chitty-Venkata et al., 4 Sep 2025).

2. Block-Aware Cost Models and Algorithmic Implications

Block-aware caching generalizes classic single-item caching to settings where cost is incurred per block rather than per item. Key parameters include cache size $k$ , block size $\beta$ , and block cost $c_B$ . This leads to two main cost models:

Eviction Cost Model: Cost is incurred when flushing any subset of a block; fetching is free. There exist $O(\log k)$ -approximate offline algorithms, $k$ -competitive deterministic online algorithms, and $O(\log^2 k)$ -competitive randomized online algorithms for this model (Coester et al., 2022). The LP relaxation uses submodular covering constraints.
Fetching Cost Model: Cost is paid on block fetch; eviction is free. The natural LP relaxation has an $\Omega(\beta)$ integrality gap, separating the tractability of this model from eviction cost and classic paging. Online algorithms face fundamental $\Omega(\beta + \log k)$ lower bounds.

BWCache also appears in granularity-change caching (Beckmann et al., 2022), with complexity increases (NP-completeness in the offline setting) and deteriorating competitive ratios proportional to $B$ (block size). Layered policies such as Item-Block Layered Partitioning (IBLP) can partially recover optimality. These theoretical results underscore both the power and subtleties of block-aware design.

3. BWCache in Coded Caching and Network Systems

In heterogeneous wireless architectures and content networks, BWCache has been operationalized through coded caching, where files are split into multiple subfiles or blocks and stored across distributed caches. The Maddah-Ali–Niesen framework (Hachem et al., 2014) achieves order-optimal base station (BS) transmission rates by combining randomized block placement and multicast delivery using blockwise linear combinations. For multi-level content popularity, optimal memory sharing divides the cache among levels and files, allocating full caching to the most popular blocks and partial (possibly zero) cache space to less popular levels. The broadcast rates are expressed as:

$R^\text{MU}(M) \approx \sum_{h \in H} KU_h + \frac{(\sum_{i \in I} \sqrt{N_i U_i})^2}{M - \sum_{j \in J} N_j / d_j} - \sum_{i \in I} d_i U_i,$

with memory allocations $\alpha_i M \approx (\sqrt{N_i U_i}/S_i) \cdot (M - T_J + V_I)$ . Lower bounds are derived using sliding-window entropy inequalities, and dichotomous strategies emerge for single-user vs multi-user settings, distinguishing between memory-sharing and clustering approaches.

Subpacketization bottlenecks—each file split into exponentially many subfiles—are alleviated by resolvable design–based BWCache schemes, using generator matrices of linear block codes that satisfy the consecutive column property (Tang et al., 2017). Such schemes yield subpacketization levels of $F_s = q^k z$ and allow practitioners to choose points along the rate–subpacketization trade-off curve.

4. BWCache in Large-Scale and Parallel Systems

Blockwise principles are fundamental in disk caching, data-parallel compute frameworks, and persistent memory devices:

Disaggregated and adaptive cloud caching: In AdaCache, caches are organized as groups containing variable-size blocks matching I/O request sizes (Yang et al., 2023). Adaptive allocation is performed by scanning for missing cache intervals and greedily allocating the largest permissible block. Fragmentation is minimized by storing same-size blocks in contiguous groups, and a two-level LRU replacement policy interfaces block eviction and group-level memory management.
I/O transit caching in persistent memory: The Caiti algorithm treats DRAM cache as a transit buffer, eager-evicting to persistent memory via background threads, and bypassing full caches by direct PMem writes (Xu et al., 10 Mar 2024). This multi-core–driven blockwise transit design enables substantial I/O latency reductions (up to $3.6\times$ over baseline).
Data-parallel compute tasks: In Spark and related frameworks, the Least Effective Reference Count (LERC) policy preserves entire peer-groups of blocks needed together for a computation, thus optimizing the effective cache hit ratio and minimizing redundant in-memory block retention (Yu et al., 2017). The effective cache hit ratio quantifies the fraction of block accesses actually enabling task speedup.

5. BWCache in Modern Generative and Inference Workloads

Block-wise caching is increasingly integral to the efficient deployment of large-scale generative and inference models:

Diffusion and Vision Models: In image and video diffusion transformers, blockwise feature caching is enabled by the empirical observation that block outputs (e.g., transformer layer or DiT-block activations) vary minimally for long stretches of the denoising process (Wimbauer et al., 2023, Cui et al., 17 Sep 2025). A relative change metric, such as

$\text{L1}_{\text{rel}}(h_i, t) = \frac{\|h_{t, i} - h_{t+1, i}\|_1}{\|h_{t+1, i}\|_1},$

is used to decide when to reuse cached outputs. For video diffusion, the U-shaped variation in block features justifies blockwise feature cache/reuse with a dynamic similarity threshold $\delta$ . Periodic recomputation mitigates potential latent drift. Experiments demonstrate 1.61 $\times$ –2.24 $\times$ speedups with minimal visual fidelity loss (Cui et al., 17 Sep 2025).

Diffusion Policy Acceleration: In visuomotor control, Block-wise Adaptive Caching (BAC) leverages per-block adaptive update scheduling (via a DP-maximized similarity score) and a Bubbling Union Algorithm to prevent inter-block error surges in feed-forward networks, achieving up to $3\times$ inference speedup with no accuracy loss (Ji et al., 16 Jun 2025).
LLM KV Cache Pruning: For transformer-based LLMs using KV cache, PagedEviction provides a blockwise eviction strategy naturally aligned with vLLM's PagedAttention paged memory layout (Chitty-Venkata et al., 4 Sep 2025). Here, block (or page) importance is scored using the ratio of $L_2$ norms of Value and Key tensors, and blocks are evicted in their entirety, thereby reducing memory fragmentation, update overhead, and latency. Extensive LongBench evaluations support the superior throughput and memory efficiency relative to token-level and unstructured pruning baselines.

6. Comparative Analysis and Systemic Implications

Across algorithmic and system domains, BWCache outperforms itemwise or classic caching strategies by:

Reducing systemwide data movement: Coded blockwise caching in networks achieves 6×–14.5× reductions in transmission rates compared to LFU (Hachem et al., 2014).
Lowering memory and metadata overhead: By matching cache allocation granularity to I/O size and using group-based allocation, AdaCache attains up to 74% I/O reduction to backend storage and 41% metadata overhead savings (Yang et al., 2023).
Improving throughput and scaling: Blockwise cache pruning in LLM inference improves throughput by up to 37% and reduces per-token latency by 10–12% (Chitty-Venkata et al., 4 Sep 2025).
Handling spatial and temporal locality: Optimal and competitive-ratio–bounded schemes (e.g., IBLP, submodular LP relaxation) close the gap between worst-case and optimal performance, at the expense of increased complexity or parameter sensitivity (Beckmann et al., 2022, Coester et al., 2022).
Preserving practical deployment advantages: Most modern BWCache algorithms require no kernel modification or model retraining, enabling plug-and-play deployment in cloud, data-parallel, and generative settings (Yang et al., 2023, Chitty-Venkata et al., 4 Sep 2025, Wimbauer et al., 2023).

7. Open Questions and Theoretical Challenges

BWCache raises several open directions:

Complexity and optimality: The block-aware caching problem is NP-complete in the offline case, and worst-case competitive ratios suffer from block size dependencies (Coester et al., 2022, Beckmann et al., 2022).
Dynamic policy adaptation: Performance can depend sensitively on the choice of thresholding, update schedule, or block size. Two-level or adaptive policies seek to balance overhead with cache efficiency.
Extended locality models: Recent models employ dual-parametric locality functions $f(n)$ (items) and $g(n)$ (blocks) to more accurately predict upper/lower bounds and guide design choices (Beckmann et al., 2022).
Resource heterogeneity: Joint caching and forwarding frameworks in heterogeneous memory hierarchies (DRAM, PMem, Flash) require cost-aware block orchestration, solved via drift-plus-penalty optimization and virtual control plane–data plane splits (Mutlu et al., 2023).

BWCache thus represents a convergence of algorithmic, architectural, and application-driven advances in block granularity management. Its future evolution will likely continue to be driven by resource diversity, emergent workload patterns, and the pressing demands for efficiency in AI and edge systems.