Coarse-Grained Cache Overview
- Coarse-grained cache is a caching architecture that works at block or module level, balancing spatial and temporal locality to reduce redundancy.
- In applications like generative models and hardware isolation, it reuses stabilized outputs to achieve up to a 63% MAC reduction and improved security through domain isolation.
- This approach streamlines metadata management and scalability, though it may trade off precision for significant gains in efficiency and resistance to side-channel attacks.
A coarse-grained cache, in the context of both hardware and algorithmic systems, refers to a caching architecture or policy operating at a block, module, or large-structure granularity, rather than at the level of individual elements or fine-grained data objects. Across domains, coarse-grained caches balance spatial and temporal locality to reduce redundancy, enhance throughput, or limit side-channel leakage, frequently trading some precision or flexibility for significant efficiency and scalability gains.
1. Definitions and Core Principles
Coarse-grained caches, sometimes termed block-level caches, operate by storing, tracking, and reusing entire data blocks, sets, or computational module outputs as atomic cache entries. In hardware systems, this may correspond to partitioning a cache into exclusive "chunks" or sets, granting entire regions to different protection domains. In generative machine learning, coarse-grained caching refers to bypassing recomputation of whole network blocks (e.g., encoder–decoder modules) when outputs stabilize, thus reducing unnecessary compute in iterative inference processes. The fundamental mechanism is the recognition and exploitation of redundancy at a level above the minimum data or task granularity (Liu et al., 14 Nov 2025, Dessouky et al., 2021, Beckmann et al., 2022).
2. Mathematical and Algorithmic Formulation
Coarse-grained caches are characterized by policies and mechanisms defined over sets, blocks, or module outputs:
- Block-level feature reuse (deep generative models): Let and denote control-module and generation-module steps, with outputs and states , respectively. A coarse-grained cache identifies the index after which control features stabilize (using pairwise cosine similarity with threshold ) and reuses in subsequent steps: for .
- Block-aware caching (classical caching): The classic granularity-change caching model defines a universe partitioned into disjoint blocks of size . When a cache miss occurs for , loading any subset costs 1, regardless of . Optimal strategies involve bringing in all anticipated requests from at once (Beckmann et al., 2022).
- Hardware set/chunk partitioning: In set-associative caches, coarse-grained policies (e.g., Chunked-Cache) assign disjoint sets to entire protection domains. Chunk size per domain is sets, yielding a private capacity for associativity and line size (Dessouky et al., 2021).
3. Applications and Impact
3.1. Generative Models and Vision
The Hybrid-Grained Cache (HGC) deploys block-level coarse caching in diffusion models to detect convergence in encoder–decoder outputs, then reuses these block outputs, which drastically reduces MACs. On the COCO-Stuff benchmark with ControlNet, a coarse-grained cache achieves a 63% reduction in compute (from 18.22T to 6.70T MACs), a 2× speedup, and only minor increases in FID (+0.3) and CLIP Score loss (~0.85), staying within a 1.5% semantic-fidelity drop (Liu et al., 14 Nov 2025):
| Method | FID ↓ | CLIP Score ↑ | MACs ↓ | Speed ↑ | Semantic-Loss ↓ |
|---|---|---|---|---|---|
| NoCache | 19.99 | 32.37 | 18.22T | 1.00× | — |
| HGC (coarse) | 20.29 | 31.52 | 6.70T | 2.02× | ~1.5 % |
Similar efficiency-quality trade-offs hold across semantic segmentation, edge, and depth conditioning tasks.
3.2. Hardware Security and Isolation
The Chunked-Cache architecture in multi-core processors partitions set-associative LLCs into exclusive chunks at set granularity. Each trusted domain receives a contiguous, private chunk of sets, blocking Prime+Probe and occupancy-based cache attacks across domains. This mechanism, compared to way-partitioning, supports more domains and improves L3 miss rates by up to 25%, with a 43% improvement in IPC and sub-5% performance overhead for domains of ≥512 sets (Dessouky et al., 2021).
3.3. Side-Channel Attack and Defense
In Apple M1 and related SoCs, the SLC (System-Level Cache) is shared across CPU clusters and GPUs. A coarse-grained occupancy channel leverages large buffer sweeps (at 8 KiB stride) to sample aggregate cache usage, enabling cross-domain website fingerprinting (SLC: 90% top-1 accuracy), pixel stealing, and low-granularity (57-row) screen capture attacks. Countermeasures include L2 and SLC masking, which are most effective when masking buffers directly evict SLC lines bypassing L2 (Xu et al., 18 Apr 2025).
4. Computational and Theoretical Analysis
4.1. Performance and Complexity
In diffusion and controllable generation tasks, block-level cache policies yield substantial reductions in operations:
- Without cache:
- With coarse cache: total savings combine control-module and generative-module reductions, yielding an empirical acceleration ratio (Liu et al., 14 Nov 2025).
4.2. Optimality and Competitive Ratios
In block-aware online caching, deterministic strategies are fundamentally limited by the block size . Pure item or block caches suffer blowups in adversarial cases; e.g.,
- Item-cache CR lower bound:
- Block-cache CR lower bound:
Hybrid policies such as the Item-Block Layered Partitioning (IBLP) algorithm, maintaining both item and block LRU lists with , close the gap nearly optimally, adapting to locality structure and observed working-set growth (Beckmann et al., 2022). The dependence on the offline cache size differs from conventional paging, with competitive ratios for two algorithms possibly inverting as varies.
5. Security, Privacy, and Isolation Implications
Coarse-grained caches are central to both side-channel resilience and side-channel attack vectors. In Chunked-Cache, carving out sets strictly isolates protection domains, nullifying both Prime+Probe and occupancy channels without requiring randomized set indexing or re-keying. Conversely, in M1 SLC, the absence of set-level partitioning exposes a high-bandwidth aggregate channel that attackers can leverage for coarse-grained fingerprinting and screen content extraction (Xu et al., 18 Apr 2025, Dessouky et al., 2021).
Countermeasures exploit the same coarseness by introducing masking buffers that actively and repetitively fill the SLC or L2, rendering the occupancy channel noisy or unusable with low single-core (<5-7%) performance overheads.
6. Practical Trade-Offs, Scalability, and Tuning
Coarse-grained caches streamline metadata management and replacement tracking: maintaining per-block or per-set state is less costly than fine-grained element-level mechanisms. In the extended locality model, miss rate bounds can be tuned with functions (distinct items per window) and (distinct blocks per window), guiding split decisions between item and block caches (Beckmann et al., 2022).
Chunked-Cache scales in practicable multi-core or TEE systems, supporting up to concurrent protection domains, with empirical demonstrations of at least 32 isolated domains on moderate sizes. The efficacy of hybrid or dynamically tuned coarse-grained caches increases with mixed spatial and temporal locality, and adaptive schemes that track locality characteristics in real workloads approach optimality up to constant factors.
References:
- (Liu et al., 14 Nov 2025): Accelerating Controllable Generation via Hybrid-grained Cache
- (Xu et al., 18 Apr 2025): EXAM: Exploiting Exclusive System-Level Cache in Apple M-Series SoCs for Enhanced Cache Occupancy Attacks
- (Beckmann et al., 2022): Spatial Locality and Granularity Change in Caching
- (Dessouky et al., 2021): Chunked-Cache: On-Demand and Scalable Cache Isolation for Security Architectures
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free