Hybrid-Grained Cache (HGC)
- Hybrid-Grained Cache (HGC) is an architecture combining multiple cache granularities to optimize efficiency, performance isolation, and energy usage.
- It employs hybrid partitioning by splitting cache into dedicated and shared segments, providing strict guarantees alongside opportunistic resource sharing.
- HGC implementations show significant capacity reductions, speedups, and security enhancements, making them ideal for cloud infrastructures and neural network inference.
A hybrid-grained cache (HGC) refers to any cache architecture or mechanism that deliberately combines different allocation, sharing, or protection granularities within a single cache hierarchy to achieve competing objectives—such as efficiency, performance isolation, computational savings, or security—that pure coarse- or fine-grained approaches cannot deliver. In distributed systems, cloud infrastructure, trusted execution, and neural network inference, HGC approaches leverage hybrid partitioning, multi-stage caching, or layered-granularity reuse to optimize specific trade-offs: per-user or per-task Quality of Service (QoS) guarantees, computational throughput, resistance to side-channel adversaries, or energy consumption, among others.
1. Foundational Principles of Hybrid-Grained Caching
Hybrid-grained caching emerged to address trade-offs inherent in homogeneous cache policies. Pure global sharing maximizes effective utilization but exposes tenants, security domains, or computational graph stages to potential interference, starvation, or leakage. Conversely, strict isolation (e.g., fixed slabs per tenant or domain) guarantees resource protection but causes chronic over- or under-provisioning due to workload nonuniformity.
HGC architectures hybridize these extremes, most often by statically or dynamically partitioning cache resources into segments—dedicated partitions with strict guarantees and shared pools managed by fair or opportunistic allocation (Kim et al., 2019, Dessouky et al., 2019). In modern ML, HGC can refer to the strategic co-deployment of block-level and module-level cache strategies along a computational graph, or to the hierarchical partitioning of features and intermediate representations with workload- and stage-sensitive cache replacement (Liu et al., 14 Nov 2025, Sung et al., 31 Oct 2025). The mechanism, objective, and interface of the HGC are determined by the target system's need for flexibility, determinism, and resource efficiency.
2. Hybrid-Grained Cache Architectures in Cloud and Trusted Environments
In multi-tenant, memory-bound private clouds, HGC architectures carve total cache capacity into (i) exclusive Dedicated Caches (DC₁, …, DC_N), each sized precisely to meet a tenant's strict minimum hit-rate (hard guarantee); and (ii) a global Shared Cache (SC) that allows all tenants to opportunistically borrow excess capacity in pursuit of aspirational (soft) hit-rate targets. This dual-grained structure enables efficient realization of Service Level Agreements (SLAs) at both minimum and aspirational levels, while minimizing cache overprovisioning (Kim et al., 2019).
Key components are:
- Performance Model: Each tenant has a hard hit-rate and a soft target . Hit-rates are smoothed by EWMA, and a gap quantifies the surplus or deficit relative to . Resource allocation pursues (hard) while maximizing (soft).
- Analytical Partitioning: Dedicated cache size for tenant is chosen so that, under the worst-case (minimum skew) access distribution, . The shared region size is subject to a max-min allocation, solved online by fair eviction algorithms.
- Hybrid Insert/Eviction Policy: Incoming requests may update DC, SC, or both, with promotion/demotion based on recency and hit/miss location. Greedy policies ensure no tenant falls below , while SC is allocated by max-min fairness or "selfish sharing" (victims only chosen if above ).
Empirical simulation with multi-tenant Zipf workloads demonstrated that HGC architectures can reduce cache capacity needed to guarantee target hit-rates by up to 49% vs. static partitioning and 36% vs. global sharing, without SLA violations (Kim et al., 2019).
In security-centric settings, such as trusted execution environments (TEEs), hybrid-grained cache designs like HybCache allocate special subcaches to isolated domains. Each cache set reserves ways as a fully-associative subcache with random replacement, ensuring domain isolation, while all other ways remain shared among non-isolated domains according to standard set-associative policies, providing full retention of performance for legacy or low-security software (Dessouky et al., 2019). The use of isolation domain tagging and soft partitioning enables low hardware cost (sub-0.1% area) and modest performance overhead (3.5–5%) only for isolated execution regions.
3. Hybrid-Grained Caching in Accelerating Generative and Diffusion Models
Recent diffusion-based controllable generative models present unique opportunities for HGC by exploiting computational redundancy at multiple levels. In "Accelerating Controllable Generation via Hybrid-grained Cache," two orthogonal cache granularities are integrated (Liu et al., 14 Nov 2025):
- Coarse-Grained (Block-Level) Cache: Block-level caching targets entire encoder/decoder blocks. If cached block features in the control and generative modules remain highly similar to previous time steps (measured by cosine similarity), computations are skipped and features are reused throughout corresponding intervals.
- Fine-Grained (Prompt-Level) Cache: Prompt-level caching targets intra-block modules—specifically, cross-attention map computation, which is observed to converge rapidly during inference. Attention features are cached at early steps and reused at later steps, either by direct assignment or after fusion of batch-branch results.
HGC enables seamless bypassing and reuse of large portions of the generation pipeline. On COCO-Stuff segmentation, HGC achieves 63% MAC reduction (18.22T to 6.70T) and a 2× latency speedup, with semantic/perceptual metric degradation (CLIP Score) kept within 1.5% of baseline. Block-level mechanisms yield the bulk of savings, while prompt-level caching yields additional but lower gains (Liu et al., 14 Nov 2025). This approach generalizes across tasks, e.g., segmentation, edge/depth maps, and video generation.
A closely related structure, the H2-Cache, employs a hierarchical dual-stage cache in diffusion models, arranging the computational graph into structure-defining and detail-refining stages, each with independently tunable cache thresholds (Sung et al., 31 Oct 2025). A pooled feature summarization mechanism reduces similarity check overhead, enabling up to 5.08× speedup at negligible quality loss (e.g., −0.07% CLIP-IQA). Dual thresholds balance speed-quality trade-offs, and ablations demonstrate that aggressive structure-stage caching with moderating detail-stage thresholds yield the best quality-efficiency Pareto.
4. Insertion, Replacement, and Similarity Policies
HGC solutions rely on hybrid-insertion and replacement mechanisms that operate differently within each cache region or computational stage. In tenant-aware cloud caching, the insertion policy ensures LRU-like update in DC and max-min fair sharing in SC. Eviction from SC considers the gap metric to select victims, either always (fair sharing) or only if (selfish sharing), maximizing the minimum surplus above .
For block-level reuse in diffusion models, cacheability is controlled by dynamic similarity thresholds (block feature or cross-attention cosine similarity). In H2-Cache, thresholded L2 norms (or their pooled approximations) serve to bypass recomputation at both the structure- and detail-level. The overall policy is to triggger cache reuse whenever feature change falls below hand-tuned or analytically chosen cutoffs (Sung et al., 31 Oct 2025, Liu et al., 14 Nov 2025). Faster, quantized similarity estimation (e.g. pooled feature summarization) maintains computational savings by avoiding full-tensor norms.
In hardware-centric, trusted isolation, replacement within subcaches is performed uniformly at random among available ways, defeating deterministic side-channel probing, while hits require domain-specific tags to prevent leakage (Dessouky et al., 2019).
5. Performance, Security, and Quality Trade-Offs
The hybrid-grained approach achieves explicit Pareto improvements over single-granularity schemes, as summarized below:
| Scenario | Capacity or Overhead Reduction | Guaranteed Property | Fidelity / Security |
|---|---|---|---|
| Multi-tenant cloud caching (Kim et al., 2019) | 36–49% less DRAM | at all times | Bounded by design (QoS, SLA) |
| Controllable diffusion generation (Liu et al., 14 Nov 2025) | 63% MACs, 2× latency | FID/CLIP within 2%, 1% of baseline | Fine structure preserved |
| Diffusion model acceleration (H2-Cache) (Sung et al., 31 Oct 2025) | 5× speedup, ~15% cache check accel | CLIP-IQA −0.07%; PSNR, SSIM, FID nearly unchanged | Avoids detail blurring |
| Hardware side-channel isolation (Dessouky et al., 2019) | 0% overhead for non-isolated | I-domain isolation durable against cache timing attacks | 3.5–5% IPC overhead isolated |
The practicality of HGC depends critically on module and domain sensitivities: aggressive caching yields diminishing returns or visible artifacts in dynamic or structurally unstable workloads, while overly conservative partitioning restores naïve performance or silicon costs. In the security context, the occupancy channel remains unless strict partitioning is enforced.
6. Limitations, Extensions, and Future Directions
HGC effectiveness is contingent upon workload stationarity and early stabilization of features or attention in generative models. Limitations reported include degraded fidelity under rapid, geometric, or dynamic changes, or in scenarios where fixed cache intervals (e.g., static , ) fail to adapt. Thresholds and cache intervals are typically hand-tuned; methods such as reinforcement learning or variance-aware adaption are proposed to automate tuning.
In trust-domain caches, the main limitation is the persistence of the aggregate occupancy channel, requiring further techniques for complete leakage closure. Hardware costs and latency are well bounded for modern cache sizes but must be tracked for scaling to future architectures.
Planned extensions include adaptive cache interval scheduling, attention-guided invalidation, direct extension to multimodal or autoregressive models, per-thread isolation domains, and extension to non-cache residency structures (e.g., TLBs or page-walk caches) (Liu et al., 14 Nov 2025, Sung et al., 31 Oct 2025, Dessouky et al., 2019).
7. Application and Integration Guidelines
For practitioners, HGC schemes require careful analytical calibration:
- In private-cloud HGC, for DCs should be computed under worst-case access (least Zipf skew), and SC sized for remaining capacity. EWMA hit-rate smoothing at –$0.2$ is recommended. Max-min fair sharing is achieved by greedy SC eviction or selfish-sharing, with parameter choices based on performance/target trade-offs (Kim et al., 2019).
- In generative models, interval and threshold selection is the major lever. On segmentation and edge/depth guided generation, optimal and moderate interval scaling balance efficiency and quality. Prompt-level (fine) caching is most beneficial when cross-attention convergence is empirically observed (Liu et al., 14 Nov 2025).
- In hardware HGC, subcache way count () should balance security overhead (attack cost) versus area and latency overhead. Kernel and firmware support for isolation domain tagging is essential for deployment (Dessouky et al., 2019).
Hybrid-grained cache designs represent an emergent paradigm in data and instruction caching, fusing coarse and fine granularities to support system- and application-level QoS, computational cost minimization, and isolation, while resisting the inherent draw-backs of pure partitioning or sharing.
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free