Shared System-Level Cache
- Shared system-level cache is a centralized cache memory resource accessible by multiple cores that improves performance and enables QoS enforcement.
- It is a key architectural component in multi/many-core and heterogeneous systems, using methods like partitioning and coherence protocols to optimize efficiency.
- Design strategies include adaptive replacement policies, hardware/software scheduling, and security mechanisms to mitigate side-channel attacks.
A shared system-level cache (SLC), frequently synonymous with shared last-level cache (LLC), is a cache memory resource located at the highest level of the processor-internal memory hierarchy that is accessible by multiple cores, clusters, or heterogeneous processing engines. This architectural element serves as a critical arena for performance optimization, temporal isolation, coherence management, QoS enforcement, and security assurance in both general-purpose and domain-specific multicore systems.
1. Fundamental Architectural Principles
System-level caches are physically centralized resources—typically L2 or L3—in multi/many-core SoCs and heterogeneous systems. They may be organized as inclusive, exclusive, or noninclusive to lower-level caches. The physical organization can include set associative, banked, or even pseudo-random policies, as in the exclusive 8 MiB, 16-way, 128 B-line SLC of Apple M-series SoCs (indexed on PA[25:14]) (Xu et al., 18 Apr 2025). SLCs may be banked to afford increased memory-level parallelism, where independent banks/slices permit high-throughput concurrent access, at the cost of bank-level contention phenomena (Sullivan et al., 17 Oct 2024).
Sharing characteristics depend on workload and platform. CPUs, GPUs, and other agents (e.g., in heterogeneous SoCs) may share the SLC with distinct inclusion/exclusion properties (exclusive for CPU, inclusive for GPU in Apple M-series (Xu et al., 18 Apr 2025)). Thus, the SLC acts both as a performance-critical resource and a locus of interference and security vulnerability.
2. Data Coherence, Isolation, and Partitioning
Coherence of shared system-level caches is traditionally maintained using directory-based invalidation protocols such as MESI/MOESIF, as implemented in programmable or fixed-function engines (e.g., BlackParrot-BedRock MOESIF directory (Wyse, 2 May 2025)). Directoryless approaches, such as DLS (Liu et al., 2012), leverage weak memory consistency and speculation to eliminate directories, dramatically cutting area, network traffic, and energy at modest complexity.
Partitioning enables spatial isolation of SLC lines. Hardware like Intel CAT exposes “way masks” for each core or group (CLOS), and the OS or VMM can assign these statically or dynamically to applications for soft or hard QoS goals (Chatterjee et al., 2021, Saez et al., 12 Feb 2024, Sprabery et al., 2017). Partitioning aids temporal isolation (satisfying per-tenant SLAs (Kim et al., 2019)) and is a foundation for security (side-channel elimination via cache partitioning (Sprabery et al., 2017)) and mixed-criticality scheduling (dynamic redistribution at mode change (Awan et al., 2017)).
Partitioning and clustering strategies may assign partitions to cores singly or in groups, or dynamically move ways between domains guided by fairness, slowdown minimization, or cache sensitivity classifications (e.g., LFOC+’s classification and Pair-Clustering heuristic (Saez et al., 12 Feb 2024)).
3. Sharing Control, Scheduling, and Predictability
Enabling multiple agents or cores to share partitions increases cache space utilization and hardware efficiency, but risks unbounded contention and WCET (worst-case execution time) inflation. Safety-critical and real-time systems use hardware/software contracts to control contention.
Mechanisms include:
- Static and dynamic LLC partition assignment, possibly grouped by application criticality (Wu et al., 2022, Awan et al., 2017).
- Arbitration via time-division multiplexing (TDM) buses and set sequencers for predictable queueing and bounded WCL (Wu et al., 2022).
- Compiler- and ML-guided allocation frameworks (e.g., Com-CAS), where phase-aware app behavior is predicted and CAT allocations are adjusted “just-in-time” (Chatterjee et al., 2021).
- Per-bank bandwidth regulation, which multiplexes access to cache banks in order to block denial-of-service at the true contention point (bank, not cache-wide) (Sullivan et al., 17 Oct 2024).
By judiciously grouping tasks by working set, criticality, and sharing requirements, these methods mediate utilization, fairness, throughput, and predictability.
4. Replacement, Management, and Data Sharing Policies
Shared SLCs must balance eviction and protection of lines according to locality, hotness, and sharing patterns. Approaches include:
- Reuse- and sharing-aware replacement (e.g., SRCP), which equips each line with per-core and global counters to prioritize high-reuse, highly shared lines and prevent unnecessary replication across partitions (Ghosh et al., 2022).
- Pairwise instruction-data management (e.g., Garibaldi), where instruction lines with high “miss cost” (i.e., that trigger hot data) are selectively protected from eviction, and data lines paired with instruction misses may be prefetched (Kwon et al., 24 May 2025).
- Hybrid dedicated/shared cache regions for private-cloud workloads: a portion is reserved for each tenant to guarantee minimum hit-rate (“hard” SLA), while the rest is globally pooled for opportunistic performance (“soft” SLA). Victim-selection is guided by per-tenant gap to target (Kim et al., 2019).
Profiling and tuning of these policies may use trace-driven reuse-distance analysis, yielding aggregated reuse-distance histograms that predict the miss-rate impact of contention for rapid design-space exploration at the cache configuration phase (Ho et al., 2021).
5. Security and Side-Channel Mitigation
SLCs are prime targets for cross-core, cross-domain side-channel attacks due to shared state visibility:
- Timing-based attacks, such as flush+reload and occupancy-based probes, can leak cryptographic keys or infer cross-component behavior (e.g., CPU-GPU occupancy leakage in Apple M SLC (Xu et al., 18 Apr 2025)).
- Defenses include partitioning via CAT, domain-aware co-scheduling, and time-based cache-flushing (“state-cleansing”) on context switch (Sprabery et al., 2017).
- Hardware extensions (e.g., TimeCache) enforce “first-access misses,” leveraging per-line per-context s-bits and load-time timestamps to ensure that any process’s first access to a line loaded by another always incurs a miss, blocking classical cache reuse side channels (Ojha et al., 2020).
- Application-visible cacheability controls (INC-OC memory type) permit selected data to only be cached in the shared SLC level, bypassing private caches entirely and eliminating coherence overheads and associated unpredictability (Bansal et al., 2019).
Methodologies for partitioning, cleansing, and first-access motion are evaluated on isolation, microbenchmark security, and system-level overhead, with state-of-the-art designs sustaining <2% runtime overhead and near-complete elimination of targeted side-channels (Ojha et al., 2020, Sprabery et al., 2017).
6. Heterogeneous and Multi-Domain Use Cases
System-level caches are critical for efficiently supporting contemporary heterogeneous platforms. Apple M-series SoCs exemplify SLC designs shared across high-performance, efficiency, and GPU clusters, with asymmetric inclusion policies (CPU exclusive, GPU inclusive) and pseudo-random replacement (Xu et al., 18 Apr 2025). This heterogeneity enables powerful occupancy-based side-channels but also allows resource multiplexing and performance scaling across diverse agents.
Emerging ML/AI inference services such as RAG-powered LLMs require shared, persistent KV caches across many instances, with additional layers of management (in-RAM LRU, disk-backed blobs, prefetching on queue wait, etc.) for throughput and latency optimization (Lee et al., 16 Apr 2025). Here, the SLC is extended into the storage subsystem, but similar principles—prefetch, sharing, capacity constraint management—apply.
7. Summary Table: Shared SLC Design Dimensions
| Dimension | Example Mechanisms / Features |
|---|---|
| Partitioning | Way-based (Intel CAT), per-bank, hybrid dedicated/shared (Saez et al., 12 Feb 2024, Kim et al., 2019, Sullivan et al., 17 Oct 2024) |
| Coherence/Consistency | Directory/BedRock MOESIF, directoryless/DLS, programmable engine (Wyse, 2 May 2025, Liu et al., 2012) |
| Arbitration | TDM arbitration, set/queue sequencers, adaptive reallocation (Wu et al., 2022, Awan et al., 2017) |
| Replacement/Eviction | Reuse-aware, sharing-aware, instruction-data pair management (Ghosh et al., 2022, Kwon et al., 24 May 2025) |
| Security/Isolation | CAT clustering/co-scheduling, state-cleansing, TimeCache (Sprabery et al., 2017, Ojha et al., 2020) |
| Bank/Locality Handling | Per-bank regulation, aggregated reuse-distance histograms (Sullivan et al., 17 Oct 2024, Ho et al., 2021) |
| Heterogeneous Sharing | Exclusive/inclusive hybrid, multi-cluster, GPU-aware (Xu et al., 18 Apr 2025) |
| Service/Cloud-Aware Mgmt | Multi-tenant hybrid SLAs, disk-based KV cache sharing (Kim et al., 2019, Lee et al., 16 Apr 2025) |
References
- "EXAM: Exploiting Exclusive System-Level Cache in Apple M-Series SoCs for Enhanced Cache Occupancy Attacks" (Xu et al., 18 Apr 2025)
- "Predictable Sharing of Last-level Cache Partitions for Multi-core Safety-critical Systems" (Wu et al., 2022)
- "LFOC+: A Fair OS-level Cache-Clustering Policy for Commodity Multicore Systems" (Saez et al., 12 Feb 2024)
- "Reuse-Aware Cache Partitioning Framework for Data-Sharing Multicore Systems" (Ghosh et al., 2022)
- "Timing Cache Accesses to Eliminate Side Channels in Shared Software" (Ojha et al., 2020)
- "A Hybrid Cache Architecture for Meeting Per-Tenant Performance Goals in a Private Cloud" (Kim et al., 2019)
- "Cache Where you Want! Reconciling Predictability and Coherent Caching" (Bansal et al., 2019)
- "Effective Cache Apportioning for Performance Isolation Under Compiler Guidance" (Chatterjee et al., 2021)
- "Per-Bank Bandwidth Regulation of Shared Last-Level Cache for Real-Time Systems" (Sullivan et al., 17 Oct 2024)
- "An Effective Early Multi-core System Shared Cache Design Method Based on Reuse-distance Analysis" (Ho et al., 2021)
- "The Open-Source BlackParrot-BedRock Cache Coherence System" (Wyse, 2 May 2025)
- "Garibaldi: A Pairwise Instruction-Data Management for Enhancing Shared Last-Level Cache Performance in Server Workloads" (Kwon et al., 24 May 2025)
- "Mixed-criticality Scheduling with Dynamic Redistribution of Shared Cache" (Awan et al., 2017)
- "A Novel Scheduling Framework Leveraging Hardware Cache Partitioning for Cache-Side-Channel Elimination in Clouds" (Sprabery et al., 2017)
- "Shared Disk KV Cache Management for Efficient Multi-Instance Inference in RAG-Powered LLMs" (Lee et al., 16 Apr 2025)
- "DLS: Directoryless Shared Last-level Cache" (Liu et al., 2012)