Near-Cache-Slice Computing

Updated 11 April 2026

Near-cache-slice computing is a paradigm that integrates programmable logic within LLC slices to exploit high internal bandwidth and spatial locality.
By aligning data placement with thread scheduling, it minimizes cache coherence and interconnect overhead, achieving system speedups of up to 25% in diverse workloads.
Customized accelerators like CNC, ARCANE, and Casper demonstrate throughput gains up to 1.8× and energy efficiencies exceeding 30× over conventional architectures.

Near-cache-slice computing is a class of architectural techniques and allocation strategies that place programmable computation or cache-control logic close to, or directly within, the physical slices of a modern multi-slice last-level cache (LLC). By aligning data placement, thread scheduling, and domain-specific hardware units with cache slice organization, this paradigm exploits the high internal bandwidth and spatial/temporal locality within each slice to overcome data movement bottlenecks, minimize coherence and ring/interconnect overhead, extend fine-grained resource partitioning, and enable in-situ or near-memory acceleration for diverse workloads including cryptography, machine learning, networking, and scientific computing.

1. Foundational Principles and Architectural Basis

Near-cache-slice computing relies fundamentally on the physical subdivision of the LLC into discrete, independently addressable “slices,” each typically associated topologically with a subset of CPU cores or interconnect nodes. On processors such as Intel Sandy Bridge, each LLC slice is mapped via a two-stage, partially secret hash function that assigns every physical cache line to exactly one slice and set (Wei et al., 2015). Address mapping may involve a fast mixing function (e.g., XOR/XNOR of physical address bits) for slice selection, while set mapping is managed by directly indexed lower-order address bits.

This architectural feature enables two critical forms of control:

Data placement control: Fine-grained page coloring or direct address manipulation allows the pinning of particular data structures to desired sets and slices, supporting prioritization or exclusivity for specific working sets.
Thread and engine co-location: CPU threads (or hardware engines) may be scheduled such that the working set for a given computation resides primarily on the LLC slice nearest the responsible core, minimizing internal ring or crossbar traversal.

By linking these mechanisms with custom logic, microengines, or compute arrays per slice, as well as specialized mapping policies, near-cache-slice computing forms a foundation for both improved general-purpose performance and highly efficient hardware offload.

2. Techniques for Data, Thread, and Engine Co-Location

Detailed reverse-engineering of slice and set hash functions, as demonstrated for Sandy Bridge, reveals that although slice selection is non-trivial, set indexing remains a fixed substring of the physical address, re-enabling classic and extended page coloring even in the presence of hashing (Wei et al., 2015). Tasks may therefore carve the LLC into partitions not only by set but also by slice, constructing composite page color keys as follows:

$\mathit{page\_color} = \mathit{color}_{\rm set} + \left(\mathit{color}_{\rm slice} \ll s\right)$

where $\mathit{color}_{\rm set} = (P \gg c)\ %%%%1%%%%\ ((1\ll s)-1)$ and $\mathit{color}_{\rm slice}$ is a computed function of higher address bits.

By allocating pages to guarantee that threads’ working sets map predominantly to the slices and sets topologically nearest their execution context, one minimizes average cache-access round-trip time and ring contention. For example, local LLC slice hits may be serviced in ~12 cycles, compared to 18–20 cycles (plus ring contention metrics) for remote slices. Empirical data show up to 25% system speedup in parallelized, LLC-bound workloads when employing strict slice-aware coloring and thread pinning (Wei et al., 2015). Further, dynamic migration of hot pages and contingent intra-slice set partitioning mitigate conflict and adapts to shifting thread or data locality.

3. Hardware Architectures for Compute Acceleration Near LLC Slices

Recent architectural proposals extend near-cache-slice principles by physically integrating domain-specific compute logic adjacently to each LLC slice:

Crypto-Near-Cache (CNC): Each slice is augmented with an SRAM compute array, supporting bitline-level logic (AND/OR/XOR/shift) and a microcoded command memory. Compute operations execute at high internal bandwidth (e.g., 512 bits over 2 cycles per slice), parallelizing core cryptographic kernels. With aggregate slice bandwidth vastly exceeding core-to-cache NoC links (n slices × 256 bits/cycle versus a 64-bit channel), bottlenecks in PQC and lattice cryptography are eliminated, yielding up to 1.8× throughput and 30× energy efficiency improvements over CPU baselines (Zhang et al., 27 Sep 2025).
ARCANE: Each LLC slice embeds a vector processing unit (VPU) and a minimal RISC-V core (eCPU) as controller, with custom “matrix reserve” and “matrix kernel” instructions to orchestrate in-cache vector convolutions and reductions. This design achieves 30–84× performance improvement on 8-bit ML kernels at 41% area overhead over conventional cache (Petrolo et al., 3 Apr 2025).
Casper: Stencil Processing Units (SPUs) attached to each slice execute streaming, low-intensity stencil operations about an order of magnitude more efficiently than a general-purpose core, with 1.65× performance and 35% energy savings, and 37× the performance-per-area of state-of-the-art GPU approaches (Denzler et al., 2021).
Arcalis: Lightweight microengines for RPC parsing/serialization are tightly coupled to LLC slices, orchestrating cache-line–granular network operations with full cache coherence, yielding 1.79–4.16× speedup on microservice workloads over CPU-only execution and up to 88% reduction in microarchitectural overhead (Umeike et al., 13 Feb 2026).

4. Virtualization, Mapping, and Programmability

Advanced near-cache-slice systems maintain full virtual address transparency, TLB, and OS-level features. CNC and ARCANE expose custom ISA extensions, supporting direct virtual-address operations and decoupling kernel programming from cache/slice microarchitecture (Zhang et al., 27 Sep 2025, Petrolo et al., 3 Apr 2025). Allocators select physical frames to match required set/slice colors, supporting per-thread or per-application resource partitioning and deterministic mapping.

Programmability is maintained through restricted offload primitives and minimal API extensions. In Arcalis, a small number of UC (uncacheable) store/load primitives demarcate offload boundaries, while RPC handler logic is synthesized offline for the reconfigurable logic region. In ARCANE, the software-defined matrix kernel can be reprogrammed to extend the available computational operations without hardware respin. SliceMoE establishes cache hierarchy control at the cache bank (slice) level and leverages mixed-precision dynamic bit-sliced caching, driven by runtime token gating (Choi et al., 15 Dec 2025).

5. Resource Allocation, Optimization, and Performance Models

Resource allocation in near-cache-slice computing is driven both by analytic modeling and real-time optimization. In network and edge systems, a multi-domain orchestrator provisions cache and compute slices as virtual network function chains, with resource placement formulated as MILP to minimize end-to-end latency under capacity and QoS constraints (Benkacem et al., 2022). In compute-centric systems, slice-level locality, bandwidth models, and latency penalties are computed directly from address hashing and cache geometry.

Key performance models include:

Roofline: Attainable performance is bounded by $P = \min(P_{\text{peak}}, B_{\text{LLC}}\cdot\text{AI}, B_{\text{mem}}\cdot\text{AI})$ ; for AI ≪ $P_{\text{peak}}/B_{\text{LLC}}$ , workloads such as stencils are LLC-bandwidth-bound (Denzler et al., 2021).
RPC latency decomposition: $T_{\text{RPC}} = T_{\text{net}} + T_{\text{LLC}} + T_{\text{proc}}$ , with $T_{\text{LLC}} \approx N \cdot (\ell_{\text{load}} + \ell_{\text{store}})$ per cache-line (Umeike et al., 13 Feb 2026).
Slice-level capacity/miss optimization: SliceMoE solves for a DRAM-bounded allocation of high/low-bit (MSB/LSB) slices per expert to match target miss rates, maximizing effective parameter caching (with explicit objective and constraint formulation) (Choi et al., 15 Dec 2025).

Empirically, slice-aware allocation and compute consistently reduces energy and latency by >2× in cryptographic (Zhang et al., 27 Sep 2025) and ML workloads (Choi et al., 15 Dec 2025), improves performance up to 4.16× in microservices (Umeike et al., 13 Feb 2026), and achieves significant area-normalized gains (Denzler et al., 2021).

6. Generalization, Applicability, and Limitations

Near-cache-slice computing is applicable whenever working sets and access patterns can be matched to LLC slice physicality, or when low-arithmetic intensity, streaming, or bandwidth-bound computation predominates. Key domains include cryptography, signal processing, MoE inference (mixed-precision, miss-constrained caching), graph streaming, and edge content delivery (Zhang et al., 27 Sep 2025, Choi et al., 15 Dec 2025, Benkacem et al., 2022, Denzler et al., 2021).

Limitations include:

DRAM-bound phases: When working sets far exceed LLC, performance reverts to DRAM bandwidth bounds.
Specialization tradeoff: Highly tuned near-cache engines require domain specificity (e.g., stencil pipelines, RPC microengines) and may be less flexible for complex, irregular workloads.
Integration complexity: Modifications to NoC routing, TLBs, or slice controller FSMs may introduce verification overhead (Zhang et al., 27 Sep 2025). Coherence needs for tight CPU-cache-engine coupling further increase design intrusion.
Area and energy: Overheads are typically modest (<1–2% of CPU die area for 16–32 slices, <5% static leakage), but may be higher in small, resource-constrained MCUs (Petrolo et al., 3 Apr 2025).

A key advantage is that the paradigm naturally extends to composable, virtualized contexts (NFV/MEC), supporting the joint instantiation and real-time scaling of cache/compute slices for dynamic workloads (Benkacem et al., 2022).

7. Performance, Energy, and Scalability Outcomes

Evaluations consistently demonstrate major gains in throughput, energy efficiency, and area-normalized performance:

Domain	Speedup / Efficiency Gain	Key Paper
PQ cryptography (CNC)	30× energy, 1.8× throughput	(Zhang et al., 27 Sep 2025)
CNN/ML (ARCANE)	30–84× perf, ×2–3 energy	(Petrolo et al., 3 Apr 2025)
Stencil (Casper)	1.65× perf, 35% energy, 37× PA	(Denzler et al., 2021)
RPC (Arcalis)	4.16× speedup, 88% μarch	(Umeike et al., 13 Feb 2026)
MoE Inference (SliceMoE)	2.37–3× energy, 1.8× latency	(Choi et al., 15 Dec 2025)
CDN/ICN edge content	80% latency, >90% offload	(Benkacem et al., 2022)

All results are measured against strong baseline implementations (optimized CPUs, state-of-the-art PIM, off-chip FPGA engines, or prior near-memory accelerators). Architectural scalability is facilitated by the modular, slice-local design—enabling thousands of concurrent bit-parallel vectors, or fine-grained task-to-slice affinity at scale.

Near-cache-slice computing reconceptualizes the memory-compute boundary by delegating key allocation, scheduling, and computation decisions to the physical geometry of the LLC. Through slice-aware allocation, programmable in-slice engines, and tightly coupled task/data placement, this approach delivers empirically validated advantages across a wide range of domains. The paradigm is poised for continued extension into next-generation CPUs, memory hierarchies, and large-scale distributed edge architectures.