Near-Cache-Slice Computing Paradigm

Updated 4 October 2025

Near-Cache-Slice Computing Paradigm is an architectural approach that partitions computation and memory to exploit cache locality, reducing data movement and enhancing energy efficiency.
It employs cache-aware partitioning, programmable data mapping, and slice-aware scheduling to optimize resource allocation and minimize redundant memory access.
The paradigm drives diverse applications—from high-performance scientific computing to edge inference—demonstrating significant speedups and scalable, modular design.

The Near-Cache-Slice Computing Paradigm is an architectural and systems approach that integrates computation tightly with cache and memory subsystems—partitioning or “slicing” resources in a manner aware of the cache and local memory hierarchy. This paradigm aims to maximize application performance, energy efficiency, and scalability by exploiting locality, reducing unnecessary data movement, and optimizing task scheduling, resource partitioning, and data placement for both hardware and distributed environments. Its core principles apply across domains: high-performance scientific computing, in-network caching, edge inference, cryptography, and beyond. The approach encompasses hardware-level near-cache accelerators, run-time systems orchestrating cache-fitting data decomposition, dynamic resource allocation strategies, and collaborative or federated cache structures.

1. Conceptual Foundations and System Architecture

At the heart of the near-cache-slice paradigm is the systematic division and/or co-design of compute, storage, and communication resources to align with the hierarchical cache structure or distributed content caches. Architectures typically exhibit the following characteristics:

Cache-Aware Partitioning: Workloads or data domains are divided into slices such that each partition fits, or nearly fits, into a specific cache level known as the “Target Cache Level (TCL)” (Paulino et al., 2015). For distributed systems, physical caches are further subdivided into isolated slices, each dedicated to one user or tenant (Chu et al., 2017).
Integrated or Tightly-Coupled Compute Units: Enhanced cache slices may be augmented with domain-specific compute engines, e.g., vector processing units, bitline computing SRAM subarrays, or lightweight stencil/pattern processors (Denzler et al., 2021, Petrolo et al., 3 Apr 2025, Zhang et al., 27 Sep 2025).
Programmable Data Mapping: Hardware and run-time systems use programmable interfaces to map logical data indices to physical cache or memory slices, with the mapping optimized for regularity, contiguity, or access frequency (Asgari et al., 2018).
Slice-Aware Communication/Scheduling: In both in-network caching and data-parallel computing, traffic, requests, and computation are routed and scheduled with respect to the physical or logical slice configuration (Chu et al., 2017, Paulino et al., 2015).
Collaborative and Adaptive Cache Management: For distributed and federated settings, mechanisms such as multi-client collaborative caches, dynamic global allocation policies, and adaptive tile morphing algorithms are employed to manage slice allocation adaptively under workload or data distribution changes (Liang et al., 28 Nov 2024, Yoo et al., 2023).

The architectural theme is locality maximization: bring computation to where the data are and size/assign data to match the locality scope, whether that is a CPU cache, SRAM array, network cache, or distributed edge store.

2. Cache-Conscious Data Decomposition and Scheduling

A fundamental method in near-cache-slice computing is cache-conscious domain decomposition (Paulino et al., 2015):

The run-time system determines the number of partitions np such that, for a domain D comprising d sub-domains, each partition p satisfies:

$\text{size}(p) = \sum_{i=0}^{d-1} \frac{\text{size}(D_i)}{n_p} \leq \text{size}(\text{TCL})$

Partition size estimation is handled via specialized φ functions, including simple (element size times partition count) and conservative (cache-line–aware, geometry-length–adjusting) estimators.
Validation of candidate partition counts is executed iteratively (typically via binary search), ensuring efficient cache utilization.
Once domains are partitioned, tasks are clustered for scheduling using strategies like Contiguous Clustering (static contiguous block assignment) or Sibling Round-Robin Clustering (static distribution to cores sharing an LLC, round-robin within groups), combined with explicit worker-to-core affinity.
Contrast with cache-neglectful decompositions: horizontal splits that ignore cache boundaries often result in exceeding working-set sizes, incurring cache misses and degraded performance.
Quantitatively, this approach has yielded speedups of 6–7× on stencil, matrix multiplication/transpose, and similar workloads, and supports both single-node and distributed cluster deployments (Paulino et al., 2015).

This method demonstrates that automatic, architecture-aware decomposition and scheduling, when integrated at run time or in hardware, are crucial for scalable, efficient data-parallel job execution.

3. Resource Slicing, Partitioning, and Optimization

In distributed and in-network settings, the paradigm extends to resource partitioning at the system or network level:

Cache Resource Partitioning (Cache Slices): Each cache is divided into per-user or per-tenant slices (e.g., slice C_{km} for content provider k in cache m). Slice resources are statically or dynamically assigned, respecting per-cache capacity constraints:

$\sum_{k=1}^K C_{km} \leq C_m$

The optimal allocation maximizes a global utility function—typically concave in hit rate—under these constraints (Chu et al., 2017).

Slice-Oblivious but Non-Splitting Request Routing: Optimal routing is achieved when each content provider directs all requests to a single cache slice (the “no splitting” property), avoiding redundant multi-cache storage and preserving content-obliviousness.
Distributed Optimization Methods: Decentralized resource allocation and price-based negotiation converge to the optimal assignment, balancing hit rates, delay, or bandwidth under capacity constraints, and generalize to bandwidth and delay-limited environments (Chu et al., 2017, Rashid et al., 7 Nov 2024).
Adaptive and Collaborative Cache Allocation: For collaborative/decentralized edge inference, caches are allocated adaptively based on recency, frequency, and semantic similarity; a global cache assists in overcoming data non-IIDness and long-tail class distributions (Liang et al., 28 Nov 2024).

These techniques form the mathematical and algorithmic basis for scalable, fair, and efficient slice assignment across multi-tenant, multi-device, and federated systems.

4. Near-Cache and In-/Near-Memory Hardware Integration

The paradigm increasingly manifests in hardware and system architectures:

Compute-Integrated Cache Slices: Novel architectures such as ARCANE (Petrolo et al., 3 Apr 2025) and Crypto-Near-Cache (CNC) (Zhang et al., 27 Sep 2025) embed processing units (e.g., vector processing, bitline computation) within or beside each cache slice or LLC. These units support direct in-place operations, custom ISA extensions, and, in the case of CNC, virtual address support for direct core-initiated invocation.
- Bitline computing in CNC enables logic operations (XOR, AND/OR, shift) by simultaneous wordline activation, achieving high internal bandwidth and reducing energy per operation.
- ARCANE employs a RISC-V controller with software-managed compute offload and explicit support for custom matrix/vector kernels, automating synchronization and data layout within the cache hierarchy.
Application-Specific Near-Cache Accelerators: Domain-specialized accelerators (e.g., stencil units in Casper (Denzler et al., 2021)) integrate tightly with LLC slices, performing stream-based, vectorized operations on data mapped for spatial locality. Such units demonstrate significant area- and energy-efficiency gains, up to 37× in performance-per-area relative to standard GPUs (in the stencil kernel domain), and support regular “hash-based” or round-robin data mapping.
Hardware-Managed Cache Data Placement: Hardware logic, such as programmable or hashed slice selectors, ensures spatially correlated data is loaded into the same slice to maximize the benefit of near-cache computation and support unaligned or strided data access needs.

The convergence of compute and cache slices is especially impactful for bandwidth-bound and memory-intensive workloads, including cryptography, deep learning, and scientific simulations, enabling dramatic gains in speed and energy efficiency.

5. Applications and Performance Implications

The near-cache-slice paradigm manifests in diverse contexts:

Scientific and Large-Scale Data-Parallel Computing: Enhanced stencil, matrix, and dense linear algebra workloads—critical in weather modeling, PDE solvers, and neural network training—show strong gains via cache-conscious decomposition and cache-integrated accelerators (Paulino et al., 2015, Denzler et al., 2021, Asgari et al., 2018).
Cloud, Edge, and In-Network Caching: Elastic, slice-based caching with serverless function pools (e.g., InfiniCache) enables pay-per-use, large-object caching with fault tolerance via erasure coding and dynamic backup (Wang et al., 2020). Joint traffic-resource allocation and adaptive request routing maximize resource utilization, reduce latency, and adapt to workload variability (Chu et al., 2017, Rashid et al., 7 Nov 2024).
Multi-Tenant Content Delivery: Sliced and virtualized caches paired with ICN/CDN integration enable region-optimized content delivery, reducing core network traffic and serving most requests within the edge domain (Benkacem et al., 2022).
Edge Inference and Federated/Collaborative Caching: Client-adaptive caching with cross-client global updates and per-class frequency/recency scoring dramatically reduces inference latency (23–45%) with accuracy maintained within 3% (Liang et al., 28 Nov 2024).
Graph and Irregular Data Processing: Slicing feature matrices in GCN accelerators improves working set locality, lowering miss rates and yielding 1.46–1.73× higher throughput versus previous approaches—without hand-tuned tile parameterization (Yoo et al., 2023).
Post-Quantum and Classical Cryptography: In situ cryptographic operations over large PQC key structures, enabled via bitline computing and direct-virtual addressing, mitigate bandwidth limitations and offload computation from general-purpose cores (Zhang et al., 27 Sep 2025).

Consistently, performance improvements are due to reduced data movement, better exploitation of cache and memory locality, and more balanced hardware utilization. Depending on the domain and hardware configuration, performance improvements vary from 1.5× (scientific/GCN) to over 80× (CNNs in ARCANE on 8-bit data) and power efficiency of 747 GFLOPs/J (LSTM training in memory slices (Asgari et al., 2018)).

6. Discussion: Advantages, Challenges, and Outlook

The near-cache-slice computing paradigm offers several technical advantages:

Energy Efficiency: By bringing computation near data, it reduces the dynamic energy cost of data movement—especially critical in edge, mobile, and power-limited systems (Asgari et al., 2018, Zhang et al., 27 Sep 2025).
Scalability and Modularity: The modular slice design allows scalability via addition or reallocation of slices in both compute and storage dimensions. Performance frequently exhibits superlinear scaling due to diminished reloading and contention overheads (Asgari et al., 2018).
Abstraction and Usability: Automated, run-time slice management (partition sizing, task scheduling, resource allocation) abstracts complex cache/memory optimization away from the user, improving programming productivity (Paulino et al., 2015, Petrolo et al., 3 Apr 2025).
System and ISA Integration: Co-designed ISA and hardware changes (e.g., virtual address support in CNC, custom kernel instructions in ARCANE) facilitate seamless system integration with minimal software overhead or compromise to address translation mechanisms (Petrolo et al., 3 Apr 2025, Zhang et al., 27 Sep 2025).
Flexibility for Heterogeneous Workloads: Dynamic routing and allocation mechanisms (e.g., adaptive cache allocation, collaborative cache updates, and ATM) allow the paradigm to adapt to workload non-uniformity, temporally varying demand, and non-IID data (Liang et al., 28 Nov 2024, Yoo et al., 2023).

Challenges include:

Synchronization and Coordination: Collaborative and federated caching structures require robust mechanisms to avoid inconsistencies and maintain staleness bounds across slices in diverse topologies (Liang et al., 28 Nov 2024).
Area and Hardware Overhead: Hardware integration of compute logic into cache slices incurs area overheads (e.g., 41.3% in ARCANE with 8-lane VPUs (Petrolo et al., 3 Apr 2025)); such costs must be balanced against throughput/power gains.
Dynamic Resource Management Complexity: Efficient decentralized or distributed optimization under real-world workload variability and cross-slice contention remains a complex problem (Chu et al., 2017, Rashid et al., 7 Nov 2024).
Workload Characterization and Mapping: Selecting or tuning slice dimensions, mapping functions, and tile sizes (especially for irregular or sparse data) can be nontrivial—though dynamic and feedback-driven adaptation is increasingly effective (Yoo et al., 2023).

This suggests that future research will likely focus on more dynamic, feedback-driven hardware and run-time mechanisms, integration with AI-driven resource schedulers, investigation of fine-grained multi-slice synchronization protocols, and further co-design of custom ISA extensions to optimize various workload classes for emerging computing platforms.

7. Summary Table: Representative Implementations and Domains

Context	Slice Mechanism	Domain/Application
Multi-core Run-time	Cache-partitioned tasks	Matrix, stencil, image kernels (Paulino et al., 2015)
In-network Caching	Per-CP slice, no-split	Multi-tenant CDN/CP (Chu et al., 2017)
Modular Memory Slices	Systolic, 2D slices	Deep learning, RNNs, CNNs (Asgari et al., 2018)
Serverless Caching	Function pool slices	Object caches, cloud storage (Wang et al., 2020)
Edge Inference	Per-client adaptive	Image/audio models (Liang et al., 28 Nov 2024)
Near-Cache Accelerator	LLC/SPU slice coupling	Stencil, scientific (Denzler et al., 2021)
Secure In-Cache Compute	Bitline SRAM arrays	PQ cryptography (Zhang et al., 27 Sep 2025)
IoT Embedded Compute	LLC with vector VPUs	TinyNN, signal processing (Petrolo et al., 3 Apr 2025)

This table highlights the breadth of implementation strategies, from system-level run-time to deeply embedded hardware, across a wide spectrum of data-intensive domains.

The near-cache-slice computing paradigm, therefore, embodies a fundamental shift toward architecture-, workload-, and resource-aware computation and storage orchestration, subsuming traditional notions of memory hierarchy and workload scheduling under a unified, slice-driven framework.