PIM-CACHE: In-Memory Caching for Near-Data Compute
- PIM-CACHE is an architectural approach that integrates caching and data mapping near PIM hardware to reduce redundant data movement and bandwidth bottlenecks.
- It employs techniques like host-side deduplication, hierarchical KV caching, and in-memory product quantization to enhance compute efficiency and energy savings.
- Design trade-offs include balancing cache coherence, mapping strategies, and hardware constraints while relying on workload redundancy for performance gains.
Processing-in-memory cache systems (“PIM-CACHE”) integrate caching, data mapping, and/or content reduction schemes directly into or in close proximity to PIM hardware. Their purpose is to mitigate bandwidth, capacity, and data movement bottlenecks that conventionally undermine or limit the throughput and efficiency gains achievable with near-data compute. Solutions under this umbrella span from software-only host-side data deduplication and pre-processing layers, through architectural cache and replacement policies within the PIM fabric itself, to hardware-software co-designs that leverage compression, quantization, and cache coherence mechanisms. This article systematically examines core PIM-CACHE mechanisms, including host–PIM copy deduplication, hierarchical and heterogeneous memory organization for key-value (KV) caches in LLM inference, analog and digital cache augmentation within embedded memory arrays, and PIM-specific data placement strategies. It also discusses key trade-offs, performance metrics, and open design challenges.
1. PIM-CACHE Fundamentals and Motivations
PIM-CACHE systems are motivated by two main observations: (1) while PIM architectures move computation near data to exploit high local bandwidth and parallelism, conventional memory hierarchy bottlenecks, especially data transfer and redundant fetches, can dominate end-to-end performance, and (2) large-scale workloads (e.g., LLM inference, DNN acceleration, graph analytics) rely on cacheable working sets whose spatial, temporal, and semantic locality patterns remain unexploited by naive PIM designs.
The paradigmatic example is the UPMEM DPU ecosystem, where the overhead of host-to-DPU DMA, especially with highly redundant or correlated data, can eliminate PIM’s performance advantage (Yuhala et al., 24 Mar 2026). Similarly, in LLM/KV-cache dominated pipelines, simple append-only mapping to PIM banks leads to unnecessary fetches (especially under sparse attention), and a mismatch between PIM compute locality and workload "hotness" can waste both bandwidth and die area (Li et al., 7 May 2026, Fan et al., 9 May 2025, Matsushima et al., 20 Apr 2026).
PIM-CACHE thus aims to (a) minimize or eliminate redundant data movement to PIM, (b) adapt cache policies and hardware to PIM-specific semantics, and (c) re-map data layouts to align with access patterns and bank organization.
2. Host-Side Content-Aware Copy and Cache Reduction
PIM-CACHE as implemented for UPMEM-style PIM consists of a lightweight, host-resident data reduction module (DRM) that transparently intercepts and refines "copy_to" operations to DPUs (Yuhala et al., 24 Mar 2026). The core operation is block-wise content-aware deduplication, in which fixed-size blocks (e.g., 1 KiB) are fingerprinted (XXHash64) and deduplicated with a per-DPU hash table.
Workflow:
- Incoming buffers are split into blocks. For each block, the DRM checks the per-DPU hash table for a prior occurrence; if a match is found, only an offset is recorded; otherwise, the block is scheduled for transfer and indexed.
- Only distinct (unique) blocks are transferred to the DPU, together with an ordered list of offsets.
- On the DPU, block-reconstruction is index-driven: only unique blocks are loaded from retention buffers as needed for kernel execution.
Optional compression (e.g., VByte) can be applied to further reduce transfer volume. The entire pipeline is software-transparent, plugging in above the vendor PIM SDK.
Bandwidth savings are proportional to the deduplication hit rate : for highly redundant data (e.g., , all blocks duplicated), up to transfer reduction is observed; when redundancy is absent, a penalized slowdown occurs due to linear host-side DRM overhead (Yuhala et al., 24 Mar 2026). In realistic genomics and temporal-reuse tasks, up to overall speedup is seen, with end-to-end compute acceleration dominated by the savings in data staging.
Limitations:
- Efficacy relies on the workload achieving a deduplication hit rate ; below this threshold, overhead outweighs gain.
- Block size trades off dedup granularity with metadata overhead; duplications smaller than the chosen block size are missed.
- Overflow handling is global invalidation, not LRU/FIFO.
Potential enhancements include adaptive bypass (if is detected to be low), extension to DPU-to-host and inter-DPU transfer, and possible hardware offload in future PIM controllers.
3. PIM-CACHE for Hierarchical and Compressed KV Caches
LLM inference contexts, especially during decoding, generate exponentially growing KV caches that saturate both off-chip and internal PIM bandwidth and memory capacity. Recent research attacks this bottleneck by fusing cache hierarchy and management directly into HBM-PIM stacks, using both architectural heterogeneity and quantization (Li et al., 7 May 2026, Matsushima et al., 20 Apr 2026).
TokenStack (Li et al., 7 May 2026) partitions each HBM4 stack into:
- Compute layers: PIM-enabled layers containing lightweight MACs, designated for “hot” KV cache.
- Capacity layers: Dense DRAM-only dies for “cold” KV and weights, maximizing bit storage.
A "logic base die" orchestrates DMA, multi-layer address translation, attention coordination, and quantization in flight, including a streaming INT8/INT4 compression engine for keys and values yielding a capacity boost.
Key PIM-CACHE runtime policies:
- Topology-aware KV placement: New requests are scheduled to stacks that already hold their prefix, minimizing cross-stack migration.
- Workload-aware cache eviction: Every KV block is scored for demotion from compute to capacity based on reuse probability and access recency, with quantized migration to maximize capacity.
- Bounded replication: Controlled prefix replication across cards only when fan-out frequently amortizes cross-stack transfer cost.
Empirically, this yields a geometric-mean throughput boost and SLO-compliant capacity over AttAcc (PIM-only flat attention baseline), combined with 0 lower per-token energy.
AQPIM (Matsushima et al., 20 Apr 2026) employs in-memory product quantization (PQ) of activations/KV cache, using hardware-accelerated clustering and codebook search directly in PIM logic. Compression ratios up to 1 (for the KV cache) are achieved, with attention kernel acceleration (2) and overall decode speedup (3) while sacrificing only 4 points on accuracy. All geometric and memory locality is preserved by codebook-aware data placement and hardware gather schemes within DRAM rows, enabling nearly constant per-token decode latency for very long contexts.
4. In-Memory Caching and Data Placement for Digital and Analog PIM
PIM-CACHE includes hardware-level modifications in both SRAM and DRAM-based arrays designed to maximize utilization, parallelism, and buffer reuse.
NVM-in-Cache (Chakraborty et al., 15 Sep 2025) overlays RRAM devices onto a standard 6T SRAM, creating a compact 6T-2R bit-cell. In PIM mode, analog multiply-accumulate (MAC) is performed on the power rails using weighted configuration circuits and sampled ADCs, while normal SRAM operation is retained. The hybrid design effectively doubles usable cache capacity, achieving 5 TOPS and 6 TOPS/W in GF22 FDSOI, with area occupation remaining 7 per bit. This approach is drop-in compatible, coexists with cache hierarchy in existing systems, and yields 8 lower DRAM traffic for allocation metadata in dynamic workloads.
CD-PIM (Lin et al., 18 Jan 2026) for LPDDR5-based in-situ compute reallocates bank and pseudo-bank segments for distinct "decode" (GEMV) and "prefill" (GEMM) phases. Under high-bandwidth compute mode (HBCEM), four pseudo-banks per bank are exploited, raising per-bank bandwidth 9 and boosting overall decoding throughput 0 (over GPU-only) and 1 (over prior AttAcc). The PIM-cache mapping schemes partition K-cache and V-cache matrices across all CUs for 100\% utilization (vs. 1.6\% naive), providing 2 traffic reduction and 3 buffer hit rate for multi-use weights.
5. Data Layout and Mapping: Sparse and Semantically-Aligned PIM Cache Management
Bandwidth waste in PIM-based LLM attention can be dominated by cache mapping strategies, especially under sparse attention.
STARC (Fan et al., 9 May 2025) proposes semantic clustering of key-value pairs such that tokens with high mutual attention are physically co-located in PIM DRAM rows and banks. K-means clustering on key vectors, with 4 for 5 total tokens, is followed by a round-robin mapping to banks and contiguous row allocation per cluster. The result is that during attention, only the clusters (rows) with maximal relevance (by centroid-query dot-product score) are fetched, realizing minimal row activation and maximal pipeline utilization.
Quantitatively, STARC provides:
- 6–7 latency and 8–9 energy reduction versus token-wise sparsity;
- 0–1 latency and 2–3 energy reduction compared to dense full KV retrieval;
- Negligible (<2%) downstream accuracy loss relative to state-of-the-art sparse attention.
6. PIM-Cache Coherence, Dynamic Allocation, and Software-Hardware Synergy
PIM-CACHE also covers inter-core consistency and dynamic allocator metadata acceleration.
LazyPIM (Boroumand et al., 2017) introduces a speculative, signature-compressed coherence protocol for PIM systems sharing data between host and PIM logic. Instead of using traditional directory/MESI at each access, PIM execution is speculative and conflicts are detected/post-resolved using Bloom-filter signatures at commit boundaries, dramatically reducing off-chip traffic (by 4), energy(5), and improving performance (6). Further reductions are possible by adapting signature sizing and chunk boundaries dynamically.
PIM-malloc (Lee et al., 19 May 2025) leverages a 16-line fully associative, per-core buddy cache for dynamic allocation trees in PIM DRAM. Caching only metadata 4B words per tree node accelerates allocation latency by 7—from 8 to 9 cycles—while preventing bandwidth-heavy DRAM reads, with total area/power cost 0, 1mW/core.
PIM-SHERPA (Lee et al., 10 Mar 2026) targets the cacheability/hardware-trigger mismatch for in-place GEMM/GEMV in LLMs: prefill requires host-side, cache-resident duplicate of weights, while decode triggers PIM only if memory is non-cacheable. Double buffering and on-demand swizzled memory copy sidestep this, achieving 2–3 memory capacity savings for Llama 3.2 models, with negligible performance penalty at realistic sequence lengths.
7. Design Trade-offs, Scalability, and Limitations
- Efficacy of deduplication/content-aware PIM-CACHE schemes is workload dependent; highly redundant (or clustered) input is optimal.
- Quantization-based PIM-CACHE (AQPIM, TokenStack) introduces accuracy/throughput trade-offs, with competitive losses (<2 points) at 4 compression.
- Hardware-based PIM-caches (NVM-in-Cache, CD-PIM) must balance added area/ADC bottlenecks against throughput and use-case generality; benefits diminish for random-access or low-locality workloads.
- Cache coherence schemes hinge on reasonable false positive rates/commit intervals; excessive rollback or large working sets can degrade gains.
- Re-mapping/KV clustering-based PIM-CACHE (STARC) is effective where attention is semantically or temporally skewed, but less so under uniformly random or single-use contexts.
- Future directions include hardware support for adaptive content-aware copy, extension to inter-DPU/in-PIM deduplication, online learning for cache replacement, and integration with transactional memory for robust rollback.
PIM-CACHE, as an architectural and algorithmic principle, is critical to unlocking near-memory compute at exascale, ensuring that performance, energy, and capacity benefits scale with logic/memory density and application requirements (Yuhala et al., 24 Mar 2026, Li et al., 7 May 2026, Matsushima et al., 20 Apr 2026, Chakraborty et al., 15 Sep 2025, Lin et al., 18 Jan 2026, Boroumand et al., 2017, Lee et al., 10 Mar 2026, Lee et al., 19 May 2025, Fan et al., 9 May 2025).