Papers
Topics
Authors
Recent
Search
2000 character limit reached

PIM-CACHE: In-Memory Caching for Near-Data Compute

Updated 3 July 2026
  • PIM-CACHE is an architectural approach that integrates caching and data mapping near PIM hardware to reduce redundant data movement and bandwidth bottlenecks.
  • It employs techniques like host-side deduplication, hierarchical KV caching, and in-memory product quantization to enhance compute efficiency and energy savings.
  • Design trade-offs include balancing cache coherence, mapping strategies, and hardware constraints while relying on workload redundancy for performance gains.

Processing-in-memory cache systems (“PIM-CACHE”) integrate caching, data mapping, and/or content reduction schemes directly into or in close proximity to PIM hardware. Their purpose is to mitigate bandwidth, capacity, and data movement bottlenecks that conventionally undermine or limit the throughput and efficiency gains achievable with near-data compute. Solutions under this umbrella span from software-only host-side data deduplication and pre-processing layers, through architectural cache and replacement policies within the PIM fabric itself, to hardware-software co-designs that leverage compression, quantization, and cache coherence mechanisms. This article systematically examines core PIM-CACHE mechanisms, including host–PIM copy deduplication, hierarchical and heterogeneous memory organization for key-value (KV) caches in LLM inference, analog and digital cache augmentation within embedded memory arrays, and PIM-specific data placement strategies. It also discusses key trade-offs, performance metrics, and open design challenges.

1. PIM-CACHE Fundamentals and Motivations

PIM-CACHE systems are motivated by two main observations: (1) while PIM architectures move computation near data to exploit high local bandwidth and parallelism, conventional memory hierarchy bottlenecks, especially data transfer and redundant fetches, can dominate end-to-end performance, and (2) large-scale workloads (e.g., LLM inference, DNN acceleration, graph analytics) rely on cacheable working sets whose spatial, temporal, and semantic locality patterns remain unexploited by naive PIM designs.

The paradigmatic example is the UPMEM DPU ecosystem, where the overhead of host-to-DPU DMA, especially with highly redundant or correlated data, can eliminate PIM’s performance advantage (Yuhala et al., 24 Mar 2026). Similarly, in LLM/KV-cache dominated pipelines, simple append-only mapping to PIM banks leads to unnecessary fetches (especially under sparse attention), and a mismatch between PIM compute locality and workload "hotness" can waste both bandwidth and die area (Li et al., 7 May 2026, Fan et al., 9 May 2025, Matsushima et al., 20 Apr 2026).

PIM-CACHE thus aims to (a) minimize or eliminate redundant data movement to PIM, (b) adapt cache policies and hardware to PIM-specific semantics, and (c) re-map data layouts to align with access patterns and bank organization.

2. Host-Side Content-Aware Copy and Cache Reduction

PIM-CACHE as implemented for UPMEM-style PIM consists of a lightweight, host-resident data reduction module (DRM) that transparently intercepts and refines "copy_to" operations to DPUs (Yuhala et al., 24 Mar 2026). The core operation is block-wise content-aware deduplication, in which fixed-size blocks (e.g., 1 KiB) are fingerprinted (XXHash64) and deduplicated with a per-DPU hash table.

Workflow:

  • Incoming buffers are split into blocks. For each block, the DRM checks the per-DPU hash table for a prior occurrence; if a match is found, only an offset is recorded; otherwise, the block is scheduled for transfer and indexed.
  • Only distinct (unique) blocks are transferred to the DPU, together with an ordered list of offsets.
  • On the DPU, block-reconstruction is index-driven: only unique blocks are loaded from retention buffers as needed for kernel execution.

Optional compression (e.g., VByte) can be applied to further reduce transfer volume. The entire pipeline is software-transparent, plugging in above the vendor PIM SDK.

Bandwidth savings are proportional to the deduplication hit rate HH: for highly redundant data (e.g., R=1R=1, all blocks duplicated), up to 14×14\times transfer reduction is observed; when redundancy is absent, a penalized 1.3×1.3\times slowdown occurs due to linear host-side DRM overhead (Yuhala et al., 24 Mar 2026). In realistic genomics and temporal-reuse tasks, up to 1.5×1.5\times overall speedup is seen, with end-to-end compute acceleration dominated by the savings in data staging.

Limitations:

  • Efficacy relies on the workload achieving a deduplication hit rate H0.4%H \gg 0.4\%; below this threshold, overhead outweighs gain.
  • Block size trades off dedup granularity with metadata overhead; duplications smaller than the chosen block size are missed.
  • Overflow handling is global invalidation, not LRU/FIFO.

Potential enhancements include adaptive bypass (if HH is detected to be low), extension to DPU-to-host and inter-DPU transfer, and possible hardware offload in future PIM controllers.

3. PIM-CACHE for Hierarchical and Compressed KV Caches

LLM inference contexts, especially during decoding, generate exponentially growing KV caches that saturate both off-chip and internal PIM bandwidth and memory capacity. Recent research attacks this bottleneck by fusing cache hierarchy and management directly into HBM-PIM stacks, using both architectural heterogeneity and quantization (Li et al., 7 May 2026, Matsushima et al., 20 Apr 2026).

TokenStack (Li et al., 7 May 2026) partitions each HBM4 stack into:

  • Compute layers: PIM-enabled layers containing lightweight MACs, designated for “hot” KV cache.
  • Capacity layers: Dense DRAM-only dies for “cold” KV and weights, maximizing bit storage.

A "logic base die" orchestrates DMA, multi-layer address translation, attention coordination, and quantization in flight, including a streaming INT8/INT4 compression engine for keys and values yielding a 2.667×2.667\times capacity boost.

Key PIM-CACHE runtime policies:

  • Topology-aware KV placement: New requests are scheduled to stacks that already hold their prefix, minimizing cross-stack migration.
  • Workload-aware cache eviction: Every KV block is scored for demotion from compute to capacity based on reuse probability and access recency, with quantized migration to maximize capacity.
  • Bounded replication: Controlled prefix replication across cards only when fan-out frequently amortizes cross-stack transfer cost.

Empirically, this yields a 1.62×1.62\times geometric-mean throughput boost and 1.70×1.70\times SLO-compliant capacity over AttAcc (PIM-only flat attention baseline), combined with R=1R=10 lower per-token energy.

AQPIM (Matsushima et al., 20 Apr 2026) employs in-memory product quantization (PQ) of activations/KV cache, using hardware-accelerated clustering and codebook search directly in PIM logic. Compression ratios up to R=1R=11 (for the KV cache) are achieved, with attention kernel acceleration (R=1R=12) and overall decode speedup (R=1R=13) while sacrificing only R=1R=14 points on accuracy. All geometric and memory locality is preserved by codebook-aware data placement and hardware gather schemes within DRAM rows, enabling nearly constant per-token decode latency for very long contexts.

4. In-Memory Caching and Data Placement for Digital and Analog PIM

PIM-CACHE includes hardware-level modifications in both SRAM and DRAM-based arrays designed to maximize utilization, parallelism, and buffer reuse.

NVM-in-Cache (Chakraborty et al., 15 Sep 2025) overlays RRAM devices onto a standard 6T SRAM, creating a compact 6T-2R bit-cell. In PIM mode, analog multiply-accumulate (MAC) is performed on the power rails using weighted configuration circuits and sampled ADCs, while normal SRAM operation is retained. The hybrid design effectively doubles usable cache capacity, achieving R=1R=15 TOPS and R=1R=16 TOPS/W in GF22 FDSOI, with area occupation remaining R=1R=17 per bit. This approach is drop-in compatible, coexists with cache hierarchy in existing systems, and yields R=1R=18 lower DRAM traffic for allocation metadata in dynamic workloads.

CD-PIM (Lin et al., 18 Jan 2026) for LPDDR5-based in-situ compute reallocates bank and pseudo-bank segments for distinct "decode" (GEMV) and "prefill" (GEMM) phases. Under high-bandwidth compute mode (HBCEM), four pseudo-banks per bank are exploited, raising per-bank bandwidth R=1R=19 and boosting overall decoding throughput 14×14\times0 (over GPU-only) and 14×14\times1 (over prior AttAcc). The PIM-cache mapping schemes partition K-cache and V-cache matrices across all CUs for 100\% utilization (vs. 1.6\% naive), providing 14×14\times2 traffic reduction and 14×14\times3 buffer hit rate for multi-use weights.

5. Data Layout and Mapping: Sparse and Semantically-Aligned PIM Cache Management

Bandwidth waste in PIM-based LLM attention can be dominated by cache mapping strategies, especially under sparse attention.

STARC (Fan et al., 9 May 2025) proposes semantic clustering of key-value pairs such that tokens with high mutual attention are physically co-located in PIM DRAM rows and banks. K-means clustering on key vectors, with 14×14\times4 for 14×14\times5 total tokens, is followed by a round-robin mapping to banks and contiguous row allocation per cluster. The result is that during attention, only the clusters (rows) with maximal relevance (by centroid-query dot-product score) are fetched, realizing minimal row activation and maximal pipeline utilization.

Quantitatively, STARC provides:

  • 14×14\times6–14×14\times7 latency and 14×14\times8–14×14\times9 energy reduction versus token-wise sparsity;
  • 1.3×1.3\times0–1.3×1.3\times1 latency and 1.3×1.3\times2–1.3×1.3\times3 energy reduction compared to dense full KV retrieval;
  • Negligible (<2%) downstream accuracy loss relative to state-of-the-art sparse attention.

6. PIM-Cache Coherence, Dynamic Allocation, and Software-Hardware Synergy

PIM-CACHE also covers inter-core consistency and dynamic allocator metadata acceleration.

LazyPIM (Boroumand et al., 2017) introduces a speculative, signature-compressed coherence protocol for PIM systems sharing data between host and PIM logic. Instead of using traditional directory/MESI at each access, PIM execution is speculative and conflicts are detected/post-resolved using Bloom-filter signatures at commit boundaries, dramatically reducing off-chip traffic (by 1.3×1.3\times4), energy(1.3×1.3\times5), and improving performance (1.3×1.3\times6). Further reductions are possible by adapting signature sizing and chunk boundaries dynamically.

PIM-malloc (Lee et al., 19 May 2025) leverages a 16-line fully associative, per-core buddy cache for dynamic allocation trees in PIM DRAM. Caching only metadata 4B words per tree node accelerates allocation latency by 1.3×1.3\times7—from 1.3×1.3\times8 to 1.3×1.3\times9 cycles—while preventing bandwidth-heavy DRAM reads, with total area/power cost 1.5×1.5\times0, 1.5×1.5\times1mW/core.

PIM-SHERPA (Lee et al., 10 Mar 2026) targets the cacheability/hardware-trigger mismatch for in-place GEMM/GEMV in LLMs: prefill requires host-side, cache-resident duplicate of weights, while decode triggers PIM only if memory is non-cacheable. Double buffering and on-demand swizzled memory copy sidestep this, achieving 1.5×1.5\times2–1.5×1.5\times3 memory capacity savings for Llama 3.2 models, with negligible performance penalty at realistic sequence lengths.

7. Design Trade-offs, Scalability, and Limitations

  • Efficacy of deduplication/content-aware PIM-CACHE schemes is workload dependent; highly redundant (or clustered) input is optimal.
  • Quantization-based PIM-CACHE (AQPIM, TokenStack) introduces accuracy/throughput trade-offs, with competitive losses (<2 points) at 1.5×1.5\times4 compression.
  • Hardware-based PIM-caches (NVM-in-Cache, CD-PIM) must balance added area/ADC bottlenecks against throughput and use-case generality; benefits diminish for random-access or low-locality workloads.
  • Cache coherence schemes hinge on reasonable false positive rates/commit intervals; excessive rollback or large working sets can degrade gains.
  • Re-mapping/KV clustering-based PIM-CACHE (STARC) is effective where attention is semantically or temporally skewed, but less so under uniformly random or single-use contexts.
  • Future directions include hardware support for adaptive content-aware copy, extension to inter-DPU/in-PIM deduplication, online learning for cache replacement, and integration with transactional memory for robust rollback.

PIM-CACHE, as an architectural and algorithmic principle, is critical to unlocking near-memory compute at exascale, ensuring that performance, energy, and capacity benefits scale with logic/memory density and application requirements (Yuhala et al., 24 Mar 2026, Li et al., 7 May 2026, Matsushima et al., 20 Apr 2026, Chakraborty et al., 15 Sep 2025, Lin et al., 18 Jan 2026, Boroumand et al., 2017, Lee et al., 10 Mar 2026, Lee et al., 19 May 2025, Fan et al., 9 May 2025).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to PIM-CACHE.