Shared-PIM Architecture Overview
- Shared-PIM is a family of in-memory processing architectures that allow multiple computational agents to access and share memory concurrently.
- It employs techniques like bank-wide data buses, fine-grained locking, and adaptive scheduling to reduce latency and energy consumption.
- Demonstrated across DRAM, ReRAM, and unified NPU-PIM systems, Shared-PIM achieves significant speedups, including up to 60× improvement in DNN applications.
Shared-PIM refers to a family of Processing-in-Memory (PIM) architectures and methodologies that enable concurrent or coordinated access to PIM resources among multiple computational agents, software tenants, or functional units. These architectures aim to efficiently utilize memory bandwidth, reduce data-movement latency, and address contention, consistency, or resource-sharing across CPUs, NPUs, and PIM accelerators in various system contexts—including DRAM-based in-memory processing, edge-AI, deep neural network inference, and database acceleration.
1. Architectural Foundations of Shared-PIM
Shared-PIM encompasses hardware and system-level innovations emerging in both commodity DRAM- and custom PIM-class memories. Key architectural constructs include:
- Bank-wide Data Buses and Shared Rows in DRAM: In Shared-PIM DRAM designs, each memory bank is augmented with a bank-wide data bus (BK-bus) and dedicated shared rows, enabling the concurrent activation of computational (PIM) and data transfer operations. This circumvents the typical serialization found in conventional in-DRAM PIM, where row movement and computation are mutually exclusive due to shared peripheral structures such as sense amplifiers and wordlines (Mamdouh et al., 2024).
- Resource Partitioning and Scheduling: For multi-tenant or multi-agent scenarios (e.g., multi-tenant DNNs, CPUs/NPUs sharing PIM), Shared-PIM leverages spatial hardware partitioning (e.g., crossbar region slicing, DRAM rank/row assignment), fine-grained operator scheduling, or adaptive mapping—allocating disjoint or dynamically balanced fractions of PIM fabric among participants to minimize contention (Li et al., 2024, Seo et al., 2024).
- Unified Main Memory Systems: In NPU–PIM unified architectures, all DRAM address space is accessible both as standard main memory (for DMA/NPU loads/stores) and as the operand/weight store for in-memory PIM operations. Scheduling ensures mutual exclusion between direct memory accesses and in-DRAM computation, maximizing utilization and eliminating parameter duplication (Seo et al., 2024).
- Fine-Grained Locking and Launch/Poll Mechanisms: Memory controllers are enhanced to decode and arbitrate "launch" (PIM offload) and "poll" (completion query) commands, enabling CPU/PIM interleaving at the operator level. Bank- or region-level locks ensure atomicity, and polling modules aggregate PIM task status efficiently (Zhao et al., 4 Aug 2025).
2. Enabling Concurrency: Mechanisms and Protocols
A central goal of Shared-PIM is to break the conventional "compute-then-move" or "move-then-compute" serialization bottleneck. Key concurrency-enabling mechanisms include:
- Overlapping Compute and Data Transfer in DRAM Banks: Shared-PIM architectures integrate lightweight, bank-wide buses and dedicated shared rows controlled via secondary gates (GWL). A typical sequence involves copying data from a source to a shared row, freeing the local sense amplifier for computation, then using the BK-bus to transfer data to a destination row in a single operation, regardless of subarray distance. Subarray sense amplifiers are thus available for parallel PIM computation during transfers (Mamdouh et al., 2024).
- Tenant-Aware Partitioning and Pipeline Rebalancing: In ReRAM-based Shared-PIM, hardware partitioning at the tile/crossbar level is coupled with operator-level splitting and duplication, yielding a fine-grained pipeline where resource allocation continuously adapts to tenant workload demands. Tenants synchronize only at layer boundaries, and optimized partitioning equilibrates latencies across all participants (Li et al., 2024).
- Unified Memory Access Scheduling: Architectures such as IANUS introduce adaptive mapping and scheduling to handle exclusive access periods for either NPU DMA or PIM computation on the same memory fabric. When a macro-PIM operation is initiated, all DMA is blocked; otherwise, normal NPU load/store operations proceed. Pipelined attention- or fully-connected layers are mapped to exploit parallelism across both domains (Seo et al., 2024).
- Interleaved CPU–PIM Execution in Databases: PUSHtap introduces a two-phase protocol for analytical queries: a load phase (banks locked for PIM execution, bulk data transfer to PIM WRAM), followed by a compute phase (bank-unlocked, CPU continues OLTP, PIM operates in-WRAM), coordinated via launch/poll commands and fine-grained locking within the memory controller (Zhao et al., 4 Aug 2025).
3. Consistency, Coherence, and Concurrency Control
Shared-PIM must address correctness and consistency challenges:
- Cache Coherence for Shared CPU–PIM Execution: In 3D-stacked memory systems (e.g., HMC/HBM), approaches such as LazyPIM leverage speculative execution and Bloom filter–style compressed signatures to track read/write sets per PIM kernel chunk. On kernel commit, signatures are reconciled to detect conflicts (e.g., CPU writes to PIM-read lines). Successful commits apply changes atomically; conflicts trigger rollbacks. This approach maintains the familiar MESI protocol at minimal bandwidth and energy cost versus naive per-access coherence (Boroumand et al., 2017).
- Multi-Version Concurrency Control (MVCC): In database applications, Shared-PIM architectures like PUSHtap integrate MVCC by separating data into compact-aligned regions ("data" and "delta"), and managing version chains and visibility bitmaps for efficient snapshot isolation. Metadata and defragmentation protocols enable OLAP queries to access the freshest data with minimal OLTP disruption, and concurrency is orchestrated directly in the memory controller with minimal area overhead (Zhao et al., 4 Aug 2025).
4. Performance, Energy, and Area Characteristics
Quantitative evaluation across recent Shared-PIM works demonstrates substantial improvements over baseline PIM and DRAM architectures:
| Architecture | Latency Reduction | Energy Savings | Area Overhead | Notes/Benchmarks |
|---|---|---|---|---|
| Shared-PIM DRAM (Mamdouh et al., 2024) | 5× (copy) | 1.2× | 7.16% | 44% faster on MM/PMM/NTT; 29% on BFS/DFS (vs. pLUTo+LISA) |
| Shared-PIM ReRAM (Li et al., 2024) | up to 60× (DNNs) | up to 1.89× | n/a | 3–4-tenants; up to 60× latency speedup over vanilla ISAAC |
| IANUS (NPU–PIM) (Seo et al., 2024) | 6.2× (GPT-2) | 4.4× | n/a | 3.2× over DFX (FPGA); unified memory for end-to-end LLM |
| PUSHtap (Zhao et al., 4 Aug 2025) | 3.4× (OLTP) | n/a | 0.115mm² | 97% PIM BW utilization, 4.4× OLAP over multi-instance PIM |
| LazyPIM (Boroumand et al., 2017) | 19.6% (HTAP/app) | 18.0% | ~1% L1 | 30.9% traffic reduction vs. best prior PIM coherence |
Shared-PIM approaches report both component-level and workload-level advances, such as:
- Copy latency per 8KB row: 52.75 ns (vs. 260.5 ns for LISA, 1366 ns memcpy).
- Up to 44% increased speed on numerical kernels (matrix multiply, NTT, polynomial multiply).
- Multi-tenant PIM DNNs: up to 60.43× inference speedup and 1.89× lower energy.
- Database OLTP/OLAP: minimal transaction overhead (<3.5% over row-store), stable OLAP throughput under defragmentation, with <2% storage/area overhead.
5. Trade-offs, Limitations, and Applicability
The adoption of Shared-PIM introduces a set of trade-offs:
- Area and Power Overheads: Support for bank-wide buses, additional sense amplifiers, and control logic incurs modest die area (e.g., ~7% in DRAM, 0.115 mm² in the controller for PUSHtap) and transient increases in active power during shared operations (Mamdouh et al., 2024, Zhao et al., 4 Aug 2025).
- Scheduler and Compiler Complexity: Effective resource allocation, partitioning, adaptive mapping, and scheduling require accurate latency and bandwidth models. Deviations (e.g., in NPU–PIM unified memory) can erode potential speedups; robust autotuning frameworks are an open requirement (Seo et al., 2024, Li et al., 2024).
- Workload Suitability: Gains are highest for workloads with high data dependency, frequent inter-unit transfers, or tight performance/freshness coupling (e.g., DNNs, graph algorithms, HTAP databases). In workloads that rarely move data or do not require low-latency sharing, overheads are minimal but benefits are muted (Mamdouh et al., 2024, Zhao et al., 4 Aug 2025).
- Coherence Scalability: In cache-coherent Shared-PIM (e.g., LazyPIM), scaling to large numbers of PIM engines may increase commit/rollback traffic or coherence directory contention, requiring hierarchy or distributed control layers (Boroumand et al., 2017).
6. Representative Implementations and Results
Several recent representative implementations illustrate technical diversity within the Shared-PIM paradigm:
- Shared-PIM DRAM (Mamdouh et al., 2024): Each DRAM bank reserves two shared rows per subarray, adds a bank-wide data bus segmented to control wire capacitance, and extends bank sense amplifiers. Latency for 8KB inter-subarray copies drops to 52.75 ns, and overall area overhead is 7.16% over baseline pLUTo.
- ReRAM Shared-PIM for DNNs (Li et al., 2024): Partitioning and operator-splitting yield up to 60× speedup (MT5, chip1 config), with operator-only or tenant-only optimizations contributing up to ~21× and ~1.26×, respectively; energy savings reach 1.89× for contention-heavy mixes.
- IANUS NPU–PIM System (Seo et al., 2024): Implements a unified memory pool, adaptive FC-mapping, and macro-PIM command scheduling across DRAM/GDDR6 with in-PU computation. Utilization is maximized by pipelined Q/K/V/FC workloads, with average 6.2× GPT-2 speedup vs. NVIDIA A100.
- PUSHtap for HTAP Databases (Zhao et al., 4 Aug 2025): Achieves 97.4% PIM bandwidth and 59.8% CPU bandwidth by two-dimensional data alignment, employs block-circulant partitioning for PIM parallelism, and integrates MVCC with negligible controller footprint. Peak OLTP/OLAP throughput are improved by 3.4×/4.4× over multi-instance PIM.
7. Open Research Challenges and Future Directions
Open topics for further Shared-PIM advancement include:
- Dynamic, autonomous resource managers capable of workload-aware partitioning, scheduling, and fabric reconfiguration.
- Efficient cross-device data coherence, especially in multi-chiplet or distributed memory environments.
- Extended consistency and concurrency mechanisms for transactional, graph, and streaming analytics beyond the current MVCC and lazy coherence protocols.
- Hardware–software co-design for the integration of Shared-PIM with novel AI accelerators, data-centric operating systems, and storage hierarchies.
As the proliferation of memory-bound applications accelerates, Shared-PIM is a critical class of architectures for unlocking scalable, low-latency data-centric computation and sharing in heterogeneous systems across edge, datacenter, and embedded domains (Mamdouh et al., 2024, Li et al., 2024, Boroumand et al., 2017, Seo et al., 2024, Zhao et al., 4 Aug 2025).