Processing-in-Memory Architectures
- Processing-in-Memory architectures are designs that integrate computational logic within or near memory arrays to overcome data movement bottlenecks.
- They leverage in-DRAM logic, specialized accelerators, and heterogeneous integration to achieve significant latency reduction and energy efficiency.
- Cutting-edge PIM approaches include PuM, PnM, and hybrid models, supported by advanced compilers, error detection, and simulation frameworks.
Processing-in-Memory (PIM) architectures re-integrate computational logic with memory arrays or their peripheries to directly address the data movement bottleneck inherent in the von Neumann model. By executing computation near or within memory, PIM substantially reduces the time and energy costs associated with transferring operands and results between separated compute and storage resources, especially for data-intensive applications such as machine learning, graph analytics, and database workloads. The recent evolution of PIM has been driven by both device-level innovations (e.g., in-DRAM logic, non-volatile crossbars) and system-level advances (heterogeneous integration, microarchitectural support, and software toolchains), aiming to deliver latency reduction, bandwidth scaling, and energy optimization.
1. Architectural Principles and Taxonomy
PIM architectures span a broad spectrum defined by their coupling between memory and compute, granularity of computational primitives, and integration technology:
- Processing-Using-Memory (PuM or in-situ PIM): Computation physically occurs in memory cell arrays (e.g., DRAM subarrays, memristor crossbars). Classic examples include RowClone for fast DRAM copies and triple-row activation (as in Ambit, SIMDRAM) for bitwise logic. This approach leverages existing memory peripherals, typically yielding minimal area overhead (≈0.01% of DRAM) but is limited to operations that can be mapped to in-array analog or digital behavior (Mutlu et al., 2020).
- Processing-Near-Memory (PnM): Dedicated compute logic (e.g., ALUs, SIMD units, programmable cores) is placed adjacent to memory, either in the logic layer of 3D-stacked DRAM (e.g., HMC, HBM, GDDR6), as in UPMEM’s DPUs (Gómez-Luna et al., 2021, Hyun et al., 2023), or as accelerators in the buffer logic (e.g., AxDIMM, AiM). This model provides much richer general-purpose compute capability at higher area, design, and thermal cost.
- Hybrid and Specialized Architectures: SRAM-PIM (for high-speed, low-latency logic-in-cache (Duan et al., 25 May 2025)), resistive-PIM (ReRAM/PCM for analog multiply-accumulate (Sharma et al., 2024)), and multi-level DRAM PIM (bank-group/rank-level parallelism (Kim et al., 2023)) represent tailored designs for specific workload bottlenecks.
The main architectural innovations emphasize maximizing internal bandwidth by exploiting DRAM bank/subarray-level parallelism, providing low-overhead compute primitives, and carefully managing the coexistence of compute and memory operations.
2. Microarchitectural Mechanisms and Circuit-Level Innovations
Modern PIM designs extend classic DRAM/memory arrays with new microarchitectural elements:
- Row Partitioning and Bank-Wide Busses: In designs such as Shared-PIM (Mamdouh et al., 2024), a subset of DRAM rows per subarray is reserved as “shared rows” for staging data movement, while the remaining serve as computing “compute rows.” A bank-level bus (BK-bus), implemented using global bitlines and augmented sense amplifiers, allows concurrent, low-latency inter-subarray transfers, decoupling local compute from bulk copy. The critical innovation is concurrent compute/copy within the same bank, pipelining row transfers and local logic, yielding up to 5× copy latency reduction and 1.2× lower energy compared to previous methods (e.g., LISA).
- Sense/Drive Logic for Data Movement: Shared rows are connected to bank-wide buses via a second access transistor gated by a dedicated signal, enabling fast, segmented row activation and copying. This permits overlapped computation and bulk copy, removing idle compute periods previously imposed by long data movement operations.
- Programmable SIMD and Reduction Engines: For higher-level operators, PIM arrays may contain specialized SIMD datapaths (e.g., 8-lane vector ALUs in Darwin (Kim et al., 2023)) and bank-/bank-group-level controllers for massive fine-grained parallelism. Compilation techniques (see Section 5) ensure computational mapping aligns with the physical data layout.
- Lightweight Error Detection and Reliability: Fat-PIM (Zubair et al., 2022) introduces sum-based homomorphic checksum columns to detect analog and logic errors with 3.9% storage and <5% performance overhead, supporting reliability in ReRAM/memristor crossbars.
3. Data Movement, Locality, and Interconnects
Improving the proximity of data to compute is central in PIM:
- Locality-Aware Data Placement: The spatial mapping of inputs/outputs is fine-grained, with address mapping functions or explicit controllers (e.g., PIM-MMU (Lee et al., 2024)) to maximize bank/chip parallelism and minimize cross-bank accesses.
- Dynamic Data-Relocation: DL-PIM (Tian et al., 9 Oct 2025) employs a distributed, hardware-accelerated address-indirection layer to migrate hot data blocks into local memory regions of demanding PIM units, reducing average memory latency per access by up to 54% in HMC and 50% in HBM. The system dynamically enables or disables such indirection based on observed locality and data reuse, adapting the policy to avoid performance penalties on low-reuse workloads.
- Content-Aware Copy and Deduplication: PIM-CACHE (Yuhala et al., 24 Mar 2026) leverages spatial and temporal similarity in buffer contents, introducing a software-managed content addressable staging buffer and block-wise fingerprinting. This mechanism can reduce coarse-grained host-to-DPU transfers by up to 98% for high-redundancy workloads, boosting end-to-end kernel throughput nearly 10×.
- Hierarchical On-Chip Interconnects: For manycore and chiplet-based PIM designs targeting ML inference or HPC, lightweight, dataflow-aware interconnects such as the Floret SFC NoI (Sharma et al., 2024) provide low-latency intra- and inter-chiplet communication, reducing NoI latency and energy by up to 2.8× relative to mesh or torus networks.
4. Performance, Energy, and Area Efficiency
Quantitative improvements in recent PIM research are significant and benchmark-driven:
- Latency and Throughput: Shared-PIM (Mamdouh et al., 2024) achieves 5× lower copy latency (T_copy ≈ 53 ns vs. 260 ns for LISA) and 1.4× faster pipelined arithmetic primitives in in-DRAM LUT architectures, catalyzing 29–44% speedup in matrix multiplication, polynomial multiplication, NTTs, and graph BFS/DFS.
- Bandwidth Utilization: Designs such as Darwin (Kim et al., 2023) exploit multi-level DRAM parallelism, achieving 4.0–43.9× throughput increase over CPU-only systems and up to 7.5× faster basic query operators than prior leading PIM.
- Energy Efficiency: Across major DRAM and NVM-based PIMs, energy savings stem predominantly from minimized data movement (10×–100× lower) and overlapped, locality-optimized compute. For instance, Shared-PIM reduces effective copy energy to ≈0.14 μJ (vs. 0.17 μJ for LISA), and up to 85.7% energy reduction over CPU baselines in bank-group optimized architectures.
- Area Overhead: High-performance PIM designs manage to maintain area increases below 10% (e.g., 7.16% for Shared-PIM, 5.6% for Darwin's rank/bank-group extensions), even while integrating substantial peripheral and bus logic.
- Workload Coverage: PIM accelerates a wide range of computation, including memory-bound ML inference, database operators, graph traversal, and reduction/aggregation primitives. Bit-serial SRAM PIM with sparsity exploitation achieves up to 8.0× MAC throughput improvement and 85% energy savings in digital neural inference (Duan et al., 25 May 2025).
5. Programming Models, Compilers, and System Software
Unlocking PIM benefit requires corresponding software and system support:
- Data-Centric Compilation and Co-Optimization: DCC, a tensor compiler for PIM (Yang et al., 19 Nov 2025), demonstrates up to 13.2× kernel speedup by co-optimizing data rearrangement and compute code, guided by a hierarchical PIM abstraction and discriminative cost model. Compilation strategies must co-tune compute blocking, data layout, and movement, as naive compute-centric tiling may nearly double data movement overhead.
- Pattern-Based and API Level Support: Higher-level abstractions (e.g., map/reduce/scan in DaPPA (Oliveira, 27 Aug 2025)), explicit PIM offload API extensions, and hardware-aware code transformations enable application developers to write PIM-amenable code without low-level resource management.
- Memory and Virtualization Management: Heterogeneity-aware MMUs (PIM-MMU (Lee et al., 2024)) accelerate and schedule bulk DRAM↔PIM transfers via dedicated hardware engines, sophisticated schedulers, and customized address mappers. Systems support (OS + runtime) is necessary for correct partitioning, mapping, and transfer orchestration.
- Error Handling and Security: Lightweight in-situ error detection (FAT-PIM (Zubair et al., 2022)) and careful management of Bloom-filter based coherence (LazyPIM (Boroumand et al., 2017)) help maintain reliability and correctness with marginal cost in throughput or storage.
6. Benchmarking, Simulation, and Evaluation Methodologies
Rigorous comparative evaluation and design exploration require layered, open, and accurate simulation ecosystems:
- Simulation Stacks: Tools range from device-level (NVSim, SPICE), through microarchitectural (PIMSimulator, Ramulator-PIM, UPMEM uPIMulator), to full-system and event-driven frameworks (PIMSim, LLMServingSim) (Aghaei et al., 26 Nov 2025). Application-level functional simulators (MemTorch, MNSIM, AiM Simulator) enable co-evaluation of hardware and ML software stacks.
- Benchmark Suites: Standardized PIM benchmark suites—PrIM (Gómez-Luna et al., 2021), PIMbench, DAMOV—span dense/sparse LA, graph kernels, reduction/primitives, and ML tasks, supporting fair comparison between architectures.
- Evaluation Metrics: Core metrics include effective execution time, energy per operation, DRAM chip/bank utilization, throughput (GOPS/TOPS), area overhead, and, for analog designs, end-to-end DNN inference accuracy and variability (Aghaei et al., 26 Nov 2025).
- Cross-Level Design Space Exploration: Hybrid and application-driven simulation and co-design frameworks integrate device, circuit, and system-level abstractions, supporting both early architectural ideation and post-si validation of assumptions.
7. Future Directions and Open Challenges
While PIM has crossed the threshold to commercial deployment, several open research problems persist:
- System Integration and Adoption: Widespread adoption calls for robust software stacks, standard APIs, PIM-aware OS/hypervisor virtualization, data placement/migration support, and tool-chains that ease integration into heterogeneous datacenters (Oliveira et al., 2022, Oliveira, 27 Aug 2025).
- Generalization and Scalability: Extending PIM concepts to future memory technologies (PCM, MRAM, ReRAM), aggressive 3D vertical stacking, and exascale integration of hundreds or thousands of PIM units require scalable interconnects, conflict-avoidance schemes, and precise thermal/reliability modeling (Sharma et al., 2024).
- Programmability and Toolchains: Continued maturation of data-centric/tensor compilers, high-level synthesis for PIM, and auto-tuning for memory, arithmetic, and locality trade-offs remain central (Yang et al., 19 Nov 2025, Oliveira, 27 Aug 2025, Eliahu et al., 2022).
- Coherence and Consistency: Efficient, low-traffic mechanisms for coherence between host and PIM continue to be developed, balancing minimal communication with strong correctness guarantees (e.g., LazyPIM (Boroumand et al., 2017)).
- Security and Fault Tolerance: In-situ detection, robust error management, and security mechanisms covering PIM logic, memory cell arrays, and host<->PIM access protocols are required for mission-critical applications (Zubair et al., 2022).
The continued evolution of PIM is marked by convergent advances at device, architecture, compiler, and system levels, with performance and energy gains across memory-bound domains now routinely exceeding order-of-magnitude improvements over process-centric baselines. As methodologies and infrastructure mature, PIM is poised for significant roles across ML, analytics, and emerging HPC workloads (Aghaei et al., 26 Nov 2025, Mutlu et al., 2020, Oliveira, 27 Aug 2025, Yang et al., 19 Nov 2025, Kim et al., 2023, Mamdouh et al., 2024).