Processing-in-Memory (PIM) Logic
- Processing-in-Memory (PIM) logic is an innovative approach that embeds computation within or near memory arrays to minimize data movement and boost throughput.
- It leverages two main paradigms—processing-using-memory and processing-near-memory—using technologies like DRAM charge-sharing and memristive crossbars for in-situ Boolean and arithmetic operations.
- PIM logic achieves significant energy, latency, and throughput improvements while posing challenges in device reliability, integration, and development of robust programming models.
Processing-in-Memory (PIM) logic refers to architectural, circuit, and device-level techniques that enable direct computation within or adjacent to memory arrays, minimizing data movement between memory and conventional processing units. PIM logic exploits the intrinsic properties of memory devices—whether volatile (e.g., DRAM, SRAM, eDRAM) or nonvolatile (e.g., memristors, STT-MRAM, DWM, QCA)—to embed the ability to perform Boolean and arithmetic computation either inside the memory cells ("processing-using-memory," or PUM) or in specialized logic integrated close to the memory array ("processing-near-memory," or PNM). Adoption of PIM logic has demonstrated order-of-magnitude throughput and energy improvements for data-movement-bound workloads, driving active research across the memory hierarchy, memory device physics, architecture, algorithms, and programming models (Mutlu et al., 2020, Mutlu et al., 2019, Hoffer et al., 29 Jun 2025, Leitersdorf et al., 2022).
1. Core Principles and Device Foundations
PIM logic is implemented via two fundamentally distinct approaches: processing-using-memory (PUM) and processing-near-memory (PNM).
Processing-Using-Memory (PUM):
PUM exploits analog or digital in-situ properties of the memory array to carry out Boolean logic or fixed arithmetic in-place:
- DRAM Charge-Sharing: Simultaneous activation of multiple rows in DRAM causes capacitive charge sharing on the bitline; the sense amplifier resolves a majority/AND/OR function. For example, activating three rows, , , , yields
Majority behavior is harnessed for AND/OR operations based on controlled initialization of source rows (Mutlu et al., 2020, Mutlu et al., 2019).
- Memristive/Resistive Crossbars: Each cell at has conductance ; applying wordline voltages yields a BL current . Stateful logic (e.g., MAGIC, IMPLY, FELIX) re-uses the device for both storage and gate operations; row- or column-level partitioning accelerates parallel operations such as multiplication, popcount, or convolution (Leitersdorf et al., 2021, Leitersdorf et al., 2022, Leitersdorf et al., 2022).
- Gain-Cell eDRAM (GC-eDRAM): Dual-port 3T architectures enable stateful logic by using a nondestructive read and a LOGIC pulse; in-array NOR/NOT is performed via out-of-band write–read sequences, preserving information and enabling full bit-plane operations (Hoffer et al., 29 Jun 2025).
- Quantum-dot Cellular Automata (QCA), DWM: In QCA, the cell's polarization acts as data and logic. Pipelinable Akers arrays supporting Boolean primitives (F(X,Y,Z)=X¬Z+YZ) yield ultralow area/power in the sub-Kelvin regime (Chougule et al., 2016). Racetrack memories (DWM) use transverse resistance measurements to sense the aggregate bitwise state in a segment, enabling simultaneous multi-bit logic (Ollivier et al., 2021).
Processing-Near-Memory (PNM):
PNM integrates programmable or fixed-function logic within the logic layer of 3D-stacked DRAM (e.g., HMC, HBM), on-die eDRAM, or as embedded processors (e.g., UPMEM DPUs), directly accessing the high-bandwidth, low-latency internal memory channels. This model supports full RISC ALUs, SIMD units, and programmable PIM kernels (Mutlu et al., 2020, Gómez-Luna et al., 2021).
2. Logic Primitives, Partitioning, and Parallelism
PIM logic exposes a set of Boolean gates and arithmetic operators, either as fundamental in-array operations or via composition:
| Device/Technique | Native Gates | Parallelism | Comments |
|---|---|---|---|
| DRAM TRA (Ambit) | MAJ, AND, OR, NOT | – bits/bank | Requires triple-row activation, sense amp |
| Memristor Crossbar | NOR, IMPLY, Min₃ | bits/row × partitions | MAGIC/FELIX primitives, dynamic partition |
| SRAM/GC-eDRAM | NOR, NOT | $64$–$4096$ bits/subarray | GC-eDRAM uses LOGIC pulse, 3 ns latency |
| QCA (Akers Array) | XNOR, MUX, FA | Data-parallel in cell grid | 0.23 μm² XOR, × lower energy |
| Racetrack (DWM) | Multi-input logic | bits (TRD) per TR op | TRD, polynomial add/mult in-memory |
| 3D-stacked logic (PNM) | ALU, SIMD, custom | $8$–$2048$ units per channel | In-order cores co-located with DRAM banks |
Key accelerative mechanisms include:
- Row/column partitioning: Divides the array into sections to obtain semi-parallel execution; e.g., 32-way partitioned memristive arrays reduce to latency in multiplication, sorting (Leitersdorf et al., 2022, Leitersdorf et al., 2021).
- Broadcast and reduction trees: Log-depth, fan-tree patterns for propagating values, popcount, or arithmetic combine (Leitersdorf et al., 2022, Leitersdorf et al., 2022).
- In-row arithmetic: Carry-save add–shift, tree popcount, and bit-parallel prefix network for high-throughput multiply/divide/add, supporting up to 39× acceleration over serial implementations (Leitersdorf et al., 2021, Leitersdorf et al., 2022, Leitersdorf et al., 2022).
3. System Architectures and Programming Abstractions
PIM logic is integrated at multiple system levels:
- Monolithic in-DRAM logic: Minor peripheral circuit modifications (e.g., new ACTIVATE sequences, dual-contact rows for NOT) enable PUM without area penalty, as in RowClone/Ambit (Mutlu et al., 2019, Mutlu et al., 2020).
- 3D-stacked PIM: Logic die beneath multiple DRAM dies hosts small PIM cores, often with local scratchpads, running code offloaded from the host. UPMEM's DPUs offer an in-order pipeline, 64KB WRAM per DPU, and direct access to local DRAM banks (Gómez-Luna et al., 2021).
- Memristive/Nonvolatile PIM units: Crossbars and GC-eDRAM arrays are integrated as standalone accelerators or as part of a heterogeneous memory system (Hoffer et al., 29 Jun 2025, Leitersdorf et al., 2022).
Programming models evolve from explicit offload APIs, microcode flows, and high-level compiler toolchains:
- SIMDRAM: Compiles high-level operations down to MAJ/NOT primitives and sequences ACTIVATE/PRECHARGE commands (Oliveira et al., 2022).
- abstractPIM: Introduces IR/microcode separation for cross-technology portability; the IR specifies technology-agnostic gate sequences, while microcode backends specialize for in-array logic (e.g., MAGIC, FELIX, IMPLY) (Eliahu et al., 2022).
- Hardware–software co-design: Dyadic block PIM (DB-PIM) marries block/bit-level pruning algorithms with specialized digital SRAM-PIM macros, maximizing parallel utilization and skipping zero blocks/columns (Duan et al., 25 May 2025).
4. Quantitative Results and Case Studies
Adoption of PIM logic yields dramatic performance and efficiency improvements across a spectrum of data-centric workloads:
- Bulk copy/bitwise logic (DRAM): RowClone: lower latency, less energy for 4KB copy; Ambit: higher throughput, lower DRAM energy on AND/OR/NOT; – application speedup in database queries (Mutlu et al., 2020, Mutlu et al., 2019, Mutlu et al., 2019).
- Memristive crossbars: MatPIM achieves speedup for binary matrix–vector product, in convolution, with log-depth tree reductions (Leitersdorf et al., 2022). MultPIM reduces 32-bit multiply latency from to , providing speedup over RIME-style partitioned designs (Leitersdorf et al., 2021).
- GC-eDRAM: 99.5% logic gate success at 5μs retention, $13.5$ fJ per NOR with 4.6Mb/mm² density, outperforming 6T SRAM and 1T-1C eDRAM in both energy and throughput (Hoffer et al., 29 Jun 2025).
- Real PIM hardware (UPMEM): PIM-tree index structure yields up to greater throughput than prior DRAM-only skip lists, maintaining low communication and load balance under skewed queries (Kang et al., 2022). DPU-based architectures outperform Xeon CPUs by up to and Titan V GPUs by on memory-bound graph and CRUD workloads (Gómez-Luna et al., 2021).
- Arithmetic acceleration: Bit-parallel fixed/floating-point arithmetic in memristive arrays achieves $10$– higher throughput and up to two orders of magnitude energy efficiency gains versus modern GPUs (Leitersdorf et al., 2022).
5. Challenges, Trade-offs, and Adoption
Despite the clear gains, PIM logic introduces adoption and integration challenges at multiple levels (Mutlu et al., 2020, Ghose et al., 2018, Eliahu et al., 2022, Oliveira et al., 2022):
- Device-level: Variability, endurance, and retention for NVMs (memristors, DWM, QCA, GC-eDRAM); RowHammer and ECC in DRAM; clocking, process variation in QCA.
- Control and periphery: Area-efficient partition and section decoding (partitionPIM half-gates), compressed command/control messaging, retention versus array size trade-off, and managing refresh overhead in volatile devices (Leitersdorf et al., 2022, Hoffer et al., 29 Jun 2025).
- System-level: Virtual memory support (in-memory pointer chasing, region-based page tables), cache coherence (speculative signature-based protocols, LazyPIM), and minimal OS/ISA extensions (Boroumand et al., 2017, Ghose et al., 2018).
- Programming models: Abstractions for partitioned, parallel primitives; cross-stack co-design for bit/weight sparsity; backward-compatible IR/microcode separation to ease hardware evolution (Eliahu et al., 2022, Duan et al., 25 May 2025).
- Economic practicality: Achieving high parallelism with minimal logic/peripheral additions (e.g., 1–2% array area for DRAM-based PUM, sub-10% for DWM/GC-eDRAM, small die fraction in 3D-stacked PNM).
A major bottleneck remains the transition of the programming and runtime stack from processor-centric to data-centric, requiring system and compiler support, robust toolchains (DAMOV, SIMDRAM), and suitable benchmarks for real evaluation (Oliveira et al., 2022).
6. Emerging Directions and Outlook
Recent research extends PIM logic into increasingly complex domains, including:
- General in-memory arithmetic: Carry-lookahead, carry-save, floating-point (IEEE-754) fixed and variable point addition, multiplication, and division in memory (Leitersdorf et al., 2022, Leitersdorf et al., 2021).
- AI/ML acceleration: Sparse digital SRAM PIM for compressed CNN inference (DB-PIM), ultrafast binarized/ternary neural nets in crossbar and gain-cell PIM (Duan et al., 25 May 2025, Hoffer et al., 29 Jun 2025).
- Systematic compiler flows: Technology-independent IR to microcode translation, potentially with auto-tuning for device capabilities and dynamic reconfiguration (Eliahu et al., 2022).
- PIM-specific database/data-structure kernels: Parallel index/search, graph traversal, sorting, data compression, and bitwise analytic filtering (Kang et al., 2022, Mutlu et al., 2020).
- Hardware trust and security: Integration of PIM with ECC, RowHammer mitigation, and digital true random number generators (Mutlu et al., 2020).
These advances, coupled with continued demonstration of order-of-magnitude efficiency and throughput gains, position PIM logic as a central driver of future high-density, energy-efficient, and data-centric architectures. They also demand rigorous cross-layer research—from emerging device physics to software abstractions—to realize the full potential of programmable and scalable processing-in-memory systems.