Processing Using Memory (PUM)
- Processing Using Memory (PUM) is a memory-centric paradigm that performs computational operations directly within memory arrays, eliminating the need for extensive data shuttling.
- It employs diverse techniques—from bulk bitwise logic in DRAM to analog matrix–vector multiplication in resistive crossbars—to achieve massive parallelism and energy efficiency.
- PUM architectures integrate with tailored software and system designs to deliver significant speedups in applications like machine learning, cryptography, and big data analytics.
Processing Using Memory (PUM) is a paradigm in which computational operations are performed directly within memory arrays, exploiting the inherent physical mechanisms of memory cells themselves. Unlike processor-centric models—where data shuttles between memory and a compute unit—PUM architectures enable data transformation in situ, eliminating off-chip transfers, drastically reducing energy and latency, and leveraging the massive parallelism of dense memory substrates. PUM subsumes a range of design points, from bulk bitwise operations in commodity DRAM to multiply–accumulate in resistive crossbars and Boolean logic via stateful elements, and can be instantiated in DRAM, SRAM, various non-volatile memories, and even emerging quantum-dot cellular automata. This article surveys the principles, execution models, hardware and software stacks, practical realizations, application domains, and current challenges of the PUM paradigm.
1. PUM Core Principles and Execution Models
Processing Using Memory is defined by the collapse of the classic compute–memory dichotomy, allowing memory cells and their local periphery (e.g., sense amplifiers, wordline circuits) to natively execute computational operations. A standard taxonomy partitions PUM along two axes: location of compute (in-array, i.e., within bitcells/sense-amps, vs. near-array/logic-layer), and memory technology (charge-based DRAM, SRAM, resistive NVMs such as ReRAM/PCM/MRAM, Flash, QCA) (Mutlu et al., 2020, Oliveira et al., 2022).
Execution Models:
- Bulk Bitwise: Primitive logic such as AND, OR, NOT, and majority computed via multi-row activation and charge sharing in DRAM or by stateful logic (e.g., MAGIC, IMPLY) in resistive arrays (Seshadri et al., 2016, Eliahu et al., 2022).
- Analog MVM: Ohmic law and Kirchhoff’s law are exploited in crossbar NVMs (e.g., ReRAM) to perform parallel matrix–vector multiplication (MVM), digitized via per-column ADCs (Wong et al., 17 Feb 2026).
- Crossbar Digital Logic: Boolean gates realized by voltage pulses on resistive or magnetic cells (e.g., NOR/NAND/IMP) (Wong et al., 17 Feb 2026, Eliahu et al., 2022).
- Row-Clone and Initialization: Full-row bulk copy or zeroing via back-to-back DRAM ACTIVATE sequences (Seshadri et al., 2016).
- LUT-based Operations: Complex, nonlinear, or multi-operand functions implemented by large on-array lookup tables with fast parallel queries (Ferreira et al., 2021, Khabbazan et al., 4 Feb 2025).
This situates PUM within the broader Processing-in-Memory (PIM) space, but distinctively restricts compute logic to array-local structures for area, energy, and throughput optimality.
2. Circuit Techniques and Technology Substrates
DRAM:
- Triple-Row Activation: Simultaneously activating three rows releases charge onto a bitline; the sense amplifier resolves the result as a bitwise majority (Seshadri et al., 2016, Mutlu et al., 2020). AND, OR, XOR, and full adders can be composed by fixing certain rows to logic-0/logic-1 and sequencing TRA and sense-amp inversion.
- RowClone: Back-to-back ACTIVATES within a subarray enable entire-row copying or initialization. Variants adapt this to inter-bank cases with pipelined buffer moves (Seshadri et al., 2016, Olgun et al., 2021).
- LUT Accelerators: Techniques such as pLUTo and Lama repurpose wide DRAM subarrays as massive parallel LUT engines, implementing complex operators by rapid row sweeps with match-line gating and independent column selection (Ferreira et al., 2021, Khabbazan et al., 4 Feb 2025).
Resistive NVMs (ReRAM/PCM/MRAM):
- Analog MVM: Words are encoded as input voltages; memory cell conductances represent weights. The summed current at each bitline yields a dot-product, which is then quantized for digital use. Peripheral logic applies ECC or corrects for device non-idealities (Wong et al., 17 Feb 2026, Fernandez et al., 2022).
- Stateful Logic: MAGIC, IMPLY, majority-in-gate, and derived libraries implement Boolean operations by orchestrated voltage applications (Eliahu et al., 2022).
- Software-hardware decoupling: Compilation flows (e.g., abstractPIM) separate technology-independent IR generation from technology-specific microocdes, increasing portability across substrate logic families (Eliahu et al., 2022).
QCA and Novel Devices:
- Majority Logic and Memory Fusion: QCA arrays leverage electrostatic majority gates where each cell stores and computes simultaneously; primitive cells are arranged in arrays wired for pipelined logic-with-memory (Chougule et al., 2016).
3. Representative Architectures
Hybrid PUM (DARTH-PUM):
- Synthesizes analog MVM arrays and digital Boolean logic in a single chip, orchestrated by sophisticated peripheral units (shift, transpose, arbitration, parasitic compensation) for seamless analog-digital sequencing (Wong et al., 17 Feb 2026).
- Exposes a unified programming API, arbitrary-precision support, and automatic compiler scheduling.
DRAM Bitwise PUM (Ambit, RowClone, MIMDRAM):
- Ambit exploits charge-sharing and cross-subarray activation to implement scalable bitwise kernels with 44Ă— CPU throughput (Mutlu et al., 2019, Mutlu et al., 2020).
- MIMDRAM introduces fine-grained mat-based partitioning and multiple-instruction multiple-data (MIMD) scheduling, alleviating SIMD underutilization and increasing utilization by 15.6Ă— over classical approaches (Oliveira et al., 2024).
LUT-Driven PUM (pLUTo, Lama):
- Specialized architectures enable 8.5×–713× speedups for nonlinear and arithmetic-heavy kernels over CPU/GPU, reducing area and activation overhead compared to prior LUT-in-DRAM designs (Ferreira et al., 2021, Khabbazan et al., 4 Feb 2025).
Analog–Boolean Hybrid PUM:
- Combines crossbar analog MVMs for high-throughput, error-tolerant multiplication with periphery-driven Boolean operations for exact logic phases, orchestrated by shift/transposition and instruction injector units (Wong et al., 17 Feb 2026).
Off-the-Shelf DRAM PUM (PULSAR):
- Relies on carefully-timed DRAM commands to enable up to 32-row simultaneous activation, performing MAJ-k, bulk initialization, and write primitives with documented 24.2 pp improvement in operation success rate and 2.21× performance over prior FracDRAM (Yuksel et al., 2023).
4. Programmability and System Integration
Effective PUM system software combines backend memory mapping, specialized allocation routines, compiler transformations, runtime schedulers, and high-level libraries:
- Compiler Support: PUM-aware lowering passes map high-level code or domain-specific languages to sequences of memory-resident primitives (e.g., Ambit’s triple-row activation, SIMDRAM’s MAJ/NOT graph, LUT query scheduling) (Oliveira et al., 2022, Eliahu et al., 2022, Ferreira et al., 2021).
- OS and Allocators: Primitives such as RowClone/IDAO require strict placement/alignment—kernel modules like PUMA provide row/subarray-aware lazy allocation by splitting huge pages and mapping them according to DRAM address bits (Oliveira et al., 2024).
- Programming Models: Pattern libraries and APIs (e.g., DaPPA, SimplePIM) expose map/reduce/zip primitives, combining backend PUM scheduling and data management (Chen et al., 2023, Oliveira, 27 Aug 2025). Application-specific wrappers abstract platform details in hybrid analog–digital systems.
- Coherence and Consistency: Flushed CPU caches or marking PUM regions as non-cacheable ensures consistency, with region-based or lazy hardware-based protocols being explored (Mutlu et al., 2019, Mutlu et al., 2020).
- Virtual Memory and Data Mapping: Proper data placement is handled by page allocators sensitive to DRAM topology (vault/subarray), often supported by firmware or device-tree metadata (Oliveira et al., 2022, Oliveira et al., 2024).
5. Application Domains and Quantitative Impact
PUM has demonstrated large, domain-spanning impact, with empirical and modeled speedups in a variety of settings:
- Machine Learning: CNN inference (ResNet-20) and Transformer LLM encoder projections achieve 14.8×–40.8× speedup and 51–110× energy savings on hybrid analog–digital PUM over analog+CPU baselines (Wong et al., 17 Feb 2026).
- Cryptography: AES-128 mapped to DARTH-PUM delivers 59.4Ă— speedup, 39.6Ă— energy reduction per 16-byte block (Wong et al., 17 Feb 2026). LUT-based PUM accelerates S-boxes for symmetric cryptosystems (Ferreira et al., 2021, Khabbazan et al., 4 Feb 2025).
- Database and Analytics: Bitmap indexing, hash join, and aggregations see up to 12× latency reduction in DRAM bitwise PUM and 9×–40× end-to-end speedups in practical deployments (Vincon et al., 2019, Mutlu et al., 2019).
- Scientific Computing: MRAM-based PUM for dynamic time warping (sDTW) achieves 7Ă— performance and 11Ă— energy improvement over multicore CPUs (Fernandez et al., 2022).
- Bulk Data Movement: RowClone-based copy/initialize yields up to 318× end-to-end improvement for 8 KiB–8 MiB operations on real systems (Olgun et al., 2021, Seshadri et al., 2016).
A compact table illustrates verified performance/energy improvements in key systems:
| Architecture | Speedup (vs. CPU) | Energy Reduction | Operation | Reference |
|---|---|---|---|---|
| Ambit DRAM | 44Ă— | 35Ă— | bulk AND/OR | (Mutlu et al., 2019) |
| DARTH-PUM | 14.8–59.4× | 40–110× | AES, CNN, LLM workloads | (Wong et al., 17 Feb 2026) |
| pLUTo (LUT) | 713Ă— | 1,855Ă— | multi-operand LUT queries | (Ferreira et al., 2021) |
| Lama (8b mult.) | 8.5Ă— | 6.9Ă— | SIMD 8b multiplication | (Khabbazan et al., 4 Feb 2025) |
| MATSA (MRAM) | 7.35Ă— | 11.29Ă— | sDTW | (Fernandez et al., 2022) |
| MIMDRAM | 34Ă— | 14.3Ă— | general map/reduce | (Oliveira et al., 2024) |
| PULSAR | 2.21× | — | Bulk bitwise MAJ, Multi-Init | (Yuksel et al., 2023) |
All quantitative results are reported in the original works and are conditioned on workloads amenable to high arithmetic intensity or regular, massive data-parallelism.
6. Limitations and Technical Challenges
Primary limitations include:
- Primitive Set Coverage: Most DRAM PUM primitives are limited to bitwise logic and shift-based arithmetic; nontrivial floating-point, reductions, and irregular access patterns remain challenging or require expensive controller sequences (Oliveira et al., 2022, Mutlu et al., 2020).
- Precision and Nonideality: Analog PUM suffers from device and process variation, noise, IR drop, and nonideal conductance quantization, necessitating error compensation or fallback to digital PUM phases for critical computation (Wong et al., 17 Feb 2026).
- Granularity and Utilization: SIMD-only bitwise engines are inefficient for applications with less data-level parallelism; MIMDRAM and mat segmentation partially address underutilization (Oliveira et al., 2024, Oliveira, 27 Aug 2025).
- Alignment and Placement Constraints: Subarray- and row-alignment requirements must be respected by OS allocators, as misalignment forces fallback to CPU computation (Oliveira et al., 2024).
- Programming and Tooling: Lack of standardized toolchains, opaque hardware semantics, and the necessity of device-aware compilation and runtime support remain significant obstacles for broad adoption (Oliveira et al., 2022, Mutlu et al., 2020).
- Device Support: Not all commodity DRAM modules support required electrical behavior (e.g., multi-row activation); process-dependent limitations are evidenced in PULSAR's findings (Yuksel et al., 2023).
7. Prospects and Future Directions
Prospective work in PUM spans:
- Unified Programming Models: The development of portable pattern libraries, abstract compilation flows, and pattern-based OS support (as in DaPPA, abstractPIM, and SimplePIM) that enable general-purpose workloads over mixed PUM substrates (Chen et al., 2023, Oliveira, 27 Aug 2025).
- Robustness and Generality: Improving cell and sense-amp reliability for more complex and noisy analog operations; integrating error-tolerant codes and adaptive timing (Wong et al., 17 Feb 2026, Oliveira et al., 2022).
- Extensibility: Composability of in-array primitives with near-array logic to support hybrid, flexible, floating-point, and data-dependent operations; exploring dynamic bit-precision adaptation, carry-lookahead, and redundant binary logic for reduced operational latency (Oliveira, 27 Aug 2025).
- Hardware–Software Co-Design: OS-level subarray mapping, cache/CPU-PUM coherence, page allocation, and region-based protection; integrating environmental and profiling feedback to guide adaptive PUM dispatch (Oliveira et al., 2024, Olgun et al., 2021).
- Technology Migration: Extending core techniques to ReRAM, PCM, MRAM, QCA, and flash-based arrays, leveraging the modularity of PUM design for technology scaling and new physical phenomena (Fernandez et al., 2022, Chougule et al., 2016).
PUM’s demonstrated performance and energy efficiency across a spectrum of data-intensive workloads position it as an enabling technology for future memory-centric architectures. Adoption hinges on resolving open tooling challenges, standardizing execution and allocation models, and maturing system software stacks to mask low-level physical requirements while exposing unified, high-level programming abstractions.