Processing-Using-Memory (PUM) Architecture

Updated 8 December 2025

Processing-Using-Memory (PUM) is a computing paradigm that performs operations directly within memory arrays by harnessing device-level physics for massively parallel execution.
PUM techniques reduce off-chip data movement, significantly enhancing bandwidth and energy efficiency for workloads like bitwise operations and neural network inference.
Implementations such as pLUTo, Ambit, and SIMDRAM demonstrate distinct trade-offs in performance, area overhead, and energy savings through analog and digital in-memory computations.

Processing-Using-Memory (PUM) denotes a class of architectures wherein computation is performed directly within the memory substrate by exploiting the intrinsic device physics and native structures of memory arrays themselves. In contrast to near-memory processing, which integrates general logic adjacent to memory, PUM leverages analog and digital properties of standard or minimally modified DRAM, non-volatile memory (NVM), and related technologies to effect massively parallel, in-place operations without off-chip data movement. This paradigm promises substantial gains in bandwidth, energy efficiency, and computational throughput for data-centric and memory-bound workloads.

1. Taxonomy and Theoretical Underpinnings

PUM occupies a precise position in the processing-in-memory design space, distinct from both traditional processor-centric computation and processing-near-memory (PNM) accelerators. PUM—sometimes described as “in-memory computing” or “computation-in-memory”—executes operations within memory arrays (e.g., DRAM, RRAM, MRAM, PCM) by activating wordlines and bitlines to physically map logic functions to charge transfer, current summation, or majority voting among cell states (Mutlu et al., 2020, Mutlu et al., 2020, Mutlu et al., 2020, Ferreira et al., 2021). Classical examples include:

Charge-sharing in DRAM (RowClone): bulk copy/initialization via back-to-back row activations transferring data within subarrays.
Triple-row activation (Ambit): bitwise logic (AND/OR/MAJ) by concurrently activating three rows and interpreting the resulting bitline voltages.
Analog vector-matrix multiplication in crossbar NVM: applying voltage patterns to wordlines and sensing current summations at bitlines, effecting high-throughput dot-product operations (Oliveira et al., 2022, Fernandez et al., 2022).

PUM thus exploits device-intrinsic parallelism, turning every memory column (or segment) into an ALU lane and leveraging direct device physics for computation, avoiding the von Neumann bottleneck.

2. Representative Microarchitectures and Core Mechanisms

The implementation of PUM varies according to memory technology, desired operator set, and data path. Key realizations include:

DRAM-Array PUM (e.g., pLUTo, Ambit, RowClone, SIMDRAM):
- pLUTo (“processing-using-memory with lookup-table operations”): Each DRAM row encodes a lookup table (LUT) entry. By activating rows (the “ROW_SWEEP” primitive) and matching input keys in a “source row,” sense amplifiers capture and write LUT outputs into destination rows based on per-bitline comparators. This supports arbitrary functions $f:\{0, ..., 2^N-1\} \to \{0, ..., 2^M-1\}$ , enabling in-situ evaluation of operations up to and including 16-bit multiplication and complex nonlinear mappings (Ferreira et al., 2021).
- Bitwise logic (Ambit/SIMDRAM): Triple-row activation enables MAJ/AND/OR; bitwise NOT and XOR via derived activation/sensing sequences and auxiliary cell structures (Mutlu et al., 2020, Oliveira et al., 2022).
- In-memory SIMD (SIMDRAM/MIMDRAM): Vertical or fine-grained data layouts enable bank/subarray-level SIMD or even MIMD mapping, with bankwise or matwise parallel operation support (Oliveira et al., 29 Feb 2024).
LUT-optimized PUM (Lama): Further reduces energy by exploiting mat-level parallelism and open-page policy to batch multiple internal column accesses per ACTIVATE, minimizing the number of energy-intensive activations for bulk operations—especially effective for SIMD arithmetic with higher-precision operands (Khabbazan et al., 4 Feb 2025).
NVM-based PUM (e.g., MATSA): MRAM or RRAM crossbar arrays with reconfigurable sense amplifiers allow in-memory bit-serial arithmetic and multi-input logic, applied in dense DP kernels for workloads such as dynamic time warping (Fernandez et al., 2022).
Hybrid controller and API support: Memory controllers are extended with finite-state machines, custom microcode, and DRAM command extensions (e.g., pluto_op) to orchestrate PUM primitives, often tightly coupled with OS/allocator support for page placement and data mapping (Ferreira et al., 2021, Olgun et al., 2021).

3. Application Mapping and Representative Workloads

PUM is uniquely suited to workloads where data movement dominates cost and where core computations map efficiently to native in-memory primitives:

Bitwise operations: Filtering and bitmap index operations on database columns, scanning, and filtering in analytics workloads (Ferreira et al., 2021, Oliveira et al., 2022).
Bulk arithmetic and nonlinear transforms: Arbitrary deterministic functions (multiplication, CRC, substitution, nonlinear scaling), via LUTs or bit-serial mappings (Ferreira et al., 2021, Khabbazan et al., 4 Feb 2025).
Neural network inference and quantized ML: Especially inference in quantized neural architectures (e.g., LeNet-5, attention models with quantized dot-product), leveraging fast LUT queries or quantized sum/count transformations (Ferreira et al., 2021, Khabbazan et al., 4 Feb 2025).
Dynamic programming kernels: In-memory parallel DP for genomics, time-series analysis, and sequence alignment, using bit-serial additions, reductions, and vector copying implemented by in-memory crossbar logic (Fernandez et al., 2022, Diab et al., 2022).
Database and analytics operators: Selection, aggregation, ordering, and join implemented in PUM-enabled platforms, reducing memory bus traffic and accelerating large-scale queries (Frouzakis et al., 2 Apr 2025).

4. Performance, Area, and Energy: Quantitative Evaluation

Performance and energy benefits of PUM result from eliminating off-chip transfers and explicitly leveraging massively parallel in-array operations. Expressed succinctly, key relationships and outcomes reported include:

Architecture	Area Overhead	Perf. Gain (CPU)	Energy Gain (CPU)	Notes
pLUTo-BSA	16.7%	713×	1855×	LUT, DDR4-based, broad ops (Ferreira et al., 2021)
pLUTo-GSA	10.2%	-	-	LUT, lowest area, reloads LUT each query
pLUTo-GMC	23.1%	↑ perf, ↑energy	-	high perf/energy, cell mod needed
SIMDRAM	<0.2% (CPU)	21× (app.), 88× (op)	257×	Bit-serial ops, 16 banks (Oliveira et al., 2022)
Lama	2.47% (HBM2)	3.8×–8.5×	6.9×–8×	Efficient col acc., 8b precision
MATSA (MRAM)	<1% (cell)	7.35×	11.3×	Bitwise DP/sDTW, crossbar logic
UPMEM (DPU, PnM)	-	10–23×	1.6–5.2×	Broad memory-bound kernels

Speedup is generally computed as $S = \frac{T_{baseline}}{T_{PUM}}$ , energy reduction as $E_{red} = \frac{E_{baseline}}{E_{PUM}}$ , with operation throughput and area normalized per design (Ferreira et al., 2021, Khabbazan et al., 4 Feb 2025, Oliveira et al., 2022).

Limitations are observed when operand precision grows (LUT storage explodes for large $N$ ), or when area/power penalties of added per-cell logic (as in GMC) threaten yield, as well as in workloads unsuited to parallel map/reduce or requiring complex control.

5. Design Trade-Offs, ISA, and System Integration

PUM design entails multiple architectural and practical trade-offs:

Operation Coverage vs. LUT/PRIMITIVE Size: Large N-bit functions require exponential LUT rows; hence, best efficiency at $N \leq 8$ , or when functions can be hierarchically decomposed (Ferreira et al., 2021, Khabbazan et al., 4 Feb 2025).
Area/Fabrication Overhead: Operations like GMC add per-cell transistors (higher perf/energy, up to 23.1% area), while GSA keeps area minimal. All major designs maintain compatibility with standard DRAM process and timing constraints (Ferreira et al., 2021).
Programming Interface and Control: Requires ISA extensions (e.g., pluto_op, MIMDRAM bbops), controller modifications, and careful OS/allocator support for mapping, alignment, and coherence (Ferreira et al., 2021, Oliveira et al., 29 Feb 2024).
Coherence and Consistency: In-situ updates can break CPU-visible cache coherence; solutions span from explicit flushes (CLFLUSH) and library-level barriers to speculative schemes (LazyPIM, CoNDA) (Mutlu et al., 2020, Ghose et al., 2018).
Integration Model: Both end-to-end FPGA platforms (e.g., PiDRAM (Olgun et al., 2021)) and new system software stacks have been demonstrated, including open-source toolchains for rapid prototyping.
Hybridization: Future PUM systems are projected to integrate in-memory logic (bitwise/LUT/majority) with near-memory accelerators for reduction or irregular control, optimizing for workload-specific functional split (Ferreira et al., 2021, Mutlu et al., 2020).

6. Comparative Analysis and Future Outlook

PUM architectures, particularly DRAM-based solutions like pLUTo, MIMDRAM, and modern LUT-based proposals (Lama), routinely exceed previous bit-serial or near-memory designs by over an order of magnitude in both performance and energy (Ferreira et al., 2021, Khabbazan et al., 4 Feb 2025, Oliveira et al., 29 Feb 2024). For 8-bit LUT-based bulk operations, pLUTo yields 713× speedup and 1855× energy reduction over optimized CPUs, with a 16.7% area cost, while Lama leverages column-level parallelism for further energy reduction at just 2.47% area overhead (Ferreira et al., 2021, Khabbazan et al., 4 Feb 2025). Crossbar NVM-based PUM solutions (MATSA) yield similar gains in streaming- and DP-bound workloads (Fernandez et al., 2022).

Design space orientation is dictated by area overhead vs. supported operation set, flexibility vs. specialization, and granularity of row/column access. Emerging research is investigating dynamic LUT/write-backed functions, hybrid decomposition, compiler-automated function mapping, and transparent integration with near-data accelerators.

Adoption is gated by system integration challenges (ISA, virtualization, coherence), the need for cross-layer software–hardware co-design, and extension to full data-analytics workload classes. Industry prototypes and open-source frameworks now enable end-to-end, scalable experimentation and deployment.

7. Conclusion

Processing-Using-Memory transforms passive memory arrays into high-parallelism, energy-proportional computational substrates by leveraging device-intrinsic capabilities. It delivers dramatic reductions in data movement costs and is especially effective for bandwidth-bound, irregular, and highly parallel tasks. Modern DRAM-based PUM realizations (pLUTo, Lama, MIMDRAM), as well as crossbar-NVM PUM for streaming analytics, achieve performance and energy improvements exceeding 10×–1000× over traditional architectures for target workloads, with modest area cost and increasing flexibility. Continued advances in controller logic, compiler/OS support, and analytic workload mapping are broadening the applicability and impact of PUM architectures across scientific computing, AI, databases, and emerging application domains (Ferreira et al., 2021, Khabbazan et al., 4 Feb 2025, Oliveira et al., 29 Feb 2024, Fernandez et al., 2022, Mutlu et al., 2020).