Processing-Using-Memory (PuM)
- Processing-Using-Memory (PuM) is a computing paradigm that exploits the intrinsic properties of memory cells to execute bulk operations in situ, reducing costly data transfers.
- PuM employs techniques like simultaneous multi-row activation and LUT-based operations to achieve significant speedup and energy efficiency in diverse applications.
- Integration of custom memory allocators, compiler mapping, and OS support ensures accurate operand alignment and efficient resource allocation for in-memory processing.
Processing-Using-Memory (PuM) is a computing paradigm in which primary computation is performed using the intrinsic physical, electrical, or analog properties of memory cells themselves, without the data movement required in conventional processor-centric systems. Unlike Processing-Near-Memory (PnM)—which embeds logic close to memory—PuM directly leverages the organization and operational characteristics of memory arrays (primarily DRAM, but also emerging memories such as MRAM) to carry out bulk operations, particularly data-intensive primitives, within the memory fabric.
1. Fundamental Principles and Operational Models
The central goal of PuM is to minimize data movement between computation engines and memory, since this movement dominates both energy consumption and execution latency in data-intensive applications. PuM achieves this by exploiting memory cell physics to implement operations in situ, such as:
- Bulk data copy and initialization via back-to-back activations within a DRAM subarray (RowClone; (Mutlu et al., 2020))
- Bulk bitwise operations (AND, OR, NOT, MAJ) via simultaneous multi-row activations that induce charge-sharing on the bitlines and use sense amplifiers to resolve the majority value (Ambit, SIMDRAM, PULSAR; (Mutlu et al., 2020, Oliveira et al., 2022, Yuksel et al., 2023))
- Table-lookup (LUT)-based transformations, mapping computationally complex functions to precomputed results stored in DRAM, which are accessed in a massively parallel and energy-efficient fashion (pLUTo, Lama; (Ferreira et al., 2021, Khabbazan et al., 4 Feb 2025))
Mathematically, a bulk AND/OR can be expressed as a majority-of-three operation:
where the correct initialization of one row (e.g., all zeros for AND, all ones for OR) allows flexible functional selection.
While initial PuM techniques were limited to SIMD-style bitwise operations, recent advances extend the model to MIMD-style execution by enabling selective activation of finer-grained memory segments (“mats”) and by supporting flexible interconnect and resource management (Oliveira et al., 29 Feb 2024, Oliveira, 27 Aug 2025).
2. Core Architectural and Algorithmic Innovations
Major architectural enhancements for PuM include:
- Simultaneous Multi-Row Activation: Techniques such as PULSAR utilize carefully crafted DRAM command sequences to simultaneously activate up to 32 rows, replicating operand data to increase bitline voltage deviation and achieve high-reliability in analog computation. This allows implementation of not only MAJ3, but MAJ5, MAJ7, Multi-RowInit, and Bulk-Write primitives, supporting a broader set of in-memory functions (Yuksel et al., 2023).
- Fine-Grained Resource Allocation: MIMDRAM partitions subarrays into mats and introduces mat isolation transistors, row decoder latches, and selectors. This enables MIMD execution within DRAM: only the minimum required physical resources are allocated and activated per computational primitive, reducing underutilization and energy waste (Oliveira et al., 29 Feb 2024, Oliveira, 27 Aug 2025).
- LUT-Based Operations: pLUTo and Lama architectures embed lookup tables in DRAM subarrays/mats, supporting highly parallel and low-latency table-lookup via independent column accesses in each mat. This enables efficient execution of complex operations (e.g., multiply, divide, trigonometric functions) that are not tractable via direct analog logic (Ferreira et al., 2021, Khabbazan et al., 4 Feb 2025).
- Compiler and OS Integration: Advanced systems, notably MIMDRAM and PUMA, include compiler passes for automatic mat mapping and vectorization as well as custom memory allocators that ensure operand alignment and co-location to match DRAM topology, crucial for correct operation, coherence, and performance (Oliveira et al., 29 Feb 2024, Oliveira et al., 7 Mar 2024).
3. System Integration and Programming Support
Effective deployment of PuM requires cross-layer support across hardware, OS, and software stack:
- Custom Memory Allocation: Operations like RowClone and Ambit require operands to reside in the same subarray and to be aligned to DRAM row boundaries. Standard allocators (e.g., malloc, posix_memalign) fail to meet these constraints. PUMA introduces a kernel-level allocator that splits huge pages into DRAM-aware allocation units, exposing APIs for aligned and co-located object creation. This ensures alignment to DRAM row size R and subarray membership, e.g.,
H = N * R, for huge page H and allocation unit R (Oliveira et al., 7 Mar 2024).
- Automatic Code Mapping: Frameworks such as DaPPA and the MIMDRAM compiler pass identify data-parallel programming patterns (map, reduce, zip), analyze vectorization factors, and annotate memory objects with mat labels for allocation and scheduling within DRAM (Oliveira, 27 Aug 2025, Oliveira et al., 29 Feb 2024).
- Abstraction of In-Memory Primitives: By providing high-level APIs for operations such as bulk AND/OR, RowClone, or table-lookup, most recent frameworks abstract away timing and command-level details. The memory controller and OS manage cache coherence (e.g., ensuring data consistency between caches and DRAM before an in-place operation), and allocate necessary compute segments.
4. Performance and Energy Efficiency
PuM architectures offer substantial improvements across major performance metrics, as demonstrated in recent evaluation results:
| Architecture | Perf. Speedup (vs. CPU) | Energy Eff. Gain (vs. CPU) | Area Overhead | 
|---|---|---|---|
| pLUTo (Ferreira et al., 2021) | up to 12.7× | up to 1855× | 10–23% DRAM | 
| Lama (Khabbazan et al., 4 Feb 2025) | 8.5× (8b mult) | 6.9–8× | 2.47% DRAM | 
| MIMDRAM (Oliveira et al., 29 Feb 2024) | 34× (vs. PuD), 30.6× | 14.3× (vs. PuD) | 1.1% DRAM, 0.6% CPU | 
| PULSAR (Yuksel et al., 2023) | 2.21× (vs. FracDRAM) | N/A (reliability focus) | None | 
These gains derive from drastic reductions in expensive memory activation commands, increased resource utilization (by matching compute allocation to parallelism), and the elimination of most off-chip data movement.
Lama and LamaAccel specifically demonstrate that combining mat-level parallelism and open-page policies can reduce the number of activate (ACT) commands by 19.4×, and HBM-based deployments with exponential quantization can achieve >9× reductions in energy for deep learning inference (Khabbazan et al., 4 Feb 2025). PULSAR’s input replication leads to a >24% improvement in correct majority operation over prior techniques (Yuksel et al., 2023).
5. Applications and Workload Classes
PuM architectures have demonstrated significant impact in domains where data movement dominates energy and performance:
- Deep Learning/Inference: Bulk multiplication and attention operations are efficiently mapped to LUT-based PuM, with weight/activation quantization supporting further acceleration (Khabbazan et al., 4 Feb 2025).
- Databases and Analytics: In-DRAM sorting, hashing, and DB primitives (SELECT, AGGREGATE, JOIN, ORDER) are accelerated, with real systems (PIMDAL on UPMEM) attaining up to 3.9× TPC-H query speedup over CPUs (Frouzakis et al., 2 Apr 2025).
- Time Series and Signal Processing: MATSA demonstrates >7× improvement in sDTW computation over CPUs and GPUs using MRAM crossbar-based PuM (Fernandez et al., 2022).
- Bioinformatics: Sequence alignment workloads are efficiently implemented on PuM systems, with massive parallelism across DPUs (UPMEM in AIM) yielding substantial speedups (Diab et al., 2022).
- General Data-Parallel Processing: By supporting map/reduce/zip and automatic allocation of compute mats, MIMDRAM and DaPPA enable broader classes of workloads to be mapped to PuM substrates (Oliveira, 27 Aug 2025).
6. Limitations, Challenges, and Future Directions
Despite demonstrated efficiency, PuM systems face notable challenges:
- Operand Alignment and Mapping: Strict requirements for operand co-location and alignment necessitate integration with the OS and additional page mapping complexity (Oliveira et al., 7 Mar 2024, Olgun et al., 2021).
- Limited Compute Primitives: Commodity DRAM designs originally support only a fixed set of analog operations; extending to complex functions often requires additional hardware (e.g., pLUTo/Lama LUT logic, PULSAR row replication circuitry) and may induce area overhead.
- Reliability and Process Variation: Analog operations (especially multi-row activation) can be susceptible to process/temperature-induced failures; advances such as PULSAR address this by operand replication and aggressive sensing, but process-dependent profiling remains necessary (Yuksel et al., 2023).
- Programming and Toolchain Support: The requirement for DRAM-aware memory allocators, specialized compiler passes, and runtime systems creates a barrier to adoption. Recent frameworks (e.g., DaPPA, PUMA) are making advances to automate these aspects, but mainstream compilers and OS kernels are not yet PuM-ready (Oliveira, 27 Aug 2025, Oliveira et al., 7 Mar 2024).
- Adoption in Commodity Systems: While the modifications by architectures such as Lama and MIMDRAM incur modest area overheads, the absence of PuM support in standard commodity DRAMs (apart from experimental devices) may slow deployment absent further standardization and industry support.
Future directions involve continued refinement of reliability techniques (fine-grained error correction or self-tuning), transparent integration with OS/hypervisor-level memory management, broader support for complex operations via new analog/digital hybrid logic, and open-source release of programming frameworks and toolchains.
7. Summary and Impact
Processing-Using-Memory represents a paradigm shift away from processor-centric computing toward a memory-centric model, wherein computation is pushed into memory arrays. By aligning computational resource allocation with data-level parallelism, leveraging the analog behaviors of modern (and emerging) memory technologies, and orchestrating tightly the interaction of allocators, compilers, and OSes, PuM architectures are demonstrably breaking the memory wall, especially for workloads dominated by data movement. Open-source libraries (pLUTo, PIMDAL), realistic toolkits (PiDRAM), and extensive benchmarking methodologies (DAMOV) have matured the field to a stage where end-to-end system-level demonstrations with substantial performance and energy gains are possible (Ferreira et al., 2021, Frouzakis et al., 2 Apr 2025, Olgun et al., 2021, Oliveira, 27 Aug 2025). As research continues to address reliability, programmability, and integration challenges, the prospects for widespread adoption of PuM in high-performance and data-intensive systems continue to improve.