PIM Execution Primitives (PEPs)
- PIM Execution Primitives (PEPs) are the smallest indivisible computation and memory actions that enable in-memory acceleration in digital processing systems.
- They encapsulate a range of operations—from bitwise logic to matrix micro-kernels—supporting applications in scientific computing, machine learning, and graph analytics.
- Their systematic formulation and technological mapping facilitate co-design optimizations that improve performance, energy efficiency, and portability across diverse memory technologies.
PIM Execution Primitives (PEPs) are the fundamental hardware- and microcode-level actions exposing computational capabilities within modern digital processing-in-memory (PIM) architectures. They form the lowest-level compute and data movement units directly executable inside memory arrays such as DRAM or memristive crossbars, providing the enabling substrate for hardware acceleration of data-intensive workloads across scientific computing, machine learning, tensor algebra, and graph analytics. Recent architectural and compiler frameworks—such as inclusive-PIM designs for DRAM-based systems, abstractPIM’s technology-independent ISA for crossbar logic, AME-PIM’s matrix micro-kernels, and PyPIM’s programmatic Python integration—systematically formalize the PEP notion, codifying them as a compositional ISA layer with well-defined performance, parallelism, and technological mapping (Alsop et al., 2023, Eliahu et al., 2022, Venieri et al., 30 Apr 2026, Leitersdorf et al., 2022, Leitersdorf et al., 2023).
1. Formal Definition and Taxonomy of PEPs
PEPs are defined as the smallest indivisible computation/memory actions supported by the PIM hardware, exposed either as instruction-mnemonics (e.g., ADD, NOR₂, FILL), microprogrammed micro-ops, or fixed-function micro-kernels. PEP sets vary depending on PIM substrate (HBM, DRAM, memristive crossbars) and system design.
Taxonomy of PEPs:
| PEP Category | Representative PEPs | Typical Platforms |
|---|---|---|
| Bitwise digital logic | NOT, NOR₂, AND₂ | Crossbar PIM: abstractPIM, PyPIM |
| Element-wise arithmetic | ADD-PEP, MUL-PEP | HBM-PIM: AME-PIM |
| Data movement/activation | FILL, MOV, MASK | HBM-PIM, PyPIM |
| Aggregate/outer-product kernel | MAC-PEP, GEMM microkern. | AME-PIM, Inclusive-PIM |
| High-level tensor operations (R-type ISA) | ADD, SUB, CMP, MOVE | PyPIM |
- In abstractPIM, the canonical ISA of PEPs includes all 1–4 input Boolean functions, MUX, half-adders, and their symmetric extensions, with a microcode mapping for each PIM technology (Eliahu et al., 2022).
- PyPIM exposes both micro-ops (MASK, READ, WRITE, LOGIC_H, LOGIC_V, MOVE) and a macro R-type ISA for standard arithmetic, comparison, and logical operations (Leitersdorf et al., 2023).
- DRAM/HBM-PIM architectures (AME-PIM) implement PEPs as fixed-function micro-kernels matched to matrix, vector, and tile arithmetic, e.g., MAC-PEP for outer-product accumulation and SIMD ADD/MUL microkernels (Venieri et al., 30 Apr 2026).
- AritPIM details bit-serial and bit-parallel arithmetic PEPs, including full carry-lookahead and carry-save primitives for addition, multiplication, and division in both fixed- and floating-point (Leitersdorf et al., 2022).
- Inclusive-PIM analyzes five "acceleratable" PEPs: vector-sum, wavesim-volume, wavesim-flux, ss-GEMM, and push-based graph kernels, grounding the definition in mathematical, memory, and hardware constraints (Alsop et al., 2023).
2. Instruction Set Design and Encoding
The implementation of PEPs at the ISA and microarchitectural layer is critical for portability, backward compatibility, and hardware abstraction.
- In abstractPIM, each PEP is associated with a fixed-width instruction; for example, a 16-bit encoding with distinct fields for OPCODE, destination row, and source rows. Complex multi-output primitives (e.g., half-adder) extend fields as needed (Eliahu et al., 2022).
- PyPIM’s 64-bit micro-ops encode the operation type, data width, partition or crossbar selectors, gate types, and operand selectors. The R-type ISA overlays the micro-ops as 32-bit macro-instructions, akin to RISC integer ALUs (Leitersdorf et al., 2023).
- DRAM/HBM-PIM’s PEPs are microprogrammed into per-channel command registers, triggered via standard DRAM commands (column, row) that synchronize wide SIMD datapaths across banks (Venieri et al., 30 Apr 2026).
This hierarchical encoding enables decoupling high-level compute graph scheduling from hardware-specific (e.g., MAGIC, IMPLY) instruction synthesis, permitting code generation, reuse, and microcode library replacement as technologies evolve (Eliahu et al., 2022).
3. Algorithmic Realization and Technological Mapping
Each PEP must be realized efficiently for the native in-memory logic or DRAM compute unit.
- Memristive crossbar PEPs (abstractPIM, PyPIM) map to sequences of (possibly parallel) stateful logic cycles: single-row NOR, AND, NOT using row/column biasing and partition isolation. Primitives such as addition, multiplication, and division employ canonical arithmetic algorithms (bit-serial ripple-carry, parallel-prefix, carry-save, Karatsuba, Brent–Kung, non-restoring division) decomposed into sequences of base logic PEPs (Eliahu et al., 2022, Leitersdorf et al., 2022).
- HBM-PIM (AME-PIM) designs map AME RISC-V tile instructions to combinations of SIMD PEPs such as ADD-PEP, MUL-PEP, SUB-PEP, and MAC-PEP. MAC-PEP implements an outer-product update with fully in-memory accumulation to obviate host-PIM data shuttling and maximize bank-locality (Venieri et al., 30 Apr 2026).
- Data-movement PEPs (FILL, MOV in AME-PIM; MASK, MOVE in PyPIM) are designed to leverage DRAM/getter bus widths or H-tree on-chip interconnects, supporting register refill, accumulator flush, and cross-array communication in a bank/partition-parallel manner (Leitersdorf et al., 2023, Venieri et al., 30 Apr 2026).
The crossbar mapping for logic PEPs emphasizes parallel gate execution per partition; DRAM-PIM focuses on leveraging bank-parallelism and minimizing row activation penalties via bank-group pipelining (Alsop et al., 2023, Venieri et al., 30 Apr 2026).
4. Performance Modeling and Amenability
PEPs must align with the memory and compute bandwidth regimes of the PIM platform to realize speedup versus traditional compute.
- Inclusive-PIM introduces a PIM-amenability test based on four quantitative criteria: memory-bandwidth limit (operational intensity against PIM BW), memory residency/on-chip reuse, operand locality (bank alignment), and aligned data parallelism (SIMD group/grid compliance) (Alsop et al., 2023).
- For crossbar PIM, aggregate performance is analyzed as a function of micro-op latency and macro-op sequencing, with bit-parallel primitives collapsing arithmetic latency from – (serial) to (parallel), yielding up to ops/sec and > ops/W efficiency (Leitersdorf et al., 2022, Leitersdorf et al., 2023).
- DRAM/HBM-PIM can deliver up to 14.9 GFLOP/s (59.4 FLOP/cycle) per pseudo-channel for matrix multiplication using MAC-PEPs in outer-product mode, populating ridgeline points on roofline models limited by row activation, register bandwidth, and command issue rates (Venieri et al., 30 Apr 2026, Alsop et al., 2023).
PEP set choice and algorithmic mapping are thus directly informed by memory and computation bottlenecks; co-design of data placement, PEP scheduling, and microkernel design is required to achieve effective BW utilization (Alsop et al., 2023, Venieri et al., 30 Apr 2026).
5. Co-Design Optimizations, Trade-Offs, and ISA Portability
Co-design of hardware PEP sets and software/hardware interfaces is a central theme in state-of-the-art PIM platforms.
- Inclusive-PIM demonstrates that extending PEP execution by overlapping row activations, skipping PEPs for dynamic zeros (sparse matrix multiply), or partially offloading updates to on-chip caches (for graph algorithms) increases realized speedup from to $2$– relative to high-end GPU baselines (Alsop et al., 2023).
- abstractPIM emphasizes backward compatibility: by separating a target-independent PEP ISA stream from family-specific microcode, the same PEP program can execute on different memristive logic families (MAGIC, IMPLY, etc.)—with device-specific microcode libraries installed in the PIM controller. This approach reduces code size, supports technology upgrades, and encapsulates device/sequence-level scheduling detail (Eliahu et al., 2022).
- PyPIM demonstrates programmatic mapping from high-level tensor operations into R-type ISA instructions, each expanded into optimized micro-op sequences (LOGIC_H/V, MOVE, etc.). The abstraction enables full utilization of crossbar parallelism, minimizes control bottlenecks, and facilitates flexible data movement across horizontal, vertical, and inter-array axes (Leitersdorf et al., 2023).
- In AME-PIM, mapping AME instructions to PEP microkernels and eliminating host-side reduction is key for achieving near-peak sustained throughput and minimal off-chip transfers, though inner-loop limitations (e.g., counter range, accumulator width) still bound some use cases (Venieri et al., 30 Apr 2026).
PEP granularity, ISA richness, and technology-specific trade-offs—latency, area, energy, code length, flexibility—are thus central architectural decisions, with richer ISA sets reducing code size but increasing per-PEP control complexity (Eliahu et al., 2022, Leitersdorf et al., 2022).
6. Application Domains and Representative Use Cases
PEP-defined architectures and their associated ISAs enable a broad spectrum of data-centric applications.
- Scientific computing: PEPs form the substrate for high-order PDE solvers (wavesim-volume/flux), sparse linear algebra (ss-GEMM), and block-structured matrix multiply/accumulate microkernels (MAC-PEP, GEMM) (Alsop et al., 2023, Venieri et al., 30 Apr 2026).
- Machine learning: Elementwise primitives (ADD, MUL), outer-product accumulations, and tensor-format data movement are fundamental to DNN forward and backward computation; PEPs permit tiling, pipelined accumulation, and all-in-memory execution (Alsop et al., 2023, Venieri et al., 30 Apr 2026, Leitersdorf et al., 2023).
- Graph workloads: PEPs implement push-based update kernels, exploiting locality and selective host/offload decisions depending on cache-hit rates (Alsop et al., 2023).
- Arithmetic logic: Crossbar platforms realize full IEEE754 floating-point add, multiply, divide, modulo, sign/compare—and their vectorized variants—via cascades of basic logic PEPs using serial or parallel-pipeline techniques (Leitersdorf et al., 2022, Leitersdorf et al., 2023).
- Tensor libraries: PyPIM translates high-level Python/NumPy operations (.add, .sum, .matmul) directly into PEP-based instruction streams executable in massively parallel fabric (Leitersdorf et al., 2023).
Performance results demonstrate up to two orders-of-magnitude improvement in throughput and energy efficiency relative to high-end GPUs for core arithmetic, as well as 2–4× near-peak end-to-end acceleration for full matrix/tensor workloads (Leitersdorf et al., 2022, Alsop et al., 2023, Venieri et al., 30 Apr 2026, Leitersdorf et al., 2023).
7. Limitations and Outlook
Although PEP-based architectures demonstrate significant performance and energy scaling, several constraints are inherent to current realizations.
- Some DRAM/HBM-PIMs lack native cross-bank reduction or non-arithmetic instructions, requiring emulation via data movement and broader PEP sequences; e.g., SUB-PEP is implemented by MUL–1 + ADD, and reductions often need nontrivial data relayout (Venieri et al., 30 Apr 2026).
- micro-op/PEP loop depth, row/column activation overhead, register file width, and fixed tile geometry can restrict attainable parallelism or resource utilization for irregular workloads (Alsop et al., 2023, Venieri et al., 30 Apr 2026).
- Floating-point compliance (full IEEE754, NaN, subnormals), rare control-flow, and support for large-scale dynamic data structures require additional control FSM complexity or microcode extensions (Leitersdorf et al., 2022, Leitersdorf et al., 2023).
- Device technology variations (memristor drift, DRAM retention, process variation) can impact latency, error rate, or per-PEP sense margin; design must trade off gate count, periphery overhead, and microcontroller complexity for flexibility and sustained throughput (Eliahu et al., 2022, Leitersdorf et al., 2022).
Nevertheless, PIM Execution Primitives provide a formalized, extensible foundation for the architectural and algorithmic co-design of modern in-memory accelerators, decoupling instruction-level parallelism from technology-specific implementation and enabling both portability and performance across disparate compute-in-memory fabrics. The systematic formulation of PEPs—via ISAs, micro-op templates, and kernel fusion—will likely continue to shape future generations of application-specific memory-centric computing (Eliahu et al., 2022, Leitersdorf et al., 2023, Venieri et al., 30 Apr 2026, Alsop et al., 2023, Leitersdorf et al., 2022).