DIMM-PIM: In-Memory Processing Architecture

Updated 14 March 2026

DIMM-PIM is an innovative architecture that embeds compute engines directly into standard DRAM modules, closing the processor–memory performance gap.
It utilizes both analog in-array computing and digital programmable cores to enable fine-grained, high-bandwidth, in situ and near-memory processing.
Co-design strategies like PIM-MMU and optimized collectives deliver significant throughput and energy improvements in memory-intensive applications.

Dual In-line Memory Module-Based Processing-In-Memory (DIMM-PIM) refers to architectures and systems in which programmable or fixed-function compute resources are tightly integrated within standard memory modules, typically at the granularity of the DRAM bank or rank, and directly interface with the host memory bus (e.g., DDR4/5 channels). This approach aims to close the processor–memory performance gap by eliminating bulk data movement over off-chip memory buses, exploiting the intrinsic parallelism and bandwidth of DRAM, and exposing direct compute capability at memory access points. DIMM-PIM encompasses both minimal in-array analog compute for bulk operations and digital programmable engines on the DIMM PCB, enabling a spectrum of in situ and near-memory processing with minimal changes to form factor and system integration (Mutlu et al., 2019, Mutlu et al., 2020, Lee et al., 2024).

1. Architectures and Integration Models

Two principal dimensions characterize DIMM-PIM architectures: (a) the location and nature of the compute elements and (b) their interface to the host processor.

1.1 DRAM-bus-attached DIMM-PIM: Processing engines (e.g., UPMEM DPUs) are physically incorporated into DRAM chips on standard DDR4/5 DIMMs and directly plug into the host memory bus. Each DRAM bank typically integrates a lightweight RISC core or similar engine tightly coupled to local DRAM (Lee et al., 2024, Chen et al., 2023, Gómez-Luna et al., 2021). This integration enables fine-grained SPMD parallelism across hundreds to thousands of DPUs, with aggregate in-module bandwidth exceeding 1 TB/s per host.

1.2 Register/Buffers with Integrated Logic: Digital logic blocks, including SIMD or fixed-function units, can be embedded in the register or buffer chip on the DIMM, allowing interception of memory commands, high-level offload, and high-bandwidth internal communication among DRAM chips (Mutlu et al., 2019). Examples include bank-level SIMD ALUs as in Samsung’s HBM-PIM and SK Hynix's GDDR-PIM (Alsop et al., 2023).

1.3 Approaches to Processing: Two foundational approaches are (A) processing-using-memory (PUM), where DRAM arrays are slightly modified (e.g., multi-row activation, analog charge-sharing) to directly compute bitwise or bulk functions, and (B) processing-near-memory (PNM), where programmable digital engines (cores/ALUs/accelerators) are co-located either in 3D-stacked logic or on the DIMM buffer/controller (Mutlu et al., 2019, Mutlu et al., 2020, Angizi et al., 2019).

1.4 Memory-Bus vs. I/O-Bus Integration: Memory-bus PIM (e.g., UPMEM) plugs directly into the CPU’s DRAM channels; I/O-bus PIM (e.g., CXL-PNM, AiMX) sits behind cache-coherent I/O links, decoupling data and control paths (Lee et al., 2024).

2. System-Level Memory Mapping and Data Transfer

To mediate access and prevent bus arbitration hazards, DIMM-PIM systems employ explicit address space partitioning. On UPMEM-class systems, the BIOS partitions the host’s physical address space into “DRAM space” (host-accessible) and “PIM space” (PIM-local), disabling fine-grain address interleaving. Consequently, host CPUs must explicitly copy data between DRAM and PIM regions, typically using runtime APIs (e.g., dpu_push_xfer) to ensure coherence and timing correctness (Lee et al., 2024, Gómez-Luna et al., 2021).

Transfer Bottlenecks: Empirical characterization shows that bulk DRAM↔PIM transfers saturate CPU resources (≈100% utilization, ≈70 W system power), yet achieve only a fraction of maximum channel bandwidth (≈11.6–15.5% of peak) due to software bottlenecks, course-grained thread scheduling, and non-optimal mapping disabling interleaving (Lee et al., 2024).

Address Mapping: The need to avoid concurrent access to DRAM banks by host and PIM leads to partitioned, non-interleaved mappings per bank/DIMM. While this eliminates timing hazards, it reduces achievable memory-level parallelism (MLP) on both host and PIM sides if not corrected (Lee et al., 2024).

3. Hardware/Software Co-Design: The PIM-MMU

To address limitations in throughput, energy, and CPU utilization when shuttling data between host and PIM spaces, hardware/software co-designs such as PIM-MMU have emerged (Lee et al., 2024).

3.1 Data Copy Engine (DCE): Offloads all copy/transpose logic to a dedicated on-chip unit. DCE sequentially issues read and write commands by maintaining descriptor buffers, performs in-flight transposes, and maximizes utilization of both DRAM and PIM sides (Lee et al., 2024).

3.2 PIM-aware Memory Scheduler (PIM-MS): Globally schedules DRAM and PIM accesses in a round-robin, fine-grained, bank-disjoint manner, maximizing available parallelism while conforming to DRAM timing constraints such as tCCD, tRRD, tFAW.

3.3 Heterogeneity-aware Memory Mapping (HetMap): Dynamically applies either an MLP-maximizing or a locality-centric address mapping function, depending on whether the request targets DRAM or PIM space, restoring optimal parallelism on both sides.

3.4 Software Stack: Front-end APIs aggregate copy descriptors and interface with the DCE via MMIO, enabling batch, interrupt-driven offload rather than CPU-intensive software copy.

Quantitative Impact: PIM-MMU yields a 4.1× improvement in DRAM↔PIM transfer throughput (from ≈9 GB/s to ≈37 GB/s), a 4.1× improvement in copy energy efficiency, and a 2.2× end-to-end speedup for real-world memory-intensive workloads, with particular impact for copy-bound primitives (Amdahl’s Law) (Lee et al., 2024).

4. Programming Model, Software Frameworks, and Collective Primitives

DIMM-PIM exposes new requirements for programming models and runtime support.

4.1 Programming Abstractions: SPMD models map each in-bank core (DPU or PE) to a data shard, managed and orchestrated by the host runtime. Modern frameworks (e.g., SimplePIM) abstract PIM arrays, collectives (scatter, gather, allreduce), and iterators (map, reduce, zip), greatly reducing code complexity and providing efficient host-mediated collective communication (Chen et al., 2023).

4.2 Host/PIM Communication: Traditional models rely on CPU-moderated data transfers, which can form performance bottlenecks, especially for collective operations. Optimized libraries such as PID-Comm express inter-PE collectives in a d-dimensional hypercube space, leveraging PE-assisted (WRAM-local) reordering, in-register domain modulation, and cross-domain fusion to minimize host and bandwidth overhead (Noh et al., 2024).

4.3 Collective Operations: PID-Comm defines primitives (alltoall, reduce_scatter, allgather, allreduce, scatter, gather, broadcast, reduce) specifically optimized for the DRAM/channel/PE hierarchy, reducing communication time per operation by up to 5.19×. These primitives exploit bank-level parallelism and host–PE data transfer management to scale to thousands of PEs per socket (Noh et al., 2024).

4.4 Productivity and Efficiency: Software frameworks reduce lines of code by 66–83%, provide scaling close to theoretical maxima, and even outperform hand-tuned kernel code in memory/bandwidth-bound regimes (Chen et al., 2023).

5. Performance Models, Bottlenecks, and Optimization Strategies

5.1 Bottleneck Characterization: Theoretical and empirical models confirm that successful DIMM-PIM acceleration requires:

Workloads be memory-/bandwidth-bound with low arithmetic intensity.
Data placement schemes that maximize operand locality and parallel alignment across banks/chips for effective SIMD utilization.
Explicit consideration of DRAM timing parameters and collective bandwidth limits (Alsop et al., 2023, Gómez-Luna et al., 2021).

5.2 Amenability Tests and Data Placement: The “PIM-amenability test” comprises four axes—bandwidth limitation (arithmetic intensity ω), memory-residency ratio ( $R_m$ ), operand locality, and aligned data parallelism (Alsop et al., 2023). Only workloads passing these criteria (e.g., streaming, graph analytics, large GEMV/MVM) are expected to see substantial acceleration.

5.3 Optimizations: Hardware-software co-optimizations further increase speedup and application reach:

Architecture-aware row activation overlaps row precharge with compute, improving utilization.
Sparsity-aware orchestration skips superfluous operations, exploiting runtime data lane inactivity.
Cache-aware selective offload allows the system to predict and execute portions on the CPU/GPU when cache residency is likely (Alsop et al., 2023).

These mechanisms yield average speedups of 2.49× over tuned GPU baselines, and peak up to 6.1× on critical LLM workloads using integrated DIMM-PIM+GPU scheduler stacks (e.g., L3 system for multi-head attention) (Liu et al., 24 Apr 2025).

6. Application Domains, Attained Gains, and System-Level Integration

6.1 Domains: DIMM-PIM acceleration is best demonstrated in high-throughput, memory-intensive kernels with regular data access, such as linear algebra, graph traversal, bulk analytics, and deep learning attention layers (Lee et al., 2024, Kang et al., 2022, Liu et al., 24 Apr 2025).

6.2 Empirical Gains: For copy-bound primitives, UPMEM-class DIMM-PIM with PIM-MMU delivers up to 2.2× end-to-end benchmark speedup, with copy constituting over 60% of total execution time pre-optimization (Lee et al., 2024). On skewed index workloads (PIM-tree), throughput can exceed state-of-the-art CPU-only or prior PIM approaches by up to 69.7× (Kang et al., 2022). In compositional communication-heavy workloads, optimized collectives (PID-Comm) outperform naive baselines by up to 4.2×, and geomean ≈2× over host-only (Noh et al., 2024).

6.3 Integration: DIMM-PIM modules are designed as drop-in replacements for standard DDR4/5 DIMMs, requiring minor changes in BIOS and OS-level memory mapping. Key system tasks involve address-space partitioning, driver/runtime installation, and optional cache coherence management. Most designs fit within a typical 8–10 W per DIMM thermal envelope and add less than 1% die area overhead for logic (Mutlu et al., 2019, Lee et al., 2024).

7. Challenges, Limitations, and Research Trajectories

7.1 Hardware Constraints: Processing-using-memory adds minimal area but is limited to supported analog operations; digital PIM cores impose thermal and bandwidth constraints and present engineering trade-offs regarding instruction sets, register files, and local memory size (Mutlu et al., 2020, Mutlu et al., 2019). Row-hammer effects, ECC/integrity, and manufacturing yield require protective features and careful controller design.

7.2 Software Limitations: Lack of direct inter-PIM-core communication necessitates host mediation, which can severely limit performance on workloads requiring frequent data movement across PIM nodes. Providing a fast software stack and efficient collective operations is essential for general-purpose efficiency (Noh et al., 2024, Chen et al., 2023). Workloads with high arithmetic intensity, irregular memory access, or pointer chasing that crosses banks/chips generally do not benefit (Gómez-Luna et al., 2021).

7.3 Open Research Areas: Advancements are ongoing in programmable on-DIMM collectives, LLM-specialized scheduler/hardware pipelines (e.g., L3’s cross-rank coordination for KV-cache), dynamic thermal controls, and compiler/runtime systems for automated data placement and instruction scheduling (Liu et al., 24 Apr 2025, Alsop et al., 2023, Lee et al., 2024).

7.4 Standardization and Ecosystem: A fully mature ecosystem will require open APIs, OS/hardware extensions for PIM region identification, cache coherence protocols, and ongoing development of device-level standards (JEDEC, OCP) for PIM command extensions and module identification (Mutlu et al., 2019, Mutlu et al., 2020).

References

"PIM-MMU: A Memory Management Unit for Accelerating Data Transfers in Commercial PIM Systems" (Lee et al., 2024)
"Enabling Practical Processing in and near Memory for Data-Intensive Computing" (Mutlu et al., 2019)
"Inclusive-PIM: Hardware-Software Co-design for Broad Acceleration on Commercial PIM Architectures" (Alsop et al., 2023)
"A Modern Primer on Processing in Memory" (Mutlu et al., 2020)
"SimplePIM: A Software Framework for Productive and Efficient Processing-in-Memory" (Chen et al., 2023)
"PIM-tree: A Skew-resistant Index for Processing-in-Memory" (Kang et al., 2022)
"PID-Comm: A Fast and Flexible Collective Communication Framework for Commodity Processing-in-DIMM Devices" (Noh et al., 2024)
"Benchmarking Memory-Centric Computing Systems: Analysis of Real Processing-in-Memory Hardware" (Gómez-Luna et al., 2021)
"L3: DIMM-PIM Integrated Architecture and Coordination for Scalable Long-Context LLM Inference" (Liu et al., 24 Apr 2025)
"Accelerating Bulk Bit-Wise X(N)OR Operation in Processing-in-DRAM Platform" (Angizi et al., 2019)