Hybrid DRAM-PIM and SRAM-PIM Architectures

Updated 24 September 2025

Hybrid DRAM-PIM and SRAM-PIM are memory-centric systems that integrate processing within or near memory modules using approaches like 3D integration and in-situ operations.
They employ dynamic data placement, decoupled address-access schemes, and speculative coherence protocols to optimize performance and minimize energy consumption.
Empirical evaluations show these architectures achieve significant speedups and energy savings, addressing the memory wall challenges of conventional processor-centric systems.

Hybrid DRAM-PIM and SRAM-PIM architectures constitute a class of memory-centric computing systems that integrate processing capabilities within or near memory modules—principally DRAM and SRAM—leveraging either 3D integration, analog memory cell behaviors, or logic augmentation to reduce data movement, enhance throughput, and improve energy efficiency in data-intensive workloads. This paradigm is developed in response to the scaling bottlenecks and memory wall challenges endemic to conventional processor-centric architectures (Ghose et al., 2018), and is propelled by advances such as near-data logic layers, row-level in-situ operations, and system-level support for virtualization, data allocation, and coherence mechanisms.

1. Architectural Foundations and Taxonomy

Processing-in-memory (PIM) broadly spans two approaches: Processing-Near-Memory (PnM) and Processing-Using-Memory (PuM) (Oliveira et al., 2022). PnM places dedicated logic—such as accelerator cores or custom functional units—in the logic layer of 3D-stacked memory (DRAM or SRAM), enabling high internal bandwidth via through-silicon vias (TSVs). PuM directly exploits the intrinsic electrical properties of memory cells to implement logic primitives, as exemplified by triple-row activation in DRAM (the Ambit design) to enable bulk bitwise operations (AND, OR, NOT) without full logic circuits (Mutlu et al., 2019).

Hybrid DRAM-PIM systems typically utilize one or both methods:

DRAM-PIM (PnM): Incorporates general-purpose compute logic within the logic layer beneath stacked DRAM. For instance, IMPICA and LazyPIM architectures enable pointer-chasing accelerators and speculative cache coherence protocols, respectively (Ghose et al., 2018).
DRAM-PIM (PuM): Modifies DRAM cell arrays or command scheduling to implement in-place parallel bitwise or arithmetic operations, as in Ambit or RowClone (Mutlu et al., 2019, Olgun et al., 2022).
SRAM-PIM: Integrates processing clusters within on-chip SRAM, such as in hierarchical SRAM arrays with one-bit processing elements (CRAMs) or SRAM-based caches with logic extensions (Arora et al., 2023, Jeon et al., 2 Apr 2025).

Hybrid systems may, in addition, combine both DRAM-PIM and SRAM-PIM in a unified platform, co-designing data placement, interconnects, and task partitioning to exploit complementary performance and latency characteristics (Li et al., 17 Sep 2025, Jeon et al., 2 Apr 2025).

2. Mechanism Design and System Integration

A salient aspect of hybrid PIM systems is their mechanism design, which addresses both the hardware architecture and the integration with broader system stacks. Key mechanisms include:

Address–Access Decoupling (IMPICA): In the context of pointer-chasing, IMPICA separates address generation from memory access, allowing concurrent resolution of multiple independent pointer chains. The resulting speedup is quantified by $\text{Speedup} = \frac{T_{\text{baseline}}}{T_{\text{IMPICA}}}$ , reaching up to 92% on linked lists (Ghose et al., 2018).
Region-Based Page Tables: To retain virtual memory abstractions, DRAM-PIM modules use flattened, region-centric hierarchical page tables to map contiguous PIM data regions, minimizing page translation stalls (Ghose et al., 2018).
Speculative Coherence (LazyPIM): PIM cores aggregate read/write signatures via Bloom filters and verify coherence in a batched fashion with the CPU, dramatically reducing off-chip traffic by 58.8% and enabling standard shared memory programming semantics (Ghose et al., 2018).
Majority Logic & Bitwise Operations (Ambit, SIMDRAM): Bulk parallel operations are achieved via analog charge-sharing or optimized majority logic, which are abstracted by frameworks such as SIMDRAM for flexible in-DRAM computation (Mutlu et al., 2019, Oliveira et al., 2022).
Open-Source Evaluation Frameworks: PiDRAM provides a modular RISC-V/FPGA platform for integrating custom PIM techniques and memory controllers supporting in-DRAM command sequences (RowClone, D-RaNGe) (Olgun et al., 2022).

In SRAM-PIM designs, hierarchical control (H-tree local networks, spatially-aware shuffle logic, packet-switched inter-tile mesh) co-exists with adaptively precise, bit-serial compute units, offering low-latency and scalable performance (Arora et al., 2023).

3. Data Placement, Mapping, and Optimization

Hybrid architectures are characterized by dynamic and context-aware data allocation strategies—critical for balancing energy efficiency, latency, and parallelism:

Dynamic Data Placement Optimization (HH-PIM): Instance-specific placement across MRAM, SRAM modules in HP-PIM and LP-PIM clusters is formulated as a knapsack problem, solved via dynamic programming:

$\min E_{\text{task}} = \sum_{i=1}^n e_i x_i \quad \text{s.t.} \sum_{i=1}^n t_i x_i \leq t_{\text{constraint}}, \sum_{i=1}^n x_i = k, x_i \in \mathbb{Z}_+$

with a per-cluster two-stage DP and LUT for rapid contextual adjustment (Jeon et al., 2 Apr 2025).

Memory Mapping and Scheduling (PIM-MMU): Heterogeneity-aware mapping units (HetMap) manage different address spaces for DRAM and PIM, leveraging MLP-centric allocation for DRAM and locality-centric for PIM. The data copy engine and memory scheduler offload transfers and maximize parallelism in real systems, improving throughput and energy by 4.1× (Lee et al., 10 Sep 2024).
Software-Level Dataflow Co-Design: Coherent mapping and partitioning of model parameters across DRAM and PIM banks is realized in LLM accelerators (LP-Spec, PIM-GPT, CompAir), where parallel GEMM/GEMV operations are optimized for bank-level reuse and speculative execution (Wu et al., 2023, He et al., 10 Aug 2025, Li et al., 17 Sep 2025).

4. Communication, Coherence, and Scaling

Efficient communication is fundamental for hybrid architectures, ensuring low-overhead data exchange and synchronization:

Hierarchical Spatial Networks (PIMSAB): CRAM tiles are interconnected via static H-trees for collective compute/broadcast and dynamic mesh/noC for flexible, scalable inter-tile communication. Shuffle logic accelerates layout-dependent data movement (Arora et al., 2023).
Speculative/Batched Coherence Protocols: LazyPIM compacts coherence information and minimizes rollback penalties (<1 per kernel), achieving up to 49.1% performance improvement in shared-memory PIM scenarios (Ghose et al., 2018).
In-Transit NoC Computation (CompAir-NoC): Embedded Curry ALUs in NoC routers perform non-linear operations “in-network,” currying multi-operand functions and thereby reducing communication and area overhead (Li et al., 17 Sep 2025).
Distributed Scheduler and Copy Engines (PIM-MMU, PiDRAM): Hardware scheduling prioritizes memory-level parallelism and minimizes CPU involvement, demonstrating scalable transfer and initialization with speedups up to 14.6× (Olgun et al., 2022, Lee et al., 10 Sep 2024).

5. Performance, Energy, and Scalability

Hybrid PIM approaches yield substantial performance and energy benefits, established through architectural modeling and empirical analysis:

DRAM-PIM Accelerator (PIM-DRAM): Embedded MAC computation and intra-bank adder trees provide up to 19.5× speedup over Titan Xp GPUs in DNN inference at <1% area overhead, demonstrating the effectiveness of bank-parallel dataflow (Roy et al., 2021).
LPDDR5-PIM Integration (LP-Spec): GEMM-enhanced LPDDR5-PIM with bank-level SIMD MPUs and dynamic token pruning achieves up to 13.21× performance and 7.56× energy savings on mobile LLM inference relative to NPUs, with 99.87× gains in EDP versus GEMV-PIMs (He et al., 10 Aug 2025).
Hybrid Bonding (CompAir): The vertical stacking of DRAM-PIM and SRAM-PIM dies via 10–100K/mm² hybrid bonding interconnects (0.05–0.88 pJ/bit transfer) delivers prefill/decode speedups up to 7.98×/6.28× and reduces energy by 3.52× compared to GPU+HBM-PIM (Li et al., 17 Sep 2025).
Heterogeneous MRAM-SRAM PIM (HH-PIM): Dynamic allocation among HP/LP MRAM-SRAM modules yields up to 60.43% average energy savings, with measured reductions of 43.17–86.23% across TinyML benchmarks on FPGA platforms (Jeon et al., 2 Apr 2025).
Bit-Serial SRAM-PIM (PIMSAB): Hierarchical CRAM arrays demonstrate 3.0× speedup and 4.2× energy reduction versus A100 Tensor Core GPU, outperforming both Duality Cache and SIMDRAM PIM by 3.7–3.88× (Arora et al., 2023).

6. Methodological, Tool, and System Support

Adoption of hybrid DRAM-PIM and SRAM-PIM architectures is facilitated by:

Frameworks: Open-source modeling/evaluation platforms (PiDRAM) and workloads/benchmarks (DAMOV) support rapid prototyping and characterization of real DRAM-PIM behaviors and data movement bottlenecks (Olgun et al., 2022, Oliveira et al., 2022).
Compiler/ISA Extensions: Frameworks such as SIMDRAM generalize majority-logic mapping for in-DRAM compute, with compiler–ISA support for PIM offloading (Oliveira et al., 2022). CompAir’s hierarchical ISA decouples SIMD (row-level for DRAM-PIM) and MIMD (packet-level for SRAM-PIM and NoC) semantics (Li et al., 17 Sep 2025).
Operating System and Virtual Memory Support: Memory allocators and OS extensions ensure compatibility with page management, data mapping, and vertical/transposed layouts, crucial for seamless PIM integration; region-based page tables preserve virtual memory abstractions (Ghose et al., 2018, Oliveira et al., 2022).

7. Challenges, Limitations, and Future Prospects

Hybrid PIM architectures encounter specific challenges:

Design Complexity and Reliability: Analog in-memory computation is sensitive to process variations, requiring careful verification and error resilience (Mutlu et al., 2019).
Area and Integration Trade-offs: SRAM-PIM offers low latency at high area cost, whereas DRAM-PIM scales better at capacity; hybrid bonding must balance density, yield, and interconnect power (Li et al., 17 Sep 2025).
Coherence and Consistency: Ensuring efficient and correct memory semantics in shared, hybrid contexts remains nontrivial, typically resolved by speculative, batched, or software-assisted protocols (Ghose et al., 2018, Olgun et al., 2022).
Dynamic Workload and Mapping: Adaptive allocation algorithms (such as those in HH-PIM) are necessary to meet real-time constraints in changing environments (Jeon et al., 2 Apr 2025).

A plausible implication is that future PIM systems will increasingly rely on heterogeneity—combining multiple memory technologies (MRAM, DRAM, SRAM, ReRAM) and adaptive allocation/scheduling strategies—to balance energy, performance, and scalability across inference, database, and graph analytics workloads. System software, compiler, and toolchain support will be pivotal in driving deployment, flexibility, and application portability in both cloud and edge domains (Oliveira et al., 2022, Li et al., 17 Sep 2025).