Hybrid DRAM-PIM & SRAM-PIM Architectures

Updated 20 October 2025

Hybrid DRAM-PIM and SRAM-PIM architectures are advanced memory systems that integrate high-density DRAM with low-latency SRAM to address data movement bottlenecks in intensive workloads.
They employ specialized techniques for address translation, dynamic data migration, and optimized coherence protocols to balance storage capacity with rapid access.
Demonstrated performance gains of 3×–8× speedups and significant energy savings validate the architectures’ effectiveness in applications like neural networks and graph analytics.

Hybrid DRAM-PIM and SRAM-PIM architectures represent an evolution in memory-centric computing, combining the storage density, bandwidth, and scalability advantages of DRAM-based Process-in-Memory (PIM) with the low-latency, high-speed characteristics of SRAM-PIM subsystems. These hybrid systems aim to mitigate the traditional bottlenecks of data movement, optimize energy efficiency, and enable fine-grained acceleration across a wide range of data-intensive workloads, including graph analytics, neural networks, and pointer chasing. The following sections synthesize core concepts, cross-cutting system mechanisms, comparative properties, algorithm-architecture co-designs, and research directions central to hybrid DRAM-PIM and SRAM-PIM design and adoption.

1. Architectural Principles and Integration Strategies

Hybrid DRAM-PIM/SRAM-PIM architectures consist of tightly coupled or coexisting memory modules, typically incorporating:

High-density DRAM regions, equipped with either in-array compute primitives (processing-using-memory, PUM) or logic-layer accelerators (processing-near-memory, PNM) in 2.5D/3D-stacked memory structures (e.g., HMC, HBM, LPDDR-PIM).
SRAM-based PIM regions, which may function as fast on-chip caches, scratchpads, or banked compute-enabled memory macros, often integrated on-die or on the logic layer, delivering sub-10 ns latency.
System-level interconnects and controllers that manage heterogeneous data placement, coherence, and runtime scheduling across DRAM and SRAM subdomains.

Architectural integration leverages hybrid bonding (Li et al., 17 Sep 2025), explicit column and row decoder partitioning (Li et al., 17 Sep 2025), and near-data memory controllers (NMCs) (He et al., 10 Aug 2025) to balance high capacity, bandwidth, and low latency. For instance, CompAir couples each DRAM-PIM bank with multiple SRAM-PIM macros, using hybrid bonding to provision high-throughput data paths and decoupled column decoders to address bandwidth mismatches (Li et al., 17 Sep 2025). In mobile-tier systems, LP-Spec integrates LPDDR5 PIM with both DRAM and compute-enhanced ranks, managed by a near-data controller to orchestrate joint operation and rapid data reallocation (He et al., 10 Aug 2025). HH-PIM architectures on edge AI devices combine MRAM, SRAM, and heterogeneous PIM controllers to facilitate dynamic trade-offs between latency and energy efficiency (Jeon et al., 2 Apr 2025).

2. System Mechanisms: Address Translation, Data Mapping, and Coherence

Hybrid architectures require robust support for:

Address translation and page table management. Mechanisms such as IMPICA (Ghose et al., 2018) employ address–access decoupling, translating virtual addresses close to the memory arrays via region-based page tables that distinguish between DRAM and SRAM (or non-volatile) regions. Unified or hierarchical page tables are suggested as future research to further streamline hybrid memory access (Ghose et al., 2018).
Dynamic data placement and migration. Runtime monitors track access patterns and dynamically migrate hot data (pointer-chasing lists, frequently updated NN weights) into SRAM-PIM for low-latency access, relegating cold or bulk data to DRAM-PIM for storage density (Ghose et al., 2018, Jeon et al., 2 Apr 2025, Wang et al., 2023). Optimization algorithms, such as the DP-based knapsack formulation in HH-PIM, allocate data into HP/LP-MRAM and HP/LP-SRAM to minimize energy under application latency constraints (Jeon et al., 2 Apr 2025).
Cache coherence protocols. Schemes like LazyPIM (Ghose et al., 2018) and CoNDA (Ghose et al., 2019) use speculative recording of PIM access sets (via Bloom filters or compressed signatures) and batch validation against CPU coherence directories. These approaches adjust the granularity (e.g., signature size, commit interval) and may be tuned per memory type, favoring more aggressive speculation in SRAM domains (Ghose et al., 2018).

3. Performance and Energy Efficiency Characteristics

The hybrid approach exploits the complementary strengths of DRAM and SRAM:

Bandwidth and capacity. DRAM-PIM regions handle massive working sets (e.g., LLM weights, large datasets), leveraging high internal bandwidth via stacked interfaces and TSVs (Mutlu et al., 2019, Wang et al., 2023).
Low latency and fine-grained acceleration. SRAM-PIM modules deliver single-digit nanosecond access for latency-critical, data-dependent kernels (pointer chasing, reduction operations, hot neural network weights), albeit at a reduced storage density and typically within on-chip or logic-layer footprint (Ghose et al., 2018, Li et al., 17 Sep 2025).
Energy savings. By adapting the location of data and compute to workload phase (e.g., moving high-activity layers to SRAM during inference surges, or shifting to MRAM/DRAM to save power during idle/low-load scenarios), designs such as HH-PIM achieve up to 60.43% energy savings over conventional PIMs (Jeon et al., 2 Apr 2025). Analytical models capture this behavior:

$G = \frac{T_{\text{baseline}}} {(1-\alpha)T_{\text{baseline}} + \alpha \cdot \frac{T_{\text{baseline}}}{k_{\text{eff}}}}$

where $\alpha$ is the time fraction in latency-sensitive regions and $k_{\text{eff}}$ is an effective speedup incorporating both SRAM and DRAM-PIM (Ghose et al., 2018).

Hybrid designs consistently demonstrate geometric mean speedups of $3\times$ – $8\times$ and large energy-delay product reductions ( $>10\times$ over prior state-of-the-art) in DNN, LLM, and analytics workloads (Wang et al., 2023, Wu et al., 2023, He et al., 10 Aug 2025, Shekar et al., 8 Apr 2025). For example, CompAir achieves up to $3.52\times$ energy reduction versus A100+HBM-PIM, and PIM-GPT delivers $41$– $137\times$ speedup on GPT models (Li et al., 17 Sep 2025, Wu et al., 2023).

4. Algorithm–Architecture Co-Design and Mapping

Algorithm–architecture co-design is essential:

Sparsity exploitation. SRAM-PIM frameworks such as DB-PIM employ hybrid-grained (block-wise and bit-level) sparsity management (Duan et al., 25 May 2025), combining fixed threshold approximation and CSD-based encoding to maximize compute efficiency. These techniques can be extended for hybrid operation, skipping redundant computations both in SRAM and DRAM domains as regularity and structure allow.
Hierarchical scheduling and dynamic workload mapping. Frameworks like NicePIM (Wang et al., 2023) and LP-Spec (He et al., 10 Aug 2025) use dynamic programming and ILP-based schedulers to partition DNN layers or tokens across memory domains, adjusting mappings in response to parallelism, DRAM/SRAM size constraints, and hardware utilization.
In-network and in-transit computation. CompAir-NoC embeds arithmetic logic in network routers, performing non-linear and reduction operations as data is routed between memory banks, reducing off-chip and inter-bank transfer overhead (Li et al., 17 Sep 2025).

Resource allocation algorithms, such as those in HH-PIM (Jeon et al., 2 Apr 2025), use DP formulations to minimize energy for a given latency budget by choosing the per-weight storage location. The LUT generated at runtime enables rapid, workload-adaptive data migration.

5. Programming Models and System Software

The adoption of hybrid DRAM-PIM/SRAM-PIM requires:

Compiler and library support that extends intermediate representations to handle memory-type-aware intrinsics (e.g., PIM-enabled instructions mapped to PCUs or MPUs (Ghose et al., 2019, He et al., 10 Aug 2025)) and aligns code regions with PIM–friendly data layouts (Oliveira et al., 2022). Compiler passes may also transform high data-movement sections into majority/NOT logic for in-memory execution (Oliveira et al., 2022).
API and directive design to facilitate explicit or automatic code offload. Approaches range from fine-grained (single PIM-enabled instruction, preserving standard programming models (Ghose et al., 2019)) to function- or block-level marked regions with toolchain analysis of memory/compute intensity (Wang et al., 2023).
OS and runtime integration for managing PIM-aware memory mapping, consistency, and scheduling. Systems must track both virtual and physical memory mappings for DRAM and SRAM instances and possibly maintain per-memory-type page tables and migration policies (Ghose et al., 2018, Oliveira et al., 2022).

State-of-the-art approaches are supported by profiling tools (e.g., DAMOV (Oliveira et al., 2022)), energy/cycle-accurate simulators, and benchmarking suites covering both DRAM– and SRAM–PIM–accelerated workloads.

6. Case Studies and Comparative Data

Architecture	Core Features	Performance/Energy Gains
CompAir (Li et al., 17 Sep 2025)	DRAM-PIM + SRAM-PIM (hybrid bonding, in-NoC compute, hierarchical ISA)	$1.83$– $7.98\times$ prefill, $3.52\times$ energy reduction vs. A100+HBM-PIM
HH-PIM (Jeon et al., 2 Apr 2025)	HP/LP MRAM-SRAM clusters, DP data placement	$60.43\%$ average energy savings, maintains latency
PIM-GPT (Wu et al., 2023)	DRAM-PIM w/ bank-level MACs, on-chip SRAM/ASIC	$41$– $137\times$ speedup, $123$– $383\times$ energy efficiency vs. GPU
DB-PIM (Duan et al., 25 May 2025)	Value + bit-level sparsity in digital SRAM-PIM	Up to $8.01\times$ speedup, $85.28\%$ energy savings
LP-Spec (He et al., 10 Aug 2025)	Hybrid LPDDR5 PIM, near-data controller, dynamic pruning	$13.21\times$ speedup, $7.56\times$ energy eff. vs. NPU

This table (facts from the papers) encapsulates how diverse hybrid architectures synergize memory technology advantages and advanced scheduling/sparsity mechanisms for dramatic efficiency gains.

7. Open Challenges and Research Directions

Ongoing and future work targets:

Unified address translation and page mapping spanning DRAM, SRAM, and potentially non-volatile memories (Ghose et al., 2018, Mutlu et al., 2020).
Adaptive and differentiated coherence protocols tailored to hybrid domains (e.g., speculation in SRAM-PIM, batched commits in DRAM-PIM) (Ghose et al., 2018).
Automated data migration and partitioning algorithms sensitive to real-time access patterns, exploiting runtime hints, and application phase changes (Jeon et al., 2 Apr 2025, Alsop et al., 2023).
Energy–performance co-modeling. System designers employ formulas relating access time and per-access energy, e.g., $T_\text{access}^\text{SRAM}$ vs. $T_\text{access}^\text{DRAM}$ and $E_\text{SRAM}$ vs. $E_\text{DRAM}$ , with weighted speedup factors (Ghose et al., 2018).
Standardized toolchains, OS-level support, and cross-architecture programming models to lower adoption hurdles for application writers and system integrators (Ghose et al., 2019, Oliveira et al., 2022).
Hardware prototyping, evaluation platforms, and open-source simulation environment development to accelerate research beyond DRAM-PIM (e.g., UPMEM-style) to SRMRAM, LPDDR-based, and other future memory types (Gómez-Luna et al., 2021, Oliveira et al., 2022).

Hybrid DRAM-PIM and SRAM-PIM architectures are thus converging on a model of dynamic, energy-optimized, and latency-aware near-data compute. They blend complementary memory technologies with specialized logic, cross-layer data mapping and scheduling, and programmable interfaces to address the demands of data movement–bound, memory-centric computation in contemporary and emerging workloads.