Papers
Topics
Authors
Recent
Search
2000 character limit reached

Hybrid DRAM-PIM & SRAM-PIM Integration

Updated 28 April 2026
  • Hybrid DRAM-PIM and SRAM-PIM integration is a novel paradigm combining DRAM's high capacity with SRAM's ultra-low latency to optimize memory-bound neural network operations.
  • It employs advanced chiplet-based and 3D hybrid bonding techniques to enable concurrent computation and efficient data transfers across heterogeneous memory layers.
  • Evaluations demonstrate up to 7.98× speedup in LLM inference and over 10× energy savings compared to conventional GPU and monolithic PIM approaches.

Hybrid integration of DRAM-based Processing-In-Memory (DRAM-PIM) and SRAM-based Processing-In-Memory (SRAM-PIM) constitutes a leading-edge architectural paradigm to address the escalating computational and memory bandwidth demands of large-scale deep learning workloads, notably LLM inference. By synergistically combining the high capacity and bandwidth density of DRAM-PIM with the ultra-low latency and local data reuse capabilities of SRAM-PIM, these hybrid systems exploit complementary advantages, delivering substantial improvements in throughput, energy efficiency, and operational flexibility for memory-bound neural network operators. This approach leverages advanced technologies such as chiplet-based heterogeneous integration, three-dimensional (3D) hybrid bonding, and in-network arithmetic processing, along with hierarchical instruction set architectures to expose programmability and scalability across diverse compute/dataflow patterns (Kiyawat et al., 15 Nov 2025, Li et al., 17 Sep 2025).

1. Architectural Foundations of Hybrid DRAM-PIM and SRAM-PIM

Hybrid DRAM-PIM and SRAM-PIM architectures are physically and functionally co-designed to circumvent the fundamental constraints of monolithic PIM approaches. DRAM-PIM supplies multiple tens of megabytes per bank with high internal bandwidth and supports vector-parallel execution, making it suitable for bulk data movement and storage-centric tasks. However, DRAM process technology limits logic density and PE sophistication, restricting the practical integration of advanced computational arrays. SRAM-PIM, in contrast, enables sub-10 ns access latencies and supports highly parallel matrix operations via dedicated local buffers (typically 8–256 KB), but with limited storage compared to DRAM.

The CompAir architecture implements hybrid bonding, vertically aligning DRAM-PIM banks (bottom die) and logic dies (top) populated with four local SRAM-PIM macros per bank and a 2D NoC mesh. Fine-pitch hybrid bonds (256 per bank at 10–100 K bonds/mm²) achieve DRAM→SRAM transfer rates of up to 204.8 Gb/s per bank, supporting concurrent computation and data staging. Sangam, using a chiplet-based approach, decouples high-density DRAM memory modules from advanced logic tiles (containing SRAM scratchpads and PEs such as 8×8 FP16 systolic arrays), interfacing them on a 2.5D interposer and exposing full DRAM bank bandwidth to the accelerator fabric (Li et al., 17 Sep 2025, Kiyawat et al., 15 Nov 2025).

2. Component-Level Specification and Organization

Hybrid PIM systems organize DRAM and SRAM resources hierarchically:

  • DRAM-PIM: 32 MB per bank (CompAir), providing up to 32 GB/s per bank and enabling row-level SIMD for large-scale vectorized read/write operations. DRAM banks serve as primary storage for model weights, activations, and key-value caches.
  • SRAM-PIM: 32 KB per DRAM bank (CompAir, 4×8 KB macros), with matrix-multiply arrays capable of up to 4.096 TOPS per bank at 1 GHz. SRAM sits side-by-side with the DRAM sense amplifiers or on the adjacent logic chiplet/interposer (Sangam), providing rapid access to hot working sets and enabling low-latency tile-level MIMD execution.
  • Logic Layer: In Sangam, logic chiplets host both matrix-centric systolic arrays and programmable SIMD units tightly coupled with SRAM scratchpads, matched in timing to DRAM bank activation. CompAir logic dies also integrate a mesh-based NoC with embedded Curry ALUs for in-route transformation and reduction operations (Li et al., 17 Sep 2025, Kiyawat et al., 15 Nov 2025).

3. Dataflow and Kernel Mapping in Hybrid Fabrics

Multi-granularity data pathways underpin the flexibility of hybrid DRAM/SRAM-PIM architectures. Three dataflows predominate:

  • Row-Level SIMD (DRAM-PIM): Used for GeMV/flat GEMM operators with low operational intensity, direct row-buffer data access into logic-layer SIMD or systolic arrays.
  • Tile-Level MIMD (SRAM-PIM): Hot kernel weights and activations are staged from DRAM into SRAM, supporting matrix-multiply acceleration with systolic or broadcast/stationary dataflow (e.g., input/weight stationary), minimizing expensive repeated DRAM accesses.
  • Flit-Level NoC Compute: CompAir introduces an in-network Curry ALU in each router, enabling scalar non-linear functions and collective operations (e.g., softmax exponentiation/reduction, tree-wise adder/broadcast routines) to be fused with data movement, thereby amortizing arithmetic and network costs.

Kernel scheduling is hierarchical: ranks/tiles partition dynamic (KV cache) and static (projection weight) memory, model layers or attention heads are sharded across chiplets/banks, and bank-local tiles are mapped to corresponding SRAM arrays for high-parallel processing with local reduction and gather operations (Kiyawat et al., 15 Nov 2025, Li et al., 17 Sep 2025).

4. Programming Models and Instruction Set Architectures

The distinct compute/data staging needs of DRAM-PIM and SRAM-PIM require multi-layer ISA support:

  • Row-Level ISA: Abstracts DRAM-PIM vector operations and coarse-grained collectives over DRAM banks or across chiplet boundaries. Typical instructions include NoC_Scalar, NoC_Reduce, NoC_BCast, SRAM_Write, and SRAM_Compute, enabling vectorized and staged execution.
  • Packet-Level ISA: Defines flit commands for specific compute/dataflow primitives on the NoC, encoding path, writeback, iteration, and opcode metadata. Curry ALU semantics (Out = ArgReg ∘ FlitData) support operator fusion and programmable in-transit processing.
  • Hierarchical Translation: Row-level collectives are compiled into bank- or array-local packet programs, enabling hardware-level parallel execution while presenting a unified programming interface (Li et al., 17 Sep 2025).

Sangam’s approach is similar: it leverages hierarchical scheduling from rank down to array granularity, orchestrating both DRAM-centric and SRAM-accelerated operators using centralized host-side DMA management, programmable logic, and offloading of memory-bound GEMV/GEMM to in-memory systolic arrays (Kiyawat et al., 15 Nov 2025).

5. Performance and Energy Characteristics

Hybrid DRAM-PIM/SRAM-PIM designs deliver significant improvements in LLM inference relative to monolithic GPU or PIM-only baselines:

System Throughput Latency Energy/tok
DRAM-PIM only (CENT) 1.00× 1.00× 1.00×
SRAM-PIM only (stacked) 1.20× 0.83× 1.30×
XPU+HBM-PIM (AttAcc) 1.05× 0.95× 3.50×
CompAir (hybrid) 2.10× 0.48× 0.29×

CompAir achieves 1.83–7.98× speedup in LLM prefill and 1.95–6.28× in decoding relative to DRAM-PIM only, and 3.52× lower energy than an A100+HBM-PIM system at similar throughput (GPT-3-175B, decode, 4 K context). Sangam demonstrates 3.93× lower end-to-end latency, 10.3× higher decode throughput, and >10× energy savings compared to H100 GPU (LLaMA 2-7B, B=8, L_in=128, L_out=256), with energy per query dropping from ≈25 J (H100) to 2.2 J (Sangam) (Kiyawat et al., 15 Nov 2025, Li et al., 17 Sep 2025).

Internal bandwidth utilization in CompAir (η≈0.8) is maximized by pipelining DRAM→SRAM transfers with SRAM-PIM computation. Sangam’s architecture maintains >80% systolic array utilization, transforming memory-bound GEMV/GEMM into high-throughput, data-stationary operations.

6. Principal Design Trade-Offs and Insights

Capacity, bandwidth, and energy trade-offs determine the optimal deployment of hybrid DRAM/SRAM-PIM:

  • Capacity vs. Latency: DRAM-PIM provides high capacity (32 MB/bank), but at higher access latency (~25 ns + T_comm), suitable for bulk storage and streaming. SRAM-PIM gives rapid, repeated access to a 32 KB tile per bank (10 ns access), efficiently staging kernel- or layer-hot working sets. However, pure SRAM-PIM is impractical at LLM scale because of its limited capacity, as full models require DRAM-resident storage.
  • Hybrid Bonding Granularity: Mapping four SRAM macros per DRAM bank optimizes area matching, minimizes wire/parasitic costs, and caps additional per-bank power overhead to ≈0.02 W.
  • Programming and Parallelism: The hierarchical ISA model reconciles the coarse-grain parallelism of DRAM-PIM with the fine-grain MIMD pattern of SRAM-PIM and in-network compute, hiding hardware heterogeneity from the programmer.
  • Energy Saving Mechanisms: Significant external DRAM traffic reduction is achieved by staging all PIM logic in logic dies/SRAM, eliminating large GPU HBM transactions, and offloading arithmetic (including non-linears) into memory-proximate logic. In Sangam, this yields >10× energy reduction over GPUs (Kiyawat et al., 15 Nov 2025), while CompAir achieves a 3.5× reduction over HBM-PIM (Li et al., 17 Sep 2025).

7. Future Directions

Both CompAir and Sangam point to several avenues for further enhancement:

  • Incorporation of NVM-PIM: Non-volatile memory PIM for expanded model capacity.
  • NoC Path Optimization: Automated placement and routing of on-chip networks to maximize utilization for collectives and non-linear operators.
  • Dynamic Resource Management: Layer- and workload-aware dynamic resizing of SRAM caches, and fine-grained execution allocation between DRAM and SRAM PIM subsystems.
  • Approximative/Custom ALUs: Operator-approximate or lightweight Curry ALU variants to lower arithmetic energy and further fuse compute into interconnect paths.

Continued research into hybrid integration, dataflow-matched scheduling, and hardware/software co-design is expected to further elevate the performance, programmability, and efficiency of memory-centric inference accelerators for large-scale neural networks (Kiyawat et al., 15 Nov 2025, Li et al., 17 Sep 2025).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Hybrid Integration of DRAM-PIM and SRAM-PIM.