3D-Stacked DRAM with Logic Layer

Updated 15 April 2026

3D-stacked DRAM with a logic layer is a vertically integrated memory system that combines multiple DRAM cell arrays with a compute-rich logic die to enable near-data processing.
This architecture leverages fine-grained vertical interconnects to deliver orders of magnitude higher bandwidth and supports advanced workloads such as dynamic programming and neural network inference.
Key innovations include tiered-latency management, hybrid bonding integration, and comprehensive software–hardware co-design that optimize performance and energy efficiency.

Three-dimensional (3D) stacked DRAM with a logic layer, also referred to as monolithic or hybrid-bonded 3D DRAM with Processing-in-Memory (PIM) or Near-Memory Processing (NMP), defines a class of memory systems in which multiple DRAM cell arrays are vertically integrated atop a specialized logic die. This architecture exploits fine-grained vertical interconnects to deliver orders of magnitude higher internal bandwidth, supports near-data compute logic, and exposes architectural opportunities for latency-aware data placement, system-level parallelism, and energy efficiency unattainable with planar or TSV-based stacked memory. Recent research demonstrates that such architectures fundamentally reshape system bottlenecks, unleash new software–hardware co-design opportunities, and enable new classes of accelerators for data-intensive workloads (Lu et al., 27 Feb 2026, Pan et al., 6 Oct 2025, Ghiasi et al., 2022, Lee et al., 12 Mar 2026).

1. Physical Organization and Integration Technologies

3D-stacked DRAM architectures consist of a stack of DRAM layers fabricated above a logic layer, connected by high-density vertical interconnects. Two leading integration methods dominate the landscape:

Monolithic 3D stacking constructs DRAM and logic layers sequentially on a single wafer, enabling sub-micron vertical I/O pitch. For example, GenDRAM employs a 1024-layer monolithic 3D DRAM ("M3D DRAM") stack coupled to a logic die via ~1 μm pitch Cu–Cu hybrid bonds, supporting ≈10× the vertical interconnect density of TSV-based designs as in HBM (Lu et al., 27 Feb 2026, Pan et al., 6 Oct 2025).
Hybrid bonding face-to-face bonds separately fabricated dies for logic and DRAM, also supporting ~1 μm interconnect pitch, limited primarily by BEOL alignment and routing constraints (Lee et al., 12 Mar 2026).

Table: Comparison of Monolithic and TSV-based 3D DRAM Integration

Integration	Vertical Interconnect	Pitch	Typical Bandwidth (per stack)
Monolithic	Cu–Cu hybrid bond	~1 μm	~20–34 TB/s (internal (Pan et al., 6 Oct 2025))
TSV-based	Through-Silicon Via	5–15 μm	≤1 TB/s (e.g., HBM3: 0.8 TB/s)

In both cases, the logic die hosts near-memory processing units, memory controllers, local coherency and scheduling engines, and interfaces to external hosts. A typical monolithic stack reported in (Lu et al., 27 Feb 2026) comprises:

1024-stacked DRAM layers partitioned into 8 latency "tiers"
16 DRAM channels per chip × 16 banks/channel = 256 banks, grouped into 32 bank-groups (each mapped to a logic-layer Processing Unit)
32 logic-die PUs organized as 8 Search PUs (memory-bound) + 24 Compute PUs (compute-bound), each with dense, wide I/O aligned to the DRAM bank-groups

This structural organization enables rapid vertical transfers of full DRAM row activations (~4 KB) and high sustained bandwidth between memory and logic.

2. Logic-Layer Architecture and Processing-In-Memory Capabilities

The bottom logic die in 3D-stacked DRAM systems is a compute-rich silicon layer optimized for in-situ data-parallel operations and local control. Design points include:

Specialized Processing Units: GenDRAM's logic layer incorporates dedicated Search PUs for pointer-table and candidate-location lookups, and Compute PUs with 16 sub-PEs per unit for dynamic programming; instruction sets are domain-tailored (e.g., ACTIVATE_ROW, MIN, MAX, WRITE_BACK, wavefront BRANCH) and implemented in multiplier-less datapaths for energy efficiency (Lu et al., 27 Feb 2026).
Tensor and Vector Engines: In Stratum, each logic PU integrates 16 bank-local PEs, each a 16×16 FP16 tensor core, with local SRAM/free register files; there are also large shared SRAMs and specialized SIMD units for functions such as Softmax or GeLU (for transformer inference) (Pan et al., 6 Oct 2025).
Near-Data Parallelism: Logic PUs are mapped directly to bank-groups or channels, exposing up to 384 Gops/s DP throughput (in GenDRAM) and saturating the available 3D-DRAM bandwidth with local computation (Lu et al., 27 Feb 2026, Pan et al., 6 Oct 2025).
Inter-PU Interconnection: Ring- or mesh-based NoCs provide high-bandwidth communication (128 GB/s per link in GenDRAM) for reduction, gather/scatter, and synchronization across logical tiles.

The logic die also hosts DRAM controllers, fine-grained refresh/protection mechanisms (e.g., adjacent-row refresh for RowHammer), and in higher-level PIM designs, coherence directory structures or region-based page tables (Ghose et al., 2018, Pan et al., 6 Oct 2025).

3. Data Placement, Tiered-Latency Management, and Mapping Strategies

With vertically tiered DRAM layers, wire delay and sense-amp parasitics introduce distinct access time differences along the z-dimension. This motivates latency- and bandwidth-aware data placement:

Vertical Tier Mapping: DRAM layers are grouped into "tiers" with distinct t_RCD values (e.g., Tier0: 2.29 ns, Tier7: 22.88 ns in GenDRAM). Latency-sensitive or frequently accessed tables (e.g., PTR, CAL) are pinned to the fastest tier (Lu et al., 27 Feb 2026, Pan et al., 6 Oct 2025).
Horizontal Bank and PU Interleaving: Workload tiles are mapped to logical bank-groups and PUs such that neighbor/broadcasting tiles are on distinct banks and PUs, maximizing concurrency and saturating ~34 TB/s stack bandwidth (Lu et al., 27 Feb 2026).
Dynamic Data Mapping: In Stratum, topic-gated placement uses query-specific topic class information to predict "hot" experts in Mixture-of-Experts (MoE) layers; high-likelihood weights are assigned to fast access tiers, yielding up to 1.6× latency reduction for key layers (Pan et al., 6 Oct 2025).
ILP-Based Data Scheduling: For DNN workloads, integer-programmed Hamiltonian cycles schedule inter-node transfers, balancing local and network bandwidth to optimize latency and utilization (Wang et al., 2023).

Such strategies, coupled with 3D-aware data interleaving and bandwidth pooling, are critical to fully exploit the local memory bandwidth and manage skew/saturation effects in large-scale compute.

4. Performance, Energy, and Thermal Characteristics

3D-stacked DRAM systems with logic layers exhibit distinctive throughput, efficiency, and thermal profiles:

Compute and Bandwidth Efficiency:
- Peak DP throughput in GenDRAM: 384 Gops/s; aggregate DRAM-logic bandwidth saturation at ~34 TB/s (Lu et al., 27 Feb 2026)
- Mono3D DRAM: internal on-stack bandwidth of 19–30 TB/s per device, with energy reductions and throughput improvements of 7.66× and 8.29×, respectively, over HBM+GPU baselines (Pan et al., 6 Oct 2025)
Energy per Operation:
- DRAM access: 0.429 pJ/bit (GenDRAM), SRAM: 0.007 nJ/access, ALU ops: 3 pJ/op (Lu et al., 27 Feb 2026)
- Read/write energy in monolithic 3D DRAM: 1.35–6.26 fJ/cell vs. >15 fJ in 2D (Lee et al., 12 Mar 2026)
End-to-End Gains:
- APSP (Shortest Path): 324× speedup (vs. A100 GPU); 3,442× energy efficiency
- Genomics: up to 45× (short-reads), 23× (long-reads) speedup; 23,000× energy improvement (short-reads) (Lu et al., 27 Feb 2026)
Thermal Behavior:
- High-density stacking yields power densities resulting in logic die peaks >350 °C for large LLMs (e.g., GPT-3); optimized core placement (P vs. E-cores) and bandwidth sharing reduce peak logic T by up to 9.37 °C, mitigate lateral gradients, and allow stable 1 GHz operation within reliability envelopes (He et al., 10 Aug 2025).
- Thermal-aware stack configuration (layer count, logic cluster sizing) determines sustainable frequency and safe DRAM/logic operation (Mo et al., 6 Apr 2026, Hadidi et al., 2017).

5. System-Level Architecture, Parallelism, and Software Co-Design

The unique characteristics of 3D-stacked DRAM systems fundamentally alter system bottlenecks, parallelism strategies, and programming models:

Core–Memory Coupling: Monolithic 3D systems (e.g., RevaMp3D) shift bottlenecks from main memory into core/cache microarchitecture; for instance, the elimination of shared last-level cache and L1 latency reduction is justified as t_mem drops below t_L2 (Ghiasi et al., 2022).
Software–Hardware Co-Design: Practical deployment requires compiler/runtime support for offloading (message-passing APIs, Locality-Aware Execution), OS-level data placement for PIM regions, and in-memory address translation and coherence support (e.g., region-based page tables, LazyPIM speculative coherence) (Ghose et al., 2018).
Parallelism and Scheduling: Efficient mapping of DNN and LLM workloads to PIM resources (TP/PP/DP/EP/SP/CP/FSDP) is essential—DeepStack demonstrates up to 9.5× throughput gain through joint parallelism and 3D architecture DSE, with incomplete schedule search leading to permanently suboptimal silicon (Mo et al., 6 Apr 2026).

Table: Application-Domain Performance Improvements

Workload	Speedup	Energy Efficiency Gain	Reference
APSP (graph DP)	up to 324×	3442×	(Lu et al., 27 Feb 2026)
Genomics pipeline	up to 45×	23,000× (short-reads)	(Lu et al., 27 Feb 2026)
Transformer MoE	8.29×	7.66×	(Pan et al., 6 Oct 2025)
Large DNN inference	up to 1.6×	28% avg. reduction	(Wang et al., 2023)

6. Reliability, Scalability, and Implementation Challenges

Despite significant benefits, deploying 3D-stacked DRAM/PIM systems at scale requires addressing several cross-layer challenges:

Thermal Limits and Cooling: Sustained bandwidth and high compute density require aggressive thermal management; write-intensive or logic-bound workloads can trigger logic/DRAM over-temperature if cooling, throttling, or design-for-test are insufficient (Hadidi et al., 2017, He et al., 10 Aug 2025).
Reliability (e.g., RowHammer): Logic-layer-based probabilistic refresh (e.g., PARA) and periphery protection mitigate fault modes exacerbated by greater density and power density (Mutlu et al., 2019).
Manufacturability Constraints: Trade-offs between hybrid-bonding pitch, BL/WL routing, sense margin, and tier scaling dictate feasible bit densities—2.6 Gb/mm² at 137 layers (Si stack) or 87 layers (AOS) with row cycle times as low as 10.5 ns and >60% energy saving over 2D baselines (Lee et al., 12 Mar 2026).
Scalability (System Integration): Distributed inference over multiple 3D stacks (DeepStack) requires accurate modeling of per-layer bandwidth/latency, dual-stage network mapping, and area/thermal-constrained scheduling; unconstrained scaling can result in diminishing BW returns after 9 layers (due to Little's Law and buffer saturation) (Mo et al., 6 Apr 2026).
Programming Model/ISA: A general-purpose PIM ISA, support for virtual memory and protection, efficiency primitives for fine-grained sharing, and integration of non-volatile memory technologies remain key research directions (Mutlu et al., 2019, Ghose et al., 2018).

7. Application Domains and Generalization Potential

3D-stacked DRAM with logic layer architectures have demonstrated broad utility for data-intensive, fine-grained, bandwidth-bound workloads:

Dynamic Programming: Full in-memory realization of end-to-end DP applications (APSP, genomics seeding-to-alignment) (Lu et al., 27 Feb 2026)
Neural Network Inference: MoE-transformers, CNNs, and transformer decoders utilize high BW and local compute, vastly increasing throughput and utilization (Pan et al., 6 Oct 2025, Wang et al., 2023)
Graph Analytics: Fast BFS, PageRank, and other irregular graph kernels benefit from pointer-friendly, high-bandwidth banks (Mutlu et al., 2019, Ghose et al., 2018)
Database and Data Analytics: Bitwise operations, scatter–gather, and scan workloads profit from DRAM in-array computation and local PIM cores (Mutlu et al., 2019, Mutlu et al., 2019)

Generalizing these findings, any application class dominated by high internal bandwidth demand, fine-grained synchronization, and low arithmetic intensity—such as near-data analytics, sparse linear algebra, storage offload, and ASIC acceleration—can benefit from monolithic or hybrid-bonded 3D DRAM with logic layer (Lu et al., 27 Feb 2026, Pan et al., 6 Oct 2025). The critical design levers remain tier-aware data placement, balanced logic–memory coupling, and comprehensive cross-domain software–hardware co-design.