3D DRAM-Stacked AI Systems

Updated 1 June 2026

3D DRAM-stacked AI systems are architectures that vertically integrate multiple DRAM layers with logic dies using techniques like TSV, hybrid bonding, and monolithic stacking to enhance bandwidth and capacity.
They employ design paradigms such as processor-in-memory, near-memory processing, and heterogeneous core mapping to minimize data movement and reduce latency in deep neural networks and large language models.
Key innovations include DRAM-aware tiling, software-hardware co-design, and advanced thermal management, achieving up to an order-of-magnitude improvement in throughput and energy efficiency over conventional architectures.

3D DRAM-stacked AI systems integrate multiple layers of dynamic random-access memory (DRAM) vertically with one or more compute-centric logic dies, exploiting high-bandwidth in-stack communication, increased memory capacity, and minimized data movement to accelerate contemporary deep neural networks (DNNs) and LLMs. This architectural approach leverages through-silicon vias (TSVs), hybrid bonding, or monolithic 3D stacking to tightly couple massive DRAM arrays with specialized processors, enabling efficient execution of memory-bound AI workloads that conventional 2D and 2.5D architectures cannot sustain. Recent advances span Smart Memory Cubes with processor-in-memory (PIM), monolithic 3D DRAM with near-memory processing (NMP), distributed chiplets for LLMs, and full-stack codesign frameworks. The key drivers are bandwidth-centric dataflows, fine-grained memory mapping, aggressive co-design (hardware, compiler, mapping), and end-to-end validation, producing up to an order of magnitude higher throughput and energy efficiency relative to GPU or HBM-only baselines.

1. Physical Stacking Topologies and DRAM-integration Technologies

3D DRAM-stacked AI systems adopt a variety of stacking techniques, each with distinct trade-offs in interconnect density, thermal dissipation, and manufacturability.

Hybrid-bonding & TSV-based stacks: Many designs employ fine-pitch copper–copper hybrid bonding, achieving up to 1 µm (or even sub-µm) bump pitch and ~10⁶ vertical connections/mm² (Tam et al., 2020, Li et al., 9 Apr 2026, Cai et al., 13 Dec 2025). DRAM wafers (typically 4–8 layers) are directly bonded atop logic wafers, with through-silicon vias used for both intra-stack data and external IO. This enables internal bandwidth of multiple TB/s.
Monolithic 3D DRAM: Stratum's Mono3D DRAM builds up to 1,024 layers of 1T1C DRAM in a single sequential deposition process, with vertical bitlines and staircased wordlines, yielding density and vertical bandwidth unachievable with TSVs alone. Logic die and DRAM stack are bonded face-to-face, with hybrid-bonding interconnects at 1 µm pitch (Pan et al., 6 Oct 2025).
Wafer-level stacking (HITOC): Sunrise employs a face-to-face stack of a CMOS logic wafer and a DRAM wafer via 1 µm Cu–Cu hybrid microbumps plus backside TSVs for IO. Inter-wafer interface supports up to 1.8 TB/s per chip (Tam et al., 2020).
Chiplet-based ("3.5D") systems: LaMoSys3.5D composes multiple heterogeneous 3D-DRAM-on-logic chiplets side-by-side on a 2.5D silicon interposer, connecting each via high-bandwidth die-to-die links, combining “compute-rich” and “memory-rich” stacks in one package (Wang et al., 9 Dec 2025).
Traditional HBM/HMC derivatives: Architectures like SMC (Smart Memory Cube) leverage the HMC convention of a logic base die with multiple stacked DRAM dies via TSVs, each organized into vaults/banks for bank-level parallelism (Azarkhish et al., 2017, Oliveira et al., 2022).

The choice of stacking topology impacts not only the aggregate bandwidth (scaling linearly with number of usable TSV/hybrid-bond lanes) but also the energy per bit transferred (~0.66–0.88 pJ/bit in advanced 3D-DRAM, 5–10× lower than commodity DDR) (Cai et al., 13 Dec 2025, Li et al., 9 Apr 2026), thermal gradients (R_th scaling with layer count), and yield (N-stack yield drops exponentially with added layers) (Tam et al., 2020, Mo et al., 6 Apr 2026).

2. Architectural and Microarchitectural Design Principles

Memory-compute integration in 3D DRAM-stacked AI systems is realized via multiple design paradigms:

Processor-in-memory (PIM) blocks: NeuroCluster, for example, integrates 16 clusters (8 NeuroStream FP32 MAC coprocessors + 4 RISC-V PEs/cluster) into the HMC logic base, with scratchpad, DMA, and hardware tiling logic, achieving 240 GFLOPS at 2.5 W logic power—22.5 GFLOPS/W, 3.5× that of contemporary GPUs (Azarkhish et al., 2017).
Near-memory processing (NMP): In Stratum, bank-contiguous blocks of matrices are processed in-situ by PUs at the edge of DRAM banks, each with a MAC array, psum SRAM, and tier-aware controller. System software places “hot” MoE experts and KV-cache in the fastest DRAM tiers (Pan et al., 6 Oct 2025).
Distributed core/bank mapping: Voxel demonstrates end-to-end designs where compute-layer AI cores are tiled under a DRAM bank grid, connected via vertical TSVs for local, high-bandwidth access, and mesh/torus NoC for cross-core communication (Liu et al., 29 Apr 2026, Wang et al., 2023).
Heterogeneous logic: Tasa partitions logic into high-performance cores (systolic arrays for GEMM), high-efficiency cores (MAC-trees for attention), and employs dynamic bandwidth-sharing, with fine-grain floorplan and thermal optimization (He et al., 10 Aug 2025).
Chiplet specialization: In LaMoSys3.5D, some chiplets are compute-dense (prefill), others bandwidth/capacity-dense (decode), with software targeting tensor-parallel and pipeline-parallel mapping across the package (Wang et al., 9 Dec 2025).

Core architectural guidelines validated in the literature include matching PIM-node PE array size to local DDR/DRAM bank bandwidth, optimizing SRAM size per core (≥2 MB for memory-bound decode), quantizing NoC link width for comm/compute overlap, and selecting core-group configurations that maximize DRAM utilization while avoiding NoC/thermal bottlenecks (Wang et al., 2023, Li et al., 9 Apr 2026, Liu et al., 29 Apr 2026).

3. Dataflow, Mapping, and Software Co-Design

Efficient mapping of DNN/LLM computation onto 3D DRAM-stacked architectures is characterized by:

DRAM-aware tiling: 4D tiling in SMC/Neurostream enables tiles with row-major layouts, halo padding for single-DMA transfer, and partial-sum accumulation, maximizing row-buffer hits and OI (operational intensity) per layer (Azarkhish et al., 2017). Similarly, chiplets in LaMoSys3.5D employ the Direct-DRAM-Delivery (“D³”) dataflow, streaming tiles directly from DRAM or selectively staging in SRAM for reuse (Wang et al., 9 Dec 2025).
Dimension-order mapping: Voxel shows that placing tiles that share data on spatially nearby cores in a mesh reduces average NoC hop-count and communication latency; software-aware tensor-to-bank mapping avoids DRAM row-buffer conflicts and achieves >80% peak utilization (Liu et al., 29 Apr 2026).
Parallelism strategies: DeepStack models and co-searches tensor, data, pipeline, expert, and sharding parallelisms, showing that incomplete schedule search irreversibly skews Pareto efficiency (e.g., missing expert parallelism produces SM-poor, power-walled configurations) (Mo et al., 6 Apr 2026).
End-to-end programming abstractions: ATLAS exposes unified system-level programming primitives: allocation, data movement (alloc, copy), GEMM/reduction/softmax kernels, DRAM-aware tensor declarations, explicit split_gemm/split_attention, and point-to-point/collective communication, matching directly to DRAM placement and NoC scheduling (Li et al., 9 Apr 2026).

Automated frameworks (NicePIM, Voxel, DeepStack) integrate DNN operator partitioning, mapping, and scheduling with design space exploration, providing validated reductions of latency (37%), energy (28%), and up to 25× speedup over GPU baselines for matched workloads (Wang et al., 2023, Liu et al., 29 Apr 2026, Mo et al., 6 Apr 2026).

4. Performance, Energy, and Scalability

Benchmarking and full-scale simulation reveal that 3D DRAM-stacked AI systems consistently surpass prior architectures in bandwidth, latency, throughput, and energy efficiency:

System	Throughput (tokens/s or TFLOPS)	Energy Efficiency	Benchmarks	Peak Temperature Δ (°C)
Neurostream (SMC)	240 GFLOPS/SMC, 955 GFLOPS/4x	22.5 GFLOPS/W (~3.5× GPU)	Full ConvNet inference	Negligible (11 W total)
Sunrise	25 TOPS, 1.8 TB/s, 4.5 GB on-chip	2.08–27.7 TOPS/W	CNN, NLP, vision, after node scaling	Modelled up to N=2 stacks, thermal-limited
Stratum (Mono3D)	8.29× GPU tokens/s (MoE LLMs)	7.66× GPU (tokens/J)	Mixtral, Qwen2.5, etc. LLM	Modeled 1.6× slowest-fastest tier
Tasa	2.85× A100, 1.33× Homo-3D	2.07× A100 (Joules/token)	LLaMA-65B, GPT-3 66B, batch traces	Up to –9.37 (vs. homo-3D)
ATLAS (Cloud scenario)	up to 3.64× H200 latency, 2.53× speed	up to 6.66× energy	LLM decode, prefill	≤85 °C thermal cap enforced
Voxel	1.84× SPMD (compute-shift)	up to 25% energy-to-token gain	LLM + DiT-XL	Throttle >0.7 W/mm²

Increasing stack height improves capacity/bandwidth linearly but is capped by Little’s law, TSV bandwidth, yield, and thermal budget. DeepStack finds that STPS (system tokens/sec) gain peaks at 8–9 layers; excess layers are beneficial mostly for energy-optimal points where added DRAM is sparsely accessed (Mo et al., 6 Apr 2026). Software–hardware co-search sweeps multi-dimensional trade-offs, balancing DRAM utilization, NoC BW/latency, per-core SRAM, and core NR/Nc for latency/efficiency Pareto fronts (Cai et al., 13 Dec 2025, Wang et al., 2023, Liu et al., 29 Apr 2026).

5. Thermal Management and System-level Co-Design

Thermal constraints are decisive in system scaling, achievable clock, and sustained bandwidth. Increased DRAM-stack height, central logic clustering, and large P-core counts steepen temperature gradients. Solutions include:

Heterogeneous core allocation: Tasa shows that embedding low-power E-cores in thermally stressed regions allows spatial temperature flattening, lowering peak by ≈9.4°C at iso-latency versus homogeneous layouts (He et al., 10 Aug 2025).
Cross-stack dynamic scheduling: Bandwidth-sharing mechanisms move KV-cache traffic between P- and E-cores in proportion to measured utilization, hiding data migration during normal operation (He et al., 10 Aug 2025).
Transient, stack-aware simulators: LaMoSys3.5D and DeepStack integrate event-driven thermal solvers. Delta T is computed as ΔT = P·R_th, iteratively updating DRAM-refresh and logic-leakage overheads until thermal convergence (Wang et al., 9 Dec 2025, Mo et al., 6 Apr 2026).
Liquid cooling optimization: LaMoSys3.5D models active pump flow rate as a variable for further R_th reduction within package-level thermal budgets (Wang et al., 9 Dec 2025).

In all systems, practical stack height (N_max) is ultimately constrained by the temperature for a given power density, stack geometry, and cooling technology (Tam et al., 2020, Cai et al., 13 Dec 2025).

6. DRAM Microarchitecture: Models and Design Exploration

Advances at the DRAM-array and bank level are leveraged for AI workloads:

Bank, subarray, and MAT-level customization: DreamRAM exposes a design space at MAT, subarray, and bank hierarchy, modeling wire pitch, capacitance, partial page activation, and subarray-level parallelism (SALP-all, SALP-groups) for bandwidth and energy optimization (Cai et al., 13 Dec 2025).
Routing scheme optimization: The Dataline-Over-MAT (DLOMAT) routing allows more main datalines over MAT with short CSLs, boosting peak MAT bandwidth by ~13% at marginal area and latency cost.
Partial page/activation: Enabling half/quarter-page activation reduces tRCD, bitline capacitance, and energy per access by up to 40–50% for sparsely accessed ML models (Cai et al., 13 Dec 2025).
Calibration against HBM3/2E: DreamRAM models closely match real HBM stacks to <16% error on bandwidth, latency, and energy.

Guidelines from simulation and analytic modeling converge on stack heights of ≤6 for low-latency (<60 ns) use, maximizing channels and subarray-level activation for high-concurrency ML/LLMs, and adopting routing schemes like DLOMAT on banks with highest access pressure (Cai et al., 13 Dec 2025, Li et al., 9 Apr 2026, Liu et al., 29 Apr 2026).

7. Future Directions and Design Guidelines

Design evidence points to several robust, differentiated trends and best practices:

Co-design across all system levels (hardware, dataflow, mapping) is essential: Partial exploration of parallelism or memory parameters produces irreversibly suboptimal architectures (Mo et al., 6 Apr 2026, Wang et al., 9 Dec 2025).
Heterogeneity (across chiplets/cores) outperforms monolithic “one-size-fits-all” designs: Specialization for prefill/decode or GEMM/attention produces higher energy efficiency and robust utilization across LLM workloads (Wang et al., 9 Dec 2025, He et al., 10 Aug 2025).
Compiler- and software-aware mapping achieve major gains in latency and bandwidth utilization: Software-guided tensor-to-bank mapping and dimension-ordered tile placement minimize row-buffer conflicts and NoC hops, yielding 80% utilization improvement (Liu et al., 29 Apr 2026).
Thermal- and power-aware DSE cannot be overlooked: High DRAM stack heights and core counts undermine peak throughput and yield unless integrated thermal models drive early pruning (Wang et al., 9 Dec 2025, Mo et al., 6 Apr 2026).
Open, silicon-validated performance models (ATLAS, DeepStack, Voxel, NicePIM) are pivotal for enabling reproducible, community-wide exploration of co-designed 3D DRAM-accelerator architectures (Li et al., 9 Apr 2026, Mo et al., 6 Apr 2026, Liu et al., 29 Apr 2026, Wang et al., 2023).

The continued evolution of 3D DRAM-stacked AI systems is likely to deliver ever-higher vertically integrated bandwidth, agile on-stack/near-stack processing, and judicious software–hardware symbiosis, driving a new plateau of efficiency and scalability for DNN and LLM inference (Azarkhish et al., 2017, Pan et al., 6 Oct 2025, Wang et al., 9 Dec 2025, Li et al., 9 Apr 2026, Cai et al., 13 Dec 2025).