Papers
Topics
Authors
Recent
Search
2000 character limit reached

3D DRAM-Stacked AI Systems

Updated 1 June 2026
  • 3D DRAM-stacked AI systems are architectures that vertically integrate multiple DRAM layers with logic dies using techniques like TSV, hybrid bonding, and monolithic stacking to enhance bandwidth and capacity.
  • They employ design paradigms such as processor-in-memory, near-memory processing, and heterogeneous core mapping to minimize data movement and reduce latency in deep neural networks and large language models.
  • Key innovations include DRAM-aware tiling, software-hardware co-design, and advanced thermal management, achieving up to an order-of-magnitude improvement in throughput and energy efficiency over conventional architectures.

3D DRAM-stacked AI systems integrate multiple layers of dynamic random-access memory (DRAM) vertically with one or more compute-centric logic dies, exploiting high-bandwidth in-stack communication, increased memory capacity, and minimized data movement to accelerate contemporary deep neural networks (DNNs) and LLMs. This architectural approach leverages through-silicon vias (TSVs), hybrid bonding, or monolithic 3D stacking to tightly couple massive DRAM arrays with specialized processors, enabling efficient execution of memory-bound AI workloads that conventional 2D and 2.5D architectures cannot sustain. Recent advances span Smart Memory Cubes with processor-in-memory (PIM), monolithic 3D DRAM with near-memory processing (NMP), distributed chiplets for LLMs, and full-stack codesign frameworks. The key drivers are bandwidth-centric dataflows, fine-grained memory mapping, aggressive co-design (hardware, compiler, mapping), and end-to-end validation, producing up to an order of magnitude higher throughput and energy efficiency relative to GPU or HBM-only baselines.

1. Physical Stacking Topologies and DRAM-integration Technologies

3D DRAM-stacked AI systems adopt a variety of stacking techniques, each with distinct trade-offs in interconnect density, thermal dissipation, and manufacturability.

  • Hybrid-bonding & TSV-based stacks: Many designs employ fine-pitch copper–copper hybrid bonding, achieving up to 1 µm (or even sub-µm) bump pitch and ~10⁶ vertical connections/mm² (Tam et al., 2020, Li et al., 9 Apr 2026, Cai et al., 13 Dec 2025). DRAM wafers (typically 4–8 layers) are directly bonded atop logic wafers, with through-silicon vias used for both intra-stack data and external IO. This enables internal bandwidth of multiple TB/s.
  • Monolithic 3D DRAM: Stratum's Mono3D DRAM builds up to 1,024 layers of 1T1C DRAM in a single sequential deposition process, with vertical bitlines and staircased wordlines, yielding density and vertical bandwidth unachievable with TSVs alone. Logic die and DRAM stack are bonded face-to-face, with hybrid-bonding interconnects at 1 µm pitch (Pan et al., 6 Oct 2025).
  • Wafer-level stacking (HITOC): Sunrise employs a face-to-face stack of a CMOS logic wafer and a DRAM wafer via 1 µm Cu–Cu hybrid microbumps plus backside TSVs for IO. Inter-wafer interface supports up to 1.8 TB/s per chip (Tam et al., 2020).
  • Chiplet-based ("3.5D") systems: LaMoSys3.5D composes multiple heterogeneous 3D-DRAM-on-logic chiplets side-by-side on a 2.5D silicon interposer, connecting each via high-bandwidth die-to-die links, combining “compute-rich” and “memory-rich” stacks in one package (Wang et al., 9 Dec 2025).
  • Traditional HBM/HMC derivatives: Architectures like SMC (Smart Memory Cube) leverage the HMC convention of a logic base die with multiple stacked DRAM dies via TSVs, each organized into vaults/banks for bank-level parallelism (Azarkhish et al., 2017, Oliveira et al., 2022).

The choice of stacking topology impacts not only the aggregate bandwidth (scaling linearly with number of usable TSV/hybrid-bond lanes) but also the energy per bit transferred (~0.66–0.88 pJ/bit in advanced 3D-DRAM, 5–10× lower than commodity DDR) (Cai et al., 13 Dec 2025, Li et al., 9 Apr 2026), thermal gradients (R_th scaling with layer count), and yield (N-stack yield drops exponentially with added layers) (Tam et al., 2020, Mo et al., 6 Apr 2026).

2. Architectural and Microarchitectural Design Principles

Memory-compute integration in 3D DRAM-stacked AI systems is realized via multiple design paradigms:

  • Processor-in-memory (PIM) blocks: NeuroCluster, for example, integrates 16 clusters (8 NeuroStream FP32 MAC coprocessors + 4 RISC-V PEs/cluster) into the HMC logic base, with scratchpad, DMA, and hardware tiling logic, achieving 240 GFLOPS at 2.5 W logic power—22.5 GFLOPS/W, 3.5× that of contemporary GPUs (Azarkhish et al., 2017).
  • Near-memory processing (NMP): In Stratum, bank-contiguous blocks of matrices are processed in-situ by PUs at the edge of DRAM banks, each with a MAC array, psum SRAM, and tier-aware controller. System software places “hot” MoE experts and KV-cache in the fastest DRAM tiers (Pan et al., 6 Oct 2025).
  • Distributed core/bank mapping: Voxel demonstrates end-to-end designs where compute-layer AI cores are tiled under a DRAM bank grid, connected via vertical TSVs for local, high-bandwidth access, and mesh/torus NoC for cross-core communication (Liu et al., 29 Apr 2026, Wang et al., 2023).
  • Heterogeneous logic: Tasa partitions logic into high-performance cores (systolic arrays for GEMM), high-efficiency cores (MAC-trees for attention), and employs dynamic bandwidth-sharing, with fine-grain floorplan and thermal optimization (He et al., 10 Aug 2025).
  • Chiplet specialization: In LaMoSys3.5D, some chiplets are compute-dense (prefill), others bandwidth/capacity-dense (decode), with software targeting tensor-parallel and pipeline-parallel mapping across the package (Wang et al., 9 Dec 2025).

Core architectural guidelines validated in the literature include matching PIM-node PE array size to local DDR/DRAM bank bandwidth, optimizing SRAM size per core (≥2 MB for memory-bound decode), quantizing NoC link width for comm/compute overlap, and selecting core-group configurations that maximize DRAM utilization while avoiding NoC/thermal bottlenecks (Wang et al., 2023, Li et al., 9 Apr 2026, Liu et al., 29 Apr 2026).

3. Dataflow, Mapping, and Software Co-Design

Efficient mapping of DNN/LLM computation onto 3D DRAM-stacked architectures is characterized by:

  • DRAM-aware tiling: 4D tiling in SMC/Neurostream enables tiles with row-major layouts, halo padding for single-DMA transfer, and partial-sum accumulation, maximizing row-buffer hits and OI (operational intensity) per layer (Azarkhish et al., 2017). Similarly, chiplets in LaMoSys3.5D employ the Direct-DRAM-Delivery (“D³”) dataflow, streaming tiles directly from DRAM or selectively staging in SRAM for reuse (Wang et al., 9 Dec 2025).
  • Dimension-order mapping: Voxel shows that placing tiles that share data on spatially nearby cores in a mesh reduces average NoC hop-count and communication latency; software-aware tensor-to-bank mapping avoids DRAM row-buffer conflicts and achieves >80% peak utilization (Liu et al., 29 Apr 2026).
  • Parallelism strategies: DeepStack models and co-searches tensor, data, pipeline, expert, and sharding parallelisms, showing that incomplete schedule search irreversibly skews Pareto efficiency (e.g., missing expert parallelism produces SM-poor, power-walled configurations) (Mo et al., 6 Apr 2026).
  • End-to-end programming abstractions: ATLAS exposes unified system-level programming primitives: allocation, data movement (alloc, copy), GEMM/reduction/softmax kernels, DRAM-aware tensor declarations, explicit split_gemm/split_attention, and point-to-point/collective communication, matching directly to DRAM placement and NoC scheduling (Li et al., 9 Apr 2026).

Automated frameworks (NicePIM, Voxel, DeepStack) integrate DNN operator partitioning, mapping, and scheduling with design space exploration, providing validated reductions of latency (37%), energy (28%), and up to 25× speedup over GPU baselines for matched workloads (Wang et al., 2023, Liu et al., 29 Apr 2026, Mo et al., 6 Apr 2026).

4. Performance, Energy, and Scalability

Benchmarking and full-scale simulation reveal that 3D DRAM-stacked AI systems consistently surpass prior architectures in bandwidth, latency, throughput, and energy efficiency:

System Throughput (tokens/s or TFLOPS) Energy Efficiency Benchmarks Peak Temperature Δ (°C)
Neurostream (SMC) 240 GFLOPS/SMC, 955 GFLOPS/4x 22.5 GFLOPS/W (~3.5× GPU) Full ConvNet inference Negligible (11 W total)
Sunrise 25 TOPS, 1.8 TB/s, 4.5 GB on-chip 2.08–27.7 TOPS/W CNN, NLP, vision, after node scaling Modelled up to N=2 stacks, thermal-limited
Stratum (Mono3D) 8.29× GPU tokens/s (MoE LLMs) 7.66× GPU (tokens/J) Mixtral, Qwen2.5, etc. LLM Modeled 1.6× slowest-fastest tier
Tasa 2.85× A100, 1.33× Homo-3D 2.07× A100 (Joules/token) LLaMA-65B, GPT-3 66B, batch traces Up to –9.37 (vs. homo-3D)
ATLAS (Cloud scenario) up to 3.64× H200 latency, 2.53× speed up to 6.66× energy LLM decode, prefill ≤85 °C thermal cap enforced
Voxel 1.84× SPMD (compute-shift) up to 25% energy-to-token gain LLM + DiT-XL Throttle >0.7 W/mm²

Increasing stack height improves capacity/bandwidth linearly but is capped by Little’s law, TSV bandwidth, yield, and thermal budget. DeepStack finds that STPS (system tokens/sec) gain peaks at 8–9 layers; excess layers are beneficial mostly for energy-optimal points where added DRAM is sparsely accessed (Mo et al., 6 Apr 2026). Software–hardware co-search sweeps multi-dimensional trade-offs, balancing DRAM utilization, NoC BW/latency, per-core SRAM, and core NR/Nc for latency/efficiency Pareto fronts (Cai et al., 13 Dec 2025, Wang et al., 2023, Liu et al., 29 Apr 2026).

5. Thermal Management and System-level Co-Design

Thermal constraints are decisive in system scaling, achievable clock, and sustained bandwidth. Increased DRAM-stack height, central logic clustering, and large P-core counts steepen temperature gradients. Solutions include:

  • Heterogeneous core allocation: Tasa shows that embedding low-power E-cores in thermally stressed regions allows spatial temperature flattening, lowering peak by ≈9.4°C at iso-latency versus homogeneous layouts (He et al., 10 Aug 2025).
  • Cross-stack dynamic scheduling: Bandwidth-sharing mechanisms move KV-cache traffic between P- and E-cores in proportion to measured utilization, hiding data migration during normal operation (He et al., 10 Aug 2025).
  • Transient, stack-aware simulators: LaMoSys3.5D and DeepStack integrate event-driven thermal solvers. Delta T is computed as ΔT = P·R_th, iteratively updating DRAM-refresh and logic-leakage overheads until thermal convergence (Wang et al., 9 Dec 2025, Mo et al., 6 Apr 2026).
  • Liquid cooling optimization: LaMoSys3.5D models active pump flow rate as a variable for further R_th reduction within package-level thermal budgets (Wang et al., 9 Dec 2025).

In all systems, practical stack height (N_max) is ultimately constrained by the temperature for a given power density, stack geometry, and cooling technology (Tam et al., 2020, Cai et al., 13 Dec 2025).

6. DRAM Microarchitecture: Models and Design Exploration

Advances at the DRAM-array and bank level are leveraged for AI workloads:

  • Bank, subarray, and MAT-level customization: DreamRAM exposes a design space at MAT, subarray, and bank hierarchy, modeling wire pitch, capacitance, partial page activation, and subarray-level parallelism (SALP-all, SALP-groups) for bandwidth and energy optimization (Cai et al., 13 Dec 2025).
  • Routing scheme optimization: The Dataline-Over-MAT (DLOMAT) routing allows more main datalines over MAT with short CSLs, boosting peak MAT bandwidth by ~13% at marginal area and latency cost.
  • Partial page/activation: Enabling half/quarter-page activation reduces tRCD, bitline capacitance, and energy per access by up to 40–50% for sparsely accessed ML models (Cai et al., 13 Dec 2025).
  • Calibration against HBM3/2E: DreamRAM models closely match real HBM stacks to <16% error on bandwidth, latency, and energy.

Guidelines from simulation and analytic modeling converge on stack heights of ≤6 for low-latency (<60 ns) use, maximizing channels and subarray-level activation for high-concurrency ML/LLMs, and adopting routing schemes like DLOMAT on banks with highest access pressure (Cai et al., 13 Dec 2025, Li et al., 9 Apr 2026, Liu et al., 29 Apr 2026).

7. Future Directions and Design Guidelines

Design evidence points to several robust, differentiated trends and best practices:

  • Co-design across all system levels (hardware, dataflow, mapping) is essential: Partial exploration of parallelism or memory parameters produces irreversibly suboptimal architectures (Mo et al., 6 Apr 2026, Wang et al., 9 Dec 2025).
  • Heterogeneity (across chiplets/cores) outperforms monolithic “one-size-fits-all” designs: Specialization for prefill/decode or GEMM/attention produces higher energy efficiency and robust utilization across LLM workloads (Wang et al., 9 Dec 2025, He et al., 10 Aug 2025).
  • Compiler- and software-aware mapping achieve major gains in latency and bandwidth utilization: Software-guided tensor-to-bank mapping and dimension-ordered tile placement minimize row-buffer conflicts and NoC hops, yielding 80% utilization improvement (Liu et al., 29 Apr 2026).
  • Thermal- and power-aware DSE cannot be overlooked: High DRAM stack heights and core counts undermine peak throughput and yield unless integrated thermal models drive early pruning (Wang et al., 9 Dec 2025, Mo et al., 6 Apr 2026).
  • Open, silicon-validated performance models (ATLAS, DeepStack, Voxel, NicePIM) are pivotal for enabling reproducible, community-wide exploration of co-designed 3D DRAM-accelerator architectures (Li et al., 9 Apr 2026, Mo et al., 6 Apr 2026, Liu et al., 29 Apr 2026, Wang et al., 2023).

The continued evolution of 3D DRAM-stacked AI systems is likely to deliver ever-higher vertically integrated bandwidth, agile on-stack/near-stack processing, and judicious software–hardware symbiosis, driving a new plateau of efficiency and scalability for DNN and LLM inference (Azarkhish et al., 2017, Pan et al., 6 Oct 2025, Wang et al., 9 Dec 2025, Li et al., 9 Apr 2026, Cai et al., 13 Dec 2025).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to 3D DRAM-stacked AI Systems.