3D-DRAM-Stacked AI Systems

Updated 20 April 2026

3D-DRAM-stacked AI systems are advanced architectures that vertically integrate DRAM and logic dies using TSVs and hybrid bonding to meet high bandwidth and energy efficiency demands.
They employ varied stacking techniques—TSV-based, hybrid bonding, and Mono3D—to optimize interconnect density, reduce latency, and scale capacity from tens of GB to multiple TB.
These systems support near-memory processing and system-level co-design, enabling optimized performance for deep neural networks, large language models, and MoE workloads.

Three-dimensional (3D) DRAM-stacked AI systems comprise the integration of memory (specifically DRAM) dies and logic dies in a vertical stack, interconnected via through-silicon vias (TSVs), micro-bumps, or hybrid bonding. This architectural paradigm is deployed to satisfy the increasing bandwidth, capacity, and energy-efficiency demands of AI workloads, spanning deep neural networks (DNNs), LLMs, and specialized dataflows such as Mixture-of-Experts (MoE) and biologically plausible neural circuits. Multiple research avenues now coalesce under this topic, ranging from device-level electrical/thermal characterization and DRAM modeling to system-level parallelization, distributed scheduling, heterogeneous integration, and near-memory/processing-in-memory (PIM) architectures.

1. Architectural Principles and Integration Mechanisms

3D DRAM-stacked AI systems exploit the vertical integration of logic and memory for tight proximity, massive interconnect density, and low-energy data movement. Multiple stack and bonding approaches are in use:

TSV-based stacking: Copper TSVs (5–10 µm pitch) and micro-bumps connect multiple DRAM dies atop logic, enabling vertical bandwidth scaling and sub-millimeter round-trip latencies, cutting external DRAM delays by 5–10× versus PCB-traced DDR (Kurshan et al., 2024).
Hybrid bonding: Fine-pitch hybrid bonds (down to ~1 µm) surpass TSVs in I/O density (>110,000 pins/mm²), allowing for denser logic–DRAM integration and improved vertical connectivity (Li et al., 9 Apr 2026).
Monolithic 3D DRAM (Mono3D): Stacks hundreds to thousands of horizontal DRAM layers with vertically routed bitlines, integrating logic via face-to-face hybrid bonding and eliminating conventional TSVs’ area and process cost constraints (Pan et al., 6 Oct 2025).
2.5D/3.5D IC: Heterogeneous chiplets (compute-rich, memory-rich) are embedded on a silicon interposer, with each logic die locally bonded to 3D DRAM (Wang et al., 9 Dec 2025). This topology supports mesh/NoC-based inter-chiplet communication and scaling.

Vertical stacking supports single-stack memory bandwidth in the range of 6–30 TB/s (He et al., 10 Aug 2025, Pan et al., 6 Oct 2025), aggregate capacities from tens of GB (edge/embedded) to multiple TB (cloud, multi-stack deployment), and on-die interconnects providing sub-30 ns DRAM access latencies (Azarkhish et al., 2017).

2. Fine-Grained Hardware Characterization and Trade-Offs

High-performance 3D DRAM-stacked AI systems demand rigorous modeling of thermal, power, noise, and reliability phenomena as stacking increases physical interaction and integration density:

Thermal management and characterization: Heat propagation across layers is modeled using 3D steady-state conduction (Fourier’s law), with measured gradients showing +5–10 °C per active compute tile; stacked architectures are sensitive to placement, power-density, and cooling (liquid-jet, RDL spreaders, micro-fluidic channels) (Kurshan et al., 2024, He et al., 10 Aug 2025).
Lifetime reliability: Electromigration and thermal cycling are modeled (e.g., $MTTF \propto J^{-n} e^{E_a/(kT)}$ ), with dense TSV farms prone to delamination and ∼30% reduced stack lifetime under high-cycling, high-T conditions (Kurshan et al., 2024).
Inter-layer electromagnetic coupling: Coupling capacitance ( $C_c ≈ ε_r ε_0 A/d$ ) and induced voltages ( $V_n(t) ≈ Z_c i_n(t)$ ) are notable for sensitive DRAM/cache layers subject to aggressive logic layer (GPU/CPU) switching (Kurshan et al., 2024).
Bandwidth–thermal–reliability trade-offs: Bandwidth increases with TSV density, but hotspots, mutual shielding, and local temperature rises necessitate isolation via low-k dielectrics, shielding rings for noise, and stack height cap (e.g. ≤3–4 layers above the heat sink at 4 W/cm²) (Kurshan et al., 2024, Mo et al., 6 Apr 2026).

Optimizing stack height, die assignment (placing DRAM closest to heat sink), and TSV layout (solid Cu-TSVs for signal, thermal farm isolation) is critical to balancing throughput, power, and product reliability.

3. Memory System Modeling, Bandwidth, and Customization

Custom DRAM modeling frameworks (e.g., DreamRAM (Cai et al., 13 Dec 2025)) expose a rich set of architectural knobs at each DRAM hierarchy:

MAT/Subarray/Bank/Channel-level tuning: Channel counts, MAT routing schemes (DLOMAT), sense amplifier types, partial-page activation, subarray parallelism (SALP), MDL/LDL counts are all parameterized to align DRAM behavior with AI workload requirements.
Analytical metrics:
- Energy/bit: $E_{\rm bit} = \frac{1}{2}\alpha (C' l)\Delta V_{int} V_{ext}$
- Bank cycle/latency: $t_{\rm bank}$ , $t_{\rm CCDL}$
- Row-miss latency: $t_{\rm miss} = t_{\rm RP} + t_{\rm RCD} + t_{\rm CL} +$ cross-die delays
- Bandwidth: $BW = \frac{{\rm TSVs} \times \text{I/O rate}}{t_{\rm CCDS}}$

Iso-bandwidth, iso-capacity, and iso-power optimization yield up to +66% bandwidth, +100% capacity, and –45% E/bit versus commodity HBM, with area scaling captured via per-die wire pitch and routing area formulas (Cai et al., 13 Dec 2025). DLOMAT increases per-MAT bandwidth up to 13% at iso-area by optimizing the placement of MDL data lines.

Row interleaving, bank-level concurrency, and large logical rows (≥32 KB) further maximize activation locality and effective link throughput (Li et al., 9 Apr 2026, Lee et al., 2015).

4. Near-Memory Processing and Accelerator Architectures

Processing-in-memory (PIM) and near-memory processing (NMP) architectures leverage the tight proximity of logic and DRAM in 3D stacks:

Node architecture: Logic die is tiled into PIM-nodes, each bonded to DRAM banks and equipped with PE arrays (e.g., 32×32 MAC), local buffers, and DRAM controller (Wang et al., 2023, Oliveira et al., 2022).
Dataflow optimization: Weight-stationary, direct-DRAM-delivery (D³), and partial sum accumulation models exploit on-stack bandwidth and reduce off-stack communication (Wang et al., 9 Dec 2025, Pan et al., 6 Oct 2025).
Mapping and scheduling: Multi-dimensional mapping (branch parallelism, layer partitioning, weight replication, data layout selection), deep kernel learning, and ILP-based data schedulers jointly tune hardware and mapping for high utilization and energy-delay product (Wang et al., 2023).
Heterogeneous functional units: Matrix and vector units are distributed per tailor workload (cloud: 32:1, edge: 8:1), with matrix units dominant for FC/GEMM and vector for attention/softmax (Li et al., 9 Apr 2026). MoE/NMP architectures direct memory-bound compute to E-cores and high-intensity GEMM to P-cores (He et al., 10 Aug 2025, Pan et al., 6 Oct 2025).
Energy and performance: System performance can be up to 37% lower latency, 28% less energy, and 25× GPU batch-1 speedup for DNN inference (Wang et al., 2023); 128 TFLOPS NMP is reached at 1 GHz with Mono3D DRAM and in-situ tier-aware expert placement (Pan et al., 6 Oct 2025).

5. System-Level Co-Design: Distributed Inference, Scaling, and Scheduling

State-of-the-art design space exploration and simulation frameworks guide system-level AI accelerator co-design:

Full-stack simulators: ATLAS (Li et al., 9 Apr 2026) and DeepStack (Mo et al., 6 Apr 2026) model device-level thermals, compute, NoC, and DRAM banks, with cycle-level accuracy and joint hardware-parallelism–schedule exploration spanning $2.5\times10^{14}$ design points.
Distributed inference strategies: DeepStack supports 7-dim parallelism (TP, EP, SP, etc.), with dual-stage network abstraction for logical-to-physical routing, tile-wise compute–comm overlap, and parametric thermal/power budgeting. Batch size is often a more critical divider of architecture than prefill/decode separation (Mo et al., 6 Apr 2026).
3.5D-IC and chiplet-based serving: LaMoSys3.5D arranges compute-rich and bandwidth/capacity-rich chiplets per phase (prefill/decode), on a 2D interposer mesh, dynamically clustering PEs and partitioning dataflow via D³ for optimal utilization. End-to-end, system-level constrained Bayesian optimization yields up to 62% higher throughput-per-watt and 4.87× lower latency (Wang et al., 9 Dec 2025).
Thermal-aware scheduling and bandwidth sharing: Tasa introduces on-die P/E heterogeneity and bandwidth-sharing scheduling to concurrently level power density and map memory-bound work, showing up to 5.5–9.4 °C ΔT reduction and 2.85× throughput boost vs GPU/previous PIM (He et al., 10 Aug 2025).

6. Benchmark Achievements and Application Domains

3D DRAM-stacked AI systems have demonstrated leading performance in multiple domains:

System/Study	Application	Peak Bandwidth / Perf	Energy Efficiency	Latency / Throughput
Sunrise (Tam et al., 2020)	Training/Inf	1.8 TB/s @40nm	2.08 TOPS/W (meas.)	25 TOPS @12W (40nm); 50 TOPS/W (7nm)
NicePIM (Wang et al., 2023)	DNN Inference	--	Latency ↓37%, E ↓28%	25× lower latency vs V100
Stratum (Pan et al., 6 Oct 2025)	LLM MoE	19.0–30.3 TB/s Mono3D	Up to 7.66× GPU	4.48–8.29× decoding throughput (GPU)
Tasa (He et al., 10 Aug 2025)	LLM Decode	6 TB/s / stack	2.07× (GPU)	ΔT ↓5.55–9.37°C, speedup 2.85×
ATLAS (Li et al., 9 Apr 2026)	LLM Inf	1 TB/s/core (hybrid)	0.75 tokens/s/W	≤6.37% sim. error (vs silicon)
eBrainII (Stathis et al., 2019)	Spiking RL	200 TB/s (system)	0.054 TFLOP/W	162 TFLOPS, 3kW, 4× GPU BW/W

Applications span transformer training, model inference (BERT, LLaMA, GPT-3), MoE models, and petascale real-time simulation for brain models.

7. Design Guidelines and Best Practices

Designers are guided by the following measurable recommendations:

Stack height: Limit to 3–4 active logic layers or interleave low-power DRAM layers to cap hotspot temperature (Kurshan et al., 2024, Mo et al., 6 Apr 2026).
Placement: DRAM close to the heat sink, logic/compute layers above, with thermal spreaders in between.
DRAM knob selection: Use DreamRAM or similar to select per-task configurations (full-page, DLOMAT, SALP-all, OCSA, channel count) (Cai et al., 13 Dec 2025).
Thermal co-design: Simulate and tune floorplan, core microarchitecture, core ratio, NoC/NoP width early, and define stack caps in system-level DSE (He et al., 10 Aug 2025, Wang et al., 9 Dec 2025, Mo et al., 6 Apr 2026).
Parallelism: Shift from TP at small batch sizes to DP/PP/EP at large batch, tune schedule to overlap compute/comm efficiently (Mo et al., 6 Apr 2026).
Run-time management: Apply per-tile DVFS, dynamic bandwidth sharing, and real-time LUT-driven placement for energy and temperature balance (He et al., 10 Aug 2025).
Noise/Signal integrity: Deploy TSV ground rings, differential signaling, and strategic shield insertion (Kurshan et al., 2024).
Reliability: Use sparse TSV configurations, moderate bank concurrency, and thermal-aware refresh scheduling to maximize MTTF (Pan et al., 6 Oct 2025).

These best practices are corroborated by cross-stack performance models and empirical silicon validation, forming the substantive basis for next-generation 3D-DRAM-stacked AI system architectures.

References:

(Kurshan et al., 2024, Wang et al., 2023, Azarkhish et al., 2017, Li et al., 9 Apr 2026, Wang et al., 9 Dec 2025, Oliveira et al., 2022, Mo et al., 6 Apr 2026, Cai et al., 13 Dec 2025, Pan et al., 6 Oct 2025, Tam et al., 2020, He et al., 10 Aug 2025, Lee et al., 2015, Stathis et al., 2019)