DreamRAM: Custom 3D Die-Stacked DRAM Design
- DreamRAM is a modeling tool for 3D die-stacked DRAM architectures that exposes granular design knobs from inter-bank to MAT levels.
- It employs analytical wire and routing models calibrated against HBM2E and HBM3, achieving +66% bandwidth, +100% capacity, and –45% energy per bit improvements.
- The tool facilitates systematic design space exploration and Pareto-optimal DRAM design by optimizing trade-offs in bandwidth, capacity, latency, and energy.
DreamRAM is a fine-grained, parameterized modeling tool for custom 3D die-stacked DRAM architectures, designed to enable comprehensive analytical exploration of bandwidth, capacity, latency, energy, and area trade-offs at multiple levels of DRAM hierarchy. Developed in response to the inflexible design constraints of commodity DRAM solutions, DreamRAM exposes architectural and physical design knobs at inter-bank, bank, subarray, and MAT levels. Its core features include analytical wire and routing modeling, calibration against industry-first HBM2E and HBM3 chips, and a routing-aware Dataline-Over-MAT (DLOMAT) scheme to mitigate MAT-level congestion. Within the DreamRAM framework, it is possible to identify designs that achieve each of +66% bandwidth, +100% capacity, and –45% power/energy per bit at fixed iso-metric constraints, substantiating the value of aggressive co-optimization across hierarchical tiers (Cai et al., 13 Dec 2025).
1. Motivation and Context
Emerging workloads in HPC, graphics, and machine learning place diverse and conflicting requirements on DRAM subsystems, ranging from ultra-high bandwidth to maximal density and stringent energy budgets. Commodity DRAM families (DDR, LPDDR, GDDR, HBM) encode fixed trade-offs, which are suboptimal for application-specific deployments. 3D die stacking—where multiple dies are vertically integrated using TSVs—offers new degrees of freedom in silicon area and channel bandwidth. However, the absence of open, granular modeling frameworks precludes systematic exploration of these now-vast design spaces. DreamRAM was created to fill this gap, supporting robust evaluation and co-design of memory hierarchy, microarchitecture, and physical routing down to the MAT level (Cai et al., 13 Dec 2025).
2. Hierarchical Parameterization and Customization Knobs
DreamRAM organizes the DRAM hierarchy into four main levels, each with exposed configurable parameters:
- Inter-Bank Level: Stack height (number of dies), pseudo-channel count, bank-group tiling, global bus width, TSV allocations, and channel-DQ multiplexing.
- Bank Level: Number of subarrays, MATs per subarray, repair subarrays, selection of Subarray-Level Parallelism (SALP) modes (SALP-groups vs. SALP-all), offset-cancellation vs. baseline sense amplifiers.
- Subarray Level: Partial-page activation (two main modes: half-page and subchannel), multi-pump arrangements for assembling full data “atoms” over multiple cycles, and control over MDL allocation.
- MAT Level: Number of WLs/BLs, LDLs/MDLs per MAT, architectural overheads from OD-ECC and sense-amp isolation, and the option of DLOMAT routing—routing a subset of MDLs over the vertical cell array to relieve lateral congestion.
Table 1 illustrates key customizable parameters:
| Hierarchy Level | Example Knobs | Purpose |
|---|---|---|
| Inter-Bank | Dies, channels, BGBUS width | Aggregate bandwidth, stack density |
| Bank | Subarrays, SALP mode | Parallelism, row conflict avoidance |
| Subarray | Partial-page, multi-pump | Energy/bandwidth/latency trade-offs |
| MAT | DLOMAT, ECC, MDLs/LDLs | Routing efficiency, area utilization |
DLOMAT (Dataline-Over-MAT) routing enables up to ~13% per-MAT bandwidth improvement, with modest (∼3%) area overhead, by re-allocating MDLs over the array vertically rather than squeezing them laterally beneath the cell array (Cai et al., 13 Dec 2025).
3. Analytical Models for Physical and Performance Metrics
DreamRAM analytically captures wire scaling and timing relationships:
- Wire capacitance:
- Wire resistance:
- RC delay:
These equations parameterize each metal segment according to process node, pitch, routed length, and topology.
DreamRAM models timing including row miss latency (), where and are split between signal-path and bitline portions, and is a function of TSV stack and die interior length. Bank cycle time, , is limited by CSL, LDL, MDL, precharge, and driver delays, each decomposed into wire RC and logic contributions. Routing congestion at the MAT level is addressed by the DLOMAT scheme, decreasing routed segment length for high-traffic wires and increasing wire pitch, thus reducing capacitance.
Calibration against HBM3 and HBM2E chips demonstrates sub-16% error in bandwidth and sub-9% error in die area. For instance, DreamRAM predicts HBM3 bandwidth and capacity to exactly match published figures (1024 GB/s, 16 GB) and die area within –8.3% (Cai et al., 13 Dec 2025).
4. Design-Space Exploration and Trade-Offs
A five-tier sweep (A: inter-bank only; B: + bank level; C: + subarray level; D: + MAT level; E: + DLOMAT) yields ∼2.8 million design points, with Tier E (full knob set) expanding the convex hull volume by orders of magnitude beyond Tier A. Table 2 from the source quantifies that inter-bank-only parameterization covers just 0.0001% of the feasible design space.
Pareto-optimal solutions for server GPU workloads demonstrate:
- +66% bandwidth (1710 GB/s) at iso-capacity/iso-power
- +100% capacity (32 GB) at iso-bandwidth/iso-power
- –45% energy/bit at iso-bandwidth/iso-capacity
Designs achieving high bandwidth typically select more channels/DQs, narrow bank structures, high core frequencies, DLOMAT routing, and aggressive multi-pump scheduling. High capacity configurations favor more dies, increased subarrays/MATs, and, when energy constrained, partial-page activation. Low energy/bit designs adopt smaller page sizes and shorter wire runs, while area-efficient implementations limit subarray count, MAT width, and avoid DLOMAT unless essential for bandwidth.
DreamRAM projects application-specific Pareto frontiers, shifting outwards as finer-grained parameters become accessible (Cai et al., 13 Dec 2025).
5. Practical Scenarios, Recommendations, and Limitations
For throughput-bound GPUs, Tier D/E strategies (maximizing channels, DLOMAT, high multi-pump) are optimal. Capacity-intensive servers benefit from increased dies, subarrays, and MATs (Tier C–D), with partial-page activation reserved for tight energy budgets. Edge devices with power/area constraints favor Bank/Subarray level customization (Tier B–C) and modest DLOMAT deployment. Latency-sensitive CPUs minimize bitline capacitance, stack height, and exploit SALP-all to hide page conflicts.
Partial-page activation can reduce energy per bit by up to 60% at modest performance cost. Bank-level SALP alone expands the design convex hull by 60× relative to inter-bank parameters.
MAT-level routing combined with DLOMAT delivers peak-bandwidth configurations with an area overhead justified primarily in streaming and bandwidth-bound scenarios.
A plausible implication is that judicious co-optimization across all hierarchical levels allows memory architects to escape traditional commodity DRAM constraints, moving toward tailored, workload-specific DRAM architectures (Cai et al., 13 Dec 2025).
6. Calibration, Validation, and Key Insights
DreamRAM is validated against industry-cited HBM3 and HBM2E devices with reported errors below 16% (bandwidth) and 9% (area). Its wire, area, timing, and energy models are grounded in actual process and routing geometries. Calibration parameters include die count, stack height, MDL/LDL configuration, TSV geometry, and supply voltage, with outputs cross-checked against published manufacturer data (Cai et al., 13 Dec 2025).
Key insights:
- Fine-grained knobs advance the multi-dimensional Pareto frontier dramatically.
- MAT-level routing and DLOMAT unlock previously unattainable aggregate bandwidth for the area.
- Partial-page activation is a robust method for significant energy reduction.
- Each parameter tier (inter-bank, bank, subarray, MAT, routing) independently affects the metric space; only holistic optimization achieves true Pareto-optimality.
DreamRAM delivers a unified, parametrically rich framework for systematic 3D-stacked DRAM design space navigation, constituting a foundational resource for architects targeting next-generation heterogeneous, application-specialized memory subsystems (Cai et al., 13 Dec 2025).