Papers
Topics
Authors
Recent
2000 character limit reached

DLOMAT: 3D DRAM MAT Routing Optimization

Updated 20 December 2025
  • DLOMAT is a custom routing scheme for 3D DRAM that repositions MDLs over the MAT to boost bandwidth without altering sense-amplifier topology.
  • It employs binary and one-hot CSL encoding along with repurposed routing tracks to enhance data access parallelism and minimize delays.
  • The design achieves up to 13% higher bandwidth in HPC and GPU scenarios while trading increased routing area and power for significant throughput gains.

Dataline-Over-MAT (DLOMAT) is a custom routing scheme developed for 3D die-stacked DRAM at the Macro-Array-Tile (MAT) granularity, as analytically modeled and evaluated within the DreamRAM framework. DLOMAT reorganizes the placement and connectivity of global data lines (MDLs) at the MAT level to improve bandwidth, offering a reallocation of routing resources that caters to application-specific memory demands in contexts such as high-performance computing (HPC) and server-class GPU workloads. The approach enables higher data-line fan-out, optimized physical layout, and access parallelism without affecting sense-amplifier and bit-cell circuitry.

1. Architectural Fundamentals

Within a MAT, conventional DRAM design typically positions MDLs beside the sense amplifiers, with each of 64 Column-Select-Lines (CSLs) multiplexing 8 local datalines (LDLs/MDLs) external to the cell array. DLOMAT re-routes MDLs directly over the cell array in the CSL metal layer, thereby increasing the number of MDLs per MAT without expanding the sense-amplifier array or elevating the track count for global wordline/bitline routing. In one example, DLOMAT permits 32 MDLs for just 16 CSLs per MAT. CSLs themselves are now routed along former LDL tracks, renamed Local-Select-Lines (LSLs), adjacent to each Bank-Line Sense Amplifier (BLSA).

The scheme also adopts binary encoding for most CSLs, controlling wire count and area, while deploying a single one-hot CSL per pump during multi-pumping cycles. Associated repeaters and decode logic relocate into the freed-up MDL-driver region next to each BLSA. Physically, this topology changes the height budgeting: MDL drivers become part of the BLSA cell, and LSL tracks occupy the region formerly reserved for Local-Wordline-Drivers (LWDs). The net result is a topological expansion in attainable bandwidth at the MAT level without major cell-structure disruption (Cai et al., 13 Dec 2025).

2. Design Parameters and Customization

DreamRAM exposes extensive MAT-level configuration knobs for DLOMAT under its Tier E modeling. Key adjustable parameters include:

  • Number of MDLs (nMDLn_{\mathrm{MDL}}) per MAT: Increasing nMDLn_{\mathrm{MDL}} scales raw bandwidth proportionally while impacting routing area and wire capacitance.
  • Number of CSLs per MAT (nCSLn_{\mathrm{CSL}}): Determined as BLs/(MDLs/pumps)BLs/(MDLs/pumps). Reducing nCSLn_{\mathrm{CSL}} minimizes decode area but leads to longer multi-pumping cycles (tCCDLt_{\mathrm{CCDL}} increases).
  • CSL encoding (binary vs. one-hot): Binary encoding reduces area via lower wire count but demands more local decode logic, raising power. One-hot CSL per pump assures minimal CSL activation delay.
  • Wire pitch, width, and spacing (pwire,wwire,swirep_{\mathrm{wire}}, w_{\mathrm{wire}}, s_{\mathrm{wire}}): Technology-node-dependent. Increasing width and spacing lowers resistance and capacitance per unit length, improving latency and power while decreasing routing density.
  • Repeater spacing (lrepl_{\mathrm{rep}}): Finer repeater placement cuts propagation delay but incurs area and leakage overhead.
  • Metal layer assignment: DLOMAT shifts MDLs to the CSL metal layer, freeing the narrow LWD layer for LSLs. Altering metal pitches (e.g., M2 vs. M3) directly affects resistance and capacitance values.

Such configurability enables designers to trade off area, power, bandwidth, and latency in pursuit of tailored memory subsystem performance (Cai et al., 13 Dec 2025).

3. Analytical Models and Equations

DreamRAM implements the following analytical models to capture the timing, electrical, and energy behaviors of DLOMAT and its alternatives:

  • Wire Capacitance: Cwire=Cper×C_{\mathrm{wire}} = C_{\mathrm{per}\,\ell} \times \ell
  • Wire Resistance: Rwire=Rper×R_{\mathrm{wire}} = R_{\mathrm{per}\,\ell} \times \ell
  • Propagation Latency (repeater-balanced, lumped-RC): tprop12RwireCwire=12RperCper2t_{\mathrm{prop}} \approx \frac{1}{2} R_{\mathrm{wire}} C_{\mathrm{wire}} = \frac{1}{2} R_{\mathrm{per}\,\ell} C_{\mathrm{per}\,\ell} \ell^2
  • Bank-Cycle Time: tBCT=tCSL+tLSL+tMDL+tMDL,PRE+tDRVt_{\mathrm{BCT}} = t_{\mathrm{CSL}} + t_{\mathrm{LSL}} + t_{\mathrm{MDL}} + t_{\mathrm{MDL,PRE}} + t_{\mathrm{DRV}}, where line delays scale with CwireC_{\mathrm{wire}} and tDRVt_{\mathrm{DRV}} is a fixed driver delay.
  • Multi-pump CCD Latency: tCCDL=tBCT×(pumps per atom)t_{\mathrm{CCDL}} = t_{\mathrm{BCT}} \times \text{(pumps per atom)}
  • Bandwidth per bank: B=nMDLpumps per atom×fcore×bitsMDL/8B = \frac{n_{\mathrm{MDL}}}{\text{pumps per atom}} \times f_{\mathrm{core}} \times \frac{\text{bits}}{\text{MDL}} / 8
  • Energy per bit: ϵbit=12CwireVdd2\epsilon_{\mathrm{bit}} = \frac{1}{2} C_{\mathrm{wire}} V_{dd}^2
  • Full-chip energy: E=wires12αnCperΔVint/VextE = \sum_{\text{wires}} \frac{1}{2} \alpha n C_{\mathrm{per}\,\ell} \ell \Delta V_{\mathrm{int}/V_{\mathrm{ext}}}

Here, α\alpha is the activity factor and ΔVint/Vext\Delta V_{\mathrm{int}/V_{\mathrm{ext}}} represents internal/external voltage domains. All wire parameters are derived from node-specific technology files (Cai et al., 13 Dec 2025).

4. Performance Trade-offs and Optimization

Exploration of the Tier E design space within DreamRAM indicates:

  • Standalone DLOMAT MAT achieves up to approximately 13% greater bandwidth relative to equivalently sized conventional MAT designs.
  • In case studies targeting server/GPU use-cases with iso-capacity, iso-bandwidth, and iso-power constraints, DLOMAT configurations result in:
    • 66% higher bandwidth at matched power/capacity,
    • 100% higher capacity at matched bandwidth/power,
    • 45% lower power and energy per bit at matched bandwidth/capacity.
  • Across five design axes, DLOMAT points occupy the high-bandwidth frontier, trading increased area and power (attributed to extra routing capacitance) for substantial throughput gains.
  • For designs prioritizing area or energy minimization, conventional MAT routing with lower nMDLn_{\mathrm{MDL}} resides on the relevant Pareto frontier, while DLOMAT is favored where maximal bandwidth is the objective (Cai et al., 13 Dec 2025).

5. Model Calibration and Validation

DreamRAM’s wire and timing models underlying DLOMAT have been calibrated using published parameters from HBM3 and HBM2E DRAM:

Device Bandwidth (GB/s) Capacity (GB) Die Area (mm²)
HBM3 1024 (reported) / 1024 (model) 16 / 16 121 / 111
HBM2E 640 (reported) / 741 (model) 16 / 16 110 / 109.3

Model errors for bandwidth are confined to ±16%, and to less than 10% for area; predicted energy per bit (0.98–3.01 pJ/bit for HBM3) and closed-row latency (~64 ns) remain in expected industry ranges, despite a lack of public comparatives. Wire models (CperC_{\mathrm{per}\,\ell}, RperR_{\mathrm{per}\,\ell}) are further matched to through-silicon-via (TSV) data, supporting reliability (Cai et al., 13 Dec 2025).

6. Contextual Significance and Key Findings

DLOMAT’s primary contribution is enabling up to a ~13% increase in MAT-level bandwidth without altering bit-cell dimensioning or sense-amplifier topology. The requisite extra MDLs and encoded CSLs do elevate wire capacitance and thus power, as well as routing area, but the model results demonstrate net throughput gains under conditions where bandwidth is the limiting resource—typified by GPU and HPC workloads.

Multi-objective optimization reveals that DLOMAT is the preferred approach when conventional MAT routing cannot meet required bandwidth targets, but it is less optimal for designs where cost, area, or power constraints dominate. The validity of the scheme is supported by DreamRAM’s close correspondence with experimental HBM3/HBM2E specifications. A plausible implication is that DLOMAT is especially suited for co-designed memory subsystems that can afford marginal increases in area/power for significant bandwidth uplift, while ultra-low-power DRAM designs are more efficiently served by traditional MAT routing architectures (Cai et al., 13 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Dataline-Over-MAT (DLOMAT).