Chiplet-based Memory Module Overview
- Chiplet-based memory modules are memory subsystems partitioned into multiple optimized chiplets interconnected via high-speed interfaces like UCIe and CXL.
- They integrate various memory types (DRAM, SRAM, RRAM) and processing-in-memory logic to support advanced AI/ML and HPC workloads.
- Key innovations include scalable inter-chiplet protocols, modular integration, and significant performance and cost improvements over monolithic designs.
A chiplet-based memory module is a memory subsystem constructed by partitioning memory and related logic into multiple optimized chiplets, typically connected via an advanced interposer or high-speed in-package interconnect. This architectural paradigm leverages advances in heterogeneous integration, network-on-package topologies, and protocol standardization (UCIe, CXL, CHI), and is designed to overcome scalability, area, yield, bandwidth, and energy constraints inherent to monolithic memory hierarchies. Chiplet-based memory modules implement memory technologies such as DRAM, SRAM, RRAM (ReRAM), or even processing-in-memory (PIM) logic as disaggregated, modular chiplets, which enables tailoring process nodes, integrating compute/logic functions, and providing scalable bandwidth for AI/ML and HPC workloads (Wang et al., 19 Nov 2025, Kiyawat et al., 15 Nov 2025, Sharma et al., 7 Oct 2025, Krishnan et al., 2021, Paulin et al., 21 Jun 2024, Scheffler et al., 13 Jan 2025, Peng et al., 2023).
1. Architectural Organization
Chiplet-based memory modules employ several integration approaches:
- 2.5D and MCoOI Integration: Multiple chiplets (memory, logic, compute) are mounted side-by-side on a silicon interposer, which facilitates high-density, low-latency die-to-die communication. For example, Hemlet integrates analog CIM (ACIM, RRAM) and digital CIM (DCIM, SRAM) chiplets, plus a digital IDP (SRAM-based processing) chiplet via a 2D mesh NoP (Wang et al., 19 Nov 2025). Sangam physically separates DRAM "bank" chiplets and logic chiplets on a passive interposer, providing direct logic-bank point-to-point links (Kiyawat et al., 15 Nov 2025).
- On-Package Memory via UCIe: Multiple memory chiplets (DRAM, SRAM) or logic-die mediators are serially interfaced with compute chiplets using UCIe (Universal Chiplet Interconnect express), enabling asymmetric, scalable, and protocol-flexible on-package memory modules (Sharma et al., 7 Oct 2025).
- Multi-Channel HBM/DRAM Assemblies: Accelerator systems (e.g., Occamy, SIAM) deploy multiple HBM or DRAM stacks as distinct chiplets, each with dedicated controllers on the compute chiplets, with traffic distributed via wide PHYs and hierarchical routers/NoCs transmitted through a silicon interposer (Paulin et al., 21 Jun 2024, Scheffler et al., 13 Jan 2025, Krishnan et al., 2021).
- On-Chip SRAM Memory Fabrics: Modular AI supercomputer designs (Chiplet Cloud) utilize distributed SRAM-dominated CC-MEM chiplets for in-package storage, directly interfaced to compute clusters (Peng et al., 2023).
Typical Memory Module Chiplet Subtypes
| Chiplet Type | Example Memory Tech | Function/Role |
|---|---|---|
| ACIM | RRAM | Static parameter VMM/MLP weights |
| DCIM | SRAM | Dynamic VMM (attention, dynamic wts) |
| DRAM banklet | DRAM | Capacity-optimized storage array |
| Logic chiplet | 7 nm logic/SRAM/SIMD | PIM compute, controller, reduction |
| IDP/Buffer | SRAM | Global scratchpad, SIMD processing |
These modules are further interconnected by package-level 2D mesh or bus networks, high-speed SerDes, or standards-based PHYs (CXL, UCIe) (Wang et al., 19 Nov 2025, Kiyawat et al., 15 Nov 2025, Sharma et al., 7 Oct 2025).
2. Memory Hierarchy and Subarray Structure
Chiplet-based modules expose explicit, multi-tiered memory organization:
- Subarray Granularity: Each memory chiplet (e.g., ACIM or DRAM banklet) is partitioned into arrays or subarrays (e.g., 256×256 RRAM or multiple DRAM banks), with shared local peripheral logic (e.g., MUX, ADC for ACIM) (Wang et al., 19 Nov 2025, Krishnan et al., 2021).
- Tile and PE Grouping: Subarrays aggregate into tiles, then further into processing elements (PEs). For example, Hemlet’s ACIM features a Tile (16 PEs, each with 6 subarrays), collectively forming a three-level buffer hierarchy (Wang et al., 19 Nov 2025).
- Bank/Channel Partitioning: HBM and DRAM chips expose extensive parallelism via multi-bank, multi-channel partitioning. For instance, Occamy utilizes 8 HBM2E channels per stack, with each channel subdivided into banks and pseudo-channels, offering >1 TB/s bandwidth per module (Paulin et al., 21 Jun 2024, Scheffler et al., 13 Jan 2025).
3. Inter-Chiplet Interconnects and Protocols
High-bandwidth, low-latency, and scalable chiplet interconnects are essential:
- Network-on-Package (NoP): For Hemlet, a 2D mesh NoP connects chiplets via 8 SerDes links/node, with up to 256 GB/s per chiplet (Wang et al., 19 Nov 2025). SIAM also employs a mesh of passive interposer links (32 lanes at 250 MHz) (Krishnan et al., 2021).
- Serial Protocols: UCIe links in modern designs employ flit-based serialization and asymmetrical data lane allocation, supporting JEDEC-compliant DRAM protocols (LPDDR6/HBM4), CXL.Mem, or CHI with configurable bandwidth, latency, and protocol adaptation (Sharma et al., 7 Oct 2025, Kiyawat et al., 15 Nov 2025).
- Bandwidth and Latency: CXL-attached modules (Sangam) operate at x8 PCIe 6.0/CXL Gen 4 (102.4 GB/s per module) (Kiyawat et al., 15 Nov 2025). UCIe can deliver per-lane 32 GT/s, achieving >256 GB/s per group with sub-10 ns end-to-end latency (Sharma et al., 7 Oct 2025). Hybrid chiplets exploit aggregate channelization, e.g., Occamy’s dual HBM2E modules deliver up to 3.3 TB/s (Paulin et al., 21 Jun 2024).
4. Compute-in-Memory, PIM, and Buffering
Chiplet-based memory modules increasingly incorporate CIM/PIM for near-data compute:
- Analog and Digital CIM: Heterogeneous integration enables high-density RRAM-based analog MACs for static layers (ACIM) and flexible, SRAM-based digital MACs for dynamic layers (DCIM), as in Hemlet’s architecture (Wang et al., 19 Nov 2025). Analog CIM supports up to 7 bit precision, with energy per MAC ~0.5 pJ; DCIM is area-flexible but incurs higher read/write energy.
- Processing-in-Memory (PIM) Logic: Sangam separates dense DRAM banks (in DRAM chiplets) from PIM-capable logic (in center-stripe chiplet), supporting local systolic arrays, SIMD units, bank-local SRAM scratchpads, and reduction trees (Kiyawat et al., 15 Nov 2025).
- Intermediate Data Processing: The IDP chiplet, as in Hemlet, organizes SRAM banks for cross-chiplet SIMD, LayerNorm, GELU, and residual additions (Wang et al., 19 Nov 2025).
- SRAM-Based Fast Memory Modules: Chiplet Cloud’s CC-MEM implements high-bandwidth, low-latency SRAM arrays tightly co-located with compression/decompression hardware, matching SIMD cores’ access patterns (Peng et al., 2023).
5. Performance, Energy, and Scaling Models
Closed-form models and empirically validated results characterize fundamental performance and scaling traits:
- Throughput, Latency, and Energy:
- Throughput in chiplet-based CIM: (Wang et al., 19 Nov 2025).
- Memory access latency: e.g., (Wang et al., 19 Nov 2025).
- Energy per op (ACIM/DCIM): (Wang et al., 19 Nov 2025). Detailed cost models separate DRAM, logic-die, package/test (Sharma et al., 7 Oct 2025).
- HBM2E peak bandwidth: (Paulin et al., 21 Jun 2024).
- CC-MEM per-chiplet bandwidth: 1.5–4.2 TB/s, SRAM latency 1 ns (Peng et al., 2023).
- Bandwidth Density and Latency Improvement: UCIe-based modules reach linear bandwidth densities of 658 GB/s/mm, with 7.5 ns end-to-end latency (versus 18–20 ns for HBM/LPPDR) and 0.25 pJ/bit energy (versus 0.9–2.8 pJ/bit for DRAM/HBM) (Sharma et al., 7 Oct 2025).
- Empirical Outcomes:
- Hemlet: 3×–8× subarray speedup, 1.5×–2× overall chiplet speedup via GLP; 8.68 TOPS throughput; 3.86 TOPS/W; NoP bandwidth utilization up to 70% (vs. 40% baseline) (Wang et al., 19 Nov 2025).
- Sangam: 2.8–4.2× speedup over H100 GPU in LLM inference latency, 10× decode throughput, and 12× energy savings (Kiyawat et al., 15 Nov 2025).
- SIAM: up to 130× (versus V100) energy efficiency improvement on DNNs (Krishnan et al., 2021).
- Occamy: up to 3.3 TB/s sustained bandwidth, area/power cost reductions, FPU utilization >80% for stencil/dense and >40% for sparse codes (Paulin et al., 21 Jun 2024, Scheffler et al., 13 Jan 2025).
- Chiplet Cloud: 97× TCO/token improvement over GPU cloud; 1.7× larger sparse models at same memory capacity (Peng et al., 2023).
6. Design Trade-Offs, Generalization, and Future Directions
Chiplet-based memory modules admit several key trade-offs and extensibility paths:
- Analog vs. Digital CIM: ACIM enables >5–10× area density at the cost of write endurance and high reprogramming energy; DCIM offers flexibility but has higher area/energy per operation (Wang et al., 19 Nov 2025). Sangam and SIAM demonstrate that decoupling dense logic from DRAM array avoids severe density penalties and enables advanced PEs (Kiyawat et al., 15 Nov 2025, Krishnan et al., 2021).
- Cost/Area/Yield Optimization: Spreading memory and compute across multiple smaller chiplets enables higher manufacturing yield (), with up to 60% cost reduction vs. monolithic dies for large AI models (Krishnan et al., 2021, Peng et al., 2023).
- Protocol and Integration Standardization: Adoption of UCIe, CXL.Mem, and CHI enables vendor-agnostic, scalable interfaces across logic, DRAM, and NVM chiplets (Sharma et al., 7 Oct 2025).
- Applicability Beyond DNNs: Group-level parallelism and cross-chiplet mapping schemes generalize to CNNs, GNNs, graph and diffusion models (Wang et al., 19 Nov 2025).
- 3D/Heterogeneous Stacking: Emerging directions include 3D stacking for further area and wire-length compression, hybrid photonic/electrical links, and integration of nonvolatile memory (ReRAM, MRAM) with native chiplet interfaces (Sharma et al., 7 Oct 2025, Kiyawat et al., 15 Nov 2025).
- Programmable In-Package Memory Networks: Dynamic compression, decode, and all-reduce protocols (as in CC-MEM), and flexible mapping via analytic models, support robust scaling to exascale AI deployments (Peng et al., 2023).
7. Representative Comparative Metrics
| Module/Design | Peak BW (GB/s) | Latency (ns) | Energy/bit (pJ) | Capacity/Chiplet | Area Overhead | Notable Features |
|---|---|---|---|---|---|---|
| Hemlet (ACIM/DCIM) | 256 (NoP) | 10 hops: 10/H+0.015·S | 0.5–1.5/MAC | 24 Mb RRAM | – | 2.5D, RRAM/SRAM |
| Sangam (CXL-PIM) | 102.4 (CXL x8) | – | DRAM read power dominates (>80%) | 16 DDR5 chips (128–512 GB total) | – | 17 chiplets/module |
| UCIe-DRAM (CXL.Mem) | 256+ | 7.5 | 0.25 | Flexible | – | UCIe+logic-die/native |
| Occamy (HBM2E) | 1638 | 140 | ~16 (PHY+ctrl) | 16 GiB/stack | 19% die | 2.5D dual-chiplet |
| CC-MEM (SRAM) | 1500–4200 | ~1 | 0.03–0.1 (SRAM/xbar) | 80–230 MB | ~75%+ | On-chip crossbar+CDR |
Data reflect design-point specifics from (Wang et al., 19 Nov 2025, Kiyawat et al., 15 Nov 2025, Sharma et al., 7 Oct 2025, Krishnan et al., 2021, Paulin et al., 21 Jun 2024, Scheffler et al., 13 Jan 2025, Peng et al., 2023).
Chiplet-based memory modules therefore represent a convergence of advanced packaging, scalable memory-subsystem design, and in-/near-memory computing, enabling AI and HPC systems to transcend classical bottlenecks in capacity, bandwidth density, power efficiency, and cost. They provide a flexible substrate not only for DNN acceleration, but also for general-purpose memory-bound workloads. Future work will extend these modules via 3D stacking, NVM integration, photonic links, and open protocol ecosystems (Sharma et al., 7 Oct 2025, Kiyawat et al., 15 Nov 2025).
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free