Multi-Chiplet GPU Architectures
- Multi-chiplet GPU architectures are designs that integrate multiple silicon chiplets with dedicated compute and memory resources to overcome the limitations of monolithic dies.
- They utilize NUMA-aware mapping and scheduling strategies to optimize local cache utilization and mitigate latency variances in cross-chiplet communications.
- Packaging innovations and optimized interconnects in these architectures enhance throughput, energy efficiency, and scalability for advanced AI accelerator workloads.
A multi-chiplet GPU architecture integrates multiple silicon dies (“chiplets” or XCDs), each comprising a subset of the GPU's compute and memory resources, on a shared package employing high-bandwidth, low-latency inter-chiplet interconnects. This design paradigm replaces the traditional monolithic die, driven by reticle-size constraints, yield, cost, and the need for domain-specific resource scaling. The resulting architectures—now deployed in leading-edge AI accelerators such as AMD MI300X—expose pronounced non-uniform memory access (NUMA), complex on-package communication challenges, and packaging-driven thermal and mechanical trade-offs. Effective exploitation and management of multi-chiplet GPU architectures require co-optimization of hardware topology, memory hierarchy, mapping of computations, and system-level scheduling to balance performance, efficiency, and reliability.
1. Chiplet Organization and Physical Microarchitecture
Multi-chiplet GPUs typically comprise compute chiplets (e.g., 38 Compute Units per XCD in MI300X), each with dedicated L1 and L2 caches, and dedicated high-bandwidth memory (e.g., HBM3 stacks). In AMD MI300X, eight XCDs are connected via Infinity Fabric in a logical mesh/ring topology, each paired with a local 24 GB HBM3 stack, yielding aggregate bandwidths of 5.3 TB/s and aggregate memory capacity of 192 GB. Each XCD features 16 KB L1 per CU and a private 4 MB L2 cache, totaling 32 MB GPU-wide. On-package NoC interconnects facilitate intra-GPU communication with defined hop-latency and bandwidth per link.
Physical integration is realized via 2.5D (e.g., passive silicon or glass interposer) or 3D (vertical stacking) package schemes, with ultra-short-reach ICI links. Packaging choices—EMIB, passive/active interposers, and emerging glass interposers—critically impact achievable link density, thermal characteristics, yield, and total system cost (Sharma et al., 24 Jul 2025, Iff et al., 2023, Choudhary et al., 3 Nov 2025).
2. Memory Hierarchy and NUMA Effects
Each chiplet's tight coupling to a dedicated HBM stack introduces NUMA characteristics: memory access latency and bandwidth are highly non-uniform between local (intra-chiplet) and remote (cross-chiplet) memory accesses. For instance, local L2 or HBM access yields low latency, high bandwidth, while remote accesses incur ≈2× higher latency and ≈½ effective bandwidth due to additional fabric hops (Choudhary et al., 3 Nov 2025). This NUMA manifests sharply in workloads with poor data locality, undermining the assumption of uniform access in existing GPU kernel schedulers.
Private L2 caches per chiplet cause cache fragmentation: data loaded into one XCD’s L2 is not visible to others, leading to redundant off-chip fetches and cache underutilization if workloads are not NUMA-locality-aware. This effect is quantitatively severe: naive scheduling in 128-head, 128 K token attention significantly drops L2 hit rates to 1–5%, but NUMA-conscious mapping can restore rates to 90–96%. The total memory access cost can be modeled by
where maximizing is critical for efficiency (Choudhary et al., 3 Nov 2025).
3. Inter-Chiplet Interconnect and Communication Bottlenecks
Inter-chiplet communication employs network-on-package (NoP) or mesh/ring fabric topologies, supporting both unicast and multicast flows. Communication-intensive AI workloads, characteristic of large DNNs, intensify pressure on the NoP, particularly in multicast-heavy regimes (e.g., parameter broadcast in transformer attention). Analysis reveals that for moderate to large chiplet counts (N ≥ 9), communication dominates overall execution time, rising from ≈30% of total cycles at N=2 to >70% at N=18 (Musavi et al., 2024).
Key metrics include per-hop latency , per-link bandwidth , and mean hop-count in a planar mesh. For multicasts, aggregate traffic grows super-linearly, and hot spots centralize as multicast packets traverse longer paths (often >6 hops for N≥9). In practice, designs like modular hierarchical meshes with ring overlays and priority virtual channels are recommended, reducing NoP packet latency by 25–40% and improving bandwidth utilization from 60% to 85% of peak in the studied settings (Musavi et al., 2024).
Rapid early-stage exploration of interconnect performance, thermal, and cost trade-offs is enabled by proxy modeling tools such as RapidChiplet, which provides millisecond-scale predictions of ICI latency, throughput, area, and thermal peaks across trillions of design options, with <5% latency-proxy error for latency and 6–30% for throughput proxies compared to cycle-accurate simulation (Iff et al., 2023).
4. Computation and Mapping Strategies
Optimal scheduling and mapping are essential for harnessing the potential of multi-chiplet GPUs. Mapping strategies that align the software execution with hardware NUMA domains are critical. For example, Swizzled Head-First Mapping (SHFM) realigns attention heads to specific GPU NUMA domains, keeping the head's tensors resident in a chiplet’s L2 and effectively exploiting cache-locality for high reuse (Choudhary et al., 3 Nov 2025). SHFM delivers up to 50% higher throughput and increases L2 hit rates to 80–97% compared to naive or non-NUMA-aware mappings.
Layer-pipelined spatial mappings, as codified in frameworks like Gemini, combine chiplet partitioning with mapping of DNN layer-groups via simulated-annealing (SA) over tile/granularity, minimizing both intra- and inter-chiplet bandwidth consumption while balancing monetary, energy, and latency costs (Cai et al., 2023). On heterogeneous chiplet fabrics (mixing output- and weight-stationary engines), advanced scheduling achieves 2.2× throughput and 1.9× energy efficiency improvements over monolithic baselines (Odema et al., 2023). The mapping process trades off D2D bandwidth, core utilization, and interconnect contention, often favoring moderate chiplet granularities (2–4 per GPU die).
5. On-Package Memory and Domain Specialization
Domain-specialized chiplet organizations exploit composability to deploy compute and memory resources tailored to distinct workload requirements. COPA-GPU separates a compute chiplet (“GPM”) with its own SM array, L1, and L2 caches, from a memory-system module (“MSM”) providing large on-package L3 (eDRAM/SRAM), multi-stack HBM, and DRAM controllers. DL-specialized COPA-GPUs implement 16× larger caches and 1.6× DRAM bandwidth, achieving up to 31% higher DL training and 35% higher inference performance with a 50% reduction in required GPU instances for scale-out scenarios (e.g., MLPerf workloads) (Fu et al., 2021). Additional area (by ~4–6%) and power overheads are offset by DRAM access energy savings and improved energy-delay product.
6. Packaging, Interposer Technology, and Scalability Limitations
Glass interposers provide lower RDL capacitance, up to 2× bus width scaling, and lower energy per bit compared to silicon; this translates to 64.7% higher performance and 40% power reduction in DNN workloads versus silicon-based 2.5D packages. However, glass, with much lower thermal conductivity (k_glass ≈ 1 W/mK vs. ≈130 W/mK for silicon), exacerbates thermal and warpage management challenges as system size scales. Warpage-induced stress, modeled as
can be mitigated by embedding 10–20% of chiplets into the interposer (lowering t_eff), and by underfill/heat-spreader selection with tuned CTE to limit deformation to <150 μm (Sharma et al., 24 Jul 2025). Effective co-optimization of architecture and package—considering computation mapping, placement, and physical structure—is necessary to guarantee both mechanical reliability and thermal feasibility under strict temperature constraints (e.g., ).
Scalability is challenged not only by physical package limits but by communication-induced bottlenecks. Empirical analysis recommends that GPU chiplet arrays be capped at ≈3×3, unless advanced interconnects (hierarchical, wireless, photonic) are present, as NoP latency otherwise dominates and ρ(N) = T_comm(N) / T_compute>1 beyond N≥9 (Musavi et al., 2024). For large scale systems, non-grid chiplet arrangements such as HexaMesh reduce network diameter up to 42% and improve bisection bandwidth by 130% compared to 2D grids, delivering 19% lower zero-load packet latency and 34% higher throughput (Iff et al., 2022).
7. Performance, Efficiency, and Design Guidelines
Sustained performance under realistic data-intensive workloads depends on exploiting NUMA locality, minimizing remote memory and cache accesses, and balancing on-package communication. In MI300X, mapping strategies that maximize local cache utilization and align computation to chiplet-resident memory yield up to 50% higher throughput and L2 hit rates of 80–97% in multi-head attention (Choudhary et al., 3 Nov 2025).
System-level simulation and proxy-based design space exploration show that moderate chiplet counts, aggressive workload-to-chiplet mapping (e.g., co-locating multicast-intensive layers), and careful buffer sizing are key to energy-delay optimization (Cai et al., 2023, Iff et al., 2023). On irregular and sparse workloads, architectures like the Occamy dual-chiplet RISC-V system sustain linear scaling (1.95× speedup from 1→2 chiplets) with only 5% D2D overhead and achieve high FPU utilization (83% on stencils) (Paulin et al., 2024).
Practically, best-in-class multi-chiplet GPU architectures apply:
- NUMA-aware software kernel mapping for locality,
- Moderate chiplet granularity (2–4 per die) for yield/bandwidth trade-off,
- Hierarchical/hybrid NoP topologies for multicast-dominated workloads,
- Thermal/mechanical co-design, especially with glass interposers,
- Proxy-driven fast design exploration to evaluate packaging, scalability, and system power.
Open challenges include runtime die-aware kernel scheduling, flexible D2D BW adaptation, and mixed-workload coscheduling (Choudhary et al., 3 Nov 2025, Cai et al., 2023).
References
- (Choudhary et al., 3 Nov 2025) Optimizing Attention on GPUs by Exploiting GPU Architectural NUMA Effects
- (Musavi et al., 2024) Communication Characterization of AI Workloads for Large-scale Multi-chiplet Accelerators
- (Fu et al., 2021) GPU Domain Specialization via Composable On-Package Architecture
- (Odema et al., 2023) Inter-Layer Scheduling Space Exploration for Multi-model Inference on Heterogeneous Chiplets
- (Iff et al., 2023) RapidChiplet: A Toolchain for Rapid Design Space Exploration of Chiplet Architectures
- (Paulin et al., 2024) Occamy: A 432-Core 28.1 DP-GFLOP/s/W 83% FPU Utilization Dual-Chiplet Accelerator
- (Cai et al., 2023) Gemini: Mapping and Architecture Co-exploration for Large-scale DNN Chiplet Accelerators
- (Iff et al., 2022) HexaMesh: Scaling to Hundreds of Chiplets with an Optimized Chiplet Arrangement
- (Sharma et al., 24 Jul 2025) Designing High-Performance and Thermally Feasible Multi-Chiplet Architectures enabled by Non-bendable Glass Interposer