Multi-Chiplet GPU Architecture

Updated 13 March 2026

Multi-chiplet GPU architecture is a design approach that decomposes a traditional GPU into multiple specialized chiplets on a single package to overcome reticle and yield constraints.
It employs heterogeneous chiplets with tailored functions—such as compute, memory, and interconnect roles—combined with advanced 2.5D/3D packaging for enhanced performance and energy efficiency.
Empirical analyses indicate up to 20× performance gains and significant energy improvements, making it a viable solution for deep learning and high-performance scientific computing.

Multi-chiplet GPU architecture refers to the decomposition of a monolithic graphics processing unit into multiple physically separate silicon dies ("chiplets") that are integrated onto a single package or interposer. This approach provides a path to scaling aggregate compute and memory while circumventing reticle size constraints, yield barriers, and the diverging requirements of modern compute-intensive workloads, particularly in deep learning and scientific computing. Multi-chiplet architectures exploit a combination of high-bandwidth die-to-die interconnects, heterogeneity in chiplet microarchitecture, and advanced packaging (2.5D/3D, silicon or glass interposers) to achieve higher performance, energy efficiency, and design modularity compared to monolithic GPUs.

1. Fundamental Principles and Architectural Overview

Multi-chiplet GPU architectures disaggregate the logical and physical organization of traditional GPUs into clusters of specialized chiplets. Each chiplet may serve as a compute cluster, high-capacity memory controller, large L3 cache, interconnect switch, or domain-specific accelerator. Integration is typically realized through a 2D mesh or more advanced topologies (e.g., HexaMesh) across high-density package-level interconnects (Odema et al., 2023, Iff et al., 2022).

An example is a multi-chip module (MCM) containing four AI-accelerator chiplets in a 2×2 mesh, where each chiplet might contain a different dataflow engine (Output-Stationary or Weight-Stationary), 10 MB global buffer, and operates at 500 MHz. The package provides point-to-point links of 100 GB/s bandwidth and direct DRAM channels on edge chiplets, forming a topology optimized for both bandwidth and energy efficiency (Odema et al., 2023). Another realization is the Occamy dual-chiplet GPU, where each chiplet integrates 216 RISC-V cores and a high-bandwidth HBM2E stack, with both chiplets interconnected by ultra-wide (144 GB/s) D2D links over a passive silicon interposer (Paulin et al., 2024).

2. Heterogeneity and Specialization in Multi-Chiplet GPUs

Heterogeneous chiplet integration enables the assignment of specialized roles and microarchitectures to different chiplets within the package. Heterogeneity is expressed along several axes:

Dataflow specialization: Some chiplets realize output-stationary (OS) dataflow, suited to convolutions and output reuse, whereas others implement weight-stationary (WS) dataflow, optimizing weight reuse in large GEMMs and attention kernels. The scheduler can map each DNN layer to the chiplet whose dataflow minimizes its energy–delay product (EDP) (Odema et al., 2023).
Compute/memory roles: Chiplets can be specialized for streaming multiprocessor (SM) compute (e.g., for dynamic Transformer kernels), in-memory compute (e.g., ReRAM macros for static FF layers), or memory controller (MC) functions (Sharma et al., 2023).
Physical and thermal profile: High-power chiplets may be placed near the package center, with lower-power or embedded units positioned for mechanical and thermal stress relief, especially in glass interposer systems (Sharma et al., 24 Jul 2025).
Memory stacking: Individual chiplets may carry their own HBM or 3D-stacked DRAM modules, allowing linear bandwidth scaling with chiplet count (Paulin et al., 2024).

Heterogeneity enables system-level optimization, as tasks with distinct memory access, compute, or communication profiles are mapped to their most efficient hardware substrate.

3. Inter-Chiplet Interconnect Topologies and Physical Integration

The performance and scalability of multi-chiplet GPUs depend heavily on the topology and physical properties of intra-package interconnects:

2D mesh and grid topologies provide direct, short-range links but may suffer from high diameters as chiplet counts scale.
HexaMesh topologies organize chiplets into concentric rings with each (non-peripheral) chiplet having six neighbors, reducing network diameter by up to 42% and increasing bisection bandwidth by up to 131% relative to the grid, thus improving average latency by 19% and peak throughput by 34% for N=10–100 chiplets (Iff et al., 2022).
Silicon and glass interposers: Silicon interposers offer high line density and low-loss routing for 2.5D integration. Glass interposers reduce capacitance, lower energy/bit for the redistribution layer (RDL), and enable higher bus widths, but require sophisticated co-design to avoid warpage and thermal hotspots (Sharma et al., 24 Jul 2025).
Advanced packaging: Innovations such as embedded chiplets within the interposer (connected by through-glass vias) are used to reduce mechanical stress, manage thermal gradients, and further reduce in-package signal lengths.

Design strategies include aligning memory interface chiplets at the package perimeter for high-bandwidth access, reserving central regions for compute-intensive chiplets, and co-optimizing topology and chiplet placement with network traffic profiles to minimize average hop count (Sharma et al., 2023).

4. Scheduling, Mapping, and Workload Partitioning

Efficiently exploiting chiplet resources requires advanced workload mapping and scheduling, accounting for pipeline balance, data reuse, on-chip buffer constraints, and D2D bandwidth limits.

Mathematical formulation: The mapping of DNN layers to chiplets is modeled as an assignment problem, with binary variables $x_{l,c}$ indicating if layer $l$ is mapped to chiplet $c$ , and objective functions that minimize pipeline period $T$ or overall energy. Constraints ensure single mapping per layer, precedence and communication delays, buffer occupancy, and D2D bandwidth (Odema et al., 2023).
Heuristic algorithms: Two-stage approaches first assign dataflow specialization (OS vs. WS per layer) and then construct an inter-layer pipeline to balance stage latency and minimize maximum period. RA-tree representations and candidate schedule pruning are used to explore the solution space rapidly (Odema et al., 2023).
Multi-objective optimization: Placement of chiplets and routers is optimized to minimize both the mean and variance of link utilization, subject to physical and bandwidth constraints. For example, a space-filling curve (SFC)-based chiplet layout for ReRAM macros minimizes communication hops during the feed-forward stages of transformer inference (Sharma et al., 2023).
Mapping engines: Systems such as Gemini use simulated annealing-based mapping for layer groups, tuning partitioning and dataflow to jointly minimize monetary cost, energy, and delay across various hardware partitionings and D2D topologies (Cai et al., 2023).

Optimal point selection often balances pipeline concurrency, minimizes the number of expensive D2D transfers, and leverages on-chip buffer and cache reuse to maximize throughput and energy efficiency.

5. Performance, Scalability, and Cost–Benefit Analysis

Empirical results consistently show that multi-chiplet GPU architectures can dramatically outperform their monolithic counterparts in both throughput and energy efficiency, especially for large, bandwidth-hungry workloads:

Architecture/Paper	Throughput Gain	Energy Efficiency Gain	Notes
4-chiplet MCM (hetero)	2.2×	1.9×	vs. monolithic OS accelerator (Odema et al., 2023)
Hetero-transformer chiplet	11.8–22.8×	2.4–5.4×	vs. HAIMA baseline, scaling with model size (Sharma et al., 2023)
Occamy dual-chiplet	n/a	28.1 GFLOP/s/W	83% FPU utilization (stencil) (Paulin et al., 2024)
Glass interposer GPUs	1.3–1.41×	up to 40% reduction	vs. Si_2.5D, with lower cost (Sharma et al., 24 Jul 2025)
Gemini (DNN chiplets)	1.98×	1.41×	at 14% cost increase over baseline (Cai et al., 2023)
COPA-DL (2.5D/3D)	21–35%	29–35%	for training/inference, with 2× cost savings at scale (Fu et al., 2021)

Scalability is enabled by partitioning compute and memory bandwidth across many chiplets, with aggregate system bandwidth and capacity responding linearly to the number of integrated HBM/DRAM chiplets. However, excessive chiplet granularity increases area and power overhead due to D2D PHYs, suggesting moderate partition counts (2–4 chiplets at 128–512 TOPs) are optimal (Cai et al., 2023). Furthermore, mechanical, thermal, and warpage constraints set limits on interposer area, placement density, and maximum power dissipation, especially with emerging glass substrates (Sharma et al., 24 Jul 2025).

6. Design Trade-offs and Engineering Recommendations

Critical trade-offs arise in multi-chiplet GPU design:

Computation vs. communication: Deep pipelines and data reuse keep activations on-package but increase NoP hops, requiring careful scheduling and placement (e.g., DRAM-adjacent for first/last stages) (Odema et al., 2023). Balancing locality of reference against bandwidth and latency penalties is essential.
Homogeneous vs. heterogeneous pipelining: Homogeneous pipelines (e.g., all OS) may maximize throughput, while heterogeneity sacrifices some throughput for significantly higher energy efficiency, accounting for the optimal hardware mapping of each layer (Odema et al., 2023).
Interconnect topology: Shorter paths (low-diameter topologies like HexaMesh or 2D mesh) yield lower latency and higher bandwidth, but physical constraints on bump pitch and substrate wiring density govern feasible designs (Iff et al., 2022).
Thermal/mechanical co-design: The use of glass interposers and embedded chiplets allows for larger packages with wide buses and lower energy per bit, but only when accompanied by mechanical warpage control and thermal management schemes, e.g., embedding low-power units to increase stiffness and distribute heat (Sharma et al., 24 Jul 2025).
Domain specialization: Composable on-package architectures permit domain-optimized GPU variants—e.g., with massive L3 for deep learning or leaner memory systems for HPC workloads—maximizing configuration space efficiency and product-specific performance (Fu et al., 2021).
Programming model: The absence of hardware cache coherence and the requirement for explicit DMA traffic management can complicate software, necessitating more sophisticated compiler and runtime support (Paulin et al., 2024).

Engineering guidelines thus favor heterogeneous dataflow engines, mesh or ring-based interconnects, flexible chiplet floorplanning, fast cost-model driven scheduling, and staged mapping/optimization flows for maximizing total system performance under real-world constraints.

7. Future Directions and Open Challenges

Key research challenges remain, including:

Mapping/model co-design: Efficient workload partitioning for increasingly large, irregular transformer and vision models remains a computationally expensive task (Cai et al., 2023).
Interposer and package technology: Scaling bump density, minimizing crosstalk, and managing warpage and thermals under high aggregate power continue to require advances in both silicon and glass interposer process technology (Sharma et al., 24 Jul 2025).
Standardized D2D interfaces: The adoption of UCIe or other high-bandwidth die-to-die standards is necessary to ensure scalability and interoperability as commercial multi-chiplet GPUs mature.
Domain-specific hardware: The proliferation of on-package specialized memory (e.g., ReRAM, eDRAM) and domain-specific accelerators within chiplet-based GPUs suggests the continued evolution of highly composable, workload-tuned architectures.
Cost and yield analysis: Continued refinement of cost models, yield-aware partitioning, and NRE amortization through re-usable chiplet families is likely as mainstream semiconductor vendors shift to chiplet-based design (Cai et al., 2023).

In sum, multi-chiplet GPU architecture is enabling fundamental changes in the scale, specialization, and cost-effectiveness of platform design. Combining heterogeneity-aware scheduling, advanced interconnects and packaging, and domain-specific partitioning provides 2–20× performance and efficiency advantages for emerging workloads, when compared against monolithic GPUs (Odema et al., 2023, Sharma et al., 2023, Fu et al., 2021).