Composable On-Package Architecture for GPUs

Updated 21 March 2026

Composable On-Package Architecture (COPA-GPU) is a design that partitions a GPU into a reusable compute module (GPM) and a customizable memory system module (MSM) for targeted workloads.
It enables optimized deep learning and HPC performance by tailoring memory systems, achieving up to 40% higher throughput and significant energy savings.
The modular approach supports cost-effective productization by reusing a validated GPU core while dynamically configuring memory subsystems to meet diverse application demands.

A Composable On-Package Architecture for GPUs (COPA-GPU) enables the creation of domain-specialized graphics processing units through modular on-package disaggregation. By partitioning a GPU into separable functional modules—specifically, a reusable GPU module (GPM) and a swappable memory system module (MSM)—COPA-GPU supports maximal design reuse, targeted memory system specialization, and scalable performance for diverging application needs such as deep learning (DL) and high-performance computing (HPC). This approach breaks the tradition of monolithic, converged GPU designs that force a compromise between competing workload demands, offering instead a packaging-time choice of optimal memory subsystems without altering the validated compute core or its software interface (Fu et al., 2021).

1. Motivation for Composable GPU Architectures

Conventional converged GPU designs integrate a single monolithic die to target both HPC (FP64/FP32) and DL (FP16/INT8) workloads. Recent trends have seen the computational throughput for low-precision matrix operations in GPUs increase substantially—FP16 TFLOPS from 21 (P100) to 779 (GPU-N), a 37× increase—while off-chip DRAM bandwidth has only grown from 732 GB/s to 2.7 TB/s (3.7×) over three generations. This leads to a sharp decline in the DRAM_BW/FP16 ratio, dropping from 35× (P100) to 3.4× (GPU-N), whereas DL accelerators like IPUv2 and Groq TSP maintain much higher ratios (720×, 320×) via extensive on-chip SRAM or no DRAM reliance. For HPC codes, over 130 benchmarks show less than 5% sensitivity to expanded DRAM BW; HPC is over-provisioned while DL suffers. The outcome is suboptimal resource utilization: DL is bottlenecked by memory bandwidth, and HPC wastes silicon area and power (Fu et al., 2021). This dichotomy motivates the COPA-GPU concept, which enables separate, domain-tuned memory system and off-die bandwidth scaling.

2. Architectural Floorplan and Module Disaggregation

COPA-GPU leverages a multi-chip-module (MCM) disaggregation strategy. The GPU package is partitioned into:

GPU Module (GPM): Contains streaming multiprocessors (SMs), L1 caches, L2 cache slices, and the on-chip network-on-chip (NoC).
Memory System Module (MSM): Hosts a higher-level L3 cache (as appropriate), DRAM controllers, and off-die memory interfaces.

Three partitioning options were analyzed:

GPM with MSM (L2+MC): Not viable due to unacceptable TB/s on-package NoC requirements.
GPM with MSM (NoC+L2+MC): Similarly impractical.
GPM with MSM (L3+MC): Selected as feasible—post-L2 traffic is routed via an on-GPM switch either to an on-package L3 on the MSM or directly to the DRAM controller.

This arrangement allows the unmodified GPM to be paired, at packaging, with a choice of MSM specialized for HPC or DL workloads:

HPC-oriented MSM: Only memory controller and DRAM I/O, minimizing area, cost, and power.
DL-oriented MSM: Additional large L3 SRAM and more DRAM channels, maximizing capacity and bandwidth (Fu et al., 2021).

3. Memory System Specializations

The COPA-GPU approach supports significant specialization of the memory subsystem. Baseline parameters for a converged GPU (GPU-N): L2 cache capacity $C_{L2} =$ 60 MB, DRAM bandwidth (BW) $= 2.7$ TB/s, DRAM capacity $= 100$ GB.

DL-specialized COPA-GPU configurations (sharing GPM):

HBM+L3: $C_{L3} = 960$ MB, DRAM BW $= 2.7$ TB/s, DRAM $= 100$ GB
HBML+L3: $C_{L3} = 960$ MB, DRAM BW $= 4.5$ TB/s, DRAM $= 167$ GB
HBM+L3L: $C_{L3} = 1920$ MB, DRAM BW $= 2.7$ TB/s, DRAM $= 100$ GB
HBML+L3L: $C_{L3} = 1920$ MB, DRAM BW $= 4.5$ TB/s, DRAM $= 167$ GB
HBMLL+L3L: $C_{L3} = 1920$ MB, DRAM BW $= 6.3$ TB/s, DRAM $= 233$ GB

Key enhancement ratios include cache capacity ratio ( $C_{L3} / C_{L2,\mathrm{baseline}}$ ) of 16–32× and DRAM BW ratio (up to 2.3×). Off-package GPM↔MSM bandwidth is enabled using 2.5D UHB links (up to 14.7 TB/s per interface at 0.3 pJ/bit) or 3D UHB links (greater than 14.7 TB/s at 0.05 pJ/bit). A 960 MB L3 can reduce DRAM traffic by up to 82% (5× reduction); a 1920 MB L3 by approximately 94% (Fu et al., 2021).

4. Composability and Dynamic Configuration

The COPA-GPU defines composability at the packaging level. The GPM die is validated and manufactured once, and its behavior is adapted at package assembly:

MSM Selection: At packaging, the MSM is chosen (HPC or DL variant).
Boot-Time Switch: L2→L3 routing path inside GPM is dynamically reconfigured based on the presence/type of MSM.
ISA and Software Invariance: The SM datapath and instruction set are unchanged across variants, so the same binary/software stack operates on all configurations. Memory subsystem variations are transparent to the programmer.

This composability mechanism allows vendors to produce a family of accelerators by varying only the MSM, thus amortizing non-recurring engineering (NRE) of the core design and validation effort (Fu et al., 2021).

5. Performance Metrics and Evaluation

Performance was evaluated using trace-based simulation correlated to NVIDIA V100 hardware ( $r=0.986$ ). Benchmarks included 7 MLPerf Training and 5 MLPerf Inference tasks, in both large-batch (datacenter) and small-batch (edge) modes.

Key results:

Memory bandwidth scaling: +18% DL training throughput at 1.5× BW, +28% at 2× BW.
Cache scaling: +21% DL training at 960 MB L3, +27% at 1920 MB L3; theoretical unlimited L2 capacity yields up to 40% speedup.
COPA-GPU variants: Relative to GPU-N: HBM+L3 delivers +21% training, +29% inference; HBML+L3 achieves +31% training, +35% inference; HBML+L3L up to +33% training, +40% inference (diminishing BW returns observed).
Scale-out efficiency: A DL-optimized COPA-GPU with 4.5 TB/s DRAM + 960 MB L3 matches the throughput of two GPU-Ns in parallel, reducing required GPU instance count by approximately 50%.
Energy efficiency: Large on-chip L3 yields 5× fewer DRAM accesses and 3.4× energy saving for the memory system, as L3 SRAM access energy is 4× lower than HBM (Fu et al., 2021).

6. Design Implications and Productization

COPA-GPU enables GPU vendors to deliver domain-specialized products:

DL-optimized GPUs achieve 30–40% higher per-chip throughput and halve cluster GPU requirements.
HPC-optimized GPUs can omit unnecessary on-package SRAM, reducing manufacturing cost and energy consumption.

This architecture is well-suited for future reticle-limited scaling, as disaggregation permits independently sized MSM dies, and supports heterogeneous product lines with minimal silicon redesign. The small area/energy overhead for GPM’s modularity is estimated at ~5%. Because the memory/composability mechanism is packaging-time and the GPM stays constant, existing libraries and frameworks see no functional interface change, streamlining deployment.

7. Relationship to Other Composable System Architectures

The COPA-GPU model differs from PCIe-fabric–based composable GPU systems such as the GigaIO platform, which allows dynamic allocation of up to 32 GPUs via PCIe switches, but does not implement on-package, die-to-die modularity or a novel shared L3/memory controller (Ihnotic, 2024). COPA-GPU’s composability occurs within the GPU package and is realized through electrical/mechanical stacking and high-bandwidth silicon interposers, rather than external fabrics and host mediation. In systems where composability is limited to PCIe-level device pooling, memory-side optimizations (large L3, tailored DRAM), and direct on-die partitioning are not available.

A plausible implication is that future research and industry efforts may integrate COPA-GPU’s on-package disaggregation principles with dynamic fabric-level resource composability to maximize both performance-per-watt and flexibility. However, practical deployment of true composable on-package architectures at scale requires advanced packaging, robust boot-time configuration logic, and high-bandwidth/energy-efficient on-package links (Fu et al., 2021).

Markdown Report Issue Upgrade to Chat

References (2)

GPU Domain Specialization via Composable On-Package Architecture (2021)

Scaling to 32 GPUs on a Novel Composable System Architecture (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Composable On-Package Architecture (COPA-GPU).