Papers
Topics
Authors
Recent
Search
2000 character limit reached

Fleet: Hierarchical Task-based Abstraction for Megakernels on Multi-Die GPUs

Published 15 Apr 2026 in cs.AR | (2604.15379v1)

Abstract: Modern GPUs adopt chiplet-based designs with multiple private cache hierarchies, but current programming models (CUDA/HIP) expose a flat execution hierarchy that cannot express chiplet-level locality or synchronization. This mismatch leads to redundant memory traffic and poor cache utilization in memory-bound workloads such as LLM inference. We present Fleet, a multi-level task model that maps computation to memory scopes. Fleet introduces Chiplet-tasks, a new abstraction that binds work and data to a chiplet and enables coordination through its shared L2 cache. Wavefront-level, CU-level, and device-level tasks align with existing abstractions, while Chiplet-tasks expose a previously unaddressed level of the hierarchy. Fleet is implemented as a persistent kernel runtime with per-chiplet scheduling, allowing workers within a chiplet to cooperatively execute tasks with coordinated cache reuse. On AMD Instinct MI350 with Qwen3-8B, Fleet achieves 1.3-1.5x lower decode latency than vLLM at batch sizes 1-8 through persistent kernel execution and per-chiplet scheduling. At larger batch sizes, cooperative weight tiling increases L2 hit rate (from 12% to 54% at batch size 32 and from 39% to 61% at batch size 64), reducing HBM traffic by up to 37% and delivering 1.27-1.30x speedup over a chiplet-unaware megakernel baseline.

Summary

  • The paper presents Fleet, which introduces a hierarchical task model to exploit chiplet-level cache reuse, reducing latency in memory-bound workloads.
  • It uses cooperative tiling and local L2-centric synchronization to optimize memory operations across chiplets, outperforming traditional flat dispatching.
  • Experimental results on MI350 GPUs show 1.3–1.5× speedups and significant L2 hit rate improvements, highlighting Fleet’s impact on LLM inference performance.

Fleet: A Hierarchical Task-based Abstraction for Megakernels on Multi-Die GPUs

Introduction

GPUs are evolving towards multi-die architectures, with chiplets each hosting private cache resources and local memory hierarchies. Typical programming abstractions in CUDA and HIP, however, remain tied to a monolithic, flat execution hierarchy that cannot capture chiplet-level localities or synchronize at chiplet boundaries. This architectural-programmatic mismatch impairs cache re-use, amplifies redundant memory traffic within large models, and specifically undermines memory-bound workloads such as LLM inference. "Fleet: Hierarchical Task-based Abstraction for Megakernels on Multi-Die GPUs" (2604.15379) directly addresses these inefficiencies by introducing a hierarchical, chiplet-aware persistent kernel runtime with explicit task-to-memory binding.

Architectural Motivation: Chiplet-induced NUMA and Cache Utilization

Modern GPUs, such as the AMD Instinct MI350 series, partition their compute resources and caches across multiple XCDs (Accelerator Complex Dies). Each XCD provides private L1 and a non-coherent 4 MB L2 cache, while sharing a device-level 256 MB Infinity (MALL) cache and global HBM3 memory. The device’s memory topology is strictly non-uniform—cache locality within an XCD is high, but remote access to other XCDs incurs significant penalties and explicit software-managed coherence. Figure 1 illustrates the hardware memory hierarchy of the MI350. Figure 1

Figure 1: Each of eight XCDs in an AMD MI350 hosts 32 CUs and a private 4 MB L2 cache; all XCDs share a 256 MB Infinity Cache across HBM3.

A core insight analyzed in the paper is that standard dispatching approaches scatter workgroups indiscriminately across XCDs, leading to redundant movement of identical weight data from HBM into each XCD's private L2 cache and overall cache thrashing during LLM inference. This effect, combined with the lack of explicit programming abstractions for chiplet-level resource affinity, bottlenecks achievable bandwidth and inflates per-token latency in transformer models.

The Fleet Task Model: Hierarchical Decomposition

Fleet introduces a principled, multi-level task decomposition that exposes all relevant hardware boundaries. The abstraction hierarchy is as follows:

  • Wavefront-task: Operates within a single SIMD wavefront using private registers and LDS.
  • CU-task: Maps to a compute unit, utilizing LDS and L2 cache.
  • Chiplet-task (novel): Spans all workers on one XCD, using the XCD-local L2 cache for communication and data reuse.
  • Device-task: Aggregates eight chiplet-tasks spanning the whole device.

This model enables scheduling, data partitions, and synchronization to align with the device’s physical memory and coherence domains. Notably, the Chiplet-task abstraction allows for explicit binding of work/data to chiplets and new forms of intra-chiplet, L2-local cooperative execution, absent in flat grid programming. Figure 2 contrasts chiplet-unaware and Fleet-aware scheduling strategies; Fleet ensures all workers on an XCD access the same weight partition in a coordinated fashion, converting L2 misses into hits.

(Figure 2)

Figure 2: Standard scheduling scatters threads across XCDs, leading to L2 thrashing; Fleet partitions data and aligns all workers on an XCD for coordinated L2 usage.

Cooperative Tiling and L2 Reuse: Optimizing Memory-Bound Linear Layers

The majority of per-token decode in large transformer inference spends time in linear operations with working sets significantly exceeding individual XCD L2 capacities. Fleet exploits its hierarchical model by partitioning the GEMM output column-wise (N-split) to chiplets and then applying a cooperative, M-major tiling traversal within each Chiplet-task. This strategy synchronizes all workers on a chiplet to operate on the same weight columns, maximizing L2 residency and promoting repeated L2 hits before data eviction.

(Figure 3)

Figure 3: M-major tiling (Fleet) ensures spatial-temporal locality by having workers access the same weight columns in sequence, while N-major tiling (standard) destroys locality and inflates L2 misses.

The cache modifier bits (sc1, nt) on AMD GPUs are leveraged to further optimize L2 streaming: weight loads utilize cache-streaming hints to maximize short-lived reuse, and all cooperative communication within a chiplet avoids expensive, device-wide atomic operations or fences. Operator fusion (e.g., fusing SiLU with gate-up projection) removes intermediate buffer writes, contributing further to temporal L2 data reuse.

Hierarchical Scheduling and Synchronization Runtime

Fleet is realized as a persistent kernel runtime wherein a distinguished scheduler workgroup per XCD enqueues tasks for local workers. Chiplet-tasks are broadcast, enabling all workers on an XCD to operate in strict coordination with minimal synchronization overhead.

Hierarchical synchronization is critical under a partitioned cache regime: Fleet confines almost all scheduling, event counters, and atomic updates to the L2 domain of a single XCD. Expensive, global HBM traffic and cross-chiplet fences occur only once per chiplet per event, amortized across all workers. This is in sharp contrast to conventional approaches which may require many unnecessary device-wide atomics.

(Figure 4)

Figure 4: Fleet task model versus standard, and (right) runtime architecture with hierarchical, XCD-local scheduling and minimized global synchronization.

(Figure 5)

Figure 5: Hierarchical synchronization—local counters and communication stay L2-local within each XCD, global event updates (with explicit L2 flush) only occur on task completion, minimizing coherence traffic.

Experimental Results

Evaluations on MI350 with Qwen3-8B in bf16 demonstrate substantial empirical gains for Fleet:

  • At batch sizes 1–8, Fleet achieves 1.3–1.5× lower decode latency versus vLLM, and 1.1–1.2× over a chiplet-unaware megakernel baseline.
  • At batch size 32, Fleet's cooperative scheduling elevates the L2 hit rate from 38.9% (baseline) to 51.0%, reducing HBM reads by 18% and delivering 1.27× speedup. At batch size 64, the hit rate increases to 61.4% (Fleet) versus 39% (baseline), HBM reads are reduced by 37%, and speedup reaches 1.3×.
  • An ablation confirms that at small batch sizes, the majority of Fleet’s advantage derives from reduced dispatcher overhead, while at larger batch sizes, L2 locality and bandwidth savings dominate.

(Figure 6)

Figure 6: Per-token decode latency for Qwen3-8B: Fleet consistently outperforms vLLM and an XCD-unaware persistent kernel baseline, with the largest relative gains at low batch sizes.

(Figure 7)

Figure 7: GEMM roofline analysis—Fleet's L2 reuse shifts operational intensity rightwards, doubling achievable performance relative to standard techniques at moderate batch sizes by mitigating HBM bottlenecks.

Implications and Future Directions

Fleet’s introduction of a Chiplet-task programming and runtime model resolves critical inefficiencies in chiplet GPU architectures, enabling explicit software control over data locality and coordination. The abstraction generalizes to any architecture with partitioned L2 caches, not just AMD's MI350, and can straightforwardly extend to support new GPU backends and future multi-die designs. This work also demonstrates that much of the required “chiplet awareness” can, and should, be surfaced and exploited by the software stack, rather than demanding additional hardware mechanisms.

Theoretically, Fleet’s approach reframes hierarchical scheduling, task fusion, and memory placement as central compiler/runtime problems, potentially unifying persistent kernel and task graph superoptimization. Practically, the demonstrated improvements in per-token latency, HBM bandwidth utilization, and cache efficiency suggest Fleet is well positioned for integration into production LLM inference stacks, especially as model capacity, batch concurrency, and intra-node parallelism continue to escalate.

Further developments could focus on deeper integration with auto-scheduling/superoptimization frameworks (e.g., Mirage), scalable multi-GPU/tensor-parallel deployments, and compiler-driven, cost-model-informed fusion and scheduling that reason directly about chiplet topologies.

Conclusion

Fleet’s hierarchical task abstraction, chiplet-local megakernel runtime, and cooperative L2 tiling collectively resolve the pivotal performance and programmability limitations presented by multi-die GPUs in large-scale inference. By realigning software semantics with hardware topology, Fleet achieves significant improvements in cache utilization, intra-device bandwidth, and end-to-end latency for memory-bound workloads, while cleanly decoupling from device-specific hardware idiosyncrasies. This approach sets a technical foundation for scalable, high-efficiency deployment on future chiplet-based accelerators.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Explain it Like I'm 14

What this paper is about

This paper explains a new way to run AI programs on modern graphics chips (GPUs) that are built from several smaller “chiplets.” The authors introduce Fleet, a system that organizes work so each chiplet can reuse data in its own local cache, cutting down on long trips to slow memory. This matters a lot for LLMs, where the part that generates each new word (called “decode”) is often slowed by memory, not math.

The big questions the paper asks

  • Today’s GPU software treats the whole chip as one flat space, even though the hardware is split into chiplets with separate caches. Can we change the programming model to match the hardware so we waste less memory bandwidth?
  • Can this help LLMs generate tokens faster by reusing weights (the big matrices) better?
  • Can we do it in a way that avoids launch overhead (waiting to start many small GPU jobs) and keeps latency steady?

How they tackled it (in everyday terms)

Think of the GPU as an apartment building:

  • Each apartment (chiplet) has its own fridge (L2 cache).
  • The building has a shared pantry downstairs (HBM memory), which is slower to reach.
  • If people in different apartments need the same ingredient (weights), you don’t want each person to ride the elevator down to the pantry again and again—you want to share within the same apartment’s fridge.

Fleet reorganizes work so that:

  • People in the same apartment cook parts of the same recipe together, sharing what’s already in their fridge.
  • A “floor manager” (scheduler) in each apartment hands out tasks to roommates (workers), so they fetch and use the same ingredients at the same time.

A new way to think about GPU work

Fleet breaks work into four levels that match where the data lives:

  • Wavefront-task: tiny, runs inside a small team; uses registers/local scratchpad (good for simple element-wise math like activations).
  • CU-task: medium, runs on one compute unit; uses shared on-chip memory (good for attention and normalization).
  • Chiplet-task: new idea in this paper; runs across all workers inside one chiplet, sharing the chiplet’s L2 cache (great for big matrix multiplies in linear layers).
  • Device-task: runs across the whole GPU; coordinates all chiplets.

By making “Chiplet-tasks” a first-class thing, Fleet lets you say “this chunk of work and its data should live together on one chiplet.”

Scheduling inside the GPU “building”

Fleet runs as a single, long-lived “persistent kernel” (think: one big cooking session instead of many tiny ones). In each chiplet:

  • One workgroup acts as the scheduler (the floor manager).
  • The rest are workers (the cooks).
  • The scheduler hands out tasks that are local to the chiplet so workers can reuse the same cached data.

This avoids the delays from launching hundreds of tiny GPU kernels and keeps useful data in the chiplet’s cache between steps.

Sharing data the smart way

For the big matrix multiplies (the heart of LLM layers), Fleet:

  • Splits the output columns across chiplets (each chiplet computes its slice).
  • Coordinates workers so they process batches in an order that maximizes reuse of the same weight tiles (like all roommates using the same open jar before moving to a new one).
  • Uses “streaming” cache hints for weights so they stay just long enough to be reused by neighbors, then make room for the next weights.
  • Avoids clutter: temporary results are written in a way that doesn’t evict valuable weight data from the cache.
  • Fuses small follow-up steps (like SiLU activation) into the same task so data stays on-chip instead of bouncing out and back.

Keeping communication cheap

Talking across chiplets is slower than talking inside one. Fleet uses a “hierarchical” sync plan:

  • Inside a chiplet: workers update local counters (cheap and fast).
  • Across chiplets: only one worker per chiplet sends a global signal when the chiplet’s part is done (reduces expensive cross-chiplet messages).

What they found

Tested on an AMD Instinct MI350 GPU running the Qwen3-8B LLM:

  • At small batch sizes (1–8), Fleet reduces per-token decode latency by 1.3–1.5× compared to vLLM by using a persistent kernel and per-chiplet scheduling.
  • At larger batch sizes, coordinating how weights are shared raises the chiplet cache (L2) hit rate a lot:
    • From 12% to 54% at batch size 32
    • From 39% to 61% at batch size 64
  • This cuts traffic to slow memory (HBM) by up to 37%.
  • Overall, Fleet is 1.27–1.30× faster than a similar “megakernel” that doesn’t know about chiplets.

Why this matters: Most time during LLM decode is spent moving weights, not multiplying them. Better cache reuse directly speeds up token generation.

Why it matters

  • Faster, steadier responses: Lower latency for token generation helps chatbots, code assistants, and voice agents feel more responsive.
  • Better use of modern hardware: Today’s top GPUs are chiplets with separate caches. Fleet’s “Chiplet-task” idea teaches software to use this design well.
  • Less wasted memory traffic: Sharing data within a chiplet cuts duplicate loads, saving energy and bandwidth.
  • Portable thinking: While chiplet designs differ across companies, the core idea—map work to where the data lives—can apply broadly.

The takeaway

Fleet aligns the way we schedule work with the way modern GPUs are built. By introducing Chiplet-tasks and running everything in a single, persistent session, it keeps hot data local, reduces memory trips, and speeds up LLM inference. It’s a simple idea—keep work and data together—but applied carefully, it delivers real, measurable speedups.

Knowledge Gaps

Knowledge gaps, limitations, and open questions

The following list distills what remains missing, uncertain, or unexplored in the paper, with concrete directions for follow-on research:

  • Portability to other vendors: No evaluation on NVIDIA Blackwell or other multi-die GPUs; unclear how Fleet’s chiplet-tasks, cache modifiers, and synchronization map to architectures with globally coherent L2 and different cache hierarchies.
  • Generality across chiplet topologies: Behavior with different numbers/sizes of chiplets, varying L2/LLC capacities, and heterogeneous per-die capabilities is not studied.
  • Role of the shared LLC (Infinity Cache): The paper alludes to characterizing LLC behavior but provides no quantitative results; how cooperative tiling interacts with LLC, and whether LLC sharing undermines L2-locality gains, remains unclear.
  • Auto-selection of partitioning schemes: Criteria and mechanisms for switching between N-split and K-split strategies (batch size, model dimension thresholds) are heuristic and not auto-tuned or formally derived.
  • Lack of auto-tuning: Tile sizes, traversal order, and cache modifier settings are fixed or inspired by prior work; no autotuner explores trade-offs across workloads, hardware, and batch sizes.
  • Scheduling policy robustness: The per-chiplet scheduler’s round-robin and broadcast strategies are not compared against alternatives (work stealing, load-aware distribution, latency vs. throughput objectives).
  • Quantifying scheduling overheads: Only a qualitative note that 3.1% of CUs act as schedulers; no microbenchmarks isolate overheads, especially for very short tasks or high event rates.
  • Fairness and multi-tenancy: A persistent megakernel monopolizes CUs; how Fleet behaves with concurrent kernels, MPS/MIG-like partitioning, or time-slicing/preemption is unaddressed.
  • Dynamic batching/arrival variability: Handling fluctuating request rates, heterogeneous sequences, and priority scheduling in serving scenarios is not evaluated.
  • Correctness and memory-model guarantees: The hierarchical synchronization protocol (L2-local vs. device-scope) lacks formal correctness proofs under weak memory models and corner cases (e.g., device preemption, context switches).
  • Sensitivity to cache modifiers: The three-tier cache policy (streaming weights, NT stores for activations, scheduler polling) is not sensitivity-tested across models, precisions, and hardware revisions; risks of suboptimal eviction or coherence pitfalls remain.
  • Impact on power/energy: Persistent kernels, aggressive polling, and hierarchical fences may affect energy; no measurement of energy-use or thermal effects.
  • Limited operator coverage: Focus is on linear layers and some attention/CU tasks; sparse kernels, normalization variants, quantized/decompressed weights, and other ops (e.g., rotary embeddings, KV cache management) receive limited treatment.
  • Prefill and training phases: Fleet is optimized for memory-bound decode; compute-bound prefill and training workloads (with different reuse patterns and synchronization) are not analyzed.
  • Cross-token reuse opportunities: Whether weights/activations can be kept resident across decode steps to exploit temporal reuse beyond a single token is not explored.
  • Interaction with weight compression: Modern inference uses int8/FP8 or W4A8 with on-the-fly dequantization; effects on L2 residency, tiling, and cache modifiers are not investigated.
  • Compiler/codegen maturity: The task graph generation and code generation pipeline (e.g., integrating cache-scoped instructions) are not fully described; tooling for safely emitting sc0/sc1/NT variants and verifying correctness is lacking.
  • Programmer burden and API design: How developers specify chiplet-tasks, events, and cache policies at a high level (and how errors are prevented) is not elaborated; integration with mainstream frameworks (PyTorch, Triton, TVM) is open.
  • Load imbalance at small batch sizes: When bs < number of chiplets, N-split can underutilize chiplets/CUs; mechanisms for elastic repartitioning or hybrid splits are not presented.
  • Latency vs. throughput trade-offs: Fleet’s gains are shown for latency; whether similar benefits hold at high-throughput regimes or for large batches with sophisticated cuBLAS-like tiling is unclear.
  • Quantifying fence/atomic costs: The hierarchical event design is motivated qualitatively; no direct measurements isolate the overheads of buffer_wbl2 and sc0/sc1 atomics vs. L2-local operations.
  • Multi-GPU/scale-out scenarios: How Fleet’s per-chiplet abstraction composes across multiple GPUs (NVLink/xGMI), with cross-device caches and network bandwidth, remains unexplored.
  • Failure modes and deadlock/livelock: No analysis of edge cases in event-driven schedules (missed signals, queue overflow, scheduler starvation, or pathological task graphs).
  • Security/isolation concerns: Chiplet-local caches and shared LLC raise questions about isolation in multi-tenant settings; Fleet’s implications for side channels or isolation are not discussed.
  • Sensitivity across models and sizes: Results are mostly for Qwen3-8B; behavior for larger/smaller models, longer context lengths, mixture-of-experts, and encoder–decoder architectures is not provided.
  • Preemption and context switching: Persistent kernels can be preempted; the impact on event counters, L2 state, and correctness after resumption is not addressed.
  • Hardware feature dependency: Reliance on AMD-specific HW_ID and sc0/sc1/NT modifiers may limit portability; abstractions for cross-vendor support are not defined.
  • End-to-end system evaluation: Integration with realistic serving stacks (scheduler, tokenizer, networking, batching policies) and tail-latency distributions is not shown; reported gains are kernel/device-centric.

Practical Applications

Overview

The paper introduces Fleet, a hierarchical, chiplet-aware task model and persistent-kernel runtime for multi-die GPUs. It exposes a new “Chiplet-task” abstraction that binds work and data to a chiplet’s private L2 cache and adds per-chiplet scheduling, cooperative tiling, and hierarchical synchronization. On AMD Instinct MI350 with Qwen3-8B, Fleet reduces per-token decode latency by 1.3–1.5× at batch sizes 1–8 versus vLLM, increases L2 hit rates substantially at larger batches (e.g., 12%→54% at bs=32), reduces HBM traffic by up to 37%, and delivers 1.27–1.30× speedups over a chiplet-unaware megakernel baseline.

Below are practical applications enabled by these findings, methods, and innovations.

Immediate Applications

These can be deployed now using current AMD MI300/MI350-class GPUs and ROCm/HIP toolchains.

  • LLM inference acceleration in production serving stacks (software/cloud, finance, healthcare, education, contact centers)
    • Use case: Integrate Fleet’s persistent-kernel runtime and chiplet-aware scheduling into inference frameworks (e.g., vLLM, SGLang, LMDeploy) to reduce per-token latency and HBM traffic for memory-bound decode.
    • What to implement:
    • Replace per-operator launches with a single persistent megakernel.
    • Partition GEMM across chiplets (N-split) and use M-major traversal so workers on the same chiplet access the same weight tiles, converting L2 misses to hits.
    • Fuse elementwise ops (e.g., SiLU) into GEMMs to keep intermediates in registers/LDS.
    • Apply cache modifiers (e.g., streaming for weights, non-temporal for transient stores).
    • Expected impact: 1.3–1.5× lower latency at bs=1–8; 1.27–1.30× speedups and up to 37% HBM traffic reduction at larger batches (bs=32–64).
    • Dependencies/assumptions: AMD MI300/MI350 with private, non-coherent L2 per chiplet; ROCm/HIP ISA support for cache scope/NT bits; framework integration effort; persistent kernels may affect preemption/multi-tenancy.
  • Cost and energy efficiency improvements for managed AI services (cloud, platforms)
    • Use case: Offer Fleet-optimized AMD instances or managed endpoints with lower cost-per-token and lower energy per request due to reduced HBM bandwidth consumption.
    • Tools/products/workflows: “FleetRT” runtime library packaged as a backend for popular serving frameworks; autoscaling tuned for deterministic latency across batch sizes.
    • Dependencies/assumptions: Datacenter GPUs with multi-die topology; SRE policies accommodating persistent kernels; monitoring to track L2 hits/HBM bandwidth.
  • On-prem/enterprise deployments needing strict latency SLAs (healthcare, finance, customer support)
    • Use case: Real-time clinical note summarizers, financial copilots, voice/chat assistants with sub-10 ms per-token latency on AMD MI350 nodes.
    • Workflow: Deploy Fleet-enabled inference services inside regulated environments to meet latency and data residency constraints.
    • Dependencies/assumptions: HIP/ROCm availability on-prem; validation for regulated environments; latency gains most pronounced in memory-bound decode.
  • Chiplet-aware BLAS/GEMM kernels and libraries (HPC, scientific computing, energy)
    • Use case: Provide a ROCm/hipBLAS add-on that partitions large GEMMs across chiplets with cooperative tiling and hierarchical synchronization.
    • Tools/products/workflows: A “chiplet-aware GEMM” library for batched GEMV/GEMM used by recommender systems, physics solvers, or scientific post-processing pipelines that are memory-bound.
    • Dependencies/assumptions: Benefits strongest when weight-like operands dominate working set and L2 reuse can be coordinated; tuning for 4 MB L2 per chiplet.
  • Compiler/runtime schedule templates (ML compilers)
    • Use case: Add Fleet-like “Chiplet-task” schedules and hierarchical sync to TVM/Triton templates for AMD GPUs (e.g., an N-split template for decode GEMMs).
    • Tools/products/workflows: Auto-schedule knobs for M-major traversal, device-vs-chiplet task granularity, and cache policies; dynamic selection based on batch size.
    • Dependencies/assumptions: Backend access to ISA features (scope bits, buffer_wbl2); per-chiplet ID access for schedulers; correctness testing for hierarchical sync.
  • Performance engineering and observability (devtools)
    • Use case: Create profiling/monitoring tools that surface per-chiplet L2 hit rates, HBM traffic, and chiplet-task occupancy to validate locality strategies.
    • Tools/products/workflows: “Fleet Profiler” plugin for rocprof/Perfetto/Grafana tracking L2/TCC metrics by XCD and kernel segment.
    • Dependencies/assumptions: Availability of per-chiplet counters/telemetry; low overhead instrumentation.
  • Teaching and curriculum content on chiplet-NUMA GPU programming (academia)
    • Use case: Lab modules using Fleet concepts (chiplet tasks, cooperative L2 reuse, hierarchical synchronization) to teach modern multi-die GPU performance engineering.
    • Dependencies/assumptions: Access to AMD multi-die GPUs in teaching labs; simplified examples (e.g., batched GEMV microbenchmarks).

Long-Term Applications

These require additional research, scaling, portability engineering, or ecosystem changes.

  • Standardized chiplet-aware programming models and APIs (ecosystem/policy/standards)
    • Use case: Extend CUDA/HIP/SYCL/OpenMP offload with chiplet-scoped tasks, locality hints, and memory scope controls to make chiplet-awareness portable.
    • Potential outcomes: API-level “chiplet group” abstractions, per-chiplet synchronization primitives, and standard cache modifier policies.
    • Dependencies/assumptions: Vendor and standards-body buy-in; specification and conformance tests; cross-vendor architectural differences (coherent vs. private L2).
  • Porting Fleet concepts to other vendors and architectures (software/hardware co-design)
    • Use case: Adapt Chiplet-task scheduling to NVIDIA Blackwell (dual-die with globally coherent L2) and future multi-die parts across vendors.
    • Expected changes: Different coherence semantics, cache sizes, and per-die latencies; need to tune traversal and synchronization strategies to architecture.
    • Dependencies/assumptions: Access to low-level scope controls; profiling hooks for per-die metrics; potential smaller gains with globally coherent L2.
  • Automatic task graph partitioning and autotuning in ML compilers (software/ML systems)
    • Use case: Integrate auto-partitioners into TVM, PyTorch Inductor, or XLA that map ops to wavefront/CU/Chiplet/Device tasks based on working set and batch size.
    • Products/workflows: Learned cost models exploring N-split vs. K-split hybrids, fusion opportunities, and cache policy assignments per op and batch regime.
    • Dependencies/assumptions: Robust hardware models for multi-die NUMA effects; generalization across models; training data for cost models.
  • Training and multi-GPU/cluster extensions (HPC/AI training)
    • Use case: Apply hierarchical tasks to training loops (e.g., optimizer updates, KV cache writes, activation checkpointing) and across devices (per-GPU tasks with NVLink/IF-aware partitioning).
    • Workflows: Cluster-level “Device-task” composition to coordinate per-GPU chiplet tasks, reducing interconnect pressure for memory-bound phases.
    • Dependencies/assumptions: More complex dependency graphs; interaction with distributed training paradigms (TP/PP/FSDP); careful overlap of comm/compute.
  • Datacenter scheduling and multi-tenancy integration (cloud/platforms)
    • Use case: K8s/NVML/ROCm operators that recognize persistent, chiplet-aware megakernels and schedule tenants to reduce cross-chiplet interference and maintain fairness.
    • Products/workflows: “Chiplet-aware GPU Operator” that controls residency, time-slicing, and isolation; SLA-aware admission control for persistent kernels.
    • Dependencies/assumptions: OS/driver-level support for chiplet partitioning; preemption-friendly megakernel designs; telemetry for fairness and isolation.
  • Hardware/firmware features co-designed with Fleet-like runtimes (hardware vendors)
    • Use case: Hardware support for per-chiplet dispatch queues, fast in-silicon chiplet barriers, and programmable cache policies per access class.
    • Outcomes: Lower-latency cross-chiplet synchronization, reduced need for explicit fences (buffer_wbl2), and improved programmability for locality.
    • Dependencies/assumptions: Silicon changes across generations; verification effort; balance with general-purpose programming models.
  • Sector-specific low-latency applications beyond LLMs (robotics, autonomous systems, AR/VR, healthcare)
    • Use cases:
    • Robotics/autonomy: On-robot multi-die accelerators running real-time language/vision models with chiplet-locality scheduling for deterministic latency.
    • AR/VR and gaming AI: Lower-latency on-device assistants or NPC reasoning on multi-die GPUs.
    • Healthcare imaging and decision support: Memory-bound post-processing kernels (e.g., certain transforms) scheduled chiplet-locally for lower turnaround time.
    • Dependencies/assumptions: Availability of multi-die accelerators in edge/embedded form factors; model/operator mapping to memory-bound patterns.
  • Sustainability and public-sector procurement policies (policy)
    • Use case: Include chiplet-aware optimization requirements or incentives in “green AI” guidelines and public-sector RFPs to reduce datacenter energy (via lower HBM traffic).
    • Dependencies/assumptions: Clear, auditable metrics (e.g., per-token energy); vendor-neutral guidance; training for procurement and compliance teams.

Notes on feasibility across items:

  • Gains depend on memory-bound phases (e.g., LLM decode); compute-bound workloads may see limited benefit.
  • Batch-size regime matters: persistent-kernel/scheduling reduces overhead at small batches; cooperative L2 reuse increases importance at moderate batches; K-split GEMMs may dominate at very large batches.
  • Persistent kernels can complicate preemption, sharing, and debugging; productionization requires careful ops integration.
  • Low-level cache scope/NT controls and fences are currently vendor-specific; broad portability requires standardization.

Glossary

  • Accelerator Complex Die (XCD): A chiplet die used in AMD GPUs that integrates multiple compute units and a private L2 cache. "eight XCDs (Accelerator Complex Dies)"
  • bf16: A 16-bit floating-point format (bfloat16) commonly used in deep learning to reduce memory bandwidth while maintaining range. "in bf16"
  • buffer_wbl2: An AMD GPU ISA instruction that writes back dirty L2 cache lines to make data visible across chiplets. "the producer must issue buffer_wbl2 to write back dirty L2 lines"
  • cache-streaming: A cache allocation policy that allows short-lived reuse and prioritizes eviction of streamed lines; on AMD set via sc1=1, nt=1. "cache-streaming (sc1=1, nt=1)"
  • chiplet: A separate silicon die within a multi-die GPU package that has its own local resources and caches. "binds work and data to a chiplet"
  • chiplet swizzling: A platform-specific technique to remap or distribute data/work across chiplets for locality or balance. "chiplet swizzling"
  • Chiplet-task: A Fleet task type that spans all workers on a single chiplet to coordinate via its shared L2. "Chiplet-tasks, a new abstraction"
  • Coherent_Cache_Bypass: An AMD L2 cache policy that ensures cross-chiplet visibility by bypassing or writing back modified lines before reissuing reads. "Coherent_Cache_Bypass policy"
  • Compute Unit (CU): A GPU execution block that runs workgroups and provides L1/LDS resources. "32 CUs per XCD"
  • CUDA/HIP: GPU programming models from NVIDIA and AMD, respectively, that define threads, workgroups, and grids. "The CUDA/HIP programming model was designed for monolithic GPUs"
  • CUDA/HIP graph capture: A mechanism to record and replay sequences of GPU kernels to reduce launch overhead. "CUDA and HIP graph capture"
  • CU-task: A Fleet task type that occupies a single compute unit and uses LDS/L2 for intra-task data. "CU-tasks occupy a single compute unit"
  • Device-task: A Fleet task that coordinates all chiplets on a device, completing only when all chiplet subtasks finish. "a device-task, comprised of eight Chiplet-tasks"
  • flat_atomic_add: An AMD GPU atomic operation used here with GPU-scope to update global counters across chiplets. "flat_atomic_add with sc0 sc1"
  • GEMM: General Matrix-Matrix Multiply, the core linear algebra operation in neural network layers. "For an [M,K] * [K,N] GEMM"
  • GEMV: General Matrix-Vector Multiply; batched forms appear in decode workloads. "Optimizing Batched GEMV"
  • HBM: High Bandwidth Memory attached to the GPU package, providing very high throughput to device DRAM. "reducing HBM traffic by up to 37\%"
  • Infinity Cache: AMD’s large on-package last-level cache shared across chiplets (also called MALL). "MALL (Infinity Cache)"
  • Infinity Fabric: AMD’s on-package interconnect that links chiplets and I/O dies. "traverses the Infinity Fabric to the IO die"
  • IO die: A die in the GPU package that routes memory and coherence traffic between chiplets and HBM. "to the IO die"
  • K-split: A GEMM partitioning strategy that splits along the reduction (K) dimension, requiring partial-result reductions. "An alternative strategy partitions the reduction dimension (K-split)"
  • KV cache: The key–value cache holding past attention states for efficient autoregressive decoding. "KV cache"
  • last-level cache (LLC): The largest on-package cache level before HBM; on AMD GPUs this is the Infinity Cache. "last-level cache (LLC)"
  • L2 cache: A mid-level on-chip cache; on chiplet GPUs it is partitioned and private to each chiplet. "a private 4\,MB L2 cache (TCC)"
  • LDS (Local Data Share): On-chip shared memory per CU used for fast intra-workgroup communication. "using registers and LDS"
  • megakernel: A single, large persistent kernel that fuses many operators to reduce launch overhead and enable data reuse. "chiplet-unaware megakernel baseline"
  • Mirage Persistent Kernel: A system for building persistent kernels that Fleet extends with chiplet-aware features. "extends the Mirage Persistent Kernel system"
  • M-major traversal: A tile traversal order that advances primarily along the M (batch/row) dimension to improve weight reuse in L2. "windowed M-major traversal"
  • M-split traversal: A strategy assigning disjoint M-tiles to chiplets so each works on different rows without cooperative weight reuse. "M-split traversal."
  • multi-die: A GPU packaging style that integrates multiple dies (chiplets) in one package for scalability. "multi-die designs"
  • non-temporal (NT): A memory access modifier indicating data should bypass or minimally use cache because it has little reuse. "non-temporal bit (NT)"
  • N-major ordering: A tile traversal that advances along the N (column) dimension; here it reduces L2 reuse across workers. "N-major ordering were used instead"
  • N-split: A GEMM partitioning strategy that splits the output columns across chiplets, avoiding cross-chiplet reductions. "Fleet partitions GEMM output columns across XCDs (N-split)"
  • NUMA-like: Refers to non-uniform memory access effects within a single device due to chiplet-local caches. "introduces NUMA-like programming effects"
  • persistent kernel: A kernel that stays resident on the GPU to execute many operations/tasks, amortizing launch costs. "persistent kernel programming"
  • RoPE: Rotary positional embeddings, an attention mechanism component for encoding relative positions. "RoPE"
  • RMSNorm: Root Mean Square Layer Normalization, a normalization technique used in many transformer models. "RMSNorm"
  • SC1/SC0: AMD instruction bits that set the coherence scope (e.g., wave, group, device, system) for memory operations. "two scope bits (SC1, SC0)"
  • split-K GEMM: A GEMM technique that splits along K and later reduces partial results, often used to improve parallelism. "split-K GEMM"
  • TCC: AMD’s L2 cache controller block; here used to denote the per-chiplet L2. "a private 4\,MB L2 cache (TCC)"
  • threadfence: A GPU memory fence that ensures prior writes are visible before proceeding, used before cross-chiplet signaling. "issues a single threadfence"
  • wavefront: The SIMD execution grouping on AMD GPUs (typically 64 threads) analogous to a warp. "a single wavefront (64 threads on AMD)"
  • wavefront-task: A Fleet task type that runs within one wavefront for fine-grained, element-wise operations. "wavefront-tasks operate within a single wavefront"

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 216 likes about this paper.