Swizzled Head-first Mapping in GPU Scheduling
- The paper introduces Swizzled Head-first Mapping, a NUMA-aware strategy that confines each attention compute cluster to a single chiplet to enhance cache reuse.
- It employs a swizzle via index arithmetic to map all workgroups of an attention head contiguously, thereby reducing remote memory access and fragmentation.
- Experimental evaluations on AMD MI300X demonstrate up to 50% throughput improvement and L2 cache hit rates of 90–96% compared to conventional scheduling methods.
Swizzled Head-first Mapping is a spatially-aware GPU kernel scheduling strategy that aligns attention heads or grouped-query attention (GQA) groups with physical Non-Uniform Memory Access (NUMA) domains on modern multi-chiplet (multi-die) GPU architectures. Its core design objective is to exploit intra-chiplet cache reuse by co-locating all computation for an attention compute cluster (ACC)—the set of workgroups sharing the same key and value tensors (, )—on a single chiplet (XCD), thereby minimizing cross-die memory traffic and maximizing cache efficiency. This technique is particularly vital on architectures such as AMD's MI300X, which feature NUMA effects due to decentralized compute, cache, and HBM memory controllers per XCD, leading to highly non-uniform memory access latencies and bandwidths.
1. NUMA Effects and Motivation
The advent of multi-chiplet GPU architectures for AI workloads, such as AMD's MI300X with eight XCDs per device, has introduced pronounced NUMA effects: each XCD possesses local compute, 4MB L2 cache, and HBM3 memory controllers. Reads and writes to local memory/caches are low-latency and high-bandwidth, while accesses to remote XCDs are penalized by greater latency and limited bandwidth. Cache hierarchy is fragmented, with per-XCD L2 subsystems that do not maintain coherence across chiplets. Without NUMA-aware scheduling, logical units of computation (e.g., attention heads) are dispersed across XCDs, resulting in frequent high-latency HBM accesses, minimal cache reuse, and degraded throughput. Prior research on GEMM workloads demonstrated L2 hit rates improving from 43% to 92% under NUMA-aware mapping, highlighting the importance of localizing computation to single NUMA domains in memory-bound kernels.
2. Structure of Attention Kernel Computation
Multi-head attention (MHA) and grouped-query attention (GQA) decompose the model’s representational space into multiple attention heads or groups. In practical implementations such as FlashAttention2, the query tensor () is divided into row blocks, each processed by a workgroup. Within a head or GQA group, all workgroups require access to the same and . This forms the attention compute cluster (ACC)—the natural cache-sharing unit. Localizing all computation for an ACC on a single XCD maximizes L2 cache reuse and minimizes remote memory transactions. When batches, heads, or sequence length are large, the importance of such data-locality is amplified; fragmentation of a single ACC across XCDs leads to exponential increases in cache misses and redundant movement of , between memory and cache hierarchies.
3. Mapping Algorithms: Baselines and Limitations
Prior attention kernel scheduling approaches, agnostic to NUMA, frequently employ:
- Naive Block-First: Iterates over all heads for each row block, round-robining workgroups across XCDs with no attention to locality.
- Naive Head-First: Processes all blocks of one head but distributes these across XCDs, partially fragmenting ACCs.
- Swizzled Block-First (GQA-aware): Co-locates each GQA group with an XCD, performing well only when the number of GQA groups matches the number of XCDs.
Under scenarios typical in modern LLMs (e.g., 128 heads, 8 XCDs), all these methods result in significant cache fragmentation—each XCD’s cache must store data for multiple ACCs concurrently, annihilating cache effectiveness and causing L2 hit rates to drop below 1% at scale.
4. Swizzled Head-first Mapping: Principles and Implementation
The Swizzled Head-first Mapping strategy directly addresses NUMA fragmentation by strictly confining each ACC to a single XCD. The scheduling algorithm remaps linear workgroup IDs so that all blocks belonging to an attention head (or GQA group) are mapped contiguously to a designated XCD. The pseudocode executes a "swizzle" via index arithmetic, ensuring that for every batch and head, all needed workgroups are scheduled sequentially on the same chiplet before moving to the next head/XCD assignment.
The essential logic is:
- For grid configuration , map each head in batch to .
- Process all row blocks of for that head on the assigned XCD.
- Move to the next head/XCD assignment in a round-robin fashion.
This design ensures:
- The per-XCD L2 cache stores and reuses only one ACC at a time.
- Memory movement is always local, in alignment with physical memory topology.
- Kernel modifications are minimal—only index arithmetic is changed, leaving computational logic intact, thus enabling rapid integration into codebases such as Triton and compatibility with ROCm.
5. Experimental Evaluation and Comparative Results
Empirical evaluation is conducted on AMD's MI300X, with all attention kernels implemented in Triton and profiled using ROCProfiler. Testing scenarios include MHA and GQA with variable batch sizes, head counts up to 128, and sequence lengths in the hundreds of thousands (e.g., DeepSeekV3 prefill with 128 heads at 128k tokens).
- Throughput: Swizzled Head-first Mapping achieves up to 50% higher throughput compared to block-first strategies under high head count and long context length (e.g., block-first drops to 0.65x efficiency at 128k tokens and 128 heads).
- L2 Cache Hit Rate: Naive/block-first strategies cause L2 hit rates to collapse below 1%. In contrast, Swizzled Head-first maintains 90–96% L2 hit rates in all regimes considered, directly underlying the observed speedup.
- Robustness: The approach holds for both MHA and GQA, applies to the backward pass of FlashAttention2, large batch sizes, and diverse model configurations.
- Minimal Implementation Overhead: The change is restricted to scheduling math; kernel functionality and compute structure are otherwise unaltered.
| Mapping Strategy | L2 Hit Rate (128 heads, 128k tokens) | Relative Throughput (%) | Notes |
|---|---|---|---|
| Swizzled Head-first | 90–96% | 100% | One ACC per XCD |
| Naive Head-first | 40–60% | ~90% | Some cache fragmentation |
| Swizzled Block-first | ~1% | 70–76% | Good for GQA = XCD only |
| Naive Block-first | ~1% | <65% | Severe cache fragmentation |
6. Mathematical Model and Scheduling Formalization
Formally, for attention heads, batches, context length, and XCDs, the Swizzled Head-first assignment is:
for head index . The set of workgroups for , across all row blocks , are executed on this assigned XCD. This mapping is both spatially and arithmetically simple, yielding maximal reuse and minimal cross-chip transfers.
7. Implications and Outlook
NUMA-aware spatial scheduling, epitomized by Swizzled Head-first Mapping, is now essential to realize the capabilities of contemporary disaggregated GPU systems for large-scale attention workloads. Its adoption provides up to 50% performance gains, order-of-magnitude improvements in L2 utilization, and generalizes to both MHA and GQA. For current and emerging LLM deployments—featuring extreme head counts (e.g., Llama-3 405B, DeepSeekV3) and long sequence requirements—such techniques are a prerequisite for hardware efficiency. The method’s compatibility with standard frameworks (Triton, ROCm), negligible engineering overhead, and future-proof characteristics ensure broad relevance and adoption potential.
As multi-die architectures proliferate, spatially-aware kernel scheduling will underpin scalable AI training and inference, and Swizzled Head-first Mapping represents a foundation for further research and practical deployment in NUMA-affected computational environments.
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free