Swizzled Head-First Mapping for AI GPUs
- SHFM is a NUMA-aware scheduling strategy that remaps attention head tiles to the same chiplet, significantly boosting L2 cache efficiency.
- The technique optimizes multi-head attention by confining K/V matrix reuse within a single NUMA domain, reducing remote memory accesses and bandwidth bottlenecks.
- Evaluated on the AMD MI300X, SHFM achieves up to 1.50× throughput improvement and 90–96% L2 hit rates compared to traditional scheduling methods.
Swizzled Head-First Mapping (SHFM) is a NUMA-aware workgroup scheduling strategy designed to address non-uniform memory access (NUMA) effects on multi-chiplet AI GPUs, specifically in the context of large-scale attention workloads such as multi-head attention (MHA). SHFM remaps workgroups so that all tiles belonging to a given attention head are dispatched to the same GPU chiplet (NUMA domain or XCD), significantly improving L2 cache efficiency and end-to-end attention throughput. The technique was introduced and evaluated on the AMD MI300X platform, demonstrating substantive performance gains over state-of-the-art attention scheduling approaches (Choudhary et al., 3 Nov 2025).
1. NUMA Effects in Multi-Chiplet AI GPUs
Modern high-performance AI GPUs, exemplified by the AMD MI300X, are architected as multi-chiplet devices. Each chiplet, or XCD, comprises dedicated compute units (CUs), a private L2 cache (4 MB per XCD), and local memory controllers interfaced to individual HBM stacks. This topology results in spatial memory partitioning: memory access cost (in latency and bandwidth) depends on whether the access targets the local resources of a chiplet or traverses inter-chiplet links to remote resources. “Local” accesses—CU to its own XCD’s L2 and local HBM—exhibit low latency and high bandwidth, while “remote” accesses—CU to a different XCD’s L2 or HBM—are characterized by 2–3× higher latency and reduced bandwidth.
By default, GPU kernel schedulers dispatch workgroups (WGs) in a round-robin fashion across XCDs, leading to fragmentation of data locality when consecutive WGs share data, as is typical in attention mechanisms. In such scenarios, redundant global memory fetches proliferate, reducing L2 hit rates and impeding overall throughput.
2. Challenges of Conventional Scheduling for Multi-Head Attention
Typical MHA implementations, such as FlashAttention2, decompose the Q matrix into row blocks (“tiles”), assigning each block to a WG. Within this configuration, all WGs associated with a particular attention head collectively require the same K and V matrices, forming an Attention Compute Cluster (ACC). Legacy scheduling approaches—such as Block-First and Head-First, even when naively “swizzled”—tend to scatter the tiles of a single ACC across multiple XCDs:
- Block-First (round-robin across heads): Splits every ACC across all chiplets, resulting in near-zero reuse at the L2 level (∼1% L2 hit rate at high head counts).
- Head-First (all blocks of one head, then next head): Improves temporal coherence but still distributes an ACC’s tiles over several chiplets, yielding modest reuse (40–60% L2 hit rates in the large-H regime).
These patterns introduce critical NUMA-induced inefficiencies:
- Excessive latency: Remote L2 accesses predominate.
- Bandwidth saturation: Redundant fetches for the same K/V data across XCDs.
- Collapsed cache locality: Inability to confine ACC working sets to a single L2 cache.
3. SHFM Algorithm and Mapping Function
SHFM resolves NUMA-induced inefficiency in attention kernels with a lightweight remapping of workgroup IDs prior to dispatch, ensuring all tiles for a given attention head (and corresponding K/V payloads) reside and execute within a single chiplet’s NUMA domain. The mapping operates as follows:
Let
- = batch size
- = number of heads
- = number of tiles per head (per batch), with
- = number of chiplets (XCDs)
- = original linear WG ID
Decompose as
- (batch index)
- (intra-batch offset)
- (head index)
- (tile-block index)
Assuming for even head distribution:
- (heads per chiplet)
- (chiplet assignment for head )
The remapped WGID (“swizzle”) is:
This ensures all WGs for a head (over all tiles and batches) reside on the same chiplet. The mechanism comprises ∼10 lines of code in a Triton kernel, with no changes required to the main compute loop.
4. Implementation on AMD MI300X
On MI300X hardware, there are 8 chiplets (C=8), each with 38 CUs, 16 KB L1 per CU, and a private 4 MB L2. Memory is provided by HBM3, with aggregate bandwidth of 5.3 TB/s. SHFM is implemented by augmenting the attention kernel (FlashAttention2 forward and backward) in Triton: at program launch, the remapping logic computes the new workgroup ID for dispatch. No modification to the underlying Q/K/V math or memory access patterns is needed; only the launch order and XCD affinity of WGs is altered.
This method relies on the scheduler’s implicit guarantee of static (non-dynamically load-balanced) per-XCD queues at a workgroup granularity.
5. Performance Metrics and Experimental Results
Experimental evaluation on AMD MI300X utilized ROCm Profiler v3 for hardware counter analysis. The test suite swept MHA and GQA workloads across:
- Context lengths
- Batch sizes
- Heads
- Head dimension = 128; BLOCK_M × BLOCK_N = 128 × 64
- GQA: ; ; head_dim=128
Metrics:
- Throughput: in tokens × heads/s
- L2 cache-hit rate (from ROCProfiler)
- Results normalized to SHFM = 1.0
Key findings:
- Up to 1.50× higher forward-pass throughput for MHA at ,
- SHFM achieves 90–96% L2 hit rate, versus:
- Naive Block-First: , throughput ≈ 0.55×
- Naive Head-First: 40–60%, throughput ≈ 0.90×
- Swizzled Block-First: 10–70%, throughput ≈ 0.65–0.80×
- For GQA, both SHFM and Swizzled Block-First sustain ≈1.0×; Naive Block-First drops to 0.7× at high
- DeepSeek-V3 prefill (): SHFM = 1.0 versus Naive Block-First ≈ 0.65×
- Backward pass acceleration up to 1.10× at ,
6. Analysis, Limitations, and Generalization
SHFM’s efficacy stems from complete ACC co-location: each head’s K/V matrices are loaded once per chiplet, then L2-reused for all tiles and batches. The resulting L2 hit rates (typically ) reduce off-chip traffic by 2–5×, removing memory stalls and permitting near-peak compute throughput.
Limitations include:
- Requires (integral head-to-chiplet mapping); otherwise, head imbalance may cause chiplet underutilization.
- Potential incompatibility with hardware schedulers supporting large chunk sizes or dynamic load balancing; software and driver validation is necessary in those cases.
- Provides no benefit on monolithic GPUs with unified L2 cache (lacking NUMA domains).
SHFM generalizes to any tile-partitioned algorithm with significant inter-tile reuse, such as GEMM and convolution kernels. Equivalent swizzle logic can be applied to other NUMA-exposing GPU architectures (e.g., NVIDIA Rubin Ultra). Directions for extension include auto-tuning head-to-chiplet assignments for non-integral scenarios and integrating predictive, straggler-aware load-balancing.
7. Significance and Future Directions
SHFM offers a portable, minimal-overhead solution for NUMA optimization in multi-head attention on next-generation, disaggregated GPU architectures. By enforcing spatiotemporal affinity of per-head working sets to private NUMA domains, the technique effectively transforms memory bandwidth-limited attention kernels into compute-bound operations. As advanced multi-die architectures proliferate, the adoption and further generalization of per-tile, NUMA-aware swizzling methods like SHFM are poised to become foundational for maximizing performance of AI training and inference workloads on future accelerators (Choudhary et al., 3 Nov 2025).