Swizzled Head-first Mapping in GPU Scheduling

Updated 5 November 2025

The paper introduces Swizzled Head-first Mapping, a NUMA-aware strategy that confines each attention compute cluster to a single chiplet to enhance cache reuse.
It employs a swizzle via index arithmetic to map all workgroups of an attention head contiguously, thereby reducing remote memory access and fragmentation.
Experimental evaluations on AMD MI300X demonstrate up to 50% throughput improvement and L2 cache hit rates of 90–96% compared to conventional scheduling methods.

Swizzled Head-first Mapping is a spatially-aware GPU kernel scheduling strategy that aligns attention heads or grouped-query attention (GQA) groups with physical Non-Uniform Memory Access (NUMA) domains on modern multi-chiplet (multi-die) GPU architectures. Its core design objective is to exploit intra-chiplet cache reuse by co-locating all computation for an attention compute cluster (ACC)—the set of workgroups sharing the same key and value tensors ( $K$ , $V$ )—on a single chiplet (XCD), thereby minimizing cross-die memory traffic and maximizing cache efficiency. This technique is particularly vital on architectures such as AMD's MI300X, which feature NUMA effects due to decentralized compute, cache, and HBM memory controllers per XCD, leading to highly non-uniform memory access latencies and bandwidths.

1. NUMA Effects and Motivation

The advent of multi-chiplet GPU architectures for AI workloads, such as AMD's MI300X with eight XCDs per device, has introduced pronounced NUMA effects: each XCD possesses local compute, 4MB L2 cache, and HBM3 memory controllers. Reads and writes to local memory/caches are low-latency and high-bandwidth, while accesses to remote XCDs are penalized by greater latency and limited bandwidth. Cache hierarchy is fragmented, with per-XCD L2 subsystems that do not maintain coherence across chiplets. Without NUMA-aware scheduling, logical units of computation (e.g., attention heads) are dispersed across XCDs, resulting in frequent high-latency HBM accesses, minimal cache reuse, and degraded throughput. Prior research on GEMM workloads demonstrated L2 hit rates improving from 43% to 92% under NUMA-aware mapping, highlighting the importance of localizing computation to single NUMA domains in memory-bound kernels.

2. Structure of Attention Kernel Computation

Multi-head attention (MHA) and grouped-query attention (GQA) decompose the model’s representational space into multiple attention heads or groups. In practical implementations such as FlashAttention2, the query tensor ( $Q$ ) is divided into row blocks, each processed by a workgroup. Within a head or GQA group, all workgroups require access to the same $K$ and $V$ . This forms the attention compute cluster (ACC)—the natural cache-sharing unit. Localizing all computation for an ACC on a single XCD maximizes L2 cache reuse and minimizes remote memory transactions. When batches, heads, or sequence length are large, the importance of such data-locality is amplified; fragmentation of a single ACC across XCDs leads to exponential increases in cache misses and redundant movement of $K$ , $V$ between memory and cache hierarchies.

3. Mapping Algorithms: Baselines and Limitations

Prior attention kernel scheduling approaches, agnostic to NUMA, frequently employ:

Naive Block-First: Iterates over all heads for each row block, round-robining workgroups across XCDs with no attention to locality.
Naive Head-First: Processes all blocks of one head but distributes these across XCDs, partially fragmenting ACCs.
Swizzled Block-First (GQA-aware): Co-locates each GQA group with an XCD, performing well only when the number of GQA groups matches the number of XCDs.

Under scenarios typical in modern LLMs (e.g., 128 heads, 8 XCDs), all these methods result in significant cache fragmentation—each XCD’s cache must store data for multiple ACCs concurrently, annihilating cache effectiveness and causing L2 hit rates to drop below 1% at scale.

4. Swizzled Head-first Mapping: Principles and Implementation

The Swizzled Head-first Mapping strategy directly addresses NUMA fragmentation by strictly confining each ACC to a single XCD. The scheduling algorithm remaps linear workgroup IDs so that all blocks belonging to an attention head (or GQA group) are mapped contiguously to a designated XCD. The pseudocode executes a "swizzle" via index arithmetic, ensuring that for every batch and head, all needed workgroups are scheduled sequentially on the same chiplet before moving to the next head/XCD assignment.

The essential logic is:

For grid configuration $(B \times H \times (N/\text{block size}))$ , map each head $h$ in batch $b$ to $XCD\_id = h \bmod C$ .
Process all $N$ row blocks of $Q$ for that head on the assigned XCD.
Move to the next head/XCD assignment in a round-robin fashion.

This design ensures:

The per-XCD L2 cache stores and reuses only one ACC at a time.
Memory movement is always local, in alignment with physical memory topology.
Kernel modifications are minimal—only index arithmetic is changed, leaving computational logic intact, thus enabling rapid integration into codebases such as Triton and compatibility with ROCm.

5. Experimental Evaluation and Comparative Results

Empirical evaluation is conducted on AMD's MI300X, with all attention kernels implemented in Triton and profiled using ROCProfiler. Testing scenarios include MHA and GQA with variable batch sizes, head counts up to 128, and sequence lengths in the hundreds of thousands (e.g., DeepSeekV3 prefill with 128 heads at 128k tokens).

Throughput: Swizzled Head-first Mapping achieves up to 50% higher throughput compared to block-first strategies under high head count and long context length (e.g., block-first drops to 0.65x efficiency at 128k tokens and 128 heads).
L2 Cache Hit Rate: Naive/block-first strategies cause L2 hit rates to collapse below 1%. In contrast, Swizzled Head-first maintains 90–96% L2 hit rates in all regimes considered, directly underlying the observed speedup.
Robustness: The approach holds for both MHA and GQA, applies to the backward pass of FlashAttention2, large batch sizes, and diverse model configurations.
Minimal Implementation Overhead: The change is restricted to scheduling math; kernel functionality and compute structure are otherwise unaltered.

Mapping Strategy	L2 Hit Rate (128 heads, 128k tokens)	Relative Throughput (%)	Notes
Swizzled Head-first	90–96%	100%	One ACC per XCD
Naive Head-first	40–60%	~90%	Some cache fragmentation
Swizzled Block-first	~1%	70–76%	Good for GQA = XCD only
Naive Block-first	~1%	<65%	Severe cache fragmentation

6. Mathematical Model and Scheduling Formalization

Formally, for $H$ attention heads, $B$ batches, $N$ context length, and $C$ XCDs, the Swizzled Head-first assignment is:

$\text{XCD\_id} = h \bmod C$

for head index $h$ . The set of workgroups for $(b,h)$ , across all row blocks $k$ , are executed on this assigned XCD. This mapping is both spatially and arithmetically simple, yielding maximal reuse and minimal cross-chip transfers.

7. Implications and Outlook

NUMA-aware spatial scheduling, epitomized by Swizzled Head-first Mapping, is now essential to realize the capabilities of contemporary disaggregated GPU systems for large-scale attention workloads. Its adoption provides up to 50% performance gains, order-of-magnitude improvements in L2 utilization, and generalizes to both MHA and GQA. For current and emerging LLM deployments—featuring extreme head counts (e.g., Llama-3 405B, DeepSeekV3) and long sequence requirements—such techniques are a prerequisite for hardware efficiency. The method’s compatibility with standard frameworks (Triton, ROCm), negligible engineering overhead, and future-proof characteristics ensure broad relevance and adoption potential.

As multi-die architectures proliferate, spatially-aware kernel scheduling will underpin scalable AI training and inference, and Swizzled Head-first Mapping represents a foundation for further research and practical deployment in NUMA-affected computational environments.

PDF Markdown Chat (Pro)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to Swizzled Head-first Mapping.

Swizzled Head-first Mapping in GPU Scheduling

1. NUMA Effects and Motivation

2. Structure of Attention Kernel Computation

3. Mapping Algorithms: Baselines and Limitations

4. Swizzled Head-first Mapping: Principles and Implementation

5. Experimental Evaluation and Comparative Results

6. Mathematical Model and Scheduling Formalization

7. Implications and Outlook

Whiteboard

Topic to Video (Beta)

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research

Swizzled Head-first Mapping in GPU Scheduling

1. NUMA Effects and Motivation

2. Structure of Attention Kernel Computation

3. Mapping Algorithms: Baselines and Limitations

4. Swizzled Head-first Mapping: Principles and Implementation

5. Experimental Evaluation and Comparative Results

6. Mathematical Model and Scheduling Formalization

7. Implications and Outlook

Sponsor

Whiteboard

Topic to Video (Beta)

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research