Papers
Topics
Authors
Recent
Search
2000 character limit reached

Swizzled Head-First Mapping for AI GPUs

Updated 13 March 2026
  • SHFM is a NUMA-aware scheduling strategy that remaps attention head tiles to the same chiplet, significantly boosting L2 cache efficiency.
  • The technique optimizes multi-head attention by confining K/V matrix reuse within a single NUMA domain, reducing remote memory accesses and bandwidth bottlenecks.
  • Evaluated on the AMD MI300X, SHFM achieves up to 1.50× throughput improvement and 90–96% L2 hit rates compared to traditional scheduling methods.

Swizzled Head-First Mapping (SHFM) is a NUMA-aware workgroup scheduling strategy designed to address non-uniform memory access (NUMA) effects on multi-chiplet AI GPUs, specifically in the context of large-scale attention workloads such as multi-head attention (MHA). SHFM remaps workgroups so that all tiles belonging to a given attention head are dispatched to the same GPU chiplet (NUMA domain or XCD), significantly improving L2 cache efficiency and end-to-end attention throughput. The technique was introduced and evaluated on the AMD MI300X platform, demonstrating substantive performance gains over state-of-the-art attention scheduling approaches (Choudhary et al., 3 Nov 2025).

1. NUMA Effects in Multi-Chiplet AI GPUs

Modern high-performance AI GPUs, exemplified by the AMD MI300X, are architected as multi-chiplet devices. Each chiplet, or XCD, comprises dedicated compute units (CUs), a private L2 cache (4 MB per XCD), and local memory controllers interfaced to individual HBM stacks. This topology results in spatial memory partitioning: memory access cost (in latency and bandwidth) depends on whether the access targets the local resources of a chiplet or traverses inter-chiplet links to remote resources. “Local” accesses—CU to its own XCD’s L2 and local HBM—exhibit low latency and high bandwidth, while “remote” accesses—CU to a different XCD’s L2 or HBM—are characterized by 2–3× higher latency and reduced bandwidth.

By default, GPU kernel schedulers dispatch workgroups (WGs) in a round-robin fashion across XCDs, leading to fragmentation of data locality when consecutive WGs share data, as is typical in attention mechanisms. In such scenarios, redundant global memory fetches proliferate, reducing L2 hit rates and impeding overall throughput.

2. Challenges of Conventional Scheduling for Multi-Head Attention

Typical MHA implementations, such as FlashAttention2, decompose the Q matrix into row blocks (“tiles”), assigning each block to a WG. Within this configuration, all WGs associated with a particular attention head collectively require the same K and V matrices, forming an Attention Compute Cluster (ACC). Legacy scheduling approaches—such as Block-First and Head-First, even when naively “swizzled”—tend to scatter the tiles of a single ACC across multiple XCDs:

  • Block-First (round-robin across heads): Splits every ACC across all chiplets, resulting in near-zero reuse at the L2 level (∼1% L2 hit rate at high head counts).
  • Head-First (all blocks of one head, then next head): Improves temporal coherence but still distributes an ACC’s tiles over several chiplets, yielding modest reuse (40–60% L2 hit rates in the large-H regime).

These patterns introduce critical NUMA-induced inefficiencies:

  • Excessive latency: Remote L2 accesses predominate.
  • Bandwidth saturation: Redundant fetches for the same K/V data across XCDs.
  • Collapsed cache locality: Inability to confine ACC working sets to a single L2 cache.

3. SHFM Algorithm and Mapping Function

SHFM resolves NUMA-induced inefficiency in attention kernels with a lightweight remapping of workgroup IDs prior to dispatch, ensuring all tiles for a given attention head (and corresponding K/V payloads) reside and execute within a single chiplet’s NUMA domain. The mapping operates as follows:

Let

  • BB = batch size
  • HH = number of heads
  • MM = number of tiles per head (per batch), with M=NCTX/BLOCKMM = \lceil N_{CTX}/BLOCK_M \rceil
  • CC = number of chiplets (XCDs)
  • w[0,BHM)w \in [0, B \cdot H \cdot M) = original linear WG ID

Decompose ww as

  • b=w/(HM)b = \lfloor w / (H \cdot M) \rfloor (batch index)
  • r=wmod(HM)r = w \bmod (H \cdot M) (intra-batch offset)
  • h=r/Mh = \lfloor r / M \rfloor (head index)
  • t=rmodMt = r \bmod M (tile-block index)

Assuming HmodC=0H \bmod C = 0 for even head distribution:

  • G=H/CG = H / C (heads per chiplet)
  • x=h/Gx = \lfloor h / G \rfloor (chiplet assignment for head hh)
  • hlocal=hmodGh_{local} = h \bmod G

The remapped WGID (“swizzle”) is:

new_wgid=b(CGM)+x(GM)+(hmodG)M+t\text{new\_wgid} = b \cdot (C \cdot G \cdot M) + x \cdot (G \cdot M) + (h \bmod G) \cdot M + t

This ensures all WGs for a head (over all tiles and batches) reside on the same chiplet. The mechanism comprises ∼10 lines of code in a Triton kernel, with no changes required to the main compute loop.

4. Implementation on AMD MI300X

On MI300X hardware, there are 8 chiplets (C=8), each with 38 CUs, 16 KB L1 per CU, and a private 4 MB L2. Memory is provided by HBM3, with aggregate bandwidth of 5.3 TB/s. SHFM is implemented by augmenting the attention kernel (FlashAttention2 forward and backward) in Triton: at program launch, the remapping logic computes the new workgroup ID for dispatch. No modification to the underlying Q/K/V math or memory access patterns is needed; only the launch order and XCD affinity of WGs is altered.

This method relies on the scheduler’s implicit guarantee of static (non-dynamically load-balanced) per-XCD queues at a workgroup granularity.

5. Performance Metrics and Experimental Results

Experimental evaluation on AMD MI300X utilized ROCm Profiler v3 for hardware counter analysis. The test suite swept MHA and GQA workloads across:

  • Context lengths NCTX{8K,32K,128K}N_{CTX} \in \{8\text{K}, 32\text{K}, 128\text{K}\}
  • Batch sizes B{1,2,4,8}B \in \{1,2,4,8\}
  • Heads H{8,16,32,64,128}H \in \{8,16,32,64,128\}
  • Head dimension = 128; BLOCK_M × BLOCK_N = 128 × 64
  • GQA: HQ=32,64,128H_Q = 32, 64, 128; HK=8H_K=8; head_dim=128

Metrics:

  • Throughput: T=(B×H×NCTX)/runtimeT = (B \times H \times N_{CTX})/\text{runtime} in tokens × heads/s
  • L2 cache-hit rate φ=hitshits+misses\varphi = \frac{\text{hits}}{\text{hits} + \text{misses}} (from ROCProfiler)
  • Results normalized to SHFM = 1.0

Key findings:

  • Up to 1.50× higher forward-pass throughput for MHA at H=128H=128, NCTX=128KN_{CTX}=128\text{K}
  • SHFM achieves φ\varphi \approx 90–96% L2 hit rate, versus:
    • Naive Block-First: φ1%\varphi \approx 1\%, throughput ≈ 0.55×
    • Naive Head-First: φ\varphi \approx 40–60%, throughput ≈ 0.90×
    • Swizzled Block-First: φ\varphi \approx 10–70%, throughput ≈ 0.65–0.80×
  • For GQA, both SHFM and Swizzled Block-First sustain ≈1.0×; Naive Block-First drops to 0.7× at high HQH_Q
  • DeepSeek-V3 prefill (H=128H=128): SHFM = 1.0 versus Naive Block-First ≈ 0.65×
  • Backward pass acceleration up to 1.10× at NCTX=128KN_{CTX}=128\text{K}, B=2B=2

6. Analysis, Limitations, and Generalization

SHFM’s efficacy stems from complete ACC co-location: each head’s K/V matrices are loaded once per chiplet, then L2-reused for all tiles and batches. The resulting L2 hit rates (typically >90%>90\%) reduce off-chip traffic by 2–5×, removing memory stalls and permitting near-peak compute throughput.

Limitations include:

  • Requires HmodC=0H \bmod C = 0 (integral head-to-chiplet mapping); otherwise, head imbalance may cause chiplet underutilization.
  • Potential incompatibility with hardware schedulers supporting large chunk sizes or dynamic load balancing; software and driver validation is necessary in those cases.
  • Provides no benefit on monolithic GPUs with unified L2 cache (lacking NUMA domains).

SHFM generalizes to any tile-partitioned algorithm with significant inter-tile reuse, such as GEMM and convolution kernels. Equivalent swizzle logic can be applied to other NUMA-exposing GPU architectures (e.g., NVIDIA Rubin Ultra). Directions for extension include auto-tuning head-to-chiplet assignments for non-integral scenarios and integrating predictive, straggler-aware load-balancing.

7. Significance and Future Directions

SHFM offers a portable, minimal-overhead solution for NUMA optimization in multi-head attention on next-generation, disaggregated GPU architectures. By enforcing spatiotemporal affinity of per-head working sets to private NUMA domains, the technique effectively transforms memory bandwidth-limited attention kernels into compute-bound operations. As advanced multi-die architectures proliferate, the adoption and further generalization of per-tile, NUMA-aware swizzling methods like SHFM are poised to become foundational for maximizing performance of AI training and inference workloads on future accelerators (Choudhary et al., 3 Nov 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Swizzled Head-First Mapping (SHFM).