CDNA3 & CDNA4 AMD Platforms
- CDNA3 and CDNA4 AMD platforms are state-of-the-art data center GPU architectures that employ a chiplet-based design with advanced compute units and register hierarchies.
- They feature a sophisticated memory subsystem and chiplet-aware scheduling to optimize high-throughput AI operations like dense matrix multiplication and attention.
- The HipKittens framework leverages optimized programming primitives and explicit register control to achieve near-peak hardware efficiency on these platforms.
The CDNA3 and CDNA4 AMD platforms comprise the third and fourth generations of AMD's data center GPU architectures, designed to deliver state-of-the-art compute and memory bandwidth for AI and high-performance computing workloads. Each platform is deployed in leading AMD GPU models—specifically, CDNA3 in the MI325X and CDNA4 in the MI355X. These platforms utilize a chiplet-based design and architecturally significant features such as advanced register hierarchies, high-throughput memory systems, and specialized instructions to accelerate dense matrix multiplication (GEMM), attention, and other AI-centric operations. The explicit investigation and evaluation of programming models, as exemplified by the HipKittens (HK) framework, have validated and extended high-performance kernel development on these platforms by abstracting away raw assembly intricacies while attaining near-peak hardware efficiency.
1. Microarchitecture of CDNA3 and CDNA4
CDNA3 and CDNA4 GPUs are organized as 8 chiplets ("XCDs"), each implementing a scalable array of Compute Units (CUs): 32 CUs per XCD in CDNA4 MI355X, and 38 CUs per XCD in CDNA3 MI325X. Each CU contains 4 SIMD units, with each SIMD executing 64-lane wavefronts in lockstep. The register architecture provides 512 registers of 32 bits per SIMD, divided evenly between 256 vector general-purpose registers (VGPRs) and 256 accumulator general-purpose registers (AGPRs) when executing a single wave per SIMD.
Key architectural elements:
| Resource | CDNA3 MI325X | CDNA4 MI355X |
|---|---|---|
| CUs per XCD | 38 | 32 |
| SIMDs per CU | 4 | 4 |
| Registers/SIMD | 512 × 32 bits | 512 × 32 bits |
| Shared Memory/CU | 128 KB | 128–256 KB |
| Private L2/XCD | 4 MB | 4 MB |
| HBM Bandwidth | 8 TB/s | 8 TB/s |
The hierarchical memory subsystem situates a private 4 MB L2 cache on each XCD, with a global last-level cache (LLC) positioned before the high-bandwidth memory (HBM). The programmable L1 data cache is co-resident with the Local Data Share (SMEM) in each CU. Data movement and inter-chiplet communication are managed via AMD’s Infinity Fabric, which also orchestrates traffic between global, L2, and LLC layers. Hardware workgroup scheduling assigns thread blocks to XCDs in a round-robin fashion.
2. High-Performance Kernel Primitives and Hardware Mapping
HipKittens abstracts and implements a set of programming primitives that leverage the tile-oriented and asynchronous execution model required for maximal efficiency on CDNA3 and CDNA4. These primitives include:
A. Tile Data Structures
- Register tiles (
r<dtype, M, N, layout, mfma_shape>): Represent matrix tiles in registers, mapping directly to MFMA (Matrix Fused Multiply-Add) instructions. - Shared-memory tiles (
s<dtype, M, N, layout>): Configured to select and swizzle SMEM instructions (e.g.,ds_read_b128,ds_read_b96,ds_write_b64) for bank-conflict-free data access. - Bulk compute (
mma(A, B, C)): Exposes MFMA instructions directly to programmers for efficient matrix operations without additional overhead. - Global-to-shared tile loads: Chooses among
buffer_load_dwordx4,dwordx3based on datatype to optimize throughput. - Vector operations: Single-instruction vector ops (
add,mul,exp2,log, etc.) provided for common per-tile reductions and activations.
B. Explicit Register Scheduling
To address HIPCC’s inability to bind AGPRs as MFMA operands, HipKittens exposes a "pinned register range" API. Developers explicitly specify register allocation for tiles; this mechanism closes a 20% performance gap relative to raw assembly for attention backwards kernels.
C. Wave-level Overlap Schedules
- 8-wave ping-pong: Assigns 8 waves per grid tile (2 waves per SIMD); compute and prefetch operations are alternated via a conditional barrier, with roles swapped at each iteration. Achieves or surpasses hand-optimized assembly for BF16/FP8 GEMM and attention (forward).
- 4-wave interleave: Deploys one wave per SIMD, tightly interleaving tiles’ compute and load operations in micro-pipelines, crucial for imbalanced compute/memory-bound kernels.
D. Chiplet-Aware Grid Scheduling
Naïve row-major thread block assignment leads to poor L2 hit rates (e.g., 36% for 1459214592 FP16 GEMM) and underutilizes L2/LLC. HipKittens implements two-level swizzling: (1) chunks blocks to match XCD affinity, (2) tiling grids vertically to enhance L2 data reuse, enabling up to 15% end-to-end GEMM speedup.
3. Performance Modeling and Analytical Approach
Performance evaluation relies on roofline-style models and occupancy/utilization metrics:
- Roofline throughput:
- Effective bandwidth:
- Cache efficiency:
- Occupancy:
- Compute utilization:
These models inform tuning for tile sizes, wave schedulers, and grid partitioning to extract theoretical peak or near-peak platform performance.
4. Empirical Results: Microbenchmarks and Kernel Performance
Evaluations were performed using 500 warmup and 100 timed runs over random inputs across CDNA3/4 and reference NVIDIA platforms for direct comparison. Key results include:
| Kernel / Platform | HipKittens (HK) | AITER asm | NVIDIA TMA/CUTLASS | Triton (AMD) |
|---|---|---|---|---|
| BF16 GEMM, MI325X (CDNA3) | 1470 TFLOP/s | - | - | 550–800 |
| BF16 GEMM, MI355X (CDNA4) | 1610 TFLOP/s | 1620 | 1538 | 550–800 |
| FP8 GEMM, MI355X | 3327 TFLOP/s | 3300 | - | 2800 |
| Attention Forward (GQA/MHA) | 1.0–2.1× AITER | - | up to 4.5× Triton | — |
| Attention Backward (GQA non-causal, , ) | 1091 (4-wave), 894 (8-wave) | 490 | — | — |
| Memory-bound kernels | 1.1–2.2× AITER, torch.compile | - | - | - |
Notably, HK-attention backward with 4-wave interleave achieves 2.3 higher throughput than hand-optimized AITER assembly. In several scenarios, HipKittens exceeds all available kernel baselines by , particularly in attention, GQA backward, and memory-bound cases. Triton implementations on AMD trail HK by factors of .
5. Kernel Launch Patterns and Example Pseudocode
High-performing kernel launches on CDNA4 follow an explicit thread and memory hierarchy:
1 2 |
dim3 grid(ceil(N/256), ceil(M/256)); GEMM_BF16<<<grid, 256, SMEM_bytes>>>(globals); |
Inside the block:
- Allocate shared memory:
bf16 A[2] [2] [128×64],B[2] [2] [128×64] - Allocate register tiles:
rA[HALF×64],rB[HALF×64],accum C[2] [2] - Prefetch SMEM pages, employ staggered barriers
- Hot loop for :
- Wave 1: Load HBMSMEM, then SMEMreg via
ds_read - Wave 0: MFMA on current tile
- Swap tile roles, use
s_barrier
- Wave 1: Load HBMSMEM, then SMEMreg via
- Epilogue: accumulate and write back to global memory
The attention forward kernel for (8-wave ping-pong) involves interleaved wave roles, explicit SMEM management, and synchronization barriers at each stage.
6. Abstractions, Algorithmic Lessons, and Porting Insights
Analysis reveals that tile-structured abstractions—register/shared-memory tiles with vector/bulk operations—transfer directly from NVIDIA to AMD, though AMD requires hardware-specific adjustments to tile layout and instruction selection (MFMA shapes, SMEM banking).
Key implementation findings:
- Traditional NVIDIA wave specialization underperforms on AMD without fine-grained compiler control; new generic scheduling patterns (8-wave ping-pong, 4-wave interleave) recover or surpass assembly efficiency for both compute- and memory-bound kernels.
- Direct AGPR/VGPR control, bypassing HIPCC, is indispensable for extremely register-intensive kernels.
- Chiplet-aware scheduling, rather than row-major block allocation, is necessary to approach the platform’s full memory bandwidth; naïve assignments can result in up to 20% throughput loss on CDNA4.
- Identical 8-wave scheduling code paths on CDNA3 and CDNA4 deliver over 95% of peak hardware performance, with only minor adjustments for shared-memory size differences.
By codifying these AMD-specific strategies into compact, high-level primitives within the HipKittens framework, domain experts can develop high-performance kernels on CDNA3 and CDNA4 efficiently, without requiring manual assembly optimization for each kernel variant. The entire HipKittens framework and associated kernel suite are available at https://github.com/HazyResearch/HipKittens.