Papers
Topics
Authors
Recent
2000 character limit reached

CDNA3 & CDNA4 AMD Platforms

Updated 12 November 2025
  • CDNA3 and CDNA4 AMD platforms are state-of-the-art data center GPU architectures that employ a chiplet-based design with advanced compute units and register hierarchies.
  • They feature a sophisticated memory subsystem and chiplet-aware scheduling to optimize high-throughput AI operations like dense matrix multiplication and attention.
  • The HipKittens framework leverages optimized programming primitives and explicit register control to achieve near-peak hardware efficiency on these platforms.

The CDNA3 and CDNA4 AMD platforms comprise the third and fourth generations of AMD's data center GPU architectures, designed to deliver state-of-the-art compute and memory bandwidth for AI and high-performance computing workloads. Each platform is deployed in leading AMD GPU models—specifically, CDNA3 in the MI325X and CDNA4 in the MI355X. These platforms utilize a chiplet-based design and architecturally significant features such as advanced register hierarchies, high-throughput memory systems, and specialized instructions to accelerate dense matrix multiplication (GEMM), attention, and other AI-centric operations. The explicit investigation and evaluation of programming models, as exemplified by the HipKittens (HK) framework, have validated and extended high-performance kernel development on these platforms by abstracting away raw assembly intricacies while attaining near-peak hardware efficiency.

1. Microarchitecture of CDNA3 and CDNA4

CDNA3 and CDNA4 GPUs are organized as 8 chiplets ("XCDs"), each implementing a scalable array of Compute Units (CUs): 32 CUs per XCD in CDNA4 MI355X, and 38 CUs per XCD in CDNA3 MI325X. Each CU contains 4 SIMD units, with each SIMD executing 64-lane wavefronts in lockstep. The register architecture provides 512 registers of 32 bits per SIMD, divided evenly between 256 vector general-purpose registers (VGPRs) and 256 accumulator general-purpose registers (AGPRs) when executing a single wave per SIMD.

Key architectural elements:

Resource CDNA3 MI325X CDNA4 MI355X
CUs per XCD 38 32
SIMDs per CU 4 4
Registers/SIMD 512 × 32 bits 512 × 32 bits
Shared Memory/CU 128 KB 128–256 KB
Private L2/XCD 4 MB 4 MB
HBM Bandwidth 8 TB/s 8 TB/s

The hierarchical memory subsystem situates a private 4 MB L2 cache on each XCD, with a global last-level cache (LLC) positioned before the high-bandwidth memory (HBM). The programmable L1 data cache is co-resident with the Local Data Share (SMEM) in each CU. Data movement and inter-chiplet communication are managed via AMD’s Infinity Fabric, which also orchestrates traffic between global, L2, and LLC layers. Hardware workgroup scheduling assigns thread blocks to XCDs in a round-robin fashion.

2. High-Performance Kernel Primitives and Hardware Mapping

HipKittens abstracts and implements a set of programming primitives that leverage the tile-oriented and asynchronous execution model required for maximal efficiency on CDNA3 and CDNA4. These primitives include:

A. Tile Data Structures

  • Register tiles (r<dtype, M, N, layout, mfma_shape>): Represent M×NM\times N matrix tiles in registers, mapping directly to MFMA (Matrix Fused Multiply-Add) instructions.
  • Shared-memory tiles (s<dtype, M, N, layout>): Configured to select and swizzle SMEM instructions (e.g., ds_read_b128, ds_read_b96, ds_write_b64) for bank-conflict-free data access.
  • Bulk compute (mma(A, B, C)): Exposes MFMA instructions directly to programmers for efficient matrix operations without additional overhead.
  • Global-to-shared tile loads: Chooses among buffer_load_dwordx4, dwordx3 based on datatype to optimize throughput.
  • Vector operations: Single-instruction vector ops (add, mul, exp2, log, etc.) provided for common per-tile reductions and activations.

B. Explicit Register Scheduling

To address HIPCC’s inability to bind AGPRs as MFMA operands, HipKittens exposes a "pinned register range" API. Developers explicitly specify register allocation for tiles; this mechanism closes a ∼\sim20% performance gap relative to raw assembly for attention backwards kernels.

C. Wave-level Overlap Schedules

  • 8-wave ping-pong: Assigns 8 waves per grid tile (2 waves per SIMD); compute and prefetch operations are alternated via a conditional barrier, with roles swapped at each iteration. Achieves or surpasses hand-optimized assembly for BF16/FP8 GEMM and attention (forward).
  • 4-wave interleave: Deploys one wave per SIMD, tightly interleaving tiles’ compute and load operations in micro-pipelines, crucial for imbalanced compute/memory-bound kernels.

D. Chiplet-Aware Grid Scheduling

Naïve row-major thread block assignment leads to poor L2 hit rates (e.g., ∼\sim36% for 14592×\times14592 FP16 GEMM) and underutilizes L2/LLC. HipKittens implements two-level swizzling: (1) chunks blocks to match XCD affinity, (2) tiling grids vertically to enhance L2 data reuse, enabling up to 15% end-to-end GEMM speedup.

3. Performance Modeling and Analytical Approach

Performance evaluation relies on roofline-style models and occupancy/utilization metrics:

  • Roofline throughput: GFLOP/s=#FMA opstimes\mathrm{GFLOP/s} = \frac{\# \text{FMA ops}}{\text{time}_s}
  • Effective bandwidth: BWeff=bytes movedtimes\mathrm{BW}_{\mathrm{eff}} = \frac{\text{bytes moved}}{\text{time}_s}
  • Cache efficiency: BWeff=BWLLC×HitLLC+BWL2×HitL2\mathrm{BW}_{\mathrm{eff}} = \mathrm{BW}_{\mathrm{LLC}} \times \text{Hit}_{\mathrm{LLC}} + \mathrm{BW}_{L2} \times \text{Hit}_{L2}
  • Occupancy: Occupancy=active waves per CUmax waves per CU\mathrm{Occupancy} = \frac{\text{active waves per CU}}{\text{max waves per CU}}
  • Compute utilization: Utilizationcompute=cycles with MFMAtotal cycles\mathrm{Utilization}_{\text{compute}} = \frac{\text{cycles with MFMA}}{\text{total cycles}}

These models inform tuning for tile sizes, wave schedulers, and grid partitioning to extract theoretical peak or near-peak platform performance.

4. Empirical Results: Microbenchmarks and Kernel Performance

Evaluations were performed using 500 warmup and 100 timed runs over random N(0,1)\mathcal{N}(0,1) inputs across CDNA3/4 and reference NVIDIA platforms for direct comparison. Key results include:

Kernel / Platform HipKittens (HK) AITER asm NVIDIA TMA/CUTLASS Triton (AMD)
BF16 GEMM, MI325X (CDNA3) 1470 TFLOP/s - - 550–800
BF16 GEMM, MI355X (CDNA4) 1610 TFLOP/s 1620 1538 550–800
FP8 GEMM, MI355X 3327 TFLOP/s 3300 - 2800
Attention Forward (GQA/MHA) 1.0–2.1× AITER - up to 4.5× Triton —
Attention Backward (GQA non-causal, d=128d=128, seq=8Kseq=8K) 1091 (4-wave), 894 (8-wave) 490 — —
Memory-bound kernels 1.1–2.2× AITER, torch.compile - - -

Notably, HK-attention backward with 4-wave interleave achieves 2.3×\times higher throughput than hand-optimized AITER assembly. In several scenarios, HipKittens exceeds all available kernel baselines by 1.2−2.4×1.2-2.4\times, particularly in d=64d=64 attention, GQA backward, and memory-bound cases. Triton implementations on AMD trail HK by factors of 1.3−3×1.3-3\times.

5. Kernel Launch Patterns and Example Pseudocode

High-performing kernel launches on CDNA4 follow an explicit thread and memory hierarchy:

1
2
dim3 grid(ceil(N/256), ceil(M/256));
GEMM_BF16<<<grid, 256, SMEM_bytes>>>(globals);

Inside the block:

  • Allocate shared memory: bf16 A[2] [2] [128×64], B[2] [2] [128×64]
  • Allocate register tiles: rA[HALF×64], rB[HALF×64], accum C[2] [2]
  • Prefetch SMEM pages, employ staggered barriers
  • Hot loop for t∈0…Tt \in 0\ldots T:
    • Wave 1: Load HBM→\toSMEM, then SMEM→\toreg via ds_read
    • Wave 0: MFMA on current tile
    • Swap tile roles, use s_barrier
  • Epilogue: accumulate and write back to global memory

The attention forward kernel for d=128d=128 (8-wave ping-pong) involves interleaved wave roles, explicit SMEM management, and synchronization barriers at each stage.

6. Abstractions, Algorithmic Lessons, and Porting Insights

Analysis reveals that tile-structured abstractions—register/shared-memory tiles with vector/bulk operations—transfer directly from NVIDIA to AMD, though AMD requires hardware-specific adjustments to tile layout and instruction selection (MFMA shapes, SMEM banking).

Key implementation findings:

  • Traditional NVIDIA wave specialization underperforms on AMD without fine-grained compiler control; new generic scheduling patterns (8-wave ping-pong, 4-wave interleave) recover or surpass assembly efficiency for both compute- and memory-bound kernels.
  • Direct AGPR/VGPR control, bypassing HIPCC, is indispensable for extremely register-intensive kernels.
  • Chiplet-aware scheduling, rather than row-major block allocation, is necessary to approach the platform’s full memory bandwidth; naïve assignments can result in up to 20% throughput loss on CDNA4.
  • Identical 8-wave scheduling code paths on CDNA3 and CDNA4 deliver over 95% of peak hardware performance, with only minor adjustments for shared-memory size differences.

By codifying these AMD-specific strategies into compact, high-level primitives within the HipKittens framework, domain experts can develop high-performance kernels on CDNA3 and CDNA4 efficiently, without requiring manual assembly optimization for each kernel variant. The entire HipKittens framework and associated kernel suite are available at https://github.com/HazyResearch/HipKittens.

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to CDNA3 and CDNA4 AMD Platforms.