CDNA3 & CDNA4 AMD Platforms

Updated 12 November 2025

CDNA3 and CDNA4 AMD platforms are state-of-the-art data center GPU architectures that employ a chiplet-based design with advanced compute units and register hierarchies.
They feature a sophisticated memory subsystem and chiplet-aware scheduling to optimize high-throughput AI operations like dense matrix multiplication and attention.
The HipKittens framework leverages optimized programming primitives and explicit register control to achieve near-peak hardware efficiency on these platforms.

The CDNA3 and CDNA4 AMD platforms comprise the third and fourth generations of AMD's data center GPU architectures, designed to deliver state-of-the-art compute and memory bandwidth for AI and high-performance computing workloads. Each platform is deployed in leading AMD GPU models—specifically, CDNA3 in the MI325X and CDNA4 in the MI355X. These platforms utilize a chiplet-based design and architecturally significant features such as advanced register hierarchies, high-throughput memory systems, and specialized instructions to accelerate dense matrix multiplication (GEMM), attention, and other AI-centric operations. The explicit investigation and evaluation of programming models, as exemplified by the HipKittens (HK) framework, have validated and extended high-performance kernel development on these platforms by abstracting away raw assembly intricacies while attaining near-peak hardware efficiency.

1. Microarchitecture of CDNA3 and CDNA4

CDNA3 and CDNA4 GPUs are organized as 8 chiplets ("XCDs"), each implementing a scalable array of Compute Units (CUs): 32 CUs per XCD in CDNA4 MI355X, and 38 CUs per XCD in CDNA3 MI325X. Each CU contains 4 SIMD units, with each SIMD executing 64-lane wavefronts in lockstep. The register architecture provides 512 registers of 32 bits per SIMD, divided evenly between 256 vector general-purpose registers (VGPRs) and 256 accumulator general-purpose registers (AGPRs) when executing a single wave per SIMD.

Key architectural elements:

Resource	CDNA3 MI325X	CDNA4 MI355X
CUs per XCD	38	32
SIMDs per CU	4	4
Registers/SIMD	512 × 32 bits	512 × 32 bits
Shared Memory/CU	128 KB	128–256 KB
Private L2/XCD	4 MB	4 MB
HBM Bandwidth	8 TB/s	8 TB/s

The hierarchical memory subsystem situates a private 4 MB L2 cache on each XCD, with a global last-level cache (LLC) positioned before the high-bandwidth memory (HBM). The programmable L1 data cache is co-resident with the Local Data Share (SMEM) in each CU. Data movement and inter-chiplet communication are managed via AMD’s Infinity Fabric, which also orchestrates traffic between global, L2, and LLC layers. Hardware workgroup scheduling assigns thread blocks to XCDs in a round-robin fashion.

2. High-Performance Kernel Primitives and Hardware Mapping

HipKittens abstracts and implements a set of programming primitives that leverage the tile-oriented and asynchronous execution model required for maximal efficiency on CDNA3 and CDNA4. These primitives include:

A. Tile Data Structures

Register tiles (r<dtype, M, N, layout, mfma_shape>): Represent $M\times N$ matrix tiles in registers, mapping directly to MFMA (Matrix Fused Multiply-Add) instructions.
Shared-memory tiles (s<dtype, M, N, layout>): Configured to select and swizzle SMEM instructions (e.g., ds_read_b128, ds_read_b96, ds_write_b64) for bank-conflict-free data access.
Bulk compute (mma(A, B, C)): Exposes MFMA instructions directly to programmers for efficient matrix operations without additional overhead.
Global-to-shared tile loads: Chooses among buffer_load_dwordx4, dwordx3 based on datatype to optimize throughput.
Vector operations: Single-instruction vector ops (add, mul, exp2, log, etc.) provided for common per-tile reductions and activations.

B. Explicit Register Scheduling

To address HIPCC’s inability to bind AGPRs as MFMA operands, HipKittens exposes a "pinned register range" API. Developers explicitly specify register allocation for tiles; this mechanism closes a $\sim$ 20% performance gap relative to raw assembly for attention backwards kernels.

C. Wave-level Overlap Schedules

8-wave ping-pong: Assigns 8 waves per grid tile (2 waves per SIMD); compute and prefetch operations are alternated via a conditional barrier, with roles swapped at each iteration. Achieves or surpasses hand-optimized assembly for BF16/FP8 GEMM and attention (forward).
4-wave interleave: Deploys one wave per SIMD, tightly interleaving tiles’ compute and load operations in micro-pipelines, crucial for imbalanced compute/memory-bound kernels.

D. Chiplet-Aware Grid Scheduling

Naïve row-major thread block assignment leads to poor L2 hit rates (e.g., $\sim$ 36% for 14592 $\times$ 14592 FP16 GEMM) and underutilizes L2/LLC. HipKittens implements two-level swizzling: (1) chunks blocks to match XCD affinity, (2) tiling grids vertically to enhance L2 data reuse, enabling up to 15% end-to-end GEMM speedup.

3. Performance Modeling and Analytical Approach

Performance evaluation relies on roofline-style models and occupancy/utilization metrics:

Roofline throughput: $\mathrm{GFLOP/s} = \frac{\# \text{FMA ops}}{\text{time}_s}$
Effective bandwidth: $\mathrm{BW}_{\mathrm{eff}} = \frac{\text{bytes moved}}{\text{time}_s}$
Cache efficiency: $\mathrm{BW}_{\mathrm{eff}} = \mathrm{BW}_{\mathrm{LLC}} \times \text{Hit}_{\mathrm{LLC}} + \mathrm{BW}_{L2} \times \text{Hit}_{L2}$
Occupancy: $\mathrm{Occupancy} = \frac{\text{active waves per CU}}{\text{max waves per CU}}$
Compute utilization: $\mathrm{Utilization}_{\text{compute}} = \frac{\text{cycles with MFMA}}{\text{total cycles}}$

These models inform tuning for tile sizes, wave schedulers, and grid partitioning to extract theoretical peak or near-peak platform performance.

4. Empirical Results: Microbenchmarks and Kernel Performance

Evaluations were performed using 500 warmup and 100 timed runs over random $\mathcal{N}(0,1)$ inputs across CDNA3/4 and reference NVIDIA platforms for direct comparison. Key results include:

Kernel / Platform	HipKittens (HK)	AITER asm	NVIDIA TMA/CUTLASS	Triton (AMD)
BF16 GEMM, MI325X (CDNA3)	1470 TFLOP/s	-	-	550–800
BF16 GEMM, MI355X (CDNA4)	1610 TFLOP/s	1620	1538	550–800
FP8 GEMM, MI355X	3327 TFLOP/s	3300	-	2800
Attention Forward (GQA/MHA)	1.0–2.1× AITER	-	up to 4.5× Triton	—
Attention Backward (GQA non-causal, $d=128$ , $seq=8K$ )	1091 (4-wave), 894 (8-wave)	490	—	—
Memory-bound kernels	1.1–2.2× AITER, torch.compile	-	-	-

Notably, HK-attention backward with 4-wave interleave achieves 2.3 $\times$ higher throughput than hand-optimized AITER assembly. In several scenarios, HipKittens exceeds all available kernel baselines by $1.2-2.4\times$ , particularly in $d=64$ attention, GQA backward, and memory-bound cases. Triton implementations on AMD trail HK by factors of $1.3-3\times$ .

5. Kernel Launch Patterns and Example Pseudocode

High-performing kernel launches on CDNA4 follow an explicit thread and memory hierarchy:

1 2	dim3 grid(ceil(N/256), ceil(M/256)); GEMM_BF16<<<grid, 256, SMEM_bytes>>>(globals);

Inside the block:

Allocate shared memory: bf16 A[2] [2] [128×64], B[2] [2] [128×64]
Allocate register tiles: rA[HALF×64], rB[HALF×64], accum C[2] [2]
Prefetch SMEM pages, employ staggered barriers
Hot loop for $t \in 0\ldots T$ $t \in 0 \dots T$ :
- Wave 1: Load HBM $\to$ SMEM, then SMEM $\to$ reg via ds_read
- Wave 0: MFMA on current tile
- Swap tile roles, use s_barrier
Epilogue: accumulate and write back to global memory

The attention forward kernel for $d=128$ (8-wave ping-pong) involves interleaved wave roles, explicit SMEM management, and synchronization barriers at each stage.

6. Abstractions, Algorithmic Lessons, and Porting Insights

Analysis reveals that tile-structured abstractions—register/shared-memory tiles with vector/bulk operations—transfer directly from NVIDIA to AMD, though AMD requires hardware-specific adjustments to tile layout and instruction selection (MFMA shapes, SMEM banking).

Key implementation findings:

Traditional NVIDIA wave specialization underperforms on AMD without fine-grained compiler control; new generic scheduling patterns (8-wave ping-pong, 4-wave interleave) recover or surpass assembly efficiency for both compute- and memory-bound kernels.
Direct AGPR/VGPR control, bypassing HIPCC, is indispensable for extremely register-intensive kernels.
Chiplet-aware scheduling, rather than row-major block allocation, is necessary to approach the platform’s full memory bandwidth; naïve assignments can result in up to 20% throughput loss on CDNA4.
Identical 8-wave scheduling code paths on CDNA3 and CDNA4 deliver over 95% of peak hardware performance, with only minor adjustments for shared-memory size differences.

By codifying these AMD-specific strategies into compact, high-level primitives within the HipKittens framework, domain experts can develop high-performance kernels on CDNA3 and CDNA4 efficiently, without requiring manual assembly optimization for each kernel variant. The entire HipKittens framework and associated kernel suite are available at https://github.com/HazyResearch/HipKittens.

PDF Markdown Chat (Pro)

Follow Topic

Get notified by email when new papers are published related to CDNA3 and CDNA4 AMD Platforms.