Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 189 tok/s
Gemini 2.5 Pro 53 tok/s Pro
GPT-5 Medium 36 tok/s Pro
GPT-5 High 36 tok/s Pro
GPT-4o 75 tok/s Pro
Kimi K2 160 tok/s Pro
GPT OSS 120B 443 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

CDNA3 & CDNA4 AMD Platforms

Updated 12 November 2025
  • CDNA3 and CDNA4 AMD platforms are state-of-the-art data center GPU architectures that employ a chiplet-based design with advanced compute units and register hierarchies.
  • They feature a sophisticated memory subsystem and chiplet-aware scheduling to optimize high-throughput AI operations like dense matrix multiplication and attention.
  • The HipKittens framework leverages optimized programming primitives and explicit register control to achieve near-peak hardware efficiency on these platforms.

The CDNA3 and CDNA4 AMD platforms comprise the third and fourth generations of AMD's data center GPU architectures, designed to deliver state-of-the-art compute and memory bandwidth for AI and high-performance computing workloads. Each platform is deployed in leading AMD GPU models—specifically, CDNA3 in the MI325X and CDNA4 in the MI355X. These platforms utilize a chiplet-based design and architecturally significant features such as advanced register hierarchies, high-throughput memory systems, and specialized instructions to accelerate dense matrix multiplication (GEMM), attention, and other AI-centric operations. The explicit investigation and evaluation of programming models, as exemplified by the HipKittens (HK) framework, have validated and extended high-performance kernel development on these platforms by abstracting away raw assembly intricacies while attaining near-peak hardware efficiency.

1. Microarchitecture of CDNA3 and CDNA4

CDNA3 and CDNA4 GPUs are organized as 8 chiplets ("XCDs"), each implementing a scalable array of Compute Units (CUs): 32 CUs per XCD in CDNA4 MI355X, and 38 CUs per XCD in CDNA3 MI325X. Each CU contains 4 SIMD units, with each SIMD executing 64-lane wavefronts in lockstep. The register architecture provides 512 registers of 32 bits per SIMD, divided evenly between 256 vector general-purpose registers (VGPRs) and 256 accumulator general-purpose registers (AGPRs) when executing a single wave per SIMD.

Key architectural elements:

Resource CDNA3 MI325X CDNA4 MI355X
CUs per XCD 38 32
SIMDs per CU 4 4
Registers/SIMD 512 × 32 bits 512 × 32 bits
Shared Memory/CU 128 KB 128–256 KB
Private L2/XCD 4 MB 4 MB
HBM Bandwidth 8 TB/s 8 TB/s

The hierarchical memory subsystem situates a private 4 MB L2 cache on each XCD, with a global last-level cache (LLC) positioned before the high-bandwidth memory (HBM). The programmable L1 data cache is co-resident with the Local Data Share (SMEM) in each CU. Data movement and inter-chiplet communication are managed via AMD’s Infinity Fabric, which also orchestrates traffic between global, L2, and LLC layers. Hardware workgroup scheduling assigns thread blocks to XCDs in a round-robin fashion.

2. High-Performance Kernel Primitives and Hardware Mapping

HipKittens abstracts and implements a set of programming primitives that leverage the tile-oriented and asynchronous execution model required for maximal efficiency on CDNA3 and CDNA4. These primitives include:

A. Tile Data Structures

  • Register tiles (r<dtype, M, N, layout, mfma_shape>): Represent M×NM\times N matrix tiles in registers, mapping directly to MFMA (Matrix Fused Multiply-Add) instructions.
  • Shared-memory tiles (s<dtype, M, N, layout>): Configured to select and swizzle SMEM instructions (e.g., ds_read_b128, ds_read_b96, ds_write_b64) for bank-conflict-free data access.
  • Bulk compute (mma(A, B, C)): Exposes MFMA instructions directly to programmers for efficient matrix operations without additional overhead.
  • Global-to-shared tile loads: Chooses among buffer_load_dwordx4, dwordx3 based on datatype to optimize throughput.
  • Vector operations: Single-instruction vector ops (add, mul, exp2, log, etc.) provided for common per-tile reductions and activations.

B. Explicit Register Scheduling

To address HIPCC’s inability to bind AGPRs as MFMA operands, HipKittens exposes a "pinned register range" API. Developers explicitly specify register allocation for tiles; this mechanism closes a \sim20% performance gap relative to raw assembly for attention backwards kernels.

C. Wave-level Overlap Schedules

  • 8-wave ping-pong: Assigns 8 waves per grid tile (2 waves per SIMD); compute and prefetch operations are alternated via a conditional barrier, with roles swapped at each iteration. Achieves or surpasses hand-optimized assembly for BF16/FP8 GEMM and attention (forward).
  • 4-wave interleave: Deploys one wave per SIMD, tightly interleaving tiles’ compute and load operations in micro-pipelines, crucial for imbalanced compute/memory-bound kernels.

D. Chiplet-Aware Grid Scheduling

Naïve row-major thread block assignment leads to poor L2 hit rates (e.g., \sim36% for 14592×\times14592 FP16 GEMM) and underutilizes L2/LLC. HipKittens implements two-level swizzling: (1) chunks blocks to match XCD affinity, (2) tiling grids vertically to enhance L2 data reuse, enabling up to 15% end-to-end GEMM speedup.

3. Performance Modeling and Analytical Approach

Performance evaluation relies on roofline-style models and occupancy/utilization metrics:

  • Roofline throughput: GFLOP/s=#FMA opstimes\mathrm{GFLOP/s} = \frac{\# \text{FMA ops}}{\text{time}_s}
  • Effective bandwidth: BWeff=bytes movedtimes\mathrm{BW}_{\mathrm{eff}} = \frac{\text{bytes moved}}{\text{time}_s}
  • Cache efficiency: BWeff=BWLLC×HitLLC+BWL2×HitL2\mathrm{BW}_{\mathrm{eff}} = \mathrm{BW}_{\mathrm{LLC}} \times \text{Hit}_{\mathrm{LLC}} + \mathrm{BW}_{L2} \times \text{Hit}_{L2}
  • Occupancy: Occupancy=active waves per CUmax waves per CU\mathrm{Occupancy} = \frac{\text{active waves per CU}}{\text{max waves per CU}}
  • Compute utilization: Utilizationcompute=cycles with MFMAtotal cycles\mathrm{Utilization}_{\text{compute}} = \frac{\text{cycles with MFMA}}{\text{total cycles}}

These models inform tuning for tile sizes, wave schedulers, and grid partitioning to extract theoretical peak or near-peak platform performance.

4. Empirical Results: Microbenchmarks and Kernel Performance

Evaluations were performed using 500 warmup and 100 timed runs over random N(0,1)\mathcal{N}(0,1) inputs across CDNA3/4 and reference NVIDIA platforms for direct comparison. Key results include:

Kernel / Platform HipKittens (HK) AITER asm NVIDIA TMA/CUTLASS Triton (AMD)
BF16 GEMM, MI325X (CDNA3) 1470 TFLOP/s - - 550–800
BF16 GEMM, MI355X (CDNA4) 1610 TFLOP/s 1620 1538 550–800
FP8 GEMM, MI355X 3327 TFLOP/s 3300 - 2800
Attention Forward (GQA/MHA) 1.0–2.1× AITER - up to 4.5× Triton
Attention Backward (GQA non-causal, d=128d=128, seq=8Kseq=8K) 1091 (4-wave), 894 (8-wave) 490
Memory-bound kernels 1.1–2.2× AITER, torch.compile - - -

Notably, HK-attention backward with 4-wave interleave achieves 2.3×\times higher throughput than hand-optimized AITER assembly. In several scenarios, HipKittens exceeds all available kernel baselines by 1.22.4×1.2-2.4\times, particularly in d=64d=64 attention, GQA backward, and memory-bound cases. Triton implementations on AMD trail HK by factors of 1.33×1.3-3\times.

5. Kernel Launch Patterns and Example Pseudocode

High-performing kernel launches on CDNA4 follow an explicit thread and memory hierarchy:

1
2
dim3 grid(ceil(N/256), ceil(M/256));
GEMM_BF16<<<grid, 256, SMEM_bytes>>>(globals);

Inside the block:

  • Allocate shared memory: bf16 A[2] [2] [128×64], B[2] [2] [128×64]
  • Allocate register tiles: rA[HALF×64], rB[HALF×64], accum C[2] [2]
  • Prefetch SMEM pages, employ staggered barriers
  • Hot loop for t0Tt \in 0\ldots T:
    • Wave 1: Load HBM\toSMEM, then SMEM\toreg via ds_read
    • Wave 0: MFMA on current tile
    • Swap tile roles, use s_barrier
  • Epilogue: accumulate and write back to global memory

The attention forward kernel for d=128d=128 (8-wave ping-pong) involves interleaved wave roles, explicit SMEM management, and synchronization barriers at each stage.

6. Abstractions, Algorithmic Lessons, and Porting Insights

Analysis reveals that tile-structured abstractions—register/shared-memory tiles with vector/bulk operations—transfer directly from NVIDIA to AMD, though AMD requires hardware-specific adjustments to tile layout and instruction selection (MFMA shapes, SMEM banking).

Key implementation findings:

  • Traditional NVIDIA wave specialization underperforms on AMD without fine-grained compiler control; new generic scheduling patterns (8-wave ping-pong, 4-wave interleave) recover or surpass assembly efficiency for both compute- and memory-bound kernels.
  • Direct AGPR/VGPR control, bypassing HIPCC, is indispensable for extremely register-intensive kernels.
  • Chiplet-aware scheduling, rather than row-major block allocation, is necessary to approach the platform’s full memory bandwidth; naïve assignments can result in up to 20% throughput loss on CDNA4.
  • Identical 8-wave scheduling code paths on CDNA3 and CDNA4 deliver over 95% of peak hardware performance, with only minor adjustments for shared-memory size differences.

By codifying these AMD-specific strategies into compact, high-level primitives within the HipKittens framework, domain experts can develop high-performance kernels on CDNA3 and CDNA4 efficiently, without requiring manual assembly optimization for each kernel variant. The entire HipKittens framework and associated kernel suite are available at https://github.com/HazyResearch/HipKittens.

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to CDNA3 and CDNA4 AMD Platforms.