HipKittens: Tile-Based Framework for AMD AI Kernels

Updated 12 November 2025

HipKittens is a C++-embedded tile-based programming framework that abstracts GPU operations for high-performance AI kernel development on AMD CDNA-series GPUs.
The framework simplifies AMD-specific memory management by using direct global-to-shared transfers and bank-conflict-free swizzling, boosting GEMM throughput by up to 15%.
HipKittens employs advanced asynchronous execution models, such as 8-wave ping-pong and 4-wave interleave, to achieve competitive performance against hand-tuned assembly.

HipKittens (HK) is a C++-embedded, tile-based programming framework designed to achieve high performance in AI kernel development for AMD CDNA-series GPUs. HK generalizes the lightweight, bulk-operator abstractions introduced by NVIDIA-centric domain-specific languages (DSLs) such as ThunderKittens (TK), while adapting their instantiation to the unique register file, memory hierarchy, and wave scheduling semantics of AMD hardware. The framework provides an interface for creating performant kernels that match or exceed the best hand-tuned assembly implementations on AMD platforms, specifically targeting GEMM, attention (including SDPA, GQA), LayerNorm, rotary embeddings, and operations using reduced-precision data types (BF16, FP8, FP6). HK is structured to simplify the development of cross-platform, vendor-agnostic AI kernels by decoupling the tile abstraction from its hardware-specific realization.

1. Motivation and Scope

The primary motivation for HipKittens is to break the "CUDA moat" by providing AI kernel authors with a reusable interface for AMD hardware, circumventing the need for raw assembly (as with AITER kernels) or slow compiler back-ends (such as those generated by Triton/HIPCC). Existing tile and bulk-operator abstractions, while successful on NVIDIA architectures, required re-examination for effective deployment on AMD CDNA3/4, due to differences in register and memory organization and wavefront execution.

HipKittens was developed to:

Validate the generality of the tile + bulk-operator abstraction for AMD GPUs.
Identify and redesign critical algorithms underlying tile instantiation, memory movement, and scheduling to exploit AMD's architectural features.
Deliver competitive performance for BF16/FP8/FP6 GEMMs, attention (forward and backward), LayerNorm, rotary embedding, and similar AI kernel workloads, matching or exceeding AMD's hand-tuned assembly kernels and outperforming compiler-generated code.

2. Tile-Based Abstractions and Bulk Operators

At the core of HipKittens is the tile abstraction—a parameterized data structure representing a dense submatrix with configurable data type, dimensions, and layout, allocated either in registers or shared memory. Bulk-operator interfaces are provided for essential operations such as mma (mapping to CDNA4's MFMA instruction), addition, exponentiation, and summation, all directly wrapping underlying assembly routines.

The tile-to-thread-block mapping is characterized by:

Each thread block computing a BLOCK_M × BLOCK_N output tile, subdivided among WARPS_M × WARPS_N waves. For instance, a typical BF16 GEMM uses BLOCK_M = BLOCK_N = 256, split into WARPS_M = 2, WARPS_N = 4.
Each individual wave loads a register tile with shape REG_M × K_STEP or REG_N × K_STEP (e.g., 128 × 64), enabling highly parallelized input operand buffering.

Wavefront execution leverages AMD's SIMD architecture, where a wave is composed of 64 threads. Shared-memory tiles are partitioned using subtile_inplace into wave-specific register tiles before issuing MFMA instructions (e.g., mfma_ABt for $A \cdot B^\top$ ).

Work distribution is managed by a cache-aware grid scheduling formula, which improves performance through balanced L2/LLC cache usage. Specifically:

$blocks = G_x \cdot G_y$ (number of blocks per batch)
$blocks\_per\_cycle = nXCD \cdot C$ , with $limit = \lfloor blocks/blocks\_per\_cycle \rfloor \cdot blocks\_per\_cycle$
For $wgid$ $w g i d$ in $[0, limit)$ $[0, l imi t)$ :
- $xcd = wgid \mod nXCD$
- $local = \lfloor wgid/nXCD \rfloor$
- $chunk = \lfloor local/C \rfloor$ , $pos = local \mod C$
- $new\_id = chunk \cdot blocks\_per\_cycle + xcd \cdot C + pos$
Mapping $new\_id \rightarrow (row, col)$ is performed by windowing with height $W$ to ensure cache-aware locality (as detailed in the source).

This algorithm models effective bandwidth via

$Bandwidth = LLC\_bw \cdot LLC\_hit\% + L2\_bw \cdot L2\_hit\%$

and achieves up to 15% higher throughput on large GEMMs by optimizing L2/LLC reuse.

3. Memory Hierarchy Optimizations and AMD-Specific Features

Efficient management of global, shared, and register memory—accounting for AMD-specific constraints—is a principal feature of HipKittens:

Global to Shared Memory: HK exploits CDNA4's buffer_load_dword{x3,x4} to load tiles directly into LDS (shared memory) without first passing through registers, using buffer_load_dwordx4 for BF16/FP8 and x3 for FP6, synchronizing with vmcnt, s_waitcnt, and q_waitcnt to overlap memory transactions with computation.
Shared to Register Transfer: Automatic computation of bank-conflict-free swizzle patterns is implemented for common tile layouts, e.g., a 16×32 BF16 row-major tile undergoes swizzling to ensure ds_read_b128 accesses do not collide, as demonstrated in swizzle visualizations.
Register to MFMA: Multiple MFMA shapes (such as 16×16×32 or 32×32×16) are supported, parameterized by the developer. Register tile assignment is explicit using "pinned register tiles," avoiding HIPCC’s inability to utilize AGPRs properly in MFMA instructions.
Compiler limitations are circumvented using LLVM AMDGPU intrinsics (__builtin_amdgcn_sched_barrier, sched_group_barrier, s_setprio) to optimize scheduling between vector ALU, matrix (MFMA), and memory instructions, replacing pervasive in-line assembly with higher-level hints.

4. Fine-Grained Asynchronous Execution Patterns

HipKittens critically evaluates asynchronous execution models for AMD. Standard producer–consumer specialization—effective on some NVIDIA hardware—leads to underutilization on AMD due to static register partitioning penalizing producer wave registers (e.g., a 4-producer/8-consumer schedule achieves 893 TFLOPS, far below the 1610 TFLOPS with no specialization on CDNA4).

HK instead leverages two major scheduling paradigms:

8-Wave Ping-Pong: Two waves per SIMD alternately execute compute and memory operations, synchronized via s_barrier. Each operates on large MFMA tiles, providing competitive performance with only ~50 lines of code in the compute loop. This schedule suffices for most GEMM and attention forward kernels.
4-Wave Interleave: Each SIMD hosts a single wave executing a fine-grained, interleaved sequence of loads and MFMAs, coordinated through multiplexed cluster barriers. Though increasing loop complexity (~200 lines of code), this yields higher utilization for compute- or memory-imbalanced kernels, such as attention backward.

Both paradigms reuse the same tile abstraction, supporting kernel code portability and simplicity.

5. Empirical Performance Evaluation

Comprehensive benchmarking on AMD CDNA4 hardware (e.g., MI355X) demonstrates HipKittens’ competitive and, in some instances, superior performance relative to both hand-tuned assembly and compiler-generated baselines:

GEMM (up to 16k×16k): HK BF16 kernels reach approximately 2.5 PFLOPS, matching AITER assembly and outperforming Triton by factors ranging from 1.3× to 3.0×.
FP8/FP6 GEMMs: HK attains 5.0 PFLOPS (FP8), with FP6 kernels using optimized loads and shared→register transfers surpassing AMD CK and matching FP8 performance on certain matrix shapes.
Attention Kernels (Forward, Backward): With 8-wave schedules, attention forward (GQA/MHA, with head dims of 64/128) achieves 1.0–2.1× speedup over AITER and 1.3–4.5× over PyTorch/SDPA and Triton. Attention backward (GQA, $d=64$ ) using 8-wave outpaces AITER by 1.8×, increasing to 2.3× with 4-wave interleaved scheduling.
Memory-Bound Kernels: Fused Dropout+Residual+LayerNorm, and Rotary, exhibit consistent 1.1–2.2× advantages over torch.compile and AITER assembly.
Bandwidth-Aware Scheduling: The described cache-aware schedule confers up to 15% additional performance on GEMMs with $M=N=K=14592$ , attributed to coordinated L2/LLC reuse via the provided analytical model.

6. Implications, Limitations, and Future Directions

HipKittens demonstrates that a minimal, tile-based programming API can express and enable peak kernel performance across both NVIDIA and AMD GPUs. However, optimal performance demands that tiles, memory transports, and wave schedules be tailored to each architecture’s distinctive features. This architectural adaptability is embodied in the HK design, where the high-level DSL remains vendor-agnostic but the back-end primitives (including tile swizzles, MFMA shape selection, wave scheduling, and cache modeling) are specialized.

Extensions under current development include enhancing FP6 kernel support (with improved global→LDS and LDS→register data flows), implementing dynamic autotuning for W and C parameters in the scheduling layer, and supporting next-generation AMD matrix instructions and multi-chiplet topologies.

The open-sourcing of HipKittens is intended to accelerate research on unified kernel frameworks spanning the wider GPU landscape, enabling portable, high-performance AI infrastructure.

Markdown Upgrade to Chat

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to HipKittens (HK) Programming Framework.