HipKittens: Fast and Furious AMD Kernels (2511.08083v1)

Published 11 Nov 2025 in cs.LG

Abstract: AMD GPUs offer state-of-the-art compute and memory bandwidth; however, peak performance AMD kernels are written in raw assembly. To address the difficulty of mapping AI algorithms to hardware, recent work proposes C++ embedded and PyTorch-inspired domain-specific languages like ThunderKittens (TK) to simplify high performance AI kernel development on NVIDIA hardware. We explore the extent to which such primitives -- for explicit tile-based programming with optimized memory accesses and fine-grained asynchronous execution across workers -- are NVIDIA-specific or general. We provide the first detailed study of the programming primitives that lead to performant AMD AI kernels, and we encapsulate these insights in the HipKittens (HK) programming framework. We find that tile-based abstractions used in prior DSLs generalize to AMD GPUs, however we need to rethink the algorithms that instantiate these abstractions for AMD. We validate the HK primitives across CDNA3 and CDNA4 AMD platforms. In evaluations, HK kernels compete with AMD's hand-optimized assembly kernels for GEMMs and attention, and consistently outperform compiler baselines. Moreover, assembly is difficult to scale to the breadth of AI workloads; reflecting this, in some settings HK outperforms all available kernel baselines by $1.2-2.4\times$ (e.g., $d=64$ attention, GQA backwards, memory-bound kernels). These findings help pave the way for a single, tile-based software layer for high-performance AI kernels that translates across GPU vendors. HipKittens is released at: https://github.com/HazyResearch/HipKittens.

Summary

The paper introduces a tile-based DSL for AMD that achieves up to 1091 TFLOPs in BF16 GEMM and delivers 1.8–3.0× speedups in attention kernels.
It employs explicit C++ primitives like register pinning, swizzling, and grid scheduling to bypass compiler limitations and handle unique hardware constraints.
Experimental results validate that HipKittens outperforms hand-tuned assembly and HIP/Triton-generated kernels across both compute-bound and memory-bound operations.

HipKittens: Minimal Tile-Based Abstractions and Schedules for High-Performance AI Kernels on AMD GPUs

Motivation and Context

The growing demand for hardware diversification in AI has motivated the community to look beyond the historically dominant NVIDIA platforms. AMD's modern CDNA3 and CDNA4 GPUs now offer parity or superiority in peak compute and memory bandwidth (8 TB/s, up to 2.5 PFLOPs for BF16 GEMM on the MI355X), but lag behind in the maturity, accessibility, and performance of their software stack. Peak performance kernels on AMD have traditionally been handcrafted in raw assembly, limiting their reproducibility, maintainability, and extensibility for the broader AI and ML ecosystem. This practical barrier—referred to as the "CUDA moat"—impedes adoption and heterogeneous compute at scale.

Recent advances in NVIDIA kernel programming models (e.g., ThunderKittens, CuTe DSL, Gluon) have shown that a small set of tile-based programming primitives, exposing flexible memory hierarchies and fine-grained scheduling, enables both programmability and performance. "HipKittens: Fast and Furious AMD Kernels" explores the transferability of these abstractions to AMD hardware, synthesizing architectural insights and proposing a minimal set of C++-embedded DSL primitives that encapsulate the key ingredients for efficient AI kernels on AMD.

Figure 1: The investigation of tile-based programming primitives as a foundation for high-performance AMD AI kernels led to the HipKittens programming framework.

Hardware and Software Analysis

The work begins with a thorough analysis of the AMD hardware/software stack as compared to NVIDIA. Key architectural elements on AMD MI355X include:

Compute units (CUs): 256 CUs per accelerator, organized into chiplets (XCDs).
Register file: 512 registers per SIMD, statically split between VGPRs and AGPRs.
Multi-level memory hierarchy: global HBM, chiplet-local L2, and cross-chiplet LLC.
Matrix compute: matrix core (MFMA) units with diverse instruction shapes and non-uniform memory access patterns.

On the software side, AMD exposes kernels through HIP C++ and raw assembly, with HIPCC as the compiler. Compiler-induced limitations (e.g., unavailability of AGPRs as MFMA inputs in HIPCC-compiled kernels, non-deterministic register allocation) have historically forced high-performance developers to bypass abstractions, coding at the ISA level. This restricts adoption and precludes sustainable software reuse.

Figure 2: Comparison of peak memory and compute throughput for current NVIDIA and AMD flagship GPUs.

Tile-Based Programming Model and its AMD Instantiation

Tiles and Compute Operators

HipKittens adopts tiles—the core abstraction from ThunderKittens—as programmable memory primitives representing logically strided, layout-specific views into on-chip memory (registers, shared, global). The DSL exposes PyTorch/NumPy-inspired vectorized ops and programmatic control over tile instantiation:

tile<dtype, rows, cols, layout> for configuring locality and access order.
Tile operators: load, store, mma, exp, add, etc., mapped directly to AMD's MFMA instructions and vector units.

For AMD, tile instantiation and scheduling must account for novel hardware-specific constraints: expanded variety of MFMA layouts, static register partitioning, instruction-dependent shared memory bank conflicts, and less compositional tile shapes (cf. NVIDIA's regular $16\times16$ tiles).

Figure 3: Matrix layouts for register tiles on NVIDIA and AMD; AMD instruction layouts lack compositionality, demanding architecture-specific tile handling strategies.

Register Allocation and Compiler Evasion

Compiler limitations in HIP and Triton on AMD throttle register-level performance (e.g., inability to pipeline AGPR→MFMA operands, suboptimal register lifetime management). HipKittens provides an explicit register pinning API: developers can hand-assign tile storage to specific hardware registers, bypassing compiler scheduling and matching the efficiency of hand-tuned assembly kernels. This is essential for workloads with high register pressure (attention backward, mixed-precision GEMMs).

Shared Memory Bank Swizzling

The non-uniform bank and phase executions of AMD's LDS require multiple, instruction- and layout-specific swizzle strategies to avoid bank conflicts under typical AI workloads. HipKittens automates bank-conflict avoidance for common tile sizes via a library of swizzling patterns, supporting both row- and column-major accesses by analyzing per-instruction banking characteristics.

Figure 4: Swizzle visualizations for 16x32 BF16 tiles on AMD CDNA4; correct swizzling eliminates bank conflicts for both row-major and column-major accesses.

Scheduling Patterns for High Occupancy and Arithmetic Intensity

State-of-the-art tile-based scheduling on NVIDIA (producer-consumer wave specialization, deep pipelines) does not trivially generalize to AMD due to key microarchitectural differences:

Registers are statically partitioned across all resident waves—non-compute (producer) waves still consume register capacity, shrinking feasible tile sizes and damaging arithmetic intensity.
Lacking tensor memory async and CUDA-specific TMA, MFMA instructions do not receive operands directly from shared.

Empirical investigation in HipKittens reveals that AMD peak performance is achieved not by customized schedules per workload, but by two general patterns:

8-Wave Ping-Pong: Two groups of four waves per block, each group distributed across the SIMD units. Each group alternates between full-tile bulk memory loading and matrix compute, synchronized by a conditional barrier. This pattern is sufficient for GEMM, MHA, GQA forwards, and competitive even for attention backward.
4-Wave Interleave: A single wave per SIMD, each interleaving memory and compute micro-ops at much finer granularity, best for highly imbalanced (memory- or compute-bound) kernels. This pattern incurs higher code complexity but further reduces load latency hiding.

Code size, tile sizes, and performance trade off between patterns, but 8-wave ping-pong is both sufficient and expressive for most AI workloads.

Chiplet-Aware Grid Scheduling and Cache Hierarchy Utilization

AMD's adoption of chiplet-based architectures creates a multi-level cache landscape: per-XCD (chiplet) 4MB L2s and a shared LLC. Grid scheduling choices (i.e., thread block mapping and launch order) have dramatic, often counterintuitive impacts on measured L2 and LLC hit rates and resulting bandwidth.

A naive row-major thread block assignment yields overlapping (and thus cache-hostile) memory footprints within an XCD's L2, reducing L2 hit rates. HipKittens introduces a tunable scheduling algorithm which (1) groups rectangular "L2 tiles" within each XCD for spatial locality and (2) windows blocks across XCDs for coordinated LLC reuse, governed by parameters window height $W$ and chunk size $C$ .

Figure 5: Grid scheduling visualizations demonstrating how naive row-major, L2-aware, and L2/LLC-aware assignments affect cache coherence—optimized schedules jointly maximize L2 and LLC hit rates for high bandwidth.

Experimental Validation

Benchmarks across MI325X and MI355X validate HipKittens against:

Hand-tuned AMD assembly baselines (AITER)
HIP/Triton-generated kernels
High-level compiler libraries (Composable Kernels, HipBLASLT, PyTorch SDPA)

GEMM Performance

For BF16 and FP8 GEMM, HipKittens with 8-wave ping-pong scheduling and automated tiling matches or outperforms assembly baselines, achieving up to 1091 TFLOPs for 8192x8192 shapes, and is 1.3–3.0× faster than Triton (see Figure 6).

Figure 6: GEMM throughput (TFLOPs) of HipKittens vs. best AMD baselines across various batch shapes and dtypes.

Attention Kernels

In GQA/MHA attention (forwards and backwards, causal and non-causal, head dims 64 and 128), HipKittens consistently outperforms or matches hand-coded assembly. Head dim=64 and GQA-backward workloads were previously unsolved at peak performance; HipKittens delivers 1.8–2.5× speedups relative to the best available AMD baselines. Notably, even PyTorch-compilable SDPA often achieves only a fraction (<30%) of theoretical performance, underscoring maturity gaps in the software ecosystem.

(Figures 7 and 8)

Figure 7: Attention forwards—HipKittens outperforms all available baselines including hand-optimized AITER and PyTorch SDPA.

Figure 8: Attention backwards—HipKittens delivers substantial improvements for both GQA (grouped-query) and MHA (multi-head) cases.

Memory-Bound Operations

HipKittens also implements memory-bandwidth-bound primitives (fused dropout-residual-layernorm, rotary positional encoding) outperforming AITER and compiled PyTorch kernels by 1.1–2.2×, demonstrating the flexibility and generality of the tile-based abstractions.

Figure 9: Fused dropout-residual-layernorm and rotary kernels by HipKittens consistently outperform compiler approaches on memory-bound operations.

Implementation and Reproducibility

The HipKittens framework provides:

C++ tile abstractions with templates for register/shared/global tiles.
Python interop via bindings, enabling clean integration into ML pipelines.
Modular registration of explicit and compiler-managed register tiles.
Libraries of optimized swizzle patterns for shared memory.
Tunable grid scheduling for XCD-aware memory access.
Large suite of example kernels (GEMM, attention, normalization, positional encoding), all open-sourced (https://github.com/HazyResearch/HipKittens).

Key implementation notes for practitioners:

Leveraging explicit register pinning is critical for compiler circumvention in heavily register-bound workloads.
Correct selection and tuning of swizzle patterns can be the difference between near-peak and suboptimal performance.
Grid scheduling parameters should be empirically tuned for both compute-bound and memory-bound workloads; default values target typical LLM settings.
Use of explicit LLVM scheduling hints (sched_barrier, s_setprio) can help where fine-grained instruction reordering is required, though this is secondary to explicit tile/scalar design.

Implications and Future Directions

Practical Implications:

Tile-based programming abstractions, decoupled from vendor-specific hardware features, allow a unified software stack across accelerators, lowering the barrier for multi-silicon adoption in AI infrastructure.
The "one abstraction, many instantiations" philosophy facilitates rapid porting of new kernels and rapid compiler support for future architectures (chiplet scaling, new matrix core designs, etc).
Explicit register and memory scheduling primitives are essential for circumventing compiler bottlenecks until HIPCC/Triton achieve functional parity with NVIDIA's toolchain.

Theoretical Implications and Open Questions:

The sufficiency of a small set of well-designed, opinionated primitives for kernel programmability suggests a path toward a hardware-agnostic DSL for AI kernels.
The deep reliance on underlying memory bank mappings and grid scheduling demonstrates that software abstractions can, to some extent, subsume the complexity of architectural divergence, but only with low-level architectural introspection.
Given rapid hardware evolution—e.g., increasing chiplet count, diversifying cache hierarchies—future research must address dynamic, runtime grid scheduling and automated memory hierarchy introspection.

Speculative Directions:

Integration with system-level, ML-guided compiler autotuning, where HipKittens provides the low-level DSL and higher-level autotuners select scheduling/grid/swizzle variants per-kernel and per-hardware.
Hardware abstraction layers that ingest device queries for bank/phase mappings to auto-generate swizzle strategies for previously unseen architectures.
Extension to mixed-precision and future scalar/tensor core types (e.g., emerging FP6 formats) as workload diversity increases.

Conclusion

HipKittens delivers the first systematic set of high-level, tile-based programming primitives that generalize performant AI kernel construction to AMD CDNA devices. By identifying the key mismatches between NVIDIA-optimized kernel DSLs and AMD's architectural realities, and implementing minimalist, composable solutions for register management, scheduling, and memory hierarchy utilization, the work demonstrates that a unified abstraction-driven approach is viable across vendors. Empirically, HipKittens kernels match or exceed the performance of hand-tuned assembly while preserving software hygiene and extensibility, paving the way for portable, high-performance AI kernel development as the landscape of AI hardware continues to evolve.

PDF Markdown

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Explain it Like I'm 14

Overview

This paper introduces HipKittens (HK), a new way to write super-fast programs (called “kernels”) for AMD graphics cards (GPUs) that power AI. Until now, the fastest AMD kernels were written in raw assembly—a very low-level, hard-to-write language used by only a few experts. HipKittens makes high performance much easier to achieve by giving developers simple, reusable building blocks (like LEGO tiles) that map AI algorithms efficiently onto AMD hardware. The big idea: the same “tile-based” programming style that worked well for NVIDIA GPUs can be adapted—and rethought—to run great on AMD too.

What questions does the paper ask?

The paper focuses on three simple questions:

Can the tile-based programming style (popular on NVIDIA) also work on AMD GPUs?
If yes, what changes are needed to make AMD kernels fast without using raw assembly?
Can those changes lead to AMD kernels that match or beat the best hand-tuned code and compilers across common AI tasks (like matrix multiplication and attention)?

How did the researchers approach the problem?

The authors designed HipKittens, a C++ framework with a small set of “opinionated” primitives (simple, well-chosen tools) that help developers write fast kernels for AMD. Think of it like giving chefs a compact set of reliable, high-quality cookware instead of a cluttered kitchen.

They focused on three main areas:

1) Tiles and memory: making data movement fast

Tiles are like neatly arranged slices of a big spreadsheet. Working on tiles lets the code use the GPU’s memory more efficiently.
Registers: These are tiny, very fast “pockets” inside the GPU where data is kept while computing. On AMD, the compiler (the “translator” from C++ to machine code) sometimes blocks the best use of these pockets. HipKittens lets developers “pin” registers—meaning they can directly choose which pockets to use and avoid slow extra moves.
Shared memory: This is a small, fast “whiteboard” that threads (groups of 64 mini-workers called “waves”) use together. If threads try to write to the same part at the same time, they collide (“bank conflicts”), which slows things down. HipKittens adds “swizzling”—a clever rearranging of how data is laid out—so threads don’t bump into each other.
Global memory: This is the large, slow storage (HBM). HipKittens uses AMD’s direct async loads from HBM to shared memory to avoid unnecessary detours through registers.

2) Schedules: overlapping compute and memory

GPUs are fast when they can compute and load data at the same time. NVIDIA commonly uses a “producer-consumer” schedule where some waves fetch data (producers) and others do math (consumers). On AMD, this pattern wastes registers (because AMD allocates registers statically to all waves), which hurts performance.

HipKittens introduces two simple, general-purpose schedules:

8-wave ping-pong: Two waves per processor unit alternate roles—one does math while the other prefetches the next data, then they swap. Imagine two teammates passing a ball back and forth in perfect rhythm. This works great when compute and memory times are similar, and it keeps code compact.
4-wave interleave: One wave per unit does both memory and compute in a carefully staggered sequence. This is more fine-grained and can be faster when workloads are unbalanced (more compute-heavy or memory-heavy), but the code is longer.

These schedules are reusable across many AI kernels (GEMMs and attention) and avoid AMD’s producer-consumer pitfalls.

3) Cache-aware work distribution across chiplets

Modern GPUs are built from multiple “chiplets”—like a city made of neighborhoods. Each neighborhood has its own local cache (L2), and the whole city shares a larger cache (LLC). If you schedule work poorly, neighborhoods fetch different parts of the data and miss cache reuse, which wastes bandwidth.

HipKittens adds a “chiplet swizzling” algorithm:

It groups work so blocks assigned to the same chiplet reuse the same rows/columns of input (boosting L2 cache hits).
It coordinates across chiplets so they also reuse data in the shared cache (LLC).
Two tunable knobs control this balance: window height (W) for forming reusable rectangles of work, and chunk size (C) for keeping related work on the same chiplet.

In plain terms: it plans the order of tasks so nearby teams reuse the same tools and don’t constantly run across town to fetch supplies.

What did they find, and why does it matter?

The authors tested HipKittens on AMD’s MI325X (CDNA3) and MI355X (CDNA4) GPUs across common AI workloads:

GEMMs (matrix multiplications), the backbone of deep learning
Attention (including multi-head attention, MHA, and group-query attention, GQA), a key part of transformers
Memory-bound ops like rotary embeddings (RoPE) and layer normalization

Key results:

HipKittens kernels match or beat AMD’s hand-optimized assembly in several cases (BF16 and FP8 GEMMs, attention forward).
They consistently outperform compiler-generated kernels (e.g., up to 3× faster than Triton for BF16 GEMMs, up to 2× faster than Mojo’s attention forwards).
In areas where assembly libraries don’t cover all shapes or features (like some attention forms or memory-heavy kernels), HipKittens sometimes wins by 1.2–10×.
The simple 8-wave ping-pong schedule was often enough to reach top performance; the 4-wave interleave schedule pushed even further on tough cases like GQA backward attention.
Importantly, tile-based programming from ThunderKittens (originally for NVIDIA) transfers to AMD—with new AMD-specific tweaks—suggesting a unified tile-based model can work across different GPU brands.

Why this matters:

Fast AMD kernels without raw assembly lower the barrier for developers and reduce reliance on a single vendor (the “hardware lottery”).
It helps use AMD’s strong memory and compute capabilities fully, unlocking more total compute for AI training and inference.
A common, portable programming style across vendors speeds up innovation and reduces the time-to-performance for new hardware features.

What are the broader implications?

HipKittens shows that a small, well-designed set of primitives—tiles, schedules that overlap compute and memory, and cache-aware grid planning—can give developers the control they need to write high-performance AMD AI kernels without resorting to fragile, hard-to-scale assembly. This:

Opens the path to a unified, tile-based programming layer that works across GPU vendors.
Makes high-performance AI more accessible, so more people and teams can optimize new workloads quickly.
Helps the AI ecosystem tap more hardware capacity, which is crucial to training larger models and serving them efficiently.

The framework is open source (https://github.com/HazyResearch/HipKittens), so others can build on it. If you think of AI as a race car, HipKittens is like a high-quality toolkit that lets more mechanics tune engines for different brands—fast, safely, and in a repeatable way.

View Paper Prompt View All Prompts

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Below is a concise list of unresolved issues and concrete research directions suggested by the paper.

Generality across AMD generations and product lines: HK is validated on CDNA3 (MI325X) and CDNA4 (MI355X) only; it is unclear how the proposed primitives and schedules transfer to future CDNA (e.g., CDNA5) or RDNA compute GPUs with different bank widths, LDS behavior, or matrix core shapes.
Cross-vendor unification: The paper claims tile primitives can generalize across vendors but HK is AMD-only; there is no demonstration of one unified DSL running unchanged on NVIDIA and AMD, nor an ABI/IR layer that isolates vendor-specific assembly wrapping.
Compiler evolution sensitivity: Pinned register tiles bypass HIPCC constraints, but the long-term compatibility with evolving ROCm/HIPCC (e.g., if AGPR usage rules change, register allocation policies shift, or ISA constraints tighten) is unexamined; a fall-back path and automated detection are needed.
Safety and correctness of explicit register pinning: There is no formal verification or tooling to ensure pinned-register schedules avoid hazards (bank conflicts, false dependencies, live range overlap, spills, or deadlocks); methods to statically or dynamically validate correctness would be beneficial.
Undocumented phase behavior: The LDS phase execution order and bank behavior are stated to be undocumented; the “solver” used to infer phases is not described, validated, or released. A general, provably correct method (and dataset of validations across ISA revisions) is missing.
Swizzle coverage and automation: HK supports swizzles for “commonly co-occurring” layouts; there is no systematic coverage plan for less common MFMA shapes or mixed layouts, nor an automated swizzle generator that guarantees bank-conflict freedom across all supported instructions.
Tails and non-multiple tile sizes: Tiles are restricted to multiples of matrix core shapes; handling of boundary tiles (“tails”) and correctness/performance tradeoffs for non-divisible dimensions is not detailed.
Asynchronous HBM→LDS addressing safety: The paper notes swizzling via HBM addresses for direct async loads but does not discuss bounds checking, alignment constraints, or protection against out-of-range addressing in complex tiling schemes.
Scheduling selection heuristics: While 8-wave ping-pong and 4-wave interleave are proposed, there is no principled heuristic or auto-tuner to select between them based on workload compute/memory balance, register pressure, or LDS saturation at runtime.
Pipeline depth vs tile size tradeoffs: The interaction between pipeline depth, MFMA instruction granularity, and register/shared-memory pressure is not fully characterized; a more general model or tuner could automate these choices across problem sizes and dtypes.
Synchronization overhead characterization: Shared-memory atomics are asserted to have “negligible overhead,” but detailed measurements across workloads (including contention scenarios) and comparison to hardware barriers (if/when available) are missing.
Grid scheduling generality: The XCD swizzle algorithm targets GEMM; extension to attention, reductions, batched/strided GEMMs, non-rectangular tilings, and kernels with irregular reuse patterns is not explored.
Automatic tuning for W and C parameters: The choice of window height W and chunk size C is empirical; an analytical model or auto-tuner using hardware counters (e.g., L2/LLC hit rates, latency) could generalize the scheduling across shapes and system states.
Robustness to runtime variability: Cache-aware scheduling assumes round-robin XCD assignment; sensitivity to driver/runtime changes, multi-tenant interference, and concurrent kernel execution is not analyzed.
End-to-end model impact: Benchmarks report TFLOPs/s and bandwidth, but there is no end-to-end training/inference evaluation (e.g., tokens/sec, time-to-train, energy efficiency) for full models (LLMs with GQA/MHA), making practical impact uncertain.
Numerical stability and accuracy: FP8/MXFP formats are benchmarked for speed, but there is no analysis of numerical accuracy, scaling strategies, accumulation precision, or training convergence for mixed precision pipelines on AMD.
Wider workload coverage: Key AI workloads such as convolution, MoE expert routing/dispatch, sparse/structured kernels, batched GEMM variants, KV cache management, prefix-decoding, and comms overlap (RCCL) are not covered; the generality of HK across these remains open.
Interoperability with frameworks: Integration details for PyTorch (autograd, graph capture, stream semantics, tensor layout compatibility), RCCL, and ROCm libraries are not specified; friction points and required APIs are unclear.
Performance counter instrumentation: The cache cost model relies on L2/LLC hit rates; more granular stall analysis (e.g., MFMA issue, LDS contention, memory pipeline occupancy) and fine-grained counters could refine tuning and attribution of bottlenecks.
Fairness and reproducibility of baselines: Some baselines (e.g., Mojo nightly) and ROCm 7.0 preview docker suggest volatile environments; reproducibility across software versions, tuning parity, and fair parameterization (especially for assembly libraries) need tightening.
Energy and thermal constraints: No measurements of power, energy per operation, or thermal throttling behavior are provided; optimization might shift if energy efficiency is targeted.
Multi-GPU/distributed considerations: How chiplet-aware scheduling interacts with multi-GPU training, pipeline/tensor parallelism, and communication-compute overlap is unexplored.
Portability of LDS and MFMA assumptions: The chosen swizzles and schedules rely on specific LDS bank widths and MFMA instruction sets; future ISA changes (e.g., new MFMA shapes, altered LDS banking) may break assumptions without an abstraction layer.
Handling dynamic shapes and sequence lengths: Real workloads vary sequence length and head dimensions; adaptive tiling, scheduling, and cache-aware ordering for dynamic shapes are not addressed.
Tooling for developers: Debugging, profiling, and ergonomics for pinned registers, LDS swizzles, and schedule design are not described; developer-facing tooling (visualizers, static analyzers, correctness tests) would help adoption.
Interaction with future hardware features: If AMD introduces features analogous to NVIDIA’s mbarriers, WGMMA-like shared-memory operands, or register reallocation, how HK’s schedules would evolve is an open design question.
Release completeness: Appendices and code listings referenced for kernels are incomplete in the text; ensuring all algorithms (phase solver, swizzles, grid schedulers) are fully documented and open-sourced would aid reproduction and extension.

View Paper Prompt View All Prompts

Practical Applications

Immediate Applications

Based on the paper’s results and released code, the following applications can be deployed now on AMD CDNA3/CDNA4 hardware (e.g., MI325X, MI355X) with ROCm 7.0.

High-performance GEMM and attention kernels for AMD clusters
- Sector: software/cloud, AI infrastructure, finance (risk analytics), energy (grid simulations), healthcare (imaging, genomics), academia (HPC/ML research)
- What: Drop-in HipKittens kernels for BF16 and FP8 GEMM, MHA/GQA attention forward/backward, RoPE, and LayerNorm that match or beat AMD’s assembly baselines and outperform compiler-generated kernels (e.g., Triton) by up to 3×.
- Tools/products/workflows: HipKittens library bindings; ROCm/PyTorch custom ops; model training/inference pipelines that swap in HK kernels for matmul and attention blocks.
- Dependencies/assumptions: ROCm 7.0, AMD Instinct GPUs (CDNA3/CDNA4), problem shapes aligned with HK tile constraints, Python bindings except where FP8 support is still experimental in PyTorch ROCm.
Cost and energy efficiency improvements for AI training and inference on AMD
- Sector: cloud providers, MLOps, sustainability/green IT
- What: Use HK’s 8-wave ping-pong schedule and chiplet-aware grid ordering to raise arithmetic intensity, increase L2/LLC hit rates, and reduce HBM traffic—improving throughput (TFLOPS) and energy efficiency at data-center scale.
- Tools/products/workflows: HK scheduling templates integrated into training loops; ROCm profiler-guided tuning using rocprofv3 counters (LDS bank conflicts, L2/LLC hit rates); automated kernel selection per workload shape.
- Dependencies/assumptions: Accurate cache behavior modeling on target GPUs; cluster monitoring to verify energy savings; availability of ROCm performance counters.
Diversified compute procurement and deployment beyond CUDA-centric stacks
- Sector: industry policy, government labs, cloud marketplaces
- What: Immediate ability to deploy competitive AMD GPU SKUs for large-scale AI workloads using HK kernels, reducing CUDA lock-in (“moat”) and enabling multi-vendor capacity scaling.
- Tools/products/workflows: AMD-backed model training services; HIP/ROCm-based AI stacks; internal procurement guidelines highlighting HK availability and performance.
- Dependencies/assumptions: Organizational willingness to diversify; qualification of AMD stacks in existing CI/CD; availability of required model architectures and data loaders on ROCm.
Compiler and DSL engineering improvements using HK’s primitives
- Sector: software tooling, compilers, MLIR/LLVM ecosystem
- What: Incorporate HK’s pinned-register control, MFMA-sized tiles, bank-conflict-free swizzles, and 8-wave/4-wave schedules into HIPCC/Triton/CuTe/Gluon backends to close performance gaps on AMD.
- Tools/products/workflows: New MLIR passes for register pinning and phase-aware swizzling; Triton/TileLang schedules that mimic HK’s 8-wave ping-pong; ROCm library updates.
- Dependencies/assumptions: Compiler teams’ bandwidth; alignment with AMD ISA constraints (e.g., AGPR usage policies).
Teaching and curriculum for modern GPU kernel design
- Sector: education, academia (systems/architecture/ML)
- What: Use HK to teach tile-based programming, register scheduling, shared memory banking, and chiplet-aware cache strategies with hands-on labs on AMD GPUs.
- Tools/products/workflows: Course modules and lab notebooks showing MFMA tiles, LDS swizzles, wave scheduling; reproducible performance microbenchmarks with rocprofv3.
- Dependencies/assumptions: Access to AMD GPU nodes; instructor familiarity with HIP/C++ and ROCm.
Performance profiling and debugging workflows for cache and shared memory
- Sector: software performance engineering
- What: Adopt HK’s diagnostics to detect and fix LDS bank conflicts and poor L2/LLC reuse in production kernels.
- Tools/products/workflows: Profiling harnesses capturing SQ_LDS_BANK_CONFLICT, SQ_INSTS_LDS, L2/LLC hit rates; automated remediation via HK’s swizzles and grid rescheduling.
- Dependencies/assumptions: Stable ROCm perf APIs; developer access to kernel code for remediation; problem shapes that benefit from cache-aware remapping.
Accelerated domain workloads through matmul/attention substitutions
- Sector: healthcare (foundation models for imaging/text), finance (time-series transformers), energy (forecasting, HPC simulations), scientific computing
- What: Replace bottleneck kernels in domain models (matmul, attention, norms, RoPE) with HK versions to reduce training/inference time.
- Tools/products/workflows: Pytorch-ROCm custom ops/FX graph passes to rewrite critical nodes to HK; validation harnesses to ensure numerics match.
- Dependencies/assumptions: Model graph is compatible with HK kernel interfaces; FP formats (BF16/FP8) acceptable for accuracy targets; numerical equivalence verified.
Multi-tenant runtime plugins that reduce cross-chiplet cache thrash
- Sector: cloud runtime, job schedulers
- What: Apply HK’s XCD swizzle (window/chunk scheduling) to reorder block launches for better cache locality across tenants, especially in shared clusters.
- Tools/products/workflows: HIP runtime plugin or ROCm scheduler hook that remaps (b.x,b.y) per Algorithm 1; per-job tuning of window W and chunk C.
- Dependencies/assumptions: Ability to intercept and remap grid launches; minimal interference with framework-level launch policies; cache topology matches MI325X/MI355X assumptions.

Long-Term Applications

These depend on broader adoption, further research, or hardware/compiler evolution.

A unified, vendor-agnostic tile-based kernel DSL
- Sector: software tooling, cross-vendor compute
- What: Converge ThunderKittens (NVIDIA) and HipKittens (AMD) primitives into a single DSL that targets multiple accelerators (NVIDIA, AMD, future architectures).
- Tools/products/workflows: Common tile APIs; auto-selection of vendor-specific swizzles/schedules; shared test suites; cross-vendor autotuners.
- Dependencies/assumptions: Vendor ISA harmonization or abstraction layers; mature compiler backends for MFMA/WGMMA equivalents; sustained open-source collaboration.
AI-assisted kernel generation targeting HK primitives
- Sector: AI tooling for systems
- What: RL/agentic systems that learn to assemble HK tiles and schedules (8-wave/4-wave) to produce near-assembly performance, avoiding reward hacking by grounding in perf counters.
- Tools/products/workflows: Training environments with rocprofv3 metrics; reward models for cache reuse and bank conflicts; generative coders constrained by HK API.
- Dependencies/assumptions: Stable perf measurements; robust anti-reward-hacking designs; datasets of high-quality AMD kernels.
Compiler evolution: full register control and AGPR-friendly pipelines
- Sector: compilers/LLVM/MLIR
- What: HIPCC and MLIR passes that support using AGPRs as inputs to matrix instructions and expose programmer hints to steer register lifetimes—so developers don’t need to bypass compilers.
- Tools/products/workflows: New IR intrinsics; register-pinning metadata; matrix-phase-aware memory ops.
- Dependencies/assumptions: ISA/legal constraints; vendor roadmap alignment; backward compatibility.
Generalizing chiplet-aware scheduling to more workloads and hardware
- Sector: HPC, AI systems
- What: Extend XCD swizzle to convolutions, MoE routing, sparse GEMM, pipeline-parallel training, and cross-node NUMA-aware placement; port to other chiplet designs (e.g., future NVIDIA/Intel/accelerators).
- Tools/products/workflows: Cache-aware launch planners; workload-shape-specific heuristics; cluster schedulers that couple kernel-level cache reuse with node-level NUMA.
- Dependencies/assumptions: Detailed cache topology per vendor; APIs to control launch order; accurate cost models across ops.
End-to-end ROCm integration and standardized kernel packs
- Sector: ML frameworks, package ecosystems
- What: Curated “HK Kernel Packs” that frameworks (PyTorch, JAX-on-ROCm) can select at runtime based on shape/dtype, with safe fallbacks to compilers.
- Tools/products/workflows: Runtime kernel registries; graph rewrite passes; CI for numerical fidelity and perf regression.
- Dependencies/assumptions: Framework buy-in; robust ABI stability; FP8 support standardized.
Sector-specific AI platforms on AMD
- Sector: healthcare, finance, energy
- What: Full-stack AMD-optimized platforms (data loaders, training loops, inference servers) for regulated sectors where cost/performance matters (e.g., medical imaging foundation models, risk engines, grid forecasting).
- Tools/products/workflows: Domain libraries pre-wired to HK kernels; compliance-ready deployment blueprints; perf+accuracy validation suites.
- Dependencies/assumptions: Sector validation, accuracy/tolerance for BF16/FP8; long-term ROCm support; security/compliance certifications.
Real-time and edge robotics on AMD (as AMD’s GPU portfolio broadens)
- Sector: robotics/autonomy
- What: Apply HK-style scheduling and bank-conflict mitigation to time-sensitive kernels (attention, sensor fusion) as AMD releases edge-friendly accelerators.
- Tools/products/workflows: Embedded runtime variants; real-time scheduling heuristics; minimal-latency memory layouts.
- Dependencies/assumptions: Availability of suitable AMD edge GPUs; deterministic kernel timing; robotics framework integration.
Sustainability and policy: compute diversification and carbon-aware scheduling
- Sector: public policy, sustainability
- What: Use HK-enabled AMD performance to support policies encouraging multi-vendor compute procurement and carbon-aware workload placement (e.g., favoring cache-efficient kernels to reduce power).
- Tools/products/workflows: Procurement guidelines; carbon accounting tied to perf counters; scheduling policies privileging high arithmetic intensity.
- Dependencies/assumptions: Access to energy telemetry; policy coordination across agencies; standardized metrics.
Curriculum standardization for cross-accelerator GPU programming
- Sector: education
- What: Establish a cross-vendor course track where students learn unified tile abstractions, cache models, and scheduling patterns applicable to AMD/NVIDIA/others.
- Tools/products/workflows: Shared labs, vendor-neutral DSLs, performance challenges.
- Dependencies/assumptions: Ongoing vendor support and academic partnerships.
Safety and reliability engineering for high-performance kernels
- Sector: software assurance
- What: Formalize testing harnesses and verification for HK-style kernels (numerical stability, overflow handling in FP8, edge-case shapes) used in critical systems.
- Tools/products/workflows: Property-based tests; perf/accuracy Pareto dashboards; fallback/rollback mechanisms.
- Dependencies/assumptions: Sector requirements (medical/financial standards); tooling maturity for low-level kernel verification.

Cross-cutting assumptions and dependencies

Hardware: AMD CDNA3/CDNA4 availability; future architectures may change register, cache, and shared memory behaviors.
Software stack: ROCm 7.0 maturity; PyTorch ROCm feature parity (e.g., stable FP8); HIPCC compiler capabilities.
Workload shapes: HK tile constraints require M/N/K multiples of MFMA shapes; attention head dims and sequence lengths affect schedule choice (8-wave vs 4-wave).
Portability: While tiles generalize, layouts and phases are vendor-specific; chiplet-aware scheduling must be retuned per hardware generation.
Expertise: Effective use of HK benefits from developers with HIP/C++ and performance profiling experience; broader adoption may need abstractions in higher-level frameworks.

View Paper Prompt View All Prompts

Glossary

AGPRs (Accumulator General-Purpose Registers): A class of AMD registers used to hold accumulator values, distinct from vector GPRs. "HIPCC prevents the HIP developer from using certain types of registers (AGPRs) as input operands to matrix instructions."
arithmetic intensity: The ratio of computation to memory traffic; higher values generally yield better performance. "AMD's static register allocation means that producer waves consume registers without contributing to computation, which limits the output tile size that gets computed per thread block and thus the kernel's arithmetic intensity."
bank conflicts: Performance-degrading events when multiple threads access the same shared memory bank simultaneously. "On the left is an unswizzled layout suffering from 2-way bank conflicts."
BF16 (bfloat16): A 16-bit floating-point format with 8-bit exponent and 7-bit mantissa, commonly used in AI workloads. "On the MI355X, wave specialization achieves just 80\% of peak BF16 GEMM performance (Tab.~\ref{tab:producer_consumer_ablations})."
chiplet architectures: GPU designs composed of multiple smaller dies (chiplets) interconnected, affecting cache hierarchy and scheduling. "Chiplet architectures are becoming the dominant path to GPU scaling---NVIDIA Blackwell uses 2 chips, AMD MI355X uses 8---but existing frameworks ignore their hierarchical cache structures, leaving performance untapped."
CUTLASS: NVIDIA’s CUDA Templates for Linear Algebra Subroutines and Solvers, a performance-oriented CUDA library for GEMM and related kernels. "using low level CUDA/CUTLASS, it took two years between the H100 GPU's release and the release of peak performance open-source attention kernels~\cite{dao2024flashattention3}."
ds_read_b128: An AMD shared-memory load instruction that reads 128 bits per access with specific banking behavior. "ds_read_b128 accesses shared memory through 64 banks, each 32-bits wide, and correspond the individual cells and numbers in the figure."
GEMM (General Matrix Multiply): A core linear algebra operation computing D = A·B + C. "In a GEMM kernel ( $D = AB + C$ ), each thread block computes a distinct tile of the output matrix $D$ ."
HBM (High Bandwidth Memory): High-speed external memory used by GPUs, sitting below LLC in the hierarchy. "All CUs share a large, slow global memory (HBM) and a last level cache (LLC) sits between L2 and HBM."
HIP (Heterogeneous-Compute Interface for Portability): AMD’s C++ GPU programming model analogous to CUDA. "CUDA / HIP C++ gets compiled (via NVCC, HIPCC) to assembly, and the compiler may introduce its own instruction reordering and register lifetime tracking."
HIPCC: The HIP C++ compiler for AMD GPUs. "However, while the hardware does support using AGPRs as input to matrix core instructions, HIPCC does not."
L2 cache: A per-chiplet (per-XCD) on-GPU cache level offering higher bandwidth and lower latency than LLC/HBM. "Each XCD shares a non-programmable 4MB L2 cache."
LLC (Last-Level Cache): The cache level shared across chiplets, sitting between L2 and HBM. "All CUs share a large, slow global memory (HBM) and a last level cache (LLC) sits between L2 and HBM."
LLVM: A compiler infrastructure used by GPU compilers and DSLs; supports hints to guide code generation. "LLVM accepts compiler hints, which let the developer guide the compiler's behavior."
matrix cores: Specialized GPU execution units for matrix operations (tensor/matrix math). "Threads execute instructions on physical execution units (ALU, FMA, matrix cores), which are specialized for different types of compute."
matrix fused multiply add (MFMA): AMD’s matrix multiply-accumulate instruction family for high-throughput matrix ops. "A SIMD's compute wave executes matrix fused multiply add (MFMA) instructions while its paired memory wave prefetches the next data, hiding memory effectively."
mbarriers: NVIDIA hardware synchronization primitives used in advanced GPU kernel pipelines. "NVIDIA kernels implement wave specialization using dedicated memory access hardware (tma), asynchronous matrix multiplies which accept operands directly from shared or tensor memory (wgmma, tcgnen05), deep pipelines enabled by large shared memory per processor (B200 has 40\% larger SRAM than AMD MI355X per processor), register reallocation (where the register-efficiency of TMA lets producers give their registers to consumers), and hardware synchronization primitives (mbarriers)."
occupancy: The degree to which GPU execution resources are kept busy by active threads/waves. "A few basic kernel patterns help developers achieve high occupancy, or schedule workers (waves on AMD, warps on NVIDIA) onto the different hardware execution units."
pinned registers: Explicitly assigned physical registers that bypass compiler allocation for precise control. "The developer pins the registers belonging to each tile, rather than letting HIPCC manage the registers."
producer-consumer (pattern): A scheduling pattern where some waves handle memory (producers) while others compute (consumers). "Modern NVIDIA kernels have consolidated around wave specialization (producer-consumer) scheduling patterns~\cite{dao2024flashattention3, spector2025tk, spector2025look, 25-comet, wang2025tilelang}."
register reallocation: A feature where register resources are dynamically reassigned across waves to improve efficiency. "register reallocation (where the register-efficiency of TMA lets producers give their registers to consumers)"
ROCm: AMD’s open software stack for GPU computing. "All kernels are benchmarked in AMD's recently released beta Docker using ROCm 7.0 (\url{rocm/7.0-preview:rocm7.0_preview_pytorch_training_mi35x_beta})."
shared memory (LDS on AMD): Programmable on-chip memory shared among threads/waves in a block with banked access constraints. "Each CU has an L1 cache and shared memory that can be accessed by multiple waves in the same thread block."
SIMD (Single Instruction, Multiple Data): Execution units processing multiple data elements with one instruction; AMD schedules waves per SIMD. "CUs organize their hardware resources in 4 'single instruction, multiple data' (SIMD) units."
swizzling (tile/layout swizzle): A permuted memory layout designed to avoid bank conflicts and optimize access patterns. "{HK} introduces a general 8-wave ping-pong schedule to overlap compute and memory, programmer controlled register allocation, and efficient shared memory and chiplet-aware swizzling algorithms to enable a suite of high performance AMD AI kernels."
TMA (Tensor Memory Accelerator): NVIDIA hardware for asynchronous memory movement that can bypass registers. "Like TMA, these loads bypass the register file."
Triton: A high-level GPU compiler for writing custom kernels; easier to use but may lag peak performance/features. "Compilers like Triton~\cite{triton} are simpler to use, but sacrifice performance and struggle to quickly support new hardware features~\cite{spector2025tk, triton2025gluon}."
VGPRs (Vector General-Purpose Registers): AMD vector registers used for general computation and as MFMA operands. "For kernels with a single SIMD per wave, the hardware splits the registers into 256 vector general-purpose registers (VGPRs) and 256 accumulator registers (AGPRs)."
v_accvgpr_read: An AMD instruction to move data from accumulator registers to vector registers. "For workloads that involve both matrix and vector operations (e.g., attention backwards), kernels compiled via HIPCC would need to generate redundant \verb|v_accvgpr_read| instructions that move data from AGPRs to VGPRs prior to issuing matrix instructions."
wave: A group of 64 threads that execute in lockstep on a SIMD on AMD GPUs. "waves, groups of 64 threads, execute in lockstep on individual SIMDs"
wave specialization: A scheduling strategy assigning distinct roles to waves (memory vs compute) to build deep pipelines. "The wave specialization pattern dominates NVIDIA kernels and DSLs; producer waves handle memory operations while consumers execute bulk compute operations over large tiles."
XCD (Accelerator Complex Die): A chiplet containing a cluster of compute units that share an L2 cache. "AMD MI355X GPUs contain 256 CUs organized into 8 accelerator complex dies (XCDs) of 32 CUs in a chiplet layout."

View Paper Prompt View All Prompts

Open Problems

We found no open problems mentioned in this paper.

Continue Learning

Authors (9)

Collections

GitHub

GitHub - HazyResearch/HipKittens (61 stars)

Tweets

This paper has been mentioned in 5 tweets and received 363 likes.

Upgrade to Pro to view all of the tweets about this paper:

Start a free 7-day Pro trial

HipKittens: Fast and Furious AMD Kernels (2511.08083v1)

Summary

HipKittens: Minimal Tile-Based Abstractions and Schedules for High-Performance AI Kernels on AMD GPUs

Motivation and Context

Hardware and Software Analysis

Tile-Based Programming Model and its AMD Instantiation

Tiles and Compute Operators

Register Allocation and Compiler Evasion

Shared Memory Bank Swizzling

Scheduling Patterns for High Occupancy and Arithmetic Intensity

Chiplet-Aware Grid Scheduling and Cache Hierarchy Utilization

Experimental Validation

GEMM Performance

Attention Kernels

Memory-Bound Operations

Implementation and Reproducibility

Implications and Future Directions

Conclusion

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

Overview

What questions does the paper ask?

How did the researchers approach the problem?

1) Tiles and memory: making data movement fast

2) Schedules: overlapping compute and memory

3) Cache-aware work distribution across chiplets

What did they find, and why does it matter?

What are the broader implications?

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Practical Applications

Immediate Applications

Long-Term Applications

Cross-cutting assumptions and dependencies

Glossary

Open Problems

Continue Learning

Related Papers

Authors (9)

Collections

GitHub

Tweets