ThunderKittens: High-Performance CUDA Kernels

Updated 17 April 2026

ThunderKittens (TK) is a CUDA framework that employs a tile-centric design using 16×16 register tiles to efficiently leverage tensor cores and TMA.
Its pipelined LCSF model overlaps load, compute, and store stages to hide latency and achieve up to 800 TFLOPS on key AI kernels like GEMM and attention models.
With extensions like ParallelKittens, TK scales to multi-GPU environments by streamlining communication and persistent grid scheduling for distributed AI workloads.

ThunderKittens (TK) is a CUDA framework for writing high-performance, maintainable AI kernels by employing minimal but comprehensive abstractions at warp, thread-block, and grid levels. It was designed to counter the limitations of hand-written custom kernels, which often fail to approach architectural throughput even for well-studied primitives. TK is tile-centric, offering register-level 16×16 matrix tiles as its core data structure, leveraging hardware features such as NVIDIA’s tensor cores and Tensor Memory Accelerator (TMA). By abstracting boilerplate and emphasizing concise kernel expressivity, TK enables efficient mapping of AI operations to diverse GPU architectures while outperforming or matching leading bespoke and compiler-generated baselines across GEMM, attention, state-space models, and more (Spector et al., 2024).

1. Tile-Centric Warp-Level Abstractions

TK’s foundational abstraction is the 16×16 “register tile” (type rt_bf\<16,W>), where W is the column dimension, typically a multiple of 16. Each tile is stored in registers and mapped so that 32 threads in a warp own contiguous row elements, which aligns with the hardware organization of tensor-core (MMA) units. This enables bulk vectorized loads/stores and arithmetic instructions tailored for data-parallel matrix operations.

Shared-memory staging of tiles also receives careful engineering: for any compile-time tile width, TK selects one of three “swizzled” layouts (32-/64-/128-byte) to minimize bank conflicts, ensuring efficient use of both software-managed shared memory (SMEM) and hardware accelerators (TMA, WGMMA). All layout choices are computed at compile time, and the user interface exposes only the tile-centric API, hiding hardware-specific details entirely (Spector et al., 2024).

On-tile compute primitives follow a PyTorch-inspired API supporting major MMA, FMA, pointwise, and reduction operations, such as:

warpgroup::mm_ABt(att, q_reg, k_reg); (matrix multiply)
sub_row(att, att, max_vec); (row-wise subtraction)
exp(att, att); (elementwise exponentiation) TK maps these calls directly to tensor-core instructions with the necessary register arrangements, minimizing encoded instruction overhead.

2. Thread-Block-Level Producer–Consumer Pipelining

Thread-block kernels employ the “LCSF” (Load–Compute–Store–Finish) template. Developers supply up to four small lambdas for each stage, and TK orchestrates overlap between asynchronous producer warps (handling HBM loads/stores via TMA) and consumer warps (executing compute over on-SMEM/register tiles). Data transfer and computation are thus maximally overlapped, with synchronization and data-ready signaling handled by lightweight mechanisms (arrive(...)).

Pipeline depth is directly parameterizable (default: $S=4$ ), and increasing $S$ allows hiding more global memory latency: $\begin{array}{c|cccc} \text{Stages }S & 1 & 2 & 3 & 4 \ \hline \text{TFLOPS} & 260 & 484 & 683 & 760 \ \end{array}$ on $4096 \times 4096$ GEMM (H100). Warp occupancy (producer:consumer ratio) is also easily tuned, exposing performance–register pressure trade-offs (Spector et al., 2024).

3. Grid-Level Scheduling and Persistent Blocks

TK provides two grid-level optimizations. First, it supports persistent kernel launches: all SMs are occupied by a grid of long-lived blocks, which self-manage local work assignment, thus amortizing CUDA grid launch overhead. This produces significant throughput gains (e.g., 10–15% for GEMM with large $K$ ). Second, it enables systematic exploration of 3D tiling order for grid launch (block-major, $M$ -major, $N$ -major, $K$ -major etc.), materially affecting L2 reuse and global bandwidth utilization. As shown empirically, block reorderings can yield $>2\times$ throughput improvements (e.g., 392 to 805 TFLOPS for $16$k GEMM) purely by affecting data reuse in cache (Spector et al., 2024).

4. Performance Benchmarks and Kernel Applications

Extensive empirical evaluation demonstrates that <40–50 line TK kernels match or outperform custom and heavily optimized baselines:

GEMM: parity to cuBLAS at all square problem sizes up to $S$ 0k, both peaking at ~800 TFLOPS on H100.
Attention: matches or exceeds FlashAttention-3 (e.g., $S$ 1 TFLOPS forward, $S$ 2 TFLOPS backward vs. $S$ 3 and $S$ 4 TFLOPS, respectively), with up to 10–40% gains on backward pass.
Linear Attention: 14 $S$ 5 speedup over Triton-based FLA ( $S$ 6 TFLOPS).
FFT-based state space models: 7–8 $S$ 7 speedup over published baselines.

A salient property is the minimal device code required: complex fused attention, GEMM, and state-space kernels can typically be written with less than 50 lines of device-side logic, including all synchronization (Spector et al., 2024).

5. Best Practices and Key Design Principles

TK’s effectiveness derives from:

Tile-First Abstraction: All data movement/storage/compute derived from a fixed 16×16 (or 16× $S$ 8) tile, mapping perfectly to tensor core units.
Automatic Swizzled Layouts: Static compile-time layouts providing TMA/WGMMA compatibility and bank conflict avoidance.
Unified Pipeline Template: A single LCSF template suffices for all studied AI kernels, with 3–4 pipeline stages generally optimal to overlap $S$ 9500 cycles of global memory latency.
Explicit Occupancy/Staging Tuning: Direct control over occupancy and pipeline depth affords Pareto-optimal trade-offs between register pressure and arithmetic utilization.
Persistent Grids and Block-Order Tiling: Support for persistent blocks and block-order scheduling maximizes L2/data reuse and grid-level throughput.
C++ CUDA Embedding: Seamless recourse to inline PTX and all CUDA features for expert tuning without stepping outside the TK API.

Rule-of-thumb settings: tile size $\begin{array}{c|cccc} \text{Stages }S & 1 & 2 & 3 & 4 \ \hline \text{TFLOPS} & 260 & 484 & 683 & 760 \ \end{array}$0, pipeline stages $\begin{array}{c|cccc} \text{Stages }S & 1 & 2 & 3 & 4 \ \hline \text{TFLOPS} & 260 & 484 & 683 & 760 \ \end{array}$1–$\begin{array}{c|cccc} \text{Stages }S & 1 & 2 & 3 & 4 \ \hline \text{TFLOPS} & 260 & 484 & 683 & 760 \ \end{array}$2, producer:consumer ratio between $\begin{array}{c|cccc} \text{Stages }S & 1 & 2 & 3 & 4 \ \hline \text{TFLOPS} & 260 & 484 & 683 & 760 \ \end{array}$3 and $\begin{array}{c|cccc} \text{Stages }S & 1 & 2 & 3 & 4 \ \hline \text{TFLOPS} & 260 & 484 & 683 & 760 \ \end{array}$4, block-ordering by $\begin{array}{c|cccc} \text{Stages }S & 1 & 2 & 3 & 4 \ \hline \text{TFLOPS} & 260 & 484 & 683 & 760 \ \end{array}$5 or $\begin{array}{c|cccc} \text{Stages }S & 1 & 2 & 3 & 4 \ \hline \text{TFLOPS} & 260 & 484 & 683 & 760 \ \end{array}$6 for L2 reuse (Spector et al., 2024).

6. Systematic Multi-GPU Extension via ParallelKittens

ParallelKittens (PK) extends TK’s principles to multi-GPU settings, targeting scalable performance under the increasing constraints of inter-GPU bandwidth. PK introduces the Parallel Global Layout (PGL)—symmetric memory regions allocated and peer-accessible across GPUs—and eight tile-level primitives leveraging device-initiated TMA transfers, in-network multicast/reduction, and fine-grained signaling primitives. PK’s unified Load-Compute-Store-Communicate (LCSC) template generalizes the LCSF scheme: compute and communication roles are divided among SMs, synchronizing at tile granularity.

Primitives such as store_async, store_add_async, reduce, all_reduce, signal, signal_all, wait, and barrier expose the full suite of one-way, pipelined, TMA-driven communication and synchronization, removing costly two-way handshakes typical in libraries such as NCCL/NVSHMEM (Sul et al., 17 Nov 2025).

Empirical results indicate 2.33$\begin{array}{c|cccc} \text{Stages }S & 1 & 2 & 3 & 4 \ \hline \text{TFLOPS} & 260 & 484 & 683 & 760 \ \end{array}$7 speedup (data/tensor parallel workloads), 4.08$\begin{array}{c|cccc} \text{Stages }S & 1 & 2 & 3 & 4 \ \hline \text{TFLOPS} & 260 & 484 & 683 & 760 \ \end{array}$8 (sequence parallel workloads), and up to 1.22$\begin{array}{c|cccc} \text{Stages }S & 1 & 2 & 3 & 4 \ \hline \text{TFLOPS} & 260 & 484 & 683 & 760 \ \end{array}$9 (expert parallelism) with consistently minimal additional code, validating the hypothesis that TK’s tile-centric abstractions systematically extend to the multi-GPU regime (Sul et al., 17 Nov 2025).

7. Adoption, Accessibility, and Implementation Experience

Because all abstractions in TK are embedded in C++, developers retain full CUDA functionality, including direct use of special hardware instructions. Empirical studies show that even undergraduate students with no prior CUDA experience could implement major AI primitives (such as FlashAttention-3, cuBLAS-equivalent GEMM) at state-of-the-art performance within weeks, illustrating the practical simplicity and pedagogical clarity of the design (Spector et al., 2024).

A plausible implication is that ThunderKittens may significantly lower both the barrier to entry and code maintenance overhead for researchers targeting high-performance CUDA kernels, provided their workloads fit the tiled, warp/block/grid scheduling regime. The systematic extension via ParallelKittens confirms that these principles generalize to complex communication-limited distributed AI settings (Sul et al., 17 Nov 2025).

Markdown Report Issue Upgrade to Chat

References (2)

ThunderKittens: Simple, Fast, and Adorable AI Kernels (2024)

ParallelKittens: Systematic and Practical Simplification of Multi-GPU AI Kernels (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to ThunderKittens (TK).

ThunderKittens: High-Performance CUDA Kernels

1. Tile-Centric Warp-Level Abstractions

2. Thread-Block-Level Producer–Consumer Pipelining

3. Grid-Level Scheduling and Persistent Blocks

4. Performance Benchmarks and Kernel Applications

5. Best Practices and Key Design Principles

6. Systematic Multi-GPU Extension via ParallelKittens

7. Adoption, Accessibility, and Implementation Experience

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

ThunderKittens: High-Performance CUDA Kernels

1. Tile-Centric Warp-Level Abstractions

2. Thread-Block-Level Producer–Consumer Pipelining

3. Grid-Level Scheduling and Persistent Blocks

4. Performance Benchmarks and Kernel Applications

5. Best Practices and Key Design Principles

6. Systematic Multi-GPU Extension via ParallelKittens

7. Adoption, Accessibility, and Implementation Experience

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research