Papers
Topics
Authors
Recent
Search
2000 character limit reached

Evaluating CUDA Tile for AI Workloads on Hopper and Blackwell GPUs

Published 25 Apr 2026 in cs.LG, cs.AI, and cs.AR | (2604.23466v1)

Abstract: NVIDIA's CUDA Tile (CuTile) introduces a Python-based, tile-centric abstraction for GPU kernel development that aims to simplify programming while retaining Tensor Core and Tensor Memory Accelerator (TMA) efficiency on modern GPUs. We present the first independent, cross-architecture evaluation of CuTile against established approaches such as cuBLAS, Triton, WMMA, and raw SIMT on three NVIDIA GPUs spanning Hopper and Blackwell: H100 NVL, B200, and RTX PRO 6000 Blackwell Server Edition. We benchmark representative AI workloads, including GEMM, fused multi-head attention, and end-to-end LLM inference in BF16/FP16 precision, to assess both performance and portability. Our results show that CuTile effectiveness is strongly workload- and architecture-dependent. On datacenter-class Blackwell (B200), CuTile achieves up to 1007 TFLOP/s for fused attention, outperforming FlashAttention-2 by 2.5x while requiring only 60 lines of Python kernel code. For GEMM, CuTile reaches 52-79% of cuBLAS performance in 22 lines of code (versus 123 for WMMA), making it a practical replacement for hand-written CUDA kernels but not yet for vendor-optimized libraries. However, the same CuTile attention kernel achieves only 53% of FlashAttention-2 throughput on RTX PRO 6000 (sm_120), exposing significant cross-architecture optimization gaps. In contrast, Triton sustains 62-101% of cuBLAS performance across all tested platforms without architecture-specific tuning, demonstrating substantially stronger portability.

Summary

  • The paper introduces NVIDIA’s CuTile abstraction to simplify GPU kernel programming for AI workloads on Hopper and Blackwell GPUs.
  • It compares CuTile performance against cuBLAS, Triton, WMMA, and SIMT, showing up to 79% GEMM throughput and 2.5× fused attention gains on B200.
  • The study highlights cross-architecture optimization challenges, emphasizing the need for improved compiler tuning for workstation-class GPUs.

Evaluating CUDA Tile for AI Workloads on Hopper and Blackwell GPUs

Introduction and Overview

The paper "Evaluating CUDA Tile for AI Workloads on Hopper and Blackwell GPUs" (2604.23466) addresses the challenges associated with high-performance GPU kernel programming, particularly as they relate to modern transformer architectures and dense AI workloads. CuTile, a Python-based, tile-centric abstraction introduced by NVIDIA, is proposed to simplify GPU kernel development without sacrificing Tensor Core and Tensor Memory Accelerator functionality. This study constitutes the first independent, cross-architecture evaluation of CuTile on NVIDIA's Hopper and Blackwell GPUs.

Methodology and Implementation Variants

The study evaluates CuTile's performance against established methods including cuBLAS, Triton, WMMA, and raw SIMT models. It focuses on several representative AI workloads: General Matrix Multiply (GEMM), fused multi-head attention, and end-to-end LLM inference. The GPUs assessed include the H100 NVL (Hopper), B200, and RTX PRO 6000 (Blackwell Server Edition). The study examines not only raw performance but also portability across architectures, an area where Triton has established strength through its ability to sustain high performance across diverse hardware configurations.

GEMM Evaluation

CuTile's GEMM performance demonstrated substantial efficiency, achieving 52-79% of the throughput of the optimized cuBLAS library across Blackwell architectures, despite requiring far fewer lines of code than WMMA kernels. The gains CuTile presents over WMMA underscore its practicality as it notably simplifies the kernel development process for Tensor Cores without matching cuBLAS's upper limit of performance. Figure 1

Figure 1: GEMM performance (TFLOP/s, BF16) across four square matrix sizes on three GPUs. cuBLAS dominates on all platforms. CuTile (available only on Blackwell) achieves 52-79% of cuBLAS, significantly outperforming WMMA. Raw SIMT is omitted (6-8 TFLOP/s) for visual clarity.

Fused Attention Performance

CuTile shines in its application to fused multi-head attention, especially on the B200 GPU where it achieves up to 1{,}007 TFLOP/s, outperforming FlashAttention-2 by 2.5 times. However, this performance boost is highly architecture-dependent, as evidenced by CuTile's reduced efficacy on the RTX PRO 6000 where throughput reaches only 53% of FlashAttention-2, revealing significant cross-architecture optimization gaps. Figure 2

Figure 2: Fused attention throughput (TFLOP/s) vs. sequence length (BF16, causal, batch=8). On the B200 (right), CuTile (red) dramatically outscales all other implementations, reaching up to 1{,}007 TFLOP/s.

CuTile Attention Paradox

The substantial performance variance between the B200 and RTX PRO 6000 platforms, despite identical CuTile kernel code, underscores a critical gap in compiler optimization specifically targeted at different Blackwell variants. This points to the importance of future compiler maturity and enhancements to better cater to workstation-class GPUs. Figure 3

Figure 3: The CuTile attention paradox: identical kernel code produces 2.51 times higher throughput than FlashAttention-2 on B200 (solid lines) but 47% lower on RTX PRO 6000 (dashed lines). This 5.6 times cross-architecture gap is the largest observed for any abstraction.

Implications and Future Directions

CuTile provides a substantial productivity benefit by allowing high-performance kernel expression in tens of lines of Python code. Its applicability is most pronounced on the B200 (datacenter-class Blackwell GPUs) due to impressive performance in fused attention tasks. However, the workstation-class RTX PRO 6000 exposes a need for significant cross-architecture tuning improvements. This suggests caution in deploying CuTile widely without thorough, hardware-specific benchmarking. Figure 4

Figure 4: Normalized performance heatmap. Left: GEMM throughput as percentage of cuBLAS. Right: attention throughput as percentage of FlashAttention-2. CuTile on B200 achieves up to 251% of FA2 (dark green), while CuTile on RTX PRO 6000 reaches only 53% (orange-red).

Conclusion

CuTile introduces a promising paradigm shift towards simplified, highly efficient kernel development for AI workloads within NVIDIA's ecosystem. It demonstrates potential as a practical replacement for more complex hand-written approaches on Blackwell architecture, specifically for attention workloads on datacenter-grade GPUs. However, issues of portability and optimization across different architectures remain, emphasizing the importance of targeted compiler advancements. As CUDA Toolkit and tileiras compiler technologies evolve, CuTile's utility and range are expected to expand, potentially offering comprehensive GPU kernel solutions for AI inference.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 34 likes about this paper.