Fearless Concurrency on the GPU

Published 14 Jun 2026 in cs.PL | (2606.15991v1)

Abstract: Rust has made safe systems programming practical on the CPU, but writing custom GPU kernels in Rust still forces programmers outside the language's ownership guarantees. We present cuTile Rust, a tile-based system for safe, idiomatic GPU kernel authoring in Rust. cuTile Rust extends Rust's ownership discipline to tile-based GPU kernels: mutable outputs are split into disjoint pieces, kernel launches preserve the host-side ownership contract, and programmers can opt out locally when they need lower-level control. The system also provides a composable host execution model spanning synchronous launches, asynchronous pipelines, and CUDA graph replay. Our evaluation shows that these abstractions can preserve performance on high-end GPUs. On the NVIDIA B200 GPU, cuTile Rust achieves 7 TB/s for element-wise operations and 2 PFlop/s for GEMM (96% of cuBLAS), matching cuTile Python within measurement noise. Grout, a cuTile-Rust-based inference engine, exercises cuTile Rust across an end-to-end Qwen3 inference path. In batch-1 decode, Grout reaches 171 generated tokens/s for Qwen3-4B on the NVIDIA GeForce RTX 5090 and 82 generated tokens/s for Qwen3-32B on the B200, competitive with vLLM and SGLang and consistent with an HBM roofline sanity check.

Abstract PDF Upgrade to Chat

Authors (5)

Summary

The paper introduces cuTile Rust, which extends Rust's ownership model to GPU kernels using a tile-based execution model to ensure data-race freedom.
It establishes a safe host–device interface with macro-generated type checking, allowing zero-cost safety in high-throughput GPU computations.
Evaluation shows performance nearly identical to unsafe baselines, validated by benchmarks and LLM inference use in the Grout engine.

Fearless Concurrency on the GPU: Extending Safe Rust Programming to Tile-Based GPU Kernels

Introduction

"Fearless Concurrency on the GPU" (2606.15991) addresses the longstanding gap between static safety guarantees in CPU-side systems programming and custom GPU kernel development. The Rust language’s affine type system enforces zero-cost safety and aliasing invariants, which prevent data races in concurrent CPU applications. However, existing approaches to authoring GPU kernels in Rust (e.g., rust-gpu, rust-cuda, cuda-oxide) require pervasive unsafe code, particularly due to the mismatch between Rust’s ownership model and typical SPMD GPU execution, compromising the core tenet of fearless concurrency. This work proposes and evaluates cuTile Rust, a programmable tile-based system that preserves Rust’s safety properties for both host and device code in GPU-accelerated computations.

System Model and Programming Abstractions

Tile-Based Execution and Partitioned Ownership

cuTile Rust extends the notion of ownership and borrowing from Rust to the tile-based GPU programming model. The system operates by logically partitioning mutable outputs into disjoint sub-tensors (tiles), assigning exclusive access to each tile program within a SPMD kernel launch. Immutable tensors are broadcast as shared references. This mapping is enforced so that at any point, the borrow checker ensures the absence of aliasing between mutable and shared accesses, even across the host—device boundary.

Tile computation is described as a grid of sequential programs, each operating with single-threaded semantics over tiles. The correspondence between host-side tensor partitioning and device-side tile dispatching enables static partition invariants to be discharged at compile time, typically eliminating dynamic checks from the inner execution path, except in cases explicitly marked as unsafe.

Safe Host–Device Interface and Kernel Launch Protocol

The host-to-device type mapping is materialized through a macro-generated, type-checked launch boundary. Data transfer and ownership acquisition are encoded as part of the kernel parameter protocol, using traits to enforce the prepare–recover phase: arguments are relinquished before launch and restored after stream synchronization. Thus, the semantics of borrowing, moving, and referencing tensors are precisely maintained across the launch boundary, transparently to the kernel author.

The host execution model composes device operations as lazy, typed expressions (DeviceOp). These operations can be executed synchronously, through async Rust task scheduling, or captured as CUDA Graphs for efficient replay. The system provides mechanisms for monadic chaining (then), operation fusion, and parallelism over complex host-side pipelines, including CUDA Graph scoped composition with borrow checking to ensure correct lifetime management.

Token-Based Memory Ordering Semantics

cuTile Rust leverages Tile IR’s explicit token-based memory ordering to ensure correctness in device-side mutation. Token threading establishes a sequential happens-before relation between loads and stores to mutable tensor views within a tile program, reflecting Rust’s intra-thread ordering invariants. In contrast, loads from immutable (&Tensor) types are unconstrained and amenable to compiler reordering for throughput; thus, partitioned access guarantees and token ordering compositionally enforce data-race freedom per the Tile IR memory model.

Expressiveness and Escape Hatches

cuTile Rust introduces MappedPartitonMut types, extending tensor partitions to support scheduling where a kernel may need to mutate multiple output sub-tiles sequentially (e.g., persistent GEMM). Bounded dimension iterators and branded partition indices provide static proofs of access disjointness within each kernel launch. For cases where tile-based statically-checked access is insufficient or too restrictive for optimal performance (e.g., fused or highly specialized attention kernels), unchecked access and raw pointer-based APIs are admitted only inside explicit unsafe code sections.

Evaluation

Zero-Overhead Safety in High-Throughput Kernels

Empirical results on the NVIDIA B200 GPU demonstrate that cuTile Rust achieves nearly transparent performance compared to both cuTile Python and low-level unsafe Rust baselines. For memory-bound, element-wise operations, both safe and unsafe Rust implementations reach the device’s theoretical DRAM limit (7 TB/s for two reads, one write). For compute-bound kernels (e.g., GEMM with f16, $M = N = K = 8192$ ), safe cuTile Rust attains 2.07 PFlop/s, which is 96.4% of cuBLAS and indistinguishable from the raw-pointer, manual partitioning baseline.

Host Execution Models: Synchronous, Asynchronous, and CUDA Graph Replay

Host pipeline composition (chains of up to 1000 kernels) was profiled in sync, async, and CUDA graph replay execution, showing that the cost is amortized once pipelining is leveraged, with CUDA graph replay achieving near-optimal dispatch limits. Asynchronous execution naturally overlaps host work (e.g., I/O, pre-processing), enabling a single host thread to saturate device streams and minimizing CPU footprint compared to thread-per-stream models.

End-to-End LLM Inference: Grout Engine

cuTile Rust serves as the foundation for Grout, a Qwen3 LLM batch-1 inference engine. Across RTX 5090 and B200, Grout matches or surpasses the decode throughput and prompt latency of state-of-the-art inference systems like vLLM and SGLang. Notably, Qwen3-4B decode achieves 154.7 tok/s on the RTX 5090 (74.7% of HBM roofline), and Qwen3-32B reaches 80.1 tok/s on the B200 (66.7% of HBM roofline). Grout fuses key operations and relies on cuTile Rust’s composable, safe kernel path for the majority of non-GEMM operators, using explicit unsafe only where required by performance constraints.

Data-Race Freedom and Theoretical Properties

The safe tensor API construction and kernel launch protocol provide formal guarantees of data-race freedom under the Tile IR weakly ordered memory model. The mapped partition injectivity and intra-tile token chains, together with Rust’s affine types, structurally preclude race conditions. Data-race causing bugs (e.g., index swaps leading to overlapping writes) cannot be expressed in safe cuTile Rust code.

cuTile Rust crystallizes several advances in tile-based and safe GPU programming. Distinct from efforts like Rust-CUDA, rust-gpu, or cubecl, which focus primarily on enabling Rust codegen or multi-backend support, cuTile Rust targets a high-level kernel programming model enforcing host–device ownership contracts. Compared to existing tile systems (e.g., Triton, ThunderKittens, Pallas), cuTile Rust is unique in statically checking access disjointness and memory safety, rather than relying on post-hoc verification or runtime detection. It contrasts with DSLs like Descend or Mojo in that it leverages Rust’s existing type system, avoiding the need to develop or maintain additional language infrastructure.

Limitations and Future Directions

Tile-based programming abstracts over SIMT-level control, thus precluding some low-level warp- or shared-memory-specialized kernels. While GEMM and a significant class of tensor computations map cleanly, complete elimination of raw pointer escape hatches for all relevant patterns requires further API generalization. Quantifying the impact of async execution on CPU utilization and power for real-world workloads (especially in edge or server environments) remains open. Extending partitioning principles for cross-device, multi-GPU, and collective operation safety offers a promising avenue for future research.

Conclusion

cuTile Rust demonstrates that it is possible to achieve data-race-free, zero-cost, safe GPU kernel programming in Rust for a wide range of tensor computations by mapping the affine ownership model onto a tile-based abstraction. Systematic composition of safe tensor and partition APIs results in performance that is within the noise floor of equivalent unsafe or vendor-library code for practical workloads in modern deep learning, as evidenced by both microbenchmarks and LLM inference evaluation. The work conclusively shows that static safety guarantees, often presumed infeasible in high-performance device-side programming, can be realized without a compromise in expressiveness or efficiency for a critical subset of GPU workloads.

Markdown Report Issue