- The paper introduces cuTile Rust, which extends Rust's ownership model to GPU kernels using a tile-based execution model to ensure data-race freedom.
- It establishes a safe host–device interface with macro-generated type checking, allowing zero-cost safety in high-throughput GPU computations.
- Evaluation shows performance nearly identical to unsafe baselines, validated by benchmarks and LLM inference use in the Grout engine.
Fearless Concurrency on the GPU: Extending Safe Rust Programming to Tile-Based GPU Kernels
Introduction
"Fearless Concurrency on the GPU" (2606.15991) addresses the longstanding gap between static safety guarantees in CPU-side systems programming and custom GPU kernel development. The Rust language’s affine type system enforces zero-cost safety and aliasing invariants, which prevent data races in concurrent CPU applications. However, existing approaches to authoring GPU kernels in Rust (e.g., rust-gpu, rust-cuda, cuda-oxide) require pervasive unsafe code, particularly due to the mismatch between Rust’s ownership model and typical SPMD GPU execution, compromising the core tenet of fearless concurrency. This work proposes and evaluates cuTile Rust, a programmable tile-based system that preserves Rust’s safety properties for both host and device code in GPU-accelerated computations.
System Model and Programming Abstractions
Tile-Based Execution and Partitioned Ownership
cuTile Rust extends the notion of ownership and borrowing from Rust to the tile-based GPU programming model. The system operates by logically partitioning mutable outputs into disjoint sub-tensors (tiles), assigning exclusive access to each tile program within a SPMD kernel launch. Immutable tensors are broadcast as shared references. This mapping is enforced so that at any point, the borrow checker ensures the absence of aliasing between mutable and shared accesses, even across the host—device boundary.
Tile computation is described as a grid of sequential programs, each operating with single-threaded semantics over tiles. The correspondence between host-side tensor partitioning and device-side tile dispatching enables static partition invariants to be discharged at compile time, typically eliminating dynamic checks from the inner execution path, except in cases explicitly marked as unsafe.
Safe Host–Device Interface and Kernel Launch Protocol
The host-to-device type mapping is materialized through a macro-generated, type-checked launch boundary. Data transfer and ownership acquisition are encoded as part of the kernel parameter protocol, using traits to enforce the prepare–recover phase: arguments are relinquished before launch and restored after stream synchronization. Thus, the semantics of borrowing, moving, and referencing tensors are precisely maintained across the launch boundary, transparently to the kernel author.
The host execution model composes device operations as lazy, typed expressions (DeviceOp). These operations can be executed synchronously, through async Rust task scheduling, or captured as CUDA Graphs for efficient replay. The system provides mechanisms for monadic chaining (then), operation fusion, and parallelism over complex host-side pipelines, including CUDA Graph scoped composition with borrow checking to ensure correct lifetime management.
Token-Based Memory Ordering Semantics
cuTile Rust leverages Tile IR’s explicit token-based memory ordering to ensure correctness in device-side mutation. Token threading establishes a sequential happens-before relation between loads and stores to mutable tensor views within a tile program, reflecting Rust’s intra-thread ordering invariants. In contrast, loads from immutable (&Tensor) types are unconstrained and amenable to compiler reordering for throughput; thus, partitioned access guarantees and token ordering compositionally enforce data-race freedom per the Tile IR memory model.
Expressiveness and Escape Hatches
cuTile Rust introduces MappedPartitonMut types, extending tensor partitions to support scheduling where a kernel may need to mutate multiple output sub-tiles sequentially (e.g., persistent GEMM). Bounded dimension iterators and branded partition indices provide static proofs of access disjointness within each kernel launch. For cases where tile-based statically-checked access is insufficient or too restrictive for optimal performance (e.g., fused or highly specialized attention kernels), unchecked access and raw pointer-based APIs are admitted only inside explicit unsafe code sections.
Evaluation
Zero-Overhead Safety in High-Throughput Kernels
Empirical results on the NVIDIA B200 GPU demonstrate that cuTile Rust achieves nearly transparent performance compared to both cuTile Python and low-level unsafe Rust baselines. For memory-bound, element-wise operations, both safe and unsafe Rust implementations reach the device’s theoretical DRAM limit (7 TB/s for two reads, one write). For compute-bound kernels (e.g., GEMM with f16, M=N=K=8192), safe cuTile Rust attains 2.07 PFlop/s, which is 96.4% of cuBLAS and indistinguishable from the raw-pointer, manual partitioning baseline.
Host Execution Models: Synchronous, Asynchronous, and CUDA Graph Replay
Host pipeline composition (chains of up to 1000 kernels) was profiled in sync, async, and CUDA graph replay execution, showing that the cost is amortized once pipelining is leveraged, with CUDA graph replay achieving near-optimal dispatch limits. Asynchronous execution naturally overlaps host work (e.g., I/O, pre-processing), enabling a single host thread to saturate device streams and minimizing CPU footprint compared to thread-per-stream models.
End-to-End LLM Inference: Grout Engine
cuTile Rust serves as the foundation for Grout, a Qwen3 LLM batch-1 inference engine. Across RTX 5090 and B200, Grout matches or surpasses the decode throughput and prompt latency of state-of-the-art inference systems like vLLM and SGLang. Notably, Qwen3-4B decode achieves 154.7 tok/s on the RTX 5090 (74.7% of HBM roofline), and Qwen3-32B reaches 80.1 tok/s on the B200 (66.7% of HBM roofline). Grout fuses key operations and relies on cuTile Rust’s composable, safe kernel path for the majority of non-GEMM operators, using explicit unsafe only where required by performance constraints.
Data-Race Freedom and Theoretical Properties
The safe tensor API construction and kernel launch protocol provide formal guarantees of data-race freedom under the Tile IR weakly ordered memory model. The mapped partition injectivity and intra-tile token chains, together with Rust’s affine types, structurally preclude race conditions. Data-race causing bugs (e.g., index swaps leading to overlapping writes) cannot be expressed in safe cuTile Rust code.
cuTile Rust crystallizes several advances in tile-based and safe GPU programming. Distinct from efforts like Rust-CUDA, rust-gpu, or cubecl, which focus primarily on enabling Rust codegen or multi-backend support, cuTile Rust targets a high-level kernel programming model enforcing host–device ownership contracts. Compared to existing tile systems (e.g., Triton, ThunderKittens, Pallas), cuTile Rust is unique in statically checking access disjointness and memory safety, rather than relying on post-hoc verification or runtime detection. It contrasts with DSLs like Descend or Mojo in that it leverages Rust’s existing type system, avoiding the need to develop or maintain additional language infrastructure.
Limitations and Future Directions
Tile-based programming abstracts over SIMT-level control, thus precluding some low-level warp- or shared-memory-specialized kernels. While GEMM and a significant class of tensor computations map cleanly, complete elimination of raw pointer escape hatches for all relevant patterns requires further API generalization. Quantifying the impact of async execution on CPU utilization and power for real-world workloads (especially in edge or server environments) remains open. Extending partitioning principles for cross-device, multi-GPU, and collective operation safety offers a promising avenue for future research.
Conclusion
cuTile Rust demonstrates that it is possible to achieve data-race-free, zero-cost, safe GPU kernel programming in Rust for a wide range of tensor computations by mapping the affine ownership model onto a tile-based abstraction. Systematic composition of safe tensor and partition APIs results in performance that is within the noise floor of equivalent unsafe or vendor-library code for practical workloads in modern deep learning, as evidenced by both microbenchmarks and LLM inference evaluation. The work conclusively shows that static safety guarantees, often presumed infeasible in high-performance device-side programming, can be realized without a compromise in expressiveness or efficiency for a critical subset of GPU workloads.