Create a Video View Paper

Fearless Concurrency on the GPU

This presentation explores how cuTile Rust brings Rust's compile-time safety guarantees to GPU kernel programming by mapping ownership and borrowing semantics onto a tile-based execution model. We examine how the system achieves data-race-free GPU kernels without performance overhead, demonstrated through benchmarks showing performance within noise of unsafe baselines and real-world LLM inference engines matching state-of-the-art throughput.

Script

Writing fast GPU code today means writing unsafe code. Every custom kernel, every memory access, every thread synchronization is a potential data race that the compiler cannot catch. cuTile Rust changes that by extending Rust's ownership system all the way to GPU kernels, making data-race-free device code not just possible but natural.

The core insight is tile-based partitioning with provable ownership. The system divides output tensors into disjoint tiles and assigns each tile exclusive access to one kernel instance, while immutable inputs are safely broadcast as shared references. Rust's borrow checker enforces these partitions statically, so aliasing violations are caught at compile time, not runtime.

Safety crosses the host to device boundary through a type-checked launch protocol. Tensor ownership is transferred at kernel launch and recovered after synchronization, with the borrow checker verifying correctness. On the device, token-based memory ordering enforces sequential happens-before relations within each tile, compositionally guaranteeing data-race freedom under the GPU memory model.

The safety guarantees come at zero performance cost. On the NVIDIA B200, safe cuTile Rust achieves 2.07 petaflops for matrix multiplication, which is 96.4% of cuBLAS and indistinguishable from raw pointer baselines. Memory-bound operations saturate the 7 terabyte per second DRAM limit regardless of whether the code is safe or unsafe.

Grout, a production Large Language Model inference engine built on cuTile Rust, demonstrates real-world viability. Running Qwen3 models, it matches vLLM and SGLang throughput while keeping the majority of kernels in safe code. On the RTX 5090, Qwen3 4B decodes at 154.7 tokens per second, reaching 74.7% of the memory bandwidth ceiling, with explicit unsafe required only for highly specialized fused kernels.

This work proves that compile-time safety and zero-cost performance are not mutually exclusive in GPU programming. By mapping ownership semantics onto tile abstractions, a significant class of high-performance kernels can be written without sacrificing either correctness or speed. To explore how safe systems unlock fearless concurrency across more domains, visit EmergentMind.com and create your own video.