Papers
Topics
Authors
Recent
Search
2000 character limit reached

cuTile Rust: Safe & Efficient GPU Kernels

Updated 17 June 2026
  • cuTile Rust is a tile-based system that extends Rust's safety model to GPU kernels by enforcing static ownership and aliasing rules across the CPU/GPU boundary.
  • It employs tile partitioning, token ordering, and type-driven kernel signatures to maintain Rust’s core invariants without incurring performance overhead.
  • cuTile Rust enables composable synchronous, asynchronous, and CUDA graph-based kernel execution, matching the performance of unsafe Rust and vendor libraries.

cuTile Rust is a tile-based system for authoring safe, idiomatic GPU kernels in the Rust programming language. It extends Rust's core ownership and aliasing invariants—originally developed to guarantee data-race-free, zero-cost concurrency on the CPU—onto custom GPU kernels. Through a combination of tile partitioning, token ordering, and type-driven kernel signatures, cuTile Rust allows the full static guarantees of Rust (including the &T and &mut T contract) to apply across the CPU/GPU boundary without imposing performance overhead. It supports composable host execution, synchronous and asynchronous pipelines, and CUDA graph replay, offering high throughput and competitive latency for workloads such as LLM inference (Elibol et al., 14 Jun 2026).

1. Motivation and Safety Challenges in GPU Kernel Programming

Standard Rust eliminates data races on the CPU via its ownership discipline—enforced statically by the type checker and the concept of "aliasing XOR mutability". On the CPU, no two threads may simultaneously acquire mutable access to the same object. Applying these safety guarantees to custom GPU kernels written in Rust introduces unique challenges:

  • Host-device kernel launches normally force kernel parameters to collapse down to raw pointers, erasing any &T/&mut T distinction. As such, the responsibility for aliasing is informally shifted onto the programmer.
  • GPU device-side code, typically written in "unsafe Rust," has no means to express or statically enforce that distinct thread blocks operate on disjoint regions of memory.
  • Heavily used tensor patterns—elementwise operations, GEMM, and reductions—require elaborate manual bounds checks and partitioning schemes in unsafe code, undermining Rust's promise of safety and clarity.

The primary motivation for cuTile Rust is to restore Rust's safety discipline across device launches, bringing statically checked, race-free kernel development to high-performance GPU contexts.

2. Tile-Based Ownership and Kernel Interface

cuTile Rust addresses these safety challenges via a tile-based partitioning model and a type-level grammar for kernel signatures:

  • Partitioning Mutable Outputs: Before a device launch, any mutable tensor (e.g., z: Tensor<T, S>) must be explicitly partitioned on the host:
    1
    
    let z = api::zeros::<f32>([1024]).partition([128]);
    This transforms z into a Partition<Tensor<f32>> with a 1 × 128 tiling, mapping each thread block to an exclusive 128-element slice. The partition macro then borrows or moves this partition to the GPU, with each CUDA block assigned ownership over a disjoint sub-tensor.
  • Enforced Entry Point Grammar: Kernel signatures accept:
    • &mut Tensor<T, S> for exclusive writes
    • &Tensor<T, S> for shared reads
    • MappedPartitionMut<T, S, M> for complex one-to-many mappings
    • Scalars or raw pointers (*mut T) only in unsafe fn
  • Tile-to-Block Mapping: For a tensor shape and a tiling vector B=(B0,…,Bd−1)B = (B_0, \ldots, B_{d-1}), the number of blocks per dimension is

Gi=tensor_shapeiBiG_i = \frac{\text{tensor\_shape}_i}{B_i}

with each CUDA block responsible for a tile of size BB. Tile IR manages shared-memory, vectorization, and thread cooperation internally.

  • Token-Ordered Mutability: Loads and stores through &mut Tensor are linked by token chains in the intermediate representation to ensure sequential consistency within each tile, preserving Rust's semantics. Immutable reads may be reordered for bandwidth.

A kernel's generated launch wrapper reconstructs these safe, disjoint tensor views in each thread block, permitting safe, idiomatic programming without dynamic bounds checks. Local opt-out via unchecked_accesses and raw pointers remains available for specialized, low-level workloads (Elibol et al., 14 Jun 2026).

3. Host Execution Model and Composability

cuTile Rust enables uniform and composable kernel launches on the host, abstracting over synchronous, asynchronous, and graph-based execution through its DeviceOp trait:

  • DeviceOp Abstraction: Each launch is wrapped as a lazy computation implementing DeviceOp, with methods:
    • .sync() for blocking synchronous execution,
    • .await for async execution (integrates with Rust async ecosystem),
    • .graph() for capturing the operation into a CUDA graph for later replay.
  • Typed Launch, Preparation, and Recovery: Proc macros auto-generate launch-wrapper types (e.g., AddLaunch for a kernel add). The launch executes in several steps:

    1. .prepare() each argument, disabling host access,
    2. JIT-compile or retrieve device code (Tile IR → cubin),
    3. Launch kernel with typed arguments,
    4. Recover each partitioned tensor, guaranteeing by construction that the host borrow checker prevents aliasing until kernel recovery.
  • Composable Pipelines and CUDA Graph Capture: cuTile Rust operations admit functional composition for complex workflows and offers seamless CUDA graph capture:

    1
    2
    3
    4
    5
    
    let op = kernel::add(z, x, y)
               .then(|(z,_,_)| kernel::scale(z, factor))
               .then(|z| kernel::relu(z));
    let (_z, _x, _y) = op.graph()?;
    op.replay()?;
    CUDA graph scoping allows batch capture and replay of multiple kernels, enabling low-latency, high-throughput execution on real hardware (Elibol et al., 14 Jun 2026).

4. Performance Characterization

cuTile Rust is evaluated on high-end NVIDIA B200 GPUs, and its abstractions demonstrate zero overhead versus unsafe or hand-tuned baselines:

Bandwidth-Bound Elementwise Operations

  • Problem: N=228N = 2^{28}, tile size 128
  • Results:

| Variant | Throughput (TB/s) | Hardware Peak (TB/s) | |-------------------|------------------|----------------------| | Safe Rust | 7.02 | 7.68 | | Unsafe Rust | 7.02 | 7.68 | | cuTile Python | 7.01 | 7.68 |

The safe API yields no measurable performance loss compared to the unsafe variant.

Compute-Bound GEMM

  • Configuration: f16, square matrix N=8192N=8192
  • Results:

| Variant | Performance (PFlop/s) | Fraction of cuBLAS (%) | |----------------------|----------------------|-----------------------| | Safe Rust cuTile | 2.07 | 96.4 | | Unsafe Rust | 2.08 | 96.7 | | cuTile Python | 2.04 | 94.9 | | cuBLAS Baseline | 2.15 | 100 |

  • Implication: The safe interface preserves both bandwidth and computational throughput within a few percent of hardware-optimized libraries (Elibol et al., 14 Jun 2026).

5. End-to-End Application: Grout Inference Engine

Grout exemplifies the practical utility of cuTile Rust in an end-to-end LLM inference scenario:

  • Workflow: Supports fused-norm, QK-attention, and KV-cache writes; utilizes cuBLAS for large GEMMs. Combines safe kernels with unsafe sections where necessary for performance.
  • Throughput:
    • Qwen3-4B on RTX 5090: 171 tokens/s (roofline: 229 tokens/s)
    • Qwen3-32B on B200: 82 tokens/s (roofline: 123 tokens/s)
  • Significance: Grout's batch-1 decode throughput reaches 66–75% of the theoretical HBM bandwidth-limited roofline, outperforming vLLM and SGLang by 5–10% in long-prompt regimes.
  • Prefill Latency: For prompt lengths p≤512p \leq 512, achieves prefill latency of 200 ms at p=512p=512—10–20% lower than vLLM and SGLang.

These results confirm the viability of cuTile Rust for production-grade, high-throughput inference workloads (Elibol et al., 14 Jun 2026).

6. Synthesis, Limitations, and Future Directions

cuTile Rust demonstrates that Rust's ownership and aliasing guarantees can be enforced in custom GPU kernels at zero runtime cost by partitioning tensors into disjoint tiles at launch and maintaining token-ordered mutability within thread blocks. The approach:

  • Ensures data-race freedom and sequential semantics through static typing and tile partitioning.
  • Supports synchronous, asynchronous, and CUDA graph execution styles via a uniform API.
  • Matches or exceeds performance of both unsafe Rust and mature vendor libraries.
  • Enables practical, high-throughput applications (as evidenced by end-to-end inference evaluations).

Planned advances involve extending the safe API coverage to cover more fused kernels, developing a unified safe SPIM-SIMT model for fine-grained intrinsics, supporting cross-GPU or multi-machine execution within the same ownership discipline, and incorporating more powerful async scheduling for heterogeneous I/O and compute (Elibol et al., 14 Jun 2026).

A plausible implication is that cuTile Rust paves the way for widespread, safe adoption of Rust in heterogeneous, memory-unsafe programming contexts such as GPU computing, reconciling strong concurrency safety with the performance demands of modern machine learning and scientific workloads.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to cuTile Rust.