Papers
Topics
Authors
Recent
Search
2000 character limit reached

CuTe-DSL: Hierarchical Tensor Layout Language

Updated 8 March 2026
  • CuTe-DSL is a domain-specific language that provides an algebraic framework for describing and verifying hierarchical tensor layouts using mathematical and categorical foundations.
  • It offers a Python-based API enabling static analysis and optimized code generation for CUDA/CUTLASS kernels, facilitating precise tensor operations in HPC and deep learning.
  • The DSL streamlines tensor computations by enforcing memory safety through algebraic composition and static checks, accelerating development and adaptation to new hardware.

CuTe-DSL is a domain-specific language and Python-based API designed for the concise and rigorous description, manipulation, and code generation of hierarchical data layouts in high-performance tensor computations, especially those targeting GPU tensor instructions and CUDA/CUTLASS-based kernels. CuTe-DSL is grounded in the mathematical and categorical foundations of CuTe layouts, providing a high-level algebra for reasoning about and generating correct, high-performance implementations for matrix operations, tensor contractions, and memory movement in deep learning and HPC workloads (Cecka, 2 Mar 2026, Carlisle et al., 9 Jan 2026).

1. Hierarchical Layout Models and Algebraic Foundations

At the core of CuTe-DSL is a strict extension of the standard shape–stride tensor model to hierarchical (potentially deeply nested) shapes and strides. A shape SS is a possibly nested tuple of positive integers, SHTuple(Z+)S \in HTuple(\mathbb Z^+), with rank rr and total size #S=i=0r1Si\#S = \prod_{i=0}^{r-1} S_i. A stride DD is a congruent, potentially nested tuple, such that SDS \sim D. The layout L=DSL = D\circ S is a function from in-bounds coordinates Z(S)Z(S) to offsets in memory, L(c)=i=0r1ciDiL(c) = \sum_{i=0}^{r-1} c_i D_i, generalizing the traditional L(i)=Didx2crd(i)L(i) = D \cdot idx2crd(i) for flat tensors.

The layout algebra comprises homomorphisms defined on layouts, including:

  • Concatenation: L=(L0,...,Ln)L = (L_0, ..., L_n) acts modewise, L(c0,...,cn)=i=0nLi(ci)L(c_0,...,c_n) = \sum_{i=0}^n L_i(c_i).
  • Coalescence: Reduces a hierarchical layout to minimal depth while preserving its functional image, ensuring depth(R)1\mathrm{depth}(R)\leq1 and R(i)=L(i)R(i)=L(i) for all valid indices.
  • Composition: Given R=ABR = A \circ B, B(c)B(c) must always yield a valid coordinate for AA. Composition is associative and admits identities ISI_S.
  • Complement: LL^* maps the ordered complement of image(L)\mathrm{image}(L) in the codomain, producing the “rest” offsets.
  • Logical Product: AB=(A,AB)A\otimes B = (A, A^*\circ B) forms a blocked/partitioned layout by combining tiling and offset patterns.
  • Logical Divide: AB=A(B,B#A)A \oslash B = A \circ (B, B^*_{\#A}) partitions AA into BB-hits and the residual, ensuring surjectivity onto Z#AZ_{\#A}.
  • Inversion: Left and right inverses for layouts are defined, with true inverses when bijectivity holds.

These operations enable a rich system for construction, composition, partition, fission, and verification of data and thread layouts with static, algebraic correctness guarantees (Cecka, 2 Mar 2026).

2. Categorical Structure and Theoretical Guarantees

CuTe-DSL operations are grounded in a categorical framework, particularly the categories Tuple and Nest:

  • Tuple: Objects are tuples of positive integers (flat shapes), and morphisms are tractable pointed maps preserving elementwise equality, with constraints ensuring no codomain index is hit more than once.
  • Nest: Extends Tuple to nested tuples, pairing flattenings with a parenthesization profile.

Morphisms correspond to layouts, and layout algebraic operations are directly translated to categorical composition (e.g., Lgf=LgLfL_{g\circ f} = L_g \circ L_f), logical division, and product. Non-degenerate flat layouts are in bijection with tuple morphisms of standard form, ensuring correctness, injectivity (no memory aliasing), and compactness (full coverage) by construction.

Category-theoretic identities (associativity, invertibility, distributivity) are enforced, so that layout rewrites and manipulations in CuTe-DSL are always valid provided admissibility constraints are satisfied (Carlisle et al., 9 Jan 2026).

3. CuTe-DSL API: Syntax, Semantics, and Static Analysis

CuTe-DSL is delivered as a tight Python API where all algebraic layout operations have strict, statically checked semantics:

1
2
3
4
5
6
7
8
9
10
11
12
A = Tensor(p, Layout((M, K), (stride0, stride1)))   # MxK matrix, arbitrary strides

TV = Layout(((4,8),2), ((16,1),8)) # ThreadValue layout for Ampere
A_TV = compose(A, TV)              # Schedules A by thread/block pattern

tiled = zipped_divide(A, (4,8))    # Partition into (tile_coord, grid_coord)

L = Layout(((2,2),(4,2)), ((1,8),(2,16)))
L_min = coalesce(L)                # Flatten to single stride tuple

A_inv = A.layout.right_inverse()
coords = A_inv(0)                  # Retrieve coordinates at given offset

The API is designed so that all operations—composition, division, product, coalescence, inversion—mirror the theoretical algebra exactly. All shape/stride congruence, composition admissibility, tiling/division preconditions, and rank/dimension matching are enforced at compile time. Any violation aborts code generation with informative diagnostics.

Compile-time algebraic reasoning eliminates runtime overhead, and all layout algebra compiles away to direct index computations or hardware intrinsics.

4. Integration with High-Performance Computing Kernels

CuTe-DSL’s primary application is the specification and generation of specialized CUDA/CUTLASS kernels for GPU tensor operations. The DSL enables:

  • Separation and composition of thread- and data-layouts by treating both as first-class layouts with algebraic manipulation.
  • Expression of hardware-specific patterns, such as tensor-core tiling, partitioning, and swizzling, by selecting or composing hardware-prescribed layouts.
  • Automatic generation of CUDA/CUTLASS code: The Python front-end statically emits optimized C++ templates or device PTX instructions, embedding only the necessary address arithmetic. Example flow:

1
2
3
4
5
6
7
8
9
10
A = Tensor(pA, Layout((M,K), (lda,1)))
TV = Layout(((4,8),2), ((16,1),8))
C_TV = compose(C, TV)

@kernel
def gemm_kernel(A, B, C_TV):
    for k in range(K):
        C_TV[thread_id, val_id] += A[m, k] * B[n, k]
        
gemm_kernel.compile(block=(4,2,1), grid=(M//4, N//8, 1))  # Emitted as CUTLASS GEMM

  • Compile-time static verification, including tiler compatibility, stride/shape congruence, and algorithmic preconditions, to guarantee code correctness and preempt runtime errors.

All code emitted by CuTe-DSL for kernels achieves zero dynamic layout overhead and matches or exceeds hand-optimized codegen in performance due to more pervasive algebraic fusion (Cecka, 2 Mar 2026).

5. Case Studies, Adoption, and Practical Impact

CuTe-DSL and its underlying layout algebra are the foundation of NVIDIA CUTLASS v3 and v4, reducing code implementing tensor layouts from ~55 K lines (v2) to ~3 K lines (v3) with no degradation in performance. This transition enables:

  • Rapid support for novel hardware tensor instructions, by defining a single new layout that propagates through all relevant algorithms and kernels.
  • Generic drivers for GEMM, gather, scatter, and attention that adapt to new layout/tile shapes without driver-level change; only the thread-value layout requires extension for new hardware.
  • Facilitation of high-performance primitives (e.g., FlashAttention, IO-aware attention) which can directly use DSL layout specifications for correctness in shared memory access and avoidance of resource conflicts.
  • Substantial developer productivity increase: Studies report 2–10× reduction in development effort relative to hand-tuned PTX, driven by DSL-level abstraction and static verification (Cecka, 2 Mar 2026).

Compile-time algebraic resolution guarantees that only semantically valid layouts propagate to kernel generation, with all arithmetic divisibility and rank invariants checked and fused before any code emission.

6. DSL Structure, Safety, and Future Directions

CuTe-DSL is structured around pure, composable functions on layouts with strong static analysis. The types, operations, and constraints (admissibility, congruence) are enforced at construction or composition time. Formal properties proven in the foundational work—including associativity and invertibility—allow safe DSL-level rewrites and aggressive algebraic fusion.

Compatibility with CUTLASS is guaranteed due to the bijection between categorical morphisms and legal CuTe layouts (Carlisle et al., 9 Jan 2026). This ensures that all emitted code is correct by construction, and no invalid memory layouts are admitted.

Future evolution of CuTe-DSL may incorporate semantic visual editors, enabling “drawn” composition of layout graphs that then generate fully algebraic, statically verified Python/C++/CUDA bindings. This suggests expanding the applicability of the DSL beyond existing high-performance CUDA-focused pipelines toward a general framework for any tensorized architecture.

CuTe-DSL is distinct from other domain-specific languages for low-level layout or assembly programming (e.g., for the tile assembly model in self-assembly simulation (0903.0889)) by virtue of its emphasis on hierarchical, algebraically structured tensor layouts and hardware interaction. While both approaches use compositional operations and exploit internal DSLs in Python, CuTe-DSL is optimized for dense numerical computing, tensor core scheduling, and formal static verification of all layout properties, as opposed to manual or simulated tile assembly. Its categorical grounding and practical deployment in widely used systems (such as CUTLASS) further differentiate it in both theoretical rigor and industrial significance.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (3)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to CuTe-DSL.