Papers
Topics
Authors
Recent
Search
2000 character limit reached

CuTe Layout Representation and Algebra

Published 2 Mar 2026 in cs.MS and cs.PL | (2603.02298v1)

Abstract: Modern architectures for high-performance computing and deep learning increasingly incorporate specialized tensor instructions, including tensor cores for matrix multiplication and hardware-optimized copy operations for multi-dimensional data. These instructions prescribe fixed, often complex data layouts that must be correctly propagated through the entire execution pipeline to ensure both correctness and optimal performance. We present CuTe, a novel mathematical specification for representing and manipulating tensors. CuTe introduces two key innovations: (1) a hierarchical layout representation that directly extends traditional flat-shape and flat-stride tensor representations, enabling the representation of complex mappings required by modern hardware instructions, and (2) a rich algebra of layout operations -- including concatenation, coalescence, composition, complementation, division, tiling, and inversion -- that enables sophisticated layout manipulation, derivation, verification, and static analysis. CuTe layouts provide a framework for managing both data layouts and thread arrangements in GPU kernels, while the layout algebra enables powerful compile-time reasoning about layout properties and the expression of generic tensor transformations. In this work, we demonstrate that CuTe's abstractions significantly aid software development compared to traditional approaches, promote compile-time verification of architecturally prescribed layouts, facilitate the implementation of algorithmic primitives that generalize to a wide range of applications, and enable the concise expression of tiling and partitioning patterns required by modern specialized tensor instructions. CuTe has been successfully deployed in production systems, forming the foundation of NVIDIA's CUTLASS library and a number of related efforts including CuTe DSL.

Authors (1)

Summary

  • The paper presents a novel formal specification and algebra for hierarchical tensor layouts, ensuring static correctness and optimized memory operations.
  • It introduces hierarchical representations and algebraic operations, such as composition and inversion, to streamline tensor transformations.
  • The framework reduces code complexity and supports diverse instruction-specific tensor operations, validated by real-world CUTLASS deployments.

CuTe Layout Representation and Algebra: Formal Specification and Algebraic Manipulation of Tensor Layouts

Motivation and Context

Modern high-performance computing and deep learning workloads increasingly depend on specialized GPU hardware instructions, notably Tensor Cores and hierarchical memory copy operations, that prescribe complex, fixed data layouts. This architectural evolution imposes stringent requirements on data layout correctness throughout the entire software pipeline. Conventional flat-shape and flat-stride representations (typified by BLAS, numpy.ndarray, torch.tensor, and std::mdspan) are insufficient to systematically and statically handle these requirements as architectures demand ever more sophisticated layout propagation and transformation.

CuTe provides a formal mathematical specification for tensor layouts and their manipulation, introducing both (1) hierarchical layout representation and (2) an algebra of layout operations—enabling concise expression, static analysis, and manipulation of any layout relevant for modern tensor instruction sets and corresponding thread arrangements.

Hierarchical Layout Representation

CuTe generalizes traditional flat shapes (Tuples of positive integers for tensor extents) to hierarchical tuples (HTuples), allowing layouts to represent nested and non-trivially grouped modes. This generalization arises naturally from tensor contraction folding and canonical loop transformation: any tensor computation (including contraction, GEMM, batched-GEMM, and convolution) can be folded into a canonical form where coordinate mapping is bijective between 1D and ND index spaces.

Layouts are then defined as the composition of a shape SS (HTuple of extents) with a congruent stride DD (HTuple of integers or integer-semimodule elements), giving L=DSL = D \circ S. The layout function provides mapping from any compatible coordinate set (integral, natural, or hierarchical) to a memory offset or coordinate.

(Figure 1)

Figure 1: Hierarchical folding of a 2×2×22 \times 2 \times 2 tensor into matrix forms, contrasting the limits of flat and hierarchical layout representations.

Generic algorithms written in terms of logical coordinates (matrix/vector indices) abstract away access patterns; loop transformations and partitioning operations are entirely reduced to operations on layout representations.

Coordinate Sets, Strides, and Algebraic Structure

The HTuple formalism admits multiple compatible coordinate sets per shape; coordinate mappings (idx2crd, crd2idx) are formally specified via colexicographical bijections. Strides generalize to arbitrary integer-semimodules, enabling layouts that represent not only memory offsets but also exotic thread arrangements or swizzle patterns (bank-conflict avoidance, SIMD/SIMT mapping).

Linear-algebraic properties emerge: layouts admit matrix-vector interpretations, including binary F2\mathbb{F}_2 layouts pertinent to Bit Permute Complement (BPC) and Bit Matrix Multiply Complement (BMMC) GPU memory access transformations [Edelman:IndexTransforms, Cormen:FastPermuting, Bouverot:AffineIndex, Tillet:LinearLayouts].

Tensor Abstraction, Slicing, and Static Analysis

By binding a layout to an accessor (random-access pointer or logical iterator), CuTe defines the tensor abstraction. Slicing (partial evaluation) is formalized as producing subtensors with partially applied coordinates and modified offset/accessor states. Arbitrary slicing is supported—ranged slicing is intentionally excluded to enforce static modular composition via explicit partitioning and reshaping transformations, enhancing compile-time verifiability and generic algorithm throughput.

(Figure 2)

Figure 2: Slicing of a 6×126 \times 12 tensor along various logical boundaries using CuTe layouts.

Generic Algorithm Implementation and Applications

CuTe enables generic reference implementations for both COPY and GEMM (matrix multiplication) that apply across all layouts congruent in size, rank, and requisite logical constraints. In practice, this decouples kernels from particular layout implementations and directly supports all variants required for BLAS, BLIS, and advanced tensor contractions (GETT, CONV), including partitioning for instruction-specific layouts.

Strong empirical evidence: CuTe subsumes hundreds of manually implemented layouts in legacy CUTLASS v2 (55,000 lines, 300 layouts) with only 3,000 lines capable of representing all and more. This is corroborated by production deployment as the foundational layer of CUTLASS v3/v4 and CuTe DSL, with multiple generations of Tensor Core and copy instructions covered.

Layout Algebra: Concatenation, Coalescence, Composition, Inverse, Complement

CuTe introduces a rich algebra over layouts, all statically verifiable and composable:

  • Concatenation: Direct sum of layout modes, functional addition of separate layouts.
  • Coalesce: Reduction to minimal-rank (flattened) layout, preserving integral mapping.
  • Composition: Group functional composition of layouts, implementing reshaping, permuting, tiling, and partitioning. Associativity and compatibility conditions are formally defined.
  • Inverse (Left, Right, Full): Extracts coordinate-mapping from offsets; useful for vectorization, partitioning, and instruction admissibility analysis.
  • Complement: Exhaustive enumeration of domain elements not represented in a layout; enables logical divide and product operations.
  • Logical Divide/Product: Systematic partitioning of layouts—for tiling, grid extraction, and block decomposition—without manual error-prone implementation.

(Figure 3)

Figure 3: Thread-value partitioning layout for Ampere Tensor Core, showcasing static separation of layout metadata from runtime assignment.

(Figure 4)

Figure 4: Visualization of blocked and raked products for systematic tile/grid decomposition of tensors.

Implications and Future Directions

CuTe formally bridges tensor-centric software and hardware abstraction, promoting:

  • Strong static correctness guarantees: Layout propagation, transformation, and verification are performed algebraically, avoiding runtime errors due to misaligned or invalid layouts.
  • High software velocity: Drastic reduction in code size and maintenance for layout implementations; adaptation across generations of hardware instructions is realized via compositional algebra.
  • Algorithmic uniformity and orthogonality: Algorithms remain invariant under layout permutation and partitioning patterns, supporting both generic reasoning and instruction-specific adaptations.
  • Zero performance overhead: Empirical results in CUTLASS and state-of-the-art models (FlashAttention generations) demonstrate no penalty versus hand-tuned kernels [Dao:FlashAttention, Dao:FlashAttention2, Dao:FlashAttention3].

(Figure 5)

Figure 5: TMEM load-store instruction offset mapping via CuTe layout representations; full versatility for static analysis of hardware-specific addressing.

The formalization paves the way for broader compiler and DSL development (CuTe DSL, Graphene IR), verified tensor language compilers [Liu:ATL2], and systematic AI model acceleration pipelines. Future work will further integrate layout algebra within type-safe, formally verified environments and extend to new architectural paradigms as hardware evolves.

Conclusion

CuTe advances the formalization, representation, and manipulation of tensor layouts, providing hierarchical, compositional, and algebraic frameworks suitable for the full span of tensor-centric computing on modern GPUs. Its adoption has resulted in substantial improvements in correctness, extensibility, and development efficiency, with immediate applicability and strong empirical support in production and research systems. As architectures continue to evolve, CuTe's approach ensures that layout management remains robust, generic, and statically verifiable.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Explain it Like I'm 14

Overview

This paper introduces CuTe (CUDA Tensors), a new, math-based way to describe how multi‑dimensional data (tensors) are stored and used on modern GPUs. Today’s GPUs have special “tensor core” instructions that work fastest when data is arranged in very specific patterns. If your data isn’t laid out exactly right, your program can be slow or even wrong. CuTe gives programmers a clear system to describe these complex data layouts and the steps (like tiling and reshaping) needed to feed them into fast GPU instructions.

What is the paper trying to do?

The paper focuses on two simple ideas:

  • Make a powerful, precise way to describe tensor layouts, including nested (hierarchical) shapes, not just flat rows and columns.
  • Create an “algebra” (a toolkit of operations) to transform layouts safely and predictably. These operations include things like combining dimensions, splitting them into tiles, reordering them, and checking that everything still lines up.

Put simply: CuTe helps you say “what my data looks like” and “how I need to rearrange it” so the GPU can do its work quickly and correctly.

How does it work? (Methods and analogies)

Think of data layouts like building addresses:

  • A flat address might be “House 12,” which is easy but limited.
  • A hierarchical address is like “Building A, Floor 3, Room 12.” It’s more detailed and matches how modern GPUs expect data to be organized: layers within layers.

CuTe formalizes this with:

  • Hierarchical “shapes”: sizes of your data arranged inside other sizes (like a matrix of tiles, each tile holding mini-blocks).
  • “Strides”: the step size to jump in memory when you move along one part of the shape. A stride is like how many rooms you walk past when you move from one floor number to the next.
  • “Coordinates”: ways to refer to data locations. You can use a single number (like a seat number), or multi‑part coordinates (row, column), or even nested coordinates (tile, row‑in‑tile, column‑in‑tile). CuTe defines clean rules to convert between these coordinate types.

CuTe also defines a layout as a function:

  • First, coordinates map to a “natural” multi‑dimensional position.
  • Then, that position maps to a memory offset using strides.
  • Layout = “where” × “how far to step” — so you get the exact memory address for any element.

To help programmers, CuTe adds a layout algebra, which is a set of safe building blocks:

  • Concatenation: connect dimensions end-to-end (like stacking rows).
  • Coalescence: collapse nested pieces into a flat one if possible (like flattening tiles into a single big grid).
  • Composition: chain layout transforms (like applying two map steps in a row).
  • Division and tiling: slice a big dimension into smaller tiles (like cutting a pizza into slices).
  • Inversion: undo a transform (go back from tiles to the original view).
  • Complementation: form the “other part” of a split (if you take one slice, this defines the rest).

Two more important ideas make CuTe practical:

  • Canonical loops: rewrite messy loops (with weird start/stop/step values) into clean, standard loops. This makes programs easier to read and transform.
  • Tensor folding: any complicated tensor operation can be reshaped into a standard, widely-optimized operation called batched GEMM (many small matrix multiplies at once). Folding is like grouping certain dimensions together to match what the hardware likes, without actually copying data.

What did they find, and why is it important?

The authors show that CuTe:

  • Makes writing high-performance GPU code simpler and less error-prone. You don’t mix “how to access memory” with “what the math does.” Instead, you keep the math clean and let the layout describe access patterns.
  • Lets compilers and tools check (at compile-time) that your data matches what the hardware expects, catching problems early.
  • Expresses common GPU patterns — tiling, partitioning, thread arrangements — clearly and compactly.
  • Generalizes well: the same code can handle many different shapes and layouts because the layout algebra does the reshaping for you.

This isn’t just theory. CuTe is already used in real systems:

  • It’s the foundation of NVIDIA’s CUTLASS library (a popular GPU math library).
  • It has powered fast implementations in cutting-edge ML systems like FlashAttention.
  • It’s used in various compilers and tools, such as CuTe DSL and Graphene.

What’s the impact?

Modern GPUs demand specific data layouts for best speed. As models and datasets get bigger and more complicated, hand‑crafting these layouts becomes painful and fragile. CuTe provides:

  • A common language to describe and transform layouts safely.
  • A way to write generic, reusable algorithms that still hit peak performance.
  • Strong guardrails for correctness: mismatches and mistakes are caught early.
  • A foundation that can adapt to future GPU features (new tensor instructions, deeper memory hierarchies).

In short, CuTe helps bridge the gap between elegant math (tensors and matrices) and the real-world details of high-speed GPU memory, so developers can build fast, reliable software for AI and scientific computing without getting lost in indexing headaches.

Knowledge Gaps

Below is a consolidated list of knowledge gaps, limitations, and open questions that remain unresolved by the paper. Each item is phrased to be concrete and actionable for future research.

  • Formalize the “layout algebra” operators (concatenation, coalescence, composition, complementation, division, tiling, inversion): define their exact semantics, closure properties, associativity/commutativity/idempotence, invertibility conditions, and interaction laws, with constructive algorithms and complexity bounds.
  • Specify and justify the use of non-integer stride codomains (integer-semimodules such as Q or F2): when is inner_product meaningfully mapped to memory addresses; what correctness guarantees exist; and provide worked examples for non-integer or boolean strides.
  • Provide static analyses or type-level constraints that guarantee bounds-preserving transformations under composition; currently out-of-bounds coordinates are defined without a mechanism to prevent OOB accesses in generated code.
  • Develop a hardware-aware cost model connecting CuTe layouts to GPU performance (coalescing, bank conflicts, register pressure, occupancy), and automatic scheduling/tiling strategies derived from the layout algebra.
  • Present empirical evaluation across multiple architectures (Volta, Turing, Ampere, Hopper, Blackwell) and workloads (GEMM, CONV, general tensor contractions), including comparisons to BLAS, Triton Linear Layouts, and ThunderKittens, and quantify developer effort/error reduction.
  • Describe an algorithmic pipeline for folding arbitrary einsum/tensor contractions to batched-GEMM using CuTe: criteria for grouping modes (multi-modes), handling broadcasting/dilation/padding, and deciding optimality/uniqueness.
  • Clarify semantics and constraints for negative strides, zero-sized dimensions, and overlapping/aliased views; define which cases are allowed and the resulting behavior of inner_product and layout composition.
  • Generalize coordinate bijections beyond colexicographical order (e.g., lexicographical, Morton/Z-order, Hilbert curves): expressibility within CuTe, correctness laws, and performance implications.
  • Formalize thread layout representation: how warps, blocks, and grid hierarchies are modeled as layouts; constraints to avoid divergence and ensure coalesced accesses; and mapping to hardware-prescribed instructions.
  • Address dynamic shapes and runtime variability: support for extent polymorphism, runtime shape refinement, and integration with JIT (CuTe DSL) while preserving static verification guarantees.
  • Extend the framework to sparsity: representation of block-sparse and CSR/COO layouts, operators for composing dense and sparse layouts, and correctness/performance trade-offs.
  • Define robust interoperability with NumPy, PyTorch, and std::mdspan: zero-copy conversions to hierarchical layouts, aliasing safety rules, and round-trip fidelity for complex folds.
  • Incorporate element size, padding, and alignment into the formalism: how byte sizes and alignment constraints interact with strides, compatibility, and congruence.
  • Revisit the integer-semimodule assumptions: specify how subtraction/negative offsets are supported (do some operations require an additive identity or group structure?), particularly for inversion and reversal.
  • Establish decidability and complexity of verifying layout equivalence/compatibility and of composing/inverting deep hierarchical layouts; provide practical algorithms with worst-case bounds.
  • Prove that composing transform layouts with data layouts preserves the semantics of the original loop nests; state precise conditions for semantics-preserving transformations (e.g., reduction modes, associativity/commutativity assumptions).
  • Propose debugging and tooling: visualization of hierarchical shapes/strides, traceability from runtime faults back to algebraic expressions, and automated tests to catch mis-specification.
  • Explore multi-GPU/distributed layouts: how to represent device-spanning layouts, communication/computation partitioning, and integration with NCCL/collectives while maintaining correctness/performance.
  • Analyze numerical consequences of loop/layout transformations on floating-point reproducibility and accumulation order; define policies or constraints to control non-associative effects.
  • Demonstrate extensibility beyond GEMM/CONV to modern primitives (attention, FFT, complex reductions): identify missing algebraic primitives and provide canonical layout formulations and examples.

Practical Applications

Immediate Applications

Below are practical, deployable applications that leverage CuTe’s hierarchical layout representation and layout algebra today, based on its production use (e.g., CUTLASS v3, CuTe DSL, Graphene) and the paper’s demonstrated capabilities.

  • Peak-performance GEMM/tensor-contraction kernels with compile-time layout verification (software, AI/ML, HPC)
    • Use CuTe to fold arbitrary tensor contractions into canonical batched-GEMM and implement them with CUTLASS-backed kernels; statically verify architecturally prescribed layouts (e.g., Tensor Cores).
    • Tools/workflows: CUTLASS v3 (CuTe-based), CuTe DSL, Graphene compiler; integrate into inference/training stacks or HPC kernels (e.g., chemistry, CFD).
    • Assumptions/dependencies: CUDA toolchain; NVIDIA GPUs with Tensor Cores (Volta→Blackwell); developer familiarity with CuTe concepts and CUDA C++.
  • Rapid development of custom CUDA kernels with safe indexing and separation of concerns (software engineering, AI/ML)
    • Replace ad-hoc pointer arithmetic with CuTe layouts to reduce bugs, ease maintenance, and parameterize kernels across shapes/layouts without code duplication.
    • Tools/workflows: adopt CuTe layout algebra for tiling/permutation; unit tests use CuTe’s equivalence between 1D and ND coordinates.
    • Assumptions/dependencies: C++ template metaprogramming (or CuTe DSL for Python), CUTLASS adoption.
  • Hardware-prescribed tensor-core instruction integration (mma/wgmma) and multi-dimensional copy pipelines (AI/ML, HPC)
    • Encode complex, fixed input/output layouts required by Tensor Cores and rank-5 copy instructions; propagate layouts end-to-end to ensure correctness and throughput.
    • Tools/workflows: CUTLASS GEMM, cp.async and Hopper/Blackwell bulk copy orchestration via CuTe layouts.
    • Assumptions/dependencies: Availability of Hopper/Blackwell features; correct alignment and shape constraints.
  • Memory movement and coalescing optimization for multi-dim data (systems, GPU runtime)
    • Use hierarchical shapes/strides to express shared/global-memory transfers that match hardware patterns; exploit coalescence and bank-conflict avoidance informed by layout algebra.
    • Tools/workflows: CuTe composition and coalescence operations; static inspection of stride patterns.
    • Assumptions/dependencies: Knowledge of memory hierarchy; adherence to architectural alignment and transaction sizes.
  • Generic tiling and partitioning for high-throughput kernels (AI/ML, HPC)
    • Apply CuTe’s composition/division/tiling operations to express CTA→warp→thread tiling and thread–data mapping in a single, verifiable abstraction.
    • Tools/workflows: layout composition pipelines in CUTLASS; autotuning over tile shapes with fixed algorithm logic.
    • Assumptions/dependencies: Target-dependent tile choices; register/shared-memory budgets.
  • Implementation and optimization of state-of-the-art attention kernels (AI/ML)
    • Extend or customize FlashAttention-style kernels by expressing data/thread arrangements with CuTe; maintain correctness while exploring new tiling and fusion patterns.
    • Tools/workflows: CUTLASS v3, CuTe DSL; integrate in PyTorch custom ops via CUDA extensions.
    • Assumptions/dependencies: Proper numerical/stability handling remains algorithmic; GPU residency and memory caps.
  • Compiler backends and codegen passes that understand layouts (compiler tooling, AI/ML)
    • Integrate CuTe’s algebra into compilers (e.g., Graphene) to derive index expressions, generate complex copy/compute schedules, and statically verify layout constraints.
    • Tools/workflows: MLIR/ISL-inspired passes can consume CuTe-like IR; Graphene uses CuTe already.
    • Assumptions/dependencies: IR integration effort; consistent shape metadata across passes.
  • Portable high-performance primitives across folded layouts (HPC, scientific computing, robotics, graphics)
    • Provide a single kernel that supports row-major, col-major, batched, and complex folded views (e.g., 3D convolutions as batched-GEMM) without copies.
    • Tools/workflows: employ hierarchical shapes to admit multiple coordinate systems; coalesce where possible.
    • Assumptions/dependencies: When coalescing is not possible, performance depends on hardware instruction patterns and strides.
  • Static analysis for correctness and performance properties (software QA, safety)
    • Use CuTe to detect out-of-bounds, stride mismatches, mismatched folding, and non-coalesced patterns at compile time.
    • Tools/workflows: compile-time reasoning via template types and layout equivalence; property checks in unit tests.
    • Assumptions/dependencies: Sufficiently static shapes/strides for compile-time evaluation; integration into CI.
  • Accelerated application domains via drop-in library use (healthcare imaging, energy/geoscience, finance risk)
    • Adopt CuTe-backed CUTLASS kernels for MRI/CT reconstruction, seismic imaging, portfolio optimization—benefiting from verified layouts and optimal tensor-core usage.
    • Tools/workflows: link against CUTLASS; replace bespoke GEMM code paths with CuTe-powered primitives.
    • Assumptions/dependencies: Data shapes compatible with folding into batched-GEMM; conversion costs if upstream format differs.
  • Education and pedagogy in tensor computations and loop transformations (academia)
    • Teach folding of contractions to batched-GEMM, index transforms, and tiling through CuTe’s shape/stride abstractions; illustrate canonical loop nests.
    • Tools/workflows: classroom demos with CuTe DSL; visualization of idx2crd/crd2idx mappings.
    • Assumptions/dependencies: Students need basic CUDA and linear algebra background.
  • Interoperation with std::mdspan and existing tensor libraries (C++ ecosystem)
    • Map flat-shape/stride views to hierarchical CuTe layouts (and vice versa) to incrementally modernize codebases and retain legacy interfaces.
    • Tools/workflows: adapters from mdspan/NumPy/PyTorch strides to CuTe HTuples; coalescing where possible.
    • Assumptions/dependencies: Some folded shapes may not admit flat strides; require hierarchical handling in kernels.

Long-Term Applications

These opportunities build on CuTe’s specification and early deployments but need further research, standardization, or ecosystem maturation.

  • Cross-vendor and cross-accelerator layout standardization (ecosystem, policy, software)
    • Generalize CuTe concepts to AMD/Intel GPUs, CPUs, NPUs, and SYCL/HIP backends; inform MLIR/LLVM dialects for portable tensor layouts and transformations.
    • Potential outputs: an open layout-IR standard; vendor-agnostic compiler passes.
    • Dependencies: community consensus; mapping to non-NVIDIA instruction sets and memory hierarchies.
  • Automated kernel synthesis and verified code generation (compilers, formal methods)
    • Combine CuTe’s algebra with auto-tuners and SMT/ISL-based provers to automatically generate layout-correct, performance-optimized kernels with formal guarantees.
    • Potential outputs: proof-carrying kernels; verified scheduling/tiling libraries.
    • Dependencies: solver integration; tractable specifications for complex kernels.
  • Runtime-adaptive layout selection and transformation (systems, MLOps)
    • Dynamically choose/compose layouts based on batch size, sparsity, or hardware availability; compile or specialize kernels on the fly via CuTe DSL.
    • Potential outputs: adaptive inference servers; AOT/JIT pipelines that encapsulate layout choices.
    • Dependencies: low-overhead JIT, caching, and shape-polymorphic interfaces.
  • Co-design of future tensor instructions with software-friendly layout algebra (hardware-software co-design)
    • Use CuTe to inform instruction-set design (e.g., copy/mma layouts) so software can easily compose and verify pipelines from day one.
    • Potential outputs: ISA proposals accompanied by formal layout specs and testable reference layouts.
    • Dependencies: collaboration with hardware vendors; early access to microarchitectural constraints.
  • Unified modeling of addresses, threads, and memory spaces via richer semimodules (systems research)
    • Exploit integer-semimodules to encode composite codomains (e.g., address + lane + memory-space) for holistic mapping of threads to data across hierarchies.
    • Potential outputs: single-spec mappings for CTA/warp/thread and global/shared/register spaces.
    • Dependencies: language support, compiler/runtime metadata; developer tooling.
  • Sparse and irregular tensor support (AI/ML, HPC)
    • Extend hierarchical shapes/strides to capture block-sparse and irregular layouts; fold sparse contractions into canonical primitives with verifiable indexing.
    • Potential outputs: generalized GETT for sparse tensors; hybrid sparse-dense kernels.
    • Dependencies: sparse format taxonomy; cost models for coalescence and gather/scatter.
  • Distributed and topology-aware layouts for multi-GPU/cluster execution (HPC, cloud)
    • Compose intra-GPU and inter-GPU/node layouts to align with network topologies (NVLink, InfiniBand) and reduce communication/stragglers.
    • Potential outputs: collective-friendly tiling; partitioning that maps to process grids automatically.
    • Dependencies: runtime integration (NCCL/MPI), topology discovery, scheduling support.
  • Database and analytics acceleration via layout transforms (data systems)
    • Apply layout algebra to re-shape columnar/row-major data for GPU operators (joins, aggregations), minimizing copies and ensuring coalesced access.
    • Potential outputs: GPU query engines with verifiable memory access plans.
    • Dependencies: tight coupling with storage formats (Arrow, Parquet) and ETL pipelines.
  • Safety and certification for compute in regulated domains (autonomy, healthcare)
    • Use compile-time layout verification to reduce indexing errors and certify memory safety properties in safety-critical kernels.
    • Potential outputs: toolchains producing auditable artifacts for kernel memory access.
    • Dependencies: standards engagement (e.g., ISO 26262), domain audits, reproducible builds.
  • Curriculum and workforce development in high-performance tensor programming (academia, industry training)
    • Build courses, labs, and certifications around CuTe-based loop transformations, folding, and layout reasoning for next-gen GPU programmers.
    • Potential outputs: open courseware, reference kernels, interactive visual tools.
    • Dependencies: teaching materials, industry partnerships, open-source exemplars.

Glossary

  • Ampere: NVIDIA GPU architecture generation optimized for tensor operations. "This capability expanded in Turing~\cite{NVIDIA:Turing} and Ampere~\cite{NVIDIA:Ampere} with specialized instructions for structured matrix movement within the GPU memory hierarchy."
  • Batch modes: Indices that appear in all tensors of a contraction and represent independent batches processed together. "Batch modes, ${\ell$}: Appear in A{A}, B{B}, and C{C}."
  • batched-GEMM: A matrix multiplication performed across a batch dimension, enabling many small GEMMs to be computed efficiently. "any tensor contraction can be folded into a canonical {\tt batched-GEMM} and evaluated with a trivial reference implementation composed of four nested loops:"
  • Blackwell: NVIDIA GPU architecture generation advancing tensor-oriented features. "The Hopper~\cite{NVIDIA:Hopper} and Blackwell~\cite{NVIDIA:Blackwell} architectures further advance this paradigm"
  • BLAS: Basic Linear Algebra Subprograms; a standard library of optimized linear algebra routines. "BLAS provides efficient and portable implementations of core linear algebra operations"
  • BLIS: BLAS-like Library Instantiation Software; a framework that generalizes and optimizes BLAS routines. "The BLAS-like Library Instantiation Software (BLIS) framework~\cite{VanZee:BLIS} extends GEMM by supporting non-unit strides in both row and column modes simultaneously,"
  • Category theory: A branch of mathematics using objects and morphisms to study structures and transformations. "Colfax Research~\cite{Colfax:CategoryCuTe} analyzes CuTe layouts and some operations on them in the context of category theory."
  • Coalesced representation: A flattened layout view where hierarchical structure is merged into a contiguous form. "the flat representation is called the {\em coalesced} version of the CuTe representation,"
  • Colexicographical bijection: A one-to-one mapping between coordinate systems ordered by the last index varying fastest. "This transformation is the colexicographical bijection, \verb|(i,j) = (k%4,k/4)|, between 2D coordinates \verb|(i,j)| and 1D coordinates \verb|k|."
  • Colexicographical ordering: An ordering where the last coordinate is the primary comparator, then preceding ones recursively. "In this work, we choose the colexicographical ordering, <<, of coordinates defined by:"
  • Compatibility: A partial order on shapes ensuring they have the same total size and aligned hierarchical structure. "Compatibility, \preceq, is a partial order on the set of shapes."
  • Composition (layout): Functional combination of layout transforms to produce new indexing behavior. "See Section~\ref{sec:composition} for details on layout composition and application to generic partitioning."
  • Concatenation (layout operation): Combining layouts or dimensions end-to-end within the CuTe algebra. "including concatenation, coalescence, composition, complementation, division, tiling, and inversion"
  • CONV: Convolution operation, here treated as a tensor contraction compatible with folded layouts. "including any matrix-multiplication ({\tt GEMM}), tensor contraction ({\tt GETT}), and convolution ({\tt CONV}),"
  • Coordinate set: The set of valid indices for a shape, possibly as Cartesian products of ranges. "A coordinate set is a set ZN={0,1,2,,N1}Z_N = \{0, 1, 2, \ldots, N-1\} of non-negative integers or a Cartesian product of coordinate sets, ZN×ZM=Z(N,M)Z_N \times Z_M = Z_{(N,M)}."
  • CUTLASS: NVIDIA’s CUDA Templates for Linear Algebra Subroutines and Solvers; a GPU library leveraging CuTe. "CuTe has been successfully deployed in production systems, forming the foundation of NVIDIA's CUTLASS library and a number of related efforts including CuTe DSL."
  • CuTe: A mathematical specification and set of abstractions for tensor layouts and operations. "We present CuTe, a novel mathematical specification for representing and manipulating tensors."
  • CuTe DSL: A Python-based domain-specific language for compiling CUDA linear algebra software using CuTe concepts. "forming the foundation of NVIDIA's CUTLASS library and a number of related efforts including CuTe DSL."
  • Division (layout operation): Splitting or factoring layout dimensions in the CuTe algebra to alter indexing structure. "including concatenation, coalescence, composition, complementation, division, tiling, and inversion"
  • Einstein notation: Convention where repeated indices imply summation without explicit sigma notation. "Summation is implied over repeated indices that appear only on a single side of an equation (Einstein notation),"
  • einsum: Interfaces in NumPy and PyTorch implementing Einstein summation for concise tensor contractions. "Contractions of this form are expressed compactly in the {\tt numpy.einsum} and {\tt torch.einsum} interfaces, for instance."
  • FlashAttention: An optimized attention algorithm for LLMs with memory-efficient kernels. "including FlashAttention and each of its evolving generations~\cite{Dao:FlashAttention, Dao:FlashAttention2, Dao:FlashAttention3},"
  • Folding (tensor folding): Grouping tensor modes to reinterpret data shape without copying. "This is referred to as tensor {\em folding}."
  • GEMM: General Matrix Multiply; the core matrix multiplication routine widely used and optimized. "Among BLAS primitives, the GEneral Matrix Multiply (GEMM) routine is easily the most optimized and widely used operation"
  • GETT: Generalized tensor contraction primitive analogous to GEMM for tensors. "including any matrix-multiplication ({\tt GEMM}), tensor contraction ({\tt GETT}), and convolution ({\tt CONV}),"
  • Graphene tensor compiler: A compiler framework where CuTe represents tensor operations. "CuTe has been used within the Graphene tensor compiler~\cite{Hagedorn:Graphene}, where it plays a critical role in representing tensor operations."
  • HeLayers: A library for homomorphic encryption layers influencing dense tensor layout generalizations. "Independent generalizations of dense tensor representations include HeLayers~\cite{Aharoni:HeLayers}, ThunderKittens~\cite{Spector:ThunderKittens}, and the Linear Layouts~\cite{Tillet:LinearLayouts} approach used in OpenAI's Triton compiler~\cite{Tillet:TritonAI}."
  • Hopper: NVIDIA GPU architecture introducing advanced tensor copy and core capabilities. "The Hopper~\cite{NVIDIA:Hopper} and Blackwell~\cite{NVIDIA:Blackwell} architectures further advance this paradigm, introducing copy instructions for efficiently transferring rank-5 tensors between global and shared memory and further expanding tensor core capabilities."
  • HTuple: Hierarchical tuple datatype allowing nested tuples used to represent shapes and strides. "An HTuple(T)HTuple(\mathcal{T}) is either an element of set T\mathcal{T} or a Tuple(HTuple(T))Tuple(HTuple(\mathcal{T}))."
  • Integer-semimodule: Algebraic structure with associative addition and integer scalar multiplication used for stride elements. "An integer-semimodule is a set MM equipped with an associative addition, M+MMM + M \to M, and a scalar multiplication, ZMMZ \cdot M \to M."
  • Inversion (layout operation): Reversing a layout transformation within CuTe’s algebra to recover prior indexing. "including concatenation, coalescence, composition, complementation, division, tiling, and inversion"
  • ISL (integer set relations): The Integer Set Library context for analyzing layout relations and transformations. "analyzes CuTe and Linear Layouts~\cite{Tillet:LinearLayouts} in the context of integer set relations (ISL),"
  • Layout algebra: The set of operations over layouts enabling manipulation, derivation, and analysis of tensor views. "a rich algebra of layout operations -- including concatenation, coalescence, composition, complementation, division, tiling, and inversion --"
  • Layout function: The mapping from coordinates to stride codomain defined by composing shape and stride. "A layout L=DS{L} = D \circ S is the functional composition of a shape SS and a stride D,whereD, whereS \sim D$,&quot;</li> <li><strong>Linear Layouts</strong>: A representation of tensor layouts based on F2 linear algebra used in the Triton compiler. &quot;Linear Layouts are based on $\mathbb{F}_2$ linear algebra and provide a more general representation of tensor layouts as well as an avenue for layout analysis and generation.&quot;</li> <li><strong>Matricization</strong>: Restructuring tensors into matrices to leverage BLAS routines for contractions. &quot;Conventional approaches for computing general tensor contractions rely on matricization, which involves logically or explicitly restructuring tensor data to perform computations&quot;</li> <li><strong>mdspan</strong>: A C++ facility for multidimensional array views with shape and stride metadata. &quot;and in C++, {\tt std::mdspan}&quot;</li> <li><strong>Multi-indices</strong>: Combined indices used in tensor notation to fold modes for canonical forms. &quot;a key insight motivating CuTe is the use of multi-indices in tensor notation to enable the transformation of arbitrary tensor contractions into a canonical batched-GEMM primitive.&quot;</li> <li><strong>Multi-mode</strong>: A grouped set of tensor modes treated as a single dimension after folding. &quot;where each mode may be a single mode or a group of modes, which we call a {\em multi-mode}.&quot;</li> <li><strong>Non-integral strides</strong>: Stride values not restricted to integers, enabling more general layout mappings. &quot;CuTe supports these representations and strictly expands on them with generalizations to hierarchical shapes and strides to represent more complex layouts, non-integral strides, and non-integral layout codomains.&quot;</li> <li><strong>Semi-affine striding</strong>: A striding pattern where offsets follow affine relationships but may be non-uniform across coordinates. &quot;A key observation is that the $4 {\times} 20matrixcanalsobeinterpretedasan matrix can also be interpreted as an 80$-element vector with non-uniform, semi-affine striding,&quot;</li> <li><strong>Shape:Stride representation</strong>: The paired description of tensor shape with corresponding stride defining a layout. &quot;Because there is a one-to-one correspondence between the $\text{Shape}:\text{Stride}$ information and the loop nest itself,&quot;</li> <li><strong>Stream-K algorithm</strong>: A GPU kernel scheduling strategy used in high-performance GEMM implementations. &quot;the C++ implementation of CuTe has been used in implementations of the Stream-K algorithm~\cite{Osama:StreamK}&quot;</li> <li><strong>Strided-batched GEMM</strong>: A BLAS extension allowing batches of GEMMs with constant strides between matrices. &quot;The strided-batched GEMM extension to BLAS further generalizes the primitive and allows its application to even more tensor contractions,&quot;</li> <li><strong>Stride</strong>: The mapping weights (possibly hierarchical) that convert coordinates into offsets in a layout’s codomain. &quot;A stride $Dforashape for a shape Sisan is an HTuple(\mathcal{D})thatiscongruentwiththeshape, that is congruent with the shape, S \sim D$."
  • Tensor Cores: Specialized hardware units for fast small-matrix multiplications. "NVIDIA's Volta architecture~\cite{NVIDIA:Volta} introduced Tensor Cores, enabling efficient small-matrix multiplications directly in hardware."
  • Tensor contraction: The summation over shared indices across tensors to produce a result tensor. "An instance of a tensor contraction is"
  • Tensor instructions: Hardware-level operations optimized for tensor data movement and computation. "Modern architectures for high-performance computing and deep learning increasingly incorporate specialized tensor instructions,"
  • Thunderkittens: A library providing bespoke GPU layout and access-pattern types oriented to hardware requirements. "Thunderkittens implements a wide variety of bespoke types for register memory, shared memory, row/column-major tiles, row/column-major tiles of row/column-major subtiles, and prescribed access patterns for warps and threads."
  • Tiling: Partitioning data or loops into blocks to improve locality and performance. "enable the concise expression of tiling and partitioning patterns required by modern specialized tensor instructions."
  • Triton compiler: A GPU programming system where Linear Layouts are used to represent tensor layouts. "the Linear Layouts~\cite{Tillet:LinearLayouts} approach used in OpenAI's Triton compiler~\cite{Tillet:TritonAI}."
  • Turing: NVIDIA GPU architecture generation with expanded tensor instruction support. "This capability expanded in Turing~\cite{NVIDIA:Turing} and Ampere~\cite{NVIDIA:Ampere} with specialized instructions"
  • Volta: NVIDIA GPU architecture generation that introduced Tensor Cores. "NVIDIA's Volta architecture~\cite{NVIDIA:Volta} introduced Tensor Cores,"
  • Warps: Groups of GPU threads that execute in lockstep on the hardware. "prescribed access patterns for warps and threads."
  • Weak congruence: A partial-order relation indicating one HTuple’s profile is at least as refined as another. "Weak Congruence, \lesssim, is a partial order on HTuples."

Open Problems

We found no open problems mentioned in this paper.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 8 tweets with 876 likes about this paper.

HackerNews

  1. CuTe Layout Representation and Algebra (4 points, 0 comments)