CuTe Layout Representation and Algebra
Abstract: Modern architectures for high-performance computing and deep learning increasingly incorporate specialized tensor instructions, including tensor cores for matrix multiplication and hardware-optimized copy operations for multi-dimensional data. These instructions prescribe fixed, often complex data layouts that must be correctly propagated through the entire execution pipeline to ensure both correctness and optimal performance. We present CuTe, a novel mathematical specification for representing and manipulating tensors. CuTe introduces two key innovations: (1) a hierarchical layout representation that directly extends traditional flat-shape and flat-stride tensor representations, enabling the representation of complex mappings required by modern hardware instructions, and (2) a rich algebra of layout operations -- including concatenation, coalescence, composition, complementation, division, tiling, and inversion -- that enables sophisticated layout manipulation, derivation, verification, and static analysis. CuTe layouts provide a framework for managing both data layouts and thread arrangements in GPU kernels, while the layout algebra enables powerful compile-time reasoning about layout properties and the expression of generic tensor transformations. In this work, we demonstrate that CuTe's abstractions significantly aid software development compared to traditional approaches, promote compile-time verification of architecturally prescribed layouts, facilitate the implementation of algorithmic primitives that generalize to a wide range of applications, and enable the concise expression of tiling and partitioning patterns required by modern specialized tensor instructions. CuTe has been successfully deployed in production systems, forming the foundation of NVIDIA's CUTLASS library and a number of related efforts including CuTe DSL.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Explain it Like I'm 14
Overview
This paper introduces CuTe (CUDA Tensors), a new, math-based way to describe how multi‑dimensional data (tensors) are stored and used on modern GPUs. Today’s GPUs have special “tensor core” instructions that work fastest when data is arranged in very specific patterns. If your data isn’t laid out exactly right, your program can be slow or even wrong. CuTe gives programmers a clear system to describe these complex data layouts and the steps (like tiling and reshaping) needed to feed them into fast GPU instructions.
What is the paper trying to do?
The paper focuses on two simple ideas:
- Make a powerful, precise way to describe tensor layouts, including nested (hierarchical) shapes, not just flat rows and columns.
- Create an “algebra” (a toolkit of operations) to transform layouts safely and predictably. These operations include things like combining dimensions, splitting them into tiles, reordering them, and checking that everything still lines up.
Put simply: CuTe helps you say “what my data looks like” and “how I need to rearrange it” so the GPU can do its work quickly and correctly.
How does it work? (Methods and analogies)
Think of data layouts like building addresses:
- A flat address might be “House 12,” which is easy but limited.
- A hierarchical address is like “Building A, Floor 3, Room 12.” It’s more detailed and matches how modern GPUs expect data to be organized: layers within layers.
CuTe formalizes this with:
- Hierarchical “shapes”: sizes of your data arranged inside other sizes (like a matrix of tiles, each tile holding mini-blocks).
- “Strides”: the step size to jump in memory when you move along one part of the shape. A stride is like how many rooms you walk past when you move from one floor number to the next.
- “Coordinates”: ways to refer to data locations. You can use a single number (like a seat number), or multi‑part coordinates (row, column), or even nested coordinates (tile, row‑in‑tile, column‑in‑tile). CuTe defines clean rules to convert between these coordinate types.
CuTe also defines a layout as a function:
- First, coordinates map to a “natural” multi‑dimensional position.
- Then, that position maps to a memory offset using strides.
- Layout = “where” × “how far to step” — so you get the exact memory address for any element.
To help programmers, CuTe adds a layout algebra, which is a set of safe building blocks:
- Concatenation: connect dimensions end-to-end (like stacking rows).
- Coalescence: collapse nested pieces into a flat one if possible (like flattening tiles into a single big grid).
- Composition: chain layout transforms (like applying two map steps in a row).
- Division and tiling: slice a big dimension into smaller tiles (like cutting a pizza into slices).
- Inversion: undo a transform (go back from tiles to the original view).
- Complementation: form the “other part” of a split (if you take one slice, this defines the rest).
Two more important ideas make CuTe practical:
- Canonical loops: rewrite messy loops (with weird start/stop/step values) into clean, standard loops. This makes programs easier to read and transform.
- Tensor folding: any complicated tensor operation can be reshaped into a standard, widely-optimized operation called batched GEMM (many small matrix multiplies at once). Folding is like grouping certain dimensions together to match what the hardware likes, without actually copying data.
What did they find, and why is it important?
The authors show that CuTe:
- Makes writing high-performance GPU code simpler and less error-prone. You don’t mix “how to access memory” with “what the math does.” Instead, you keep the math clean and let the layout describe access patterns.
- Lets compilers and tools check (at compile-time) that your data matches what the hardware expects, catching problems early.
- Expresses common GPU patterns — tiling, partitioning, thread arrangements — clearly and compactly.
- Generalizes well: the same code can handle many different shapes and layouts because the layout algebra does the reshaping for you.
This isn’t just theory. CuTe is already used in real systems:
- It’s the foundation of NVIDIA’s CUTLASS library (a popular GPU math library).
- It has powered fast implementations in cutting-edge ML systems like FlashAttention.
- It’s used in various compilers and tools, such as CuTe DSL and Graphene.
What’s the impact?
Modern GPUs demand specific data layouts for best speed. As models and datasets get bigger and more complicated, hand‑crafting these layouts becomes painful and fragile. CuTe provides:
- A common language to describe and transform layouts safely.
- A way to write generic, reusable algorithms that still hit peak performance.
- Strong guardrails for correctness: mismatches and mistakes are caught early.
- A foundation that can adapt to future GPU features (new tensor instructions, deeper memory hierarchies).
In short, CuTe helps bridge the gap between elegant math (tensors and matrices) and the real-world details of high-speed GPU memory, so developers can build fast, reliable software for AI and scientific computing without getting lost in indexing headaches.
Knowledge Gaps
Below is a consolidated list of knowledge gaps, limitations, and open questions that remain unresolved by the paper. Each item is phrased to be concrete and actionable for future research.
- Formalize the “layout algebra” operators (concatenation, coalescence, composition, complementation, division, tiling, inversion): define their exact semantics, closure properties, associativity/commutativity/idempotence, invertibility conditions, and interaction laws, with constructive algorithms and complexity bounds.
- Specify and justify the use of non-integer stride codomains (integer-semimodules such as Q or F2): when is inner_product meaningfully mapped to memory addresses; what correctness guarantees exist; and provide worked examples for non-integer or boolean strides.
- Provide static analyses or type-level constraints that guarantee bounds-preserving transformations under composition; currently out-of-bounds coordinates are defined without a mechanism to prevent OOB accesses in generated code.
- Develop a hardware-aware cost model connecting CuTe layouts to GPU performance (coalescing, bank conflicts, register pressure, occupancy), and automatic scheduling/tiling strategies derived from the layout algebra.
- Present empirical evaluation across multiple architectures (Volta, Turing, Ampere, Hopper, Blackwell) and workloads (GEMM, CONV, general tensor contractions), including comparisons to BLAS, Triton Linear Layouts, and ThunderKittens, and quantify developer effort/error reduction.
- Describe an algorithmic pipeline for folding arbitrary einsum/tensor contractions to batched-GEMM using CuTe: criteria for grouping modes (multi-modes), handling broadcasting/dilation/padding, and deciding optimality/uniqueness.
- Clarify semantics and constraints for negative strides, zero-sized dimensions, and overlapping/aliased views; define which cases are allowed and the resulting behavior of inner_product and layout composition.
- Generalize coordinate bijections beyond colexicographical order (e.g., lexicographical, Morton/Z-order, Hilbert curves): expressibility within CuTe, correctness laws, and performance implications.
- Formalize thread layout representation: how warps, blocks, and grid hierarchies are modeled as layouts; constraints to avoid divergence and ensure coalesced accesses; and mapping to hardware-prescribed instructions.
- Address dynamic shapes and runtime variability: support for extent polymorphism, runtime shape refinement, and integration with JIT (CuTe DSL) while preserving static verification guarantees.
- Extend the framework to sparsity: representation of block-sparse and CSR/COO layouts, operators for composing dense and sparse layouts, and correctness/performance trade-offs.
- Define robust interoperability with NumPy, PyTorch, and std::mdspan: zero-copy conversions to hierarchical layouts, aliasing safety rules, and round-trip fidelity for complex folds.
- Incorporate element size, padding, and alignment into the formalism: how byte sizes and alignment constraints interact with strides, compatibility, and congruence.
- Revisit the integer-semimodule assumptions: specify how subtraction/negative offsets are supported (do some operations require an additive identity or group structure?), particularly for inversion and reversal.
- Establish decidability and complexity of verifying layout equivalence/compatibility and of composing/inverting deep hierarchical layouts; provide practical algorithms with worst-case bounds.
- Prove that composing transform layouts with data layouts preserves the semantics of the original loop nests; state precise conditions for semantics-preserving transformations (e.g., reduction modes, associativity/commutativity assumptions).
- Propose debugging and tooling: visualization of hierarchical shapes/strides, traceability from runtime faults back to algebraic expressions, and automated tests to catch mis-specification.
- Explore multi-GPU/distributed layouts: how to represent device-spanning layouts, communication/computation partitioning, and integration with NCCL/collectives while maintaining correctness/performance.
- Analyze numerical consequences of loop/layout transformations on floating-point reproducibility and accumulation order; define policies or constraints to control non-associative effects.
- Demonstrate extensibility beyond GEMM/CONV to modern primitives (attention, FFT, complex reductions): identify missing algebraic primitives and provide canonical layout formulations and examples.
Practical Applications
Immediate Applications
Below are practical, deployable applications that leverage CuTe’s hierarchical layout representation and layout algebra today, based on its production use (e.g., CUTLASS v3, CuTe DSL, Graphene) and the paper’s demonstrated capabilities.
- Peak-performance GEMM/tensor-contraction kernels with compile-time layout verification (software, AI/ML, HPC)
- Use CuTe to fold arbitrary tensor contractions into canonical batched-GEMM and implement them with CUTLASS-backed kernels; statically verify architecturally prescribed layouts (e.g., Tensor Cores).
- Tools/workflows: CUTLASS v3 (CuTe-based), CuTe DSL, Graphene compiler; integrate into inference/training stacks or HPC kernels (e.g., chemistry, CFD).
- Assumptions/dependencies: CUDA toolchain; NVIDIA GPUs with Tensor Cores (Volta→Blackwell); developer familiarity with CuTe concepts and CUDA C++.
- Rapid development of custom CUDA kernels with safe indexing and separation of concerns (software engineering, AI/ML)
- Replace ad-hoc pointer arithmetic with CuTe layouts to reduce bugs, ease maintenance, and parameterize kernels across shapes/layouts without code duplication.
- Tools/workflows: adopt CuTe layout algebra for tiling/permutation; unit tests use CuTe’s equivalence between 1D and ND coordinates.
- Assumptions/dependencies: C++ template metaprogramming (or CuTe DSL for Python), CUTLASS adoption.
- Hardware-prescribed tensor-core instruction integration (mma/wgmma) and multi-dimensional copy pipelines (AI/ML, HPC)
- Encode complex, fixed input/output layouts required by Tensor Cores and rank-5 copy instructions; propagate layouts end-to-end to ensure correctness and throughput.
- Tools/workflows: CUTLASS GEMM, cp.async and Hopper/Blackwell bulk copy orchestration via CuTe layouts.
- Assumptions/dependencies: Availability of Hopper/Blackwell features; correct alignment and shape constraints.
- Memory movement and coalescing optimization for multi-dim data (systems, GPU runtime)
- Use hierarchical shapes/strides to express shared/global-memory transfers that match hardware patterns; exploit coalescence and bank-conflict avoidance informed by layout algebra.
- Tools/workflows: CuTe composition and coalescence operations; static inspection of stride patterns.
- Assumptions/dependencies: Knowledge of memory hierarchy; adherence to architectural alignment and transaction sizes.
- Generic tiling and partitioning for high-throughput kernels (AI/ML, HPC)
- Apply CuTe’s composition/division/tiling operations to express CTA→warp→thread tiling and thread–data mapping in a single, verifiable abstraction.
- Tools/workflows: layout composition pipelines in CUTLASS; autotuning over tile shapes with fixed algorithm logic.
- Assumptions/dependencies: Target-dependent tile choices; register/shared-memory budgets.
- Implementation and optimization of state-of-the-art attention kernels (AI/ML)
- Extend or customize FlashAttention-style kernels by expressing data/thread arrangements with CuTe; maintain correctness while exploring new tiling and fusion patterns.
- Tools/workflows: CUTLASS v3, CuTe DSL; integrate in PyTorch custom ops via CUDA extensions.
- Assumptions/dependencies: Proper numerical/stability handling remains algorithmic; GPU residency and memory caps.
- Compiler backends and codegen passes that understand layouts (compiler tooling, AI/ML)
- Integrate CuTe’s algebra into compilers (e.g., Graphene) to derive index expressions, generate complex copy/compute schedules, and statically verify layout constraints.
- Tools/workflows: MLIR/ISL-inspired passes can consume CuTe-like IR; Graphene uses CuTe already.
- Assumptions/dependencies: IR integration effort; consistent shape metadata across passes.
- Portable high-performance primitives across folded layouts (HPC, scientific computing, robotics, graphics)
- Provide a single kernel that supports row-major, col-major, batched, and complex folded views (e.g., 3D convolutions as batched-GEMM) without copies.
- Tools/workflows: employ hierarchical shapes to admit multiple coordinate systems; coalesce where possible.
- Assumptions/dependencies: When coalescing is not possible, performance depends on hardware instruction patterns and strides.
- Static analysis for correctness and performance properties (software QA, safety)
- Use CuTe to detect out-of-bounds, stride mismatches, mismatched folding, and non-coalesced patterns at compile time.
- Tools/workflows: compile-time reasoning via template types and layout equivalence; property checks in unit tests.
- Assumptions/dependencies: Sufficiently static shapes/strides for compile-time evaluation; integration into CI.
- Accelerated application domains via drop-in library use (healthcare imaging, energy/geoscience, finance risk)
- Adopt CuTe-backed CUTLASS kernels for MRI/CT reconstruction, seismic imaging, portfolio optimization—benefiting from verified layouts and optimal tensor-core usage.
- Tools/workflows: link against CUTLASS; replace bespoke GEMM code paths with CuTe-powered primitives.
- Assumptions/dependencies: Data shapes compatible with folding into batched-GEMM; conversion costs if upstream format differs.
- Education and pedagogy in tensor computations and loop transformations (academia)
- Teach folding of contractions to batched-GEMM, index transforms, and tiling through CuTe’s shape/stride abstractions; illustrate canonical loop nests.
- Tools/workflows: classroom demos with CuTe DSL; visualization of idx2crd/crd2idx mappings.
- Assumptions/dependencies: Students need basic CUDA and linear algebra background.
- Interoperation with std::mdspan and existing tensor libraries (C++ ecosystem)
- Map flat-shape/stride views to hierarchical CuTe layouts (and vice versa) to incrementally modernize codebases and retain legacy interfaces.
- Tools/workflows: adapters from mdspan/NumPy/PyTorch strides to CuTe HTuples; coalescing where possible.
- Assumptions/dependencies: Some folded shapes may not admit flat strides; require hierarchical handling in kernels.
Long-Term Applications
These opportunities build on CuTe’s specification and early deployments but need further research, standardization, or ecosystem maturation.
- Cross-vendor and cross-accelerator layout standardization (ecosystem, policy, software)
- Generalize CuTe concepts to AMD/Intel GPUs, CPUs, NPUs, and SYCL/HIP backends; inform MLIR/LLVM dialects for portable tensor layouts and transformations.
- Potential outputs: an open layout-IR standard; vendor-agnostic compiler passes.
- Dependencies: community consensus; mapping to non-NVIDIA instruction sets and memory hierarchies.
- Automated kernel synthesis and verified code generation (compilers, formal methods)
- Combine CuTe’s algebra with auto-tuners and SMT/ISL-based provers to automatically generate layout-correct, performance-optimized kernels with formal guarantees.
- Potential outputs: proof-carrying kernels; verified scheduling/tiling libraries.
- Dependencies: solver integration; tractable specifications for complex kernels.
- Runtime-adaptive layout selection and transformation (systems, MLOps)
- Dynamically choose/compose layouts based on batch size, sparsity, or hardware availability; compile or specialize kernels on the fly via CuTe DSL.
- Potential outputs: adaptive inference servers; AOT/JIT pipelines that encapsulate layout choices.
- Dependencies: low-overhead JIT, caching, and shape-polymorphic interfaces.
- Co-design of future tensor instructions with software-friendly layout algebra (hardware-software co-design)
- Use CuTe to inform instruction-set design (e.g., copy/mma layouts) so software can easily compose and verify pipelines from day one.
- Potential outputs: ISA proposals accompanied by formal layout specs and testable reference layouts.
- Dependencies: collaboration with hardware vendors; early access to microarchitectural constraints.
- Unified modeling of addresses, threads, and memory spaces via richer semimodules (systems research)
- Exploit integer-semimodules to encode composite codomains (e.g., address + lane + memory-space) for holistic mapping of threads to data across hierarchies.
- Potential outputs: single-spec mappings for CTA/warp/thread and global/shared/register spaces.
- Dependencies: language support, compiler/runtime metadata; developer tooling.
- Sparse and irregular tensor support (AI/ML, HPC)
- Extend hierarchical shapes/strides to capture block-sparse and irregular layouts; fold sparse contractions into canonical primitives with verifiable indexing.
- Potential outputs: generalized GETT for sparse tensors; hybrid sparse-dense kernels.
- Dependencies: sparse format taxonomy; cost models for coalescence and gather/scatter.
- Distributed and topology-aware layouts for multi-GPU/cluster execution (HPC, cloud)
- Compose intra-GPU and inter-GPU/node layouts to align with network topologies (NVLink, InfiniBand) and reduce communication/stragglers.
- Potential outputs: collective-friendly tiling; partitioning that maps to process grids automatically.
- Dependencies: runtime integration (NCCL/MPI), topology discovery, scheduling support.
- Database and analytics acceleration via layout transforms (data systems)
- Apply layout algebra to re-shape columnar/row-major data for GPU operators (joins, aggregations), minimizing copies and ensuring coalesced access.
- Potential outputs: GPU query engines with verifiable memory access plans.
- Dependencies: tight coupling with storage formats (Arrow, Parquet) and ETL pipelines.
- Safety and certification for compute in regulated domains (autonomy, healthcare)
- Use compile-time layout verification to reduce indexing errors and certify memory safety properties in safety-critical kernels.
- Potential outputs: toolchains producing auditable artifacts for kernel memory access.
- Dependencies: standards engagement (e.g., ISO 26262), domain audits, reproducible builds.
- Curriculum and workforce development in high-performance tensor programming (academia, industry training)
- Build courses, labs, and certifications around CuTe-based loop transformations, folding, and layout reasoning for next-gen GPU programmers.
- Potential outputs: open courseware, reference kernels, interactive visual tools.
- Dependencies: teaching materials, industry partnerships, open-source exemplars.
Glossary
- Ampere: NVIDIA GPU architecture generation optimized for tensor operations. "This capability expanded in Turing~\cite{NVIDIA:Turing} and Ampere~\cite{NVIDIA:Ampere} with specialized instructions for structured matrix movement within the GPU memory hierarchy."
- Batch modes: Indices that appear in all tensors of a contraction and represent independent batches processed together. "Batch modes, ${\ell$}: Appear in , , and ."
- batched-GEMM: A matrix multiplication performed across a batch dimension, enabling many small GEMMs to be computed efficiently. "any tensor contraction can be folded into a canonical {\tt batched-GEMM} and evaluated with a trivial reference implementation composed of four nested loops:"
- Blackwell: NVIDIA GPU architecture generation advancing tensor-oriented features. "The Hopper~\cite{NVIDIA:Hopper} and Blackwell~\cite{NVIDIA:Blackwell} architectures further advance this paradigm"
- BLAS: Basic Linear Algebra Subprograms; a standard library of optimized linear algebra routines. "BLAS provides efficient and portable implementations of core linear algebra operations"
- BLIS: BLAS-like Library Instantiation Software; a framework that generalizes and optimizes BLAS routines. "The BLAS-like Library Instantiation Software (BLIS) framework~\cite{VanZee:BLIS} extends GEMM by supporting non-unit strides in both row and column modes simultaneously,"
- Category theory: A branch of mathematics using objects and morphisms to study structures and transformations. "Colfax Research~\cite{Colfax:CategoryCuTe} analyzes CuTe layouts and some operations on them in the context of category theory."
- Coalesced representation: A flattened layout view where hierarchical structure is merged into a contiguous form. "the flat representation is called the {\em coalesced} version of the CuTe representation,"
- Colexicographical bijection: A one-to-one mapping between coordinate systems ordered by the last index varying fastest. "This transformation is the colexicographical bijection, \verb|(i,j) = (k%4,k/4)|, between 2D coordinates \verb|(i,j)| and 1D coordinates \verb|k|."
- Colexicographical ordering: An ordering where the last coordinate is the primary comparator, then preceding ones recursively. "In this work, we choose the colexicographical ordering, , of coordinates defined by:"
- Compatibility: A partial order on shapes ensuring they have the same total size and aligned hierarchical structure. "Compatibility, , is a partial order on the set of shapes."
- Composition (layout): Functional combination of layout transforms to produce new indexing behavior. "See Section~\ref{sec:composition} for details on layout composition and application to generic partitioning."
- Concatenation (layout operation): Combining layouts or dimensions end-to-end within the CuTe algebra. "including concatenation, coalescence, composition, complementation, division, tiling, and inversion"
- CONV: Convolution operation, here treated as a tensor contraction compatible with folded layouts. "including any matrix-multiplication ({\tt GEMM}), tensor contraction ({\tt GETT}), and convolution ({\tt CONV}),"
- Coordinate set: The set of valid indices for a shape, possibly as Cartesian products of ranges. "A coordinate set is a set of non-negative integers or a Cartesian product of coordinate sets, ."
- CUTLASS: NVIDIA’s CUDA Templates for Linear Algebra Subroutines and Solvers; a GPU library leveraging CuTe. "CuTe has been successfully deployed in production systems, forming the foundation of NVIDIA's CUTLASS library and a number of related efforts including CuTe DSL."
- CuTe: A mathematical specification and set of abstractions for tensor layouts and operations. "We present CuTe, a novel mathematical specification for representing and manipulating tensors."
- CuTe DSL: A Python-based domain-specific language for compiling CUDA linear algebra software using CuTe concepts. "forming the foundation of NVIDIA's CUTLASS library and a number of related efforts including CuTe DSL."
- Division (layout operation): Splitting or factoring layout dimensions in the CuTe algebra to alter indexing structure. "including concatenation, coalescence, composition, complementation, division, tiling, and inversion"
- Einstein notation: Convention where repeated indices imply summation without explicit sigma notation. "Summation is implied over repeated indices that appear only on a single side of an equation (Einstein notation),"
- einsum: Interfaces in NumPy and PyTorch implementing Einstein summation for concise tensor contractions. "Contractions of this form are expressed compactly in the {\tt numpy.einsum} and {\tt torch.einsum} interfaces, for instance."
- FlashAttention: An optimized attention algorithm for LLMs with memory-efficient kernels. "including FlashAttention and each of its evolving generations~\cite{Dao:FlashAttention, Dao:FlashAttention2, Dao:FlashAttention3},"
- Folding (tensor folding): Grouping tensor modes to reinterpret data shape without copying. "This is referred to as tensor {\em folding}."
- GEMM: General Matrix Multiply; the core matrix multiplication routine widely used and optimized. "Among BLAS primitives, the GEneral Matrix Multiply (GEMM) routine is easily the most optimized and widely used operation"
- GETT: Generalized tensor contraction primitive analogous to GEMM for tensors. "including any matrix-multiplication ({\tt GEMM}), tensor contraction ({\tt GETT}), and convolution ({\tt CONV}),"
- Graphene tensor compiler: A compiler framework where CuTe represents tensor operations. "CuTe has been used within the Graphene tensor compiler~\cite{Hagedorn:Graphene}, where it plays a critical role in representing tensor operations."
- HeLayers: A library for homomorphic encryption layers influencing dense tensor layout generalizations. "Independent generalizations of dense tensor representations include HeLayers~\cite{Aharoni:HeLayers}, ThunderKittens~\cite{Spector:ThunderKittens}, and the Linear Layouts~\cite{Tillet:LinearLayouts} approach used in OpenAI's Triton compiler~\cite{Tillet:TritonAI}."
- Hopper: NVIDIA GPU architecture introducing advanced tensor copy and core capabilities. "The Hopper~\cite{NVIDIA:Hopper} and Blackwell~\cite{NVIDIA:Blackwell} architectures further advance this paradigm, introducing copy instructions for efficiently transferring rank-5 tensors between global and shared memory and further expanding tensor core capabilities."
- HTuple: Hierarchical tuple datatype allowing nested tuples used to represent shapes and strides. "An is either an element of set or a ."
- Integer-semimodule: Algebraic structure with associative addition and integer scalar multiplication used for stride elements. "An integer-semimodule is a set equipped with an associative addition, , and a scalar multiplication, ."
- Inversion (layout operation): Reversing a layout transformation within CuTe’s algebra to recover prior indexing. "including concatenation, coalescence, composition, complementation, division, tiling, and inversion"
- ISL (integer set relations): The Integer Set Library context for analyzing layout relations and transformations. "analyzes CuTe and Linear Layouts~\cite{Tillet:LinearLayouts} in the context of integer set relations (ISL),"
- Layout algebra: The set of operations over layouts enabling manipulation, derivation, and analysis of tensor views. "a rich algebra of layout operations -- including concatenation, coalescence, composition, complementation, division, tiling, and inversion --"
- Layout function: The mapping from coordinates to stride codomain defined by composing shape and stride. "A layout is the functional composition of a shape and a stride S \sim D$,"</li> <li><strong>Linear Layouts</strong>: A representation of tensor layouts based on F2 linear algebra used in the Triton compiler. "Linear Layouts are based on $\mathbb{F}_2$ linear algebra and provide a more general representation of tensor layouts as well as an avenue for layout analysis and generation."</li> <li><strong>Matricization</strong>: Restructuring tensors into matrices to leverage BLAS routines for contractions. "Conventional approaches for computing general tensor contractions rely on matricization, which involves logically or explicitly restructuring tensor data to perform computations"</li> <li><strong>mdspan</strong>: A C++ facility for multidimensional array views with shape and stride metadata. "and in C++, {\tt std::mdspan}"</li> <li><strong>Multi-indices</strong>: Combined indices used in tensor notation to fold modes for canonical forms. "a key insight motivating CuTe is the use of multi-indices in tensor notation to enable the transformation of arbitrary tensor contractions into a canonical batched-GEMM primitive."</li> <li><strong>Multi-mode</strong>: A grouped set of tensor modes treated as a single dimension after folding. "where each mode may be a single mode or a group of modes, which we call a {\em multi-mode}."</li> <li><strong>Non-integral strides</strong>: Stride values not restricted to integers, enabling more general layout mappings. "CuTe supports these representations and strictly expands on them with generalizations to hierarchical shapes and strides to represent more complex layouts, non-integral strides, and non-integral layout codomains."</li> <li><strong>Semi-affine striding</strong>: A striding pattern where offsets follow affine relationships but may be non-uniform across coordinates. "A key observation is that the $4 {\times} 2080$-element vector with non-uniform, semi-affine striding,"</li> <li><strong>Shape:Stride representation</strong>: The paired description of tensor shape with corresponding stride defining a layout. "Because there is a one-to-one correspondence between the $\text{Shape}:\text{Stride}$ information and the loop nest itself,"</li> <li><strong>Stream-K algorithm</strong>: A GPU kernel scheduling strategy used in high-performance GEMM implementations. "the C++ implementation of CuTe has been used in implementations of the Stream-K algorithm~\cite{Osama:StreamK}"</li> <li><strong>Strided-batched GEMM</strong>: A BLAS extension allowing batches of GEMMs with constant strides between matrices. "The strided-batched GEMM extension to BLAS further generalizes the primitive and allows its application to even more tensor contractions,"</li> <li><strong>Stride</strong>: The mapping weights (possibly hierarchical) that convert coordinates into offsets in a layout’s codomain. "A stride $DSHTuple(\mathcal{D})S \sim D$."
- Tensor Cores: Specialized hardware units for fast small-matrix multiplications. "NVIDIA's Volta architecture~\cite{NVIDIA:Volta} introduced Tensor Cores, enabling efficient small-matrix multiplications directly in hardware."
- Tensor contraction: The summation over shared indices across tensors to produce a result tensor. "An instance of a tensor contraction is"
- Tensor instructions: Hardware-level operations optimized for tensor data movement and computation. "Modern architectures for high-performance computing and deep learning increasingly incorporate specialized tensor instructions,"
- Thunderkittens: A library providing bespoke GPU layout and access-pattern types oriented to hardware requirements. "Thunderkittens implements a wide variety of bespoke types for register memory, shared memory, row/column-major tiles, row/column-major tiles of row/column-major subtiles, and prescribed access patterns for warps and threads."
- Tiling: Partitioning data or loops into blocks to improve locality and performance. "enable the concise expression of tiling and partitioning patterns required by modern specialized tensor instructions."
- Triton compiler: A GPU programming system where Linear Layouts are used to represent tensor layouts. "the Linear Layouts~\cite{Tillet:LinearLayouts} approach used in OpenAI's Triton compiler~\cite{Tillet:TritonAI}."
- Turing: NVIDIA GPU architecture generation with expanded tensor instruction support. "This capability expanded in Turing~\cite{NVIDIA:Turing} and Ampere~\cite{NVIDIA:Ampere} with specialized instructions"
- Volta: NVIDIA GPU architecture generation that introduced Tensor Cores. "NVIDIA's Volta architecture~\cite{NVIDIA:Volta} introduced Tensor Cores,"
- Warps: Groups of GPU threads that execute in lockstep on the hardware. "prescribed access patterns for warps and threads."
- Weak congruence: A partial-order relation indicating one HTuple’s profile is at least as refined as another. "Weak Congruence, , is a partial order on HTuples."
Collections
Sign up for free to add this paper to one or more collections.