Papers
Topics
Authors
Recent
Search
2000 character limit reached

Cypress Programming Model for GPU Tensor Ops

Updated 12 December 2025
  • Cypress Programming Model is a task-and-tensor abstraction framework that enables efficient, asynchronous GPU computing using sequential programming semantics.
  • It employs explicit mapping specifications to control processor allocation, memory placement, and pipeline depth, ensuring deterministic scheduling and fine-grained performance tuning.
  • The compiler pipeline performs dependence analysis, vectorization, copy elimination, and warp specialization to generate optimized CUDA C++ kernels that rival hand-tuned libraries.

The Cypress programming model provides an abstraction layer for constructing efficient, asynchronous tensor computations on modern GPUs, particularly those with multiple domain-specific fixed-function units such as NVIDIA's Hopper architecture. Cypress is designed to enable programmers to compose complex GPU programs using sequential, task-based semantics while capturing the hierarchical, asynchronous behavior required to optimally leverage capabilities such as Tensor Cores and asynchronous data movement (TMA), without manual management of synchronization, explicit memory movement, or event placement (Yadav et al., 9 Apr 2025).

1. Core Abstractions: Tasks, Tensors, and Sequential Semantics

Cypress is structured around the concept of tasks, which are designated functions operating on one or more first-class tensors. Each task is declared with a unique name, a kind (k∈{inner,leaf}k \in \{\text{inner}, \text{leaf}\}), and for every tensor argument, a privilege (p∈{read,write,read-write}p \in \{\text{read}, \text{write}, \text{read-write}\}). Example syntax appears as:

1
2
def f@{gemm,Inner}(C: tensor[2, f16]{write}, A: tensor[2, f16]{read}, B: tensor[2, f16]{read}):
    ...body...

  • Inner tasks can partition inputs, perform computations, and recursively launch subtasks but cannot invoke arbitrary code outside the task model.
  • Leaf tasks may invoke arbitrary code such as CUDA intrinsics but are terminal and do not launch subtasks.

Task launches are always interpreted by the compiler as strictly "in order," so the user does not write constructs such as cudaMemcpyAsync, events, or synchronizations. All inter-task communication, scheduling, and synchronization are implicit within the model, and explicit concurrency is not exposed to the programmer. This ensures that user code is free from data races, ordering hazards, or missing barriers.

Cypress tensors are value objects with a static rank rr, precise element type τ\tau (e.g., f16), and dynamic shape vector s∈Nrs \in \mathbb{N}^r. Partitioning operators such as partition_by_blocks or partition_by_mma provide the mechanisms for hierarchical subdivision of tensors in sync with the computation, making code fully compositional.

For example, a GEMM inner task partitions its operand tensors for Tensor Core instructions:

1
2
3
Ap = partition_by_mma(A, WGMMA_64×256×16, warp, "A")
Bp = partition_by_mma(B, WGMMA_64×256×16, warp, "B")
Cp = partition_by_mma(C, WGMMA_64×256×16, warp, "C")
The strictly sequential semantics of tasks and loops allow the compiler to internally construct a dataflow graph of asynchronous operations connected by events, enforcing the program’s "as written" order.

2. Mapping Specification: Explicit Resource Control

Each Cypress program is concretized for a target architecture via an explicit mapping specification. The mapping specification prescribes where each task instance runs, where its tensors are materialized, and how its child launches are dispatched. The fields in a mapping statement are:

  1. instance: unique name for a task variant instantiation
  2. variant: which task implementation (inner or leaf) to use
  3. proc: processor level (HOST, BLOCK, WARPGROUP, WARP, THREAD)
  4. mems: memory allocation for each tensor (GLOBAL, SHARED, REGISTER, NONE)
  5. tunables: compile-time tunable parameters (e.g., tile sizes)
  6. calls: which task instances to dispatch for each launch in the task body
  7. warpspecialize (optional): instructs the compiler to partition DMA and compute warps
  8. pipeline (optional): software pipeline depth for TMA-TensorCore loops

Example mapping specification for a block-level GEMM task:

1
2
3
4
5
6
7
8
9
10
TaskMapping(
  instance="gemm_block",
  variant="gemm_block",
  proc=BLOCK,
  mems=[GLOBAL,GLOBAL,GLOBAL],
  tunables={"W":64},
  calls=["clear","gemm_tile","copy"],
  warpspecialize=true,
  pipeline=3
)
This explicit mapping fully binds abstract tasks to the desired hardware processing units, memory spaces, and execution strategies, decoupling correctness from mapping choices.

3. Compiler Pipeline: From Tasks to Optimized Kernels

The Cypress compiler lowers high-level task-based programs to CUDA C++ through five structured stages:

  • Dependence Analysis: Recursively traverses the task tree per mapping spec, emitting copies, allocating buffers, and recording explicit data-flow dependencies as SSA 'events,' with fine-grained event tracking for synchronization.
  • Vectorization: Flattens program hierarchy by injecting explicit GPU thread identifiers and converting loop-based iteration into vectorized code across GPU execution levels. Translation introduces synchronization constructs only where cross-thread dependencies exist.
  • Copy Elimination: Applies rewrite rules to remove redundant buffer copies (e.g., self-copies, duplicate copies, spill-and-hoist patterns), minimizing unnecessary data movement and associated synchronization.
  • Resource Allocation and Warp Specialization: Allocates tensors in shared memory using interference graph coloring. If warpspecialize=true, partitions tasks into (DMA warp) and (compute warps) for maximal hardware overlap, inserting barriers at necessary partition boundaries and pipelining the main computational loop exactly as in hand-optimized libraries (e.g., CUTLASS).
  • CUDA C++ Code Generation: Emits device kernels, functions, lambdas, and scalar code for each program level, lowering events to CUDA synchronization primitives or hardware-specific barriers, and invoking low-level libraries where relevant.

A representative generated kernel for block-level GEMM includes explicit shared memory ring buffers, staged prefetching with TMA, barriers, and inner loops unrolled to fit the pipeline depth, closely modeling the manual approach of expert CUDA programmers.

4. Explicit Control over Partitioning, Memory, and Asynchrony

Cypress exposes every major performance-relevant parameter to the mapping and partitioning mechanisms:

  • Partitioning is available at all levels and exploits both spatial tiling and hardware-specific blocking for tensor operations.
  • Memory placement for each tensor argument is under full programmer control, allowing precise optimization of shared versus global memory usage.
  • Overlap of TMA data movement and Tensor Core computation is first-class and statically scheduled, producing perfect asynchronous execution with minimal idle cycles and no runtime scheduling surprises.

The model encapsulates hierarchical parallelism (block, warp, thread), task tree structure, and explicit mapping, jointly yielding transparent, deterministic scheduling.

5. Benchmarks and Performance for Tensor Algorithms

Extensive benchmarking on NVIDIA H100 (80 GB SXM5) demonstrates Cypress’s capacity to match or surpass high-performance vendor and research-generated kernels on standard workloads:

Benchmark Cypress vs cuBLAS Cypress vs Triton Other Systems
GEMM (FP16 256-16K) 0.88–1.06× 1.05–1.11× Triton: 0.93–1.02× cuBLAS
Batched-GEMM (L=64) 0.90–1.08× 1.03–1.15×
Dual-GEMM ~1.36–1.40× — Triton: slower by >30%
GEMM+Reduction ~2.02–2.18× — Triton: ~0.9 TFLOP/s
FlashAttention-2 0.87–1.06× — Matches ThunderKittens
FlashAttention-3 0.80–0.98× — Triton lags by 1.5–2.2×

Major factors underpinning this competitive performance include: fully explicit hierarchical partitioning; explicit overlap and mapping of asynchronous data movement and computation; aggressive static scheduling and copy elimination removing runtime waste; and optimization of register and shared memory allocation for maximal resource reuse and pipeline depth (Yadav et al., 9 Apr 2025).

6. Significance and Relationship to Prior Systems

Cypress elevates the abstraction for asynchronous GPU programming by unifying hierarchical control, explicit memory placement, and static scheduling while offering a task-and-tensor programming interface with strictly sequential user semantics. All critical performance parameters—tiling, memory hierarchy, computational overlap, pipeline depth, and warp specialization—are explicitly surfaced in the mapping layer. In contrast to prior higher-level models (e.g., Triton), which may rely on heuristics or restrict direct control, Cypress exposes and composes these mechanisms, yielding deterministic, verified code generation and reproducible, near-optimal performance.

Unlike other programming systems that require explicit user-level management of synchronization, memory transfer, or warp-level primitives, Cypress eliminates all such user-side complexity through a combination of advanced dependence analysis, mapping-based specialization, and a robust lowering pipeline that targets the architectural specifics of contemporary GPUs.

7. Conclusion

Cypress presents a cohesive programming model for modern, heterogeneous GPU architectures, abstracting away imperative event and memory management in favor of a compositional, task- and tensor-centric approach with explicit, mappable control over parallelism, asynchrony, and memory hierarchy. Empirical results establish Cypress as matching or exceeding hand-tuned vendor libraries in critical benchmarks, and its design paradigm provides a systematic approach for efficiently orchestrating producer-consumer pipelines across emerging fixed-function units. By aligning code structure with hardware execution hierarchies and static analysis, Cypress circumscribes the challenges of contemporary GPU programming within a tractable, high-level, yet performance-portable model (Yadav et al., 9 Apr 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Cypress Programming Model.