TLX: Hardware-Native, Evolvable MIMW GPU Compiler for Large-scale Production Environments
Abstract: Modern GPUs increasingly rely on specialized hardware units and asynchronous coordination mechanisms, so performance depends on orchestrating data movement, tensor-core computation, and synchronization rather than exposing more thread-level parallelism. This creates a programming-model tension: if too much execution structure is hidden, the compiler must catch up to new hardware mechanisms; if too much is exposed, the burden of orchestration falls back onto the programmer. We present TLX (Triton Low-level Language Extensions), built around MIMW (Multi-Instruction, Multi-Warp), which expresses orchestration at warp-group granularity while preserving Triton's productive blocked programming model for regular computation. TLX realizes this idea as an embedded extension to Triton, exposing explicit interfaces for multi-warp execution, local-memory orchestration, asynchronous operations, and cluster-aware control. Our evaluation shows that TLX supports substantial customization with limited development effort while remaining competitive with state-of-the-art implementations. TLX-authored kernels have been deployed in large-scale training and inference production systems. Our code is open sourced at https://github.com/facebookexperimental/triton.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Explain it Like I'm 14
TLX: A simple explanation for teens
What is this paper about?
This paper introduces TLX, a new way to write fast programs for GPUs (the chips that train and run AI). TLX sits between two existing styles:
- Very low-level, hands-on control (like CUDA), which is powerful but hard to write.
- Higher-level, easier code (like Triton), which is friendly but sometimes canโt use the newest hardware tricks quickly.
TLX adds a โmissing middleโ called MIMW (Multi-Instruction, Multi-Warp). In simple terms, it lets small teams of threads (called warp groups) do different jobs at the same time and coordinate smoothly. Think of a busy kitchen: one team chops, another cooks, another plates, and they keep passing ingredients along without waiting for the whole kitchen to stop and sync at every step.
What were the authors trying to do?
The paperโs main goals are:
- Create a middle-ground programming model (MIMW) where groups of threads can follow different plans and still work together efficiently.
- Build TLX as an add-on to Triton so you keep Tritonโs ease for regular math while gaining fine control where it matters.
- Show that TLX can reach top-tier speed in real AI training and inference (making predictions) without tons of extra coding.
How did they do it? (With simple analogies)
To make this work, the authors add a few practical tools to Triton:
- Warp specialization (team roles): Instead of every mini-team doing the exact same steps, different warp groups get roles:
- โProducersโ bring data in (like chopping veggies).
- โConsumersโ compute with it (like cooking).
- โEpilogueโ teams finish up (like plating).
- These teams run in parallel and hand off work with lightweight signals.
- Local memory control (shared scratchpads): Teams share fast on-chip โscratchpadโ memory (like a shared counter space in the kitchen). TLX lets you:
- Allocate these buffers explicitly.
- Control their layout (how food is arranged on the counter) so hardware can access it quickly.
- Safely reuse the same space across stages without collisions.
- Asynchronous operations (no unnecessary waiting): TLX makes it normal to load data, compute, and sync at the same time, like a well-timed assembly line. Instead of stopping everyone with big red lights, TLX uses small โbarriersโ (like traffic lights for just the roads that need it) so only the teams that must wait actually wait.
- Cluster-level control (teams of teams): Modern GPUs run several thread blocks together as a โcluster.โ TLX adds:
- CLC (Cluster Launch Control): a hardware-backed queue so blocks can grab the next piece of work dynamically (like cooks pulling the next order ticket), which helps balance uneven workloads.
- DSM (Distributed Shared Memory): lets one block write into another blockโs scratchpad directly (passing plates without going back to the pantry).
- Multi-CTA instructions: some new GPU features let two blocks act as one for a single big math instruction (like two cooks lifting one heavy pot together); TLX makes this safe and explicit.
- Under the hood (compiler integration): TLX isnโt a whole new language. It embeds its features into Tritonโs compiler so the โhow teams coordinateโ decisions stay visible all the way to machine code. The compiler:
- Tracks buffer layouts and aliases.
- Inserts needed conversions only when necessary.
- Lowers everything to the right hardware instructions for NVIDIA (and works with AMD too).
What did they find, and why does it matter?
- Strong performance: TLX gets performance competitive with state-of-the-art hand-tuned code (like CUDA or expert libraries) on key AI kernels such as matrix multiplication (GEMM). It works on the newest GPU features (like Hopper and Blackwell tensor cores) and has been deployed in large-scale real-world training and inference systems at Meta.
- Faster development with enough control: Because TLX keeps Tritonโs โtile-basedโ style for regular math and only adds detail where needed, developers can move faster than if they had to write everything in low-level CUDA.
- Usability in practice: In a survey of 127 advanced students, TLX compared well to other systems on common low-level tasks, and stood out for cluster control (the โteams of teamsโ part), suggesting programmers can use these extra controls without falling back to fully manual programming.
Why it matters: GPUs are evolving quickly with more special-purpose parts (tensor cores, new memory engines) and more asynchronous behavior (things running in parallel). If a language hides too much, the compiler must โcatch upโ to every new hardware trick, which takes time. If it exposes too much, programmers get overwhelmed. TLXโs middle ground helps both performance and productivity right away.
Whatโs the big picture impact?
- Faster AI at lower effort: Teams can build high-performance GPU kernels more quickly, which helps train models faster and run them more efficiently.
- Future-proofing: As GPUs add new features (bigger teams, new memory pathways), TLXโs design makes it easier to adopt them without waiting for large compiler overhauls.
- Broad usefulness: It works within the Triton ecosystem, supports NVIDIA and AMD ideas, and is open-sourced (so the community can try, adapt, and extend it): https://github.com/facebookexperimental/triton
In short
TLX gives programmers a practical โmiddle laneโ to orchestrate GPU work: not too low-level to be painful, not too high-level to be slow. By letting small teams of threads (warp groups) take on specialized roles and coordinate with clear, lightweight rules, TLX delivers near-expert performance while keeping code understandable and adaptable for future hardware.
Knowledge Gaps
Unresolved knowledge gaps, limitations, and open questions
Below is a focused list of what the paper leaves missing, uncertain, or unexplored. Each point is phrased to help guide concrete followโup research or engineering work.
- Formal semantics and verification: No formal memory model is specified for MIMW (especially for DSM/TMA and the โarrive remote, wait localโ rule), nor tools to statically check race-freedom, absence of deadlocks, or matched multi-CTA collective participation.
- Portability beyond NVIDIA: The paper asserts AMD support but does not detail how TLX maps DSM, CLC, TMA, WGMMA/tcgen05.mma, or warp-group semantics to AMD (wave32/64) where analogous hardware features may be absent or different; a portability layer and performance portability results are missing.
- Fallback behavior on unsupported hardware: It is unclear how TLX degrades when cluster features (e.g., CLC, DSM, multicast, paired-CTA MMA) are unavailable (older NVIDIA or non-NVIDIA GPUs), and what performance/semantic trade-offs result.
- Automated orchestration synthesis: Warp-group partitioning, barrier placement, stage depths, and role counts are hand-authored; there is no cost model, autotuner, or search procedure to synthesize MIMW schedules automatically.
- Choosing cluster parameters: The selection of CLC stage counts, work granularity (tile size), multicast target sets, and DSM buffer sizes is manual; no guidance or automated optimization is provided to balance throughput, latency, and occupancy.
- Compiler overheads and JIT latency: The paper does not quantify the compile-time costs of TLXโs additional passes (layout propagation, alias rewrites, cluster legalization) or their impact on JIT-heavy production serving/training.
- Debugging and profiling: There is no discussion of tooling to debug MIMW kernels (e.g., per-warp-group pipeline traces, barrier state introspection, DSM traffic tracing) or integration with Nsight/rocprof for cluster-aware timelines.
- Safety of multi-CTA collectives: TLX relies on user-specified grouping and synchronization for paired-CTA MMA but offers no static checks to ensure matched issue orders, compatible operand layouts, or deadlock prevention under divergence/preemption.
- Determinism and reproducibility: Dynamic work assignment via CLC and asynchrony may introduce run-to-run variability; the paper does not address how to achieve deterministic execution for training or testing.
- Occupancy and resource budgeting: There is no method to automatically budget registers, shared memory, mbarriers, and TMA descriptors across role-specialized warp groups and CTAs to maximize SM residency; trade-offs are left to manual tuning.
- Limits of layout propagation: The priority-based conflict resolution for local-memory layouts lacks optimality guarantees; the frequency and performance impact of inserted layout conversions and alias conflicts are not quantified.
- Correctness under preemption/MIG/MPS: Effects of GPU preemption, Multi-Instance GPU (MIG), and Multi-Process Service (MPS) on cluster residency, CLC progress, DSM ordering, and multi-CTA collectives are not examined.
- Security and isolation: DSM exposes remote shared memory within a cluster; implications for memory isolation, cross-tenant safety, and failure containment in multi-tenant environments are not discussed.
- Applicability beyond dense ML: Evaluation and design center on tile-regular kernels; it remains unclear how well TLX handles sparse, irregular, or control-heavy workloads and what additional primitives might be needed.
- Evaluation breadth and baselines: Beyond the (partial) GEMM results and a student survey, the paper lacks comprehensive benchmarks (e.g., convolutions, attention, fused pipelines, end-to-end training/inference throughput/latency/energy) and head-to-head comparisons with CUDA, CuTeDSL, ThunderKitten, and TileLang on identical kernels.
- Productivity in real-world teams: Evidence of reduced development time/defect rates in industry settings is missing; the student survey does not measure professional developer effort across hardware generations or maintenance costs.
- Interactions with Triton autotuning and fusion: How TLX composes with Tritonโs autotuner, kernel fusion, and schedule transformations is not described; potential conflicts between TLXโs explicit orchestration and Tritonโs automatic passes remain unexplored.
- Runtime failure handling: Behavior under asynchronous engine faults (e.g., TMA errors, ECC), barrier misconfigurations, or partial cluster failure is unspecified; mechanisms for error reporting, cancellation, and graceful degradation are absent.
- Cross-SM/chiplet orchestration: TLX focuses on intra-cluster/SM-group control; scaling MIMW concepts to inter-SM, chiplet (XCD) groups, or inter-device (multi-GPU) orchestration is left open.
- Energy efficiency: The impact of deeper asynchrony and overlapping pipelines on power/energy (e.g., DVFS behavior, thermal limits) is not evaluated.
- API stability and extensibility: While claiming extensibility, the paper does not outline versioning, backward compatibility across GPU generations, or a process for adding new hardware collectives without breaking user kernels.
- Termination and fairness in CLC: Guarantees about starvation freedom, fairness across CTAs, and termination detection (especially with irregular runtimes and tails) are not formalized or empirically assessed.
- Handling hardware limits: Practical limits on mbarriers per CTA, shared-memory size, TMA descriptors, and number of concurrent async ops are not documented, nor strategies when kernels approach these limits.
- Framework integration and deployment: Concrete guidance and measurements for integrating TLX kernels with PyTorch/Inductor or other compilers (artifact caching, AOT vs JIT packaging, multi-arch builds) are not provided.
Practical Applications
Immediate Applications
The following use cases can be deployed now using the openโsourced TLX extension to Triton and current Hopper/Blackwell-class GPUs (and AMD backends where supported):
- High-performance ML kernels (GEMM, attention, fused ops, MoE routing)
- Sectors: AI/ML, Software, Cloud
- What: Author CUDAโcompetitive kernels faster by assigning warpโgroup roles (producer/compute/epilogue), overlapping TMA/async copy, and using cluster launch control (CLC) for dynamic, persistent execution.
- Tools/Workflows: TLX in Triton (github.com/facebookexperimental/triton); integrate kernels into PyTorch/Inductor or custom operator stacks; autoโtune TLX knobs (num_warps, num_regs, pipeline depth).
- Assumptions/Dependencies: Access to GPUs with features like WGMMA, TMA, DSM, and CLC; Triton/TLX runtime compatibility; developer familiarity with warp specialization and barriers.
- Throughputโoriented inference serving (dynamic batching, variable sequence lengths)
- Sectors: AI/ML, Cloud
- What: Use CLC to keep clusters resident and loadโbalanced across heterogeneous requests; reduce memory traffic with TMA multicast and DSM for crossโCTA tile sharing in attention/kvโcache updates.
- Tools/Workflows: TLX kernels embedded in serving stacks; Nsight or tracing tools to validate wait/arrive timelines; persistent-kernel orchestration via TLX intrinsics.
- Assumptions/Dependencies: Scheduler integration for persistent kernels; DSM/multicast hardware availability; careful SM occupancy management.
- Production operator modernization for new GPUs
- Sectors: AI/ML, Software
- What: Rapidly adopt new hardware features (e.g., Hopper WGMMA, Blackwell tcgen05 pairedโCTA MMA) without waiting for compiler โcatchโup,โ by expressing collectives and layouts explicitly.
- Tools/Workflows: TLX cluster primitives for pairedโCTA MMA; explicit layout propagation via TLX RequireLayout/LocalAlias; CI performance gates.
- Assumptions/Dependencies: Driver/firmware supporting multiโCTA collectives; correct operand partitioning and synchronization; regression/perf testing across GPU generations.
- Scientific/HPC dense compute kernels (Linalg, FFT/stencil mainloops)
- Sectors: HPC, Energy, Research
- What: Build asynchronous pipelines that overlap data movement and compute using MIMW roles; employ DSM for intraโcluster reductions or tile exchanges to avoid global memory roundโtrips.
- Tools/Workflows: TLX local-memory control with explicit layouts; clusterโaware scheduling; integrate into domain libraries (e.g., custom GEMM/FFT backends).
- Assumptions/Dependencies: Regular blocking suitable for Triton/TLX; mapping to AMD backends may require feature fallbacks; validation on realistic problem sizes.
- Faster compiler and systems research
- Sectors: Academia
- What: Use TLXโs explicit IR constructs (tasks, barriers, layouts) to prototype new lowering passes, autoโtuning strategies, or orchestration policies without reimplementing a compiler.
- Tools/Workflows: Extend TLX passes (layout propagation, alias rewriting); build search spaces around TLX knobs; evaluate on open productionโstyle kernels.
- Assumptions/Dependencies: Familiarity with Triton IRs (TTIR/TTGIR); stable API surface of TLX; reproducibility infrastructure.
- GPU programming education and training
- Sectors: Education
- What: Teach modern orchestration (MIMW, barriers, DSM, cluster control) with compact, highโlevel examples that still map directly to hardware behavior.
- Tools/Workflows: Course labs comparing SIMT/SIMB/MIMW; TLXโs surveyโvalidated usability for cluster control; visual timelines from traces.
- Assumptions/Dependencies: Access to recent GPUs; curated teaching materials; versionโlocked toolchains for labs.
- Performance debugging and profiling
- Sectors: Software Tooling
- What: Leverage TLXโs explicit wait/arrive and region boundaries to produce clearer traces, identify pipeline bubbles, and attribute stalls to specific warpโgroup roles.
- Tools/Workflows: Nsight/rocprof integration; custom markers from TLX tasks/barriers; microbench suites.
- Assumptions/Dependencies: Tool support for new barrier/multicast instructions; consistent symbolization through Triton JITs.
- Library engineering and openโsource ops refresh
- Sectors: Software, AI/ML
- What: Reimplement bottleneck ops (e.g., FlashAttention variants, fused optimizers, layernorm+matmul epilogues) with MIMW pipelines for portable performance.
- Tools/Workflows: TLX kernel libraries packaged as Python wheels; perโarch code paths via TLX layout/lowering; CI benchmarks (GB200/H100/MI300).
- Assumptions/Dependencies: Maintenance burden across architectures; ABI stability for downstream frameworks; autoโtuning infra.
- Datacenter efficiency and cost reduction
- Sectors: Cloud, Energy, Finance (TCO)
- What: Reduce GPU count and energy by improving kernel efficiency and shortening engineering cycles; deploy TLXโauthored kernels already validated in production.
- Tools/Workflows: Capacity planning models using measured TLX speedups; rollโout via feature flags; energy telemetry.
- Assumptions/Dependencies: Realized speedups on full training/inference jobs (not just kernels); ops reliability and SRE playbooks for persistent kernels.
Long-Term Applications
These opportunities require further research, hardware evolution, broader ecosystem integration, or standardization before wide deployment:
- Autoโsynthesized orchestration (autoโscheduling of MIMW)
- Sectors: AI/ML, Software, Academia
- What: Compilers/search systems that infer warp roles, barrier placement, and CLC/DSM usage from highโlevel intents, using TLX as a constrained design space.
- Tools/Workflows: MLโguided schedule search; cost models informed by TLX IR; integrated autotuners.
- Assumptions/Dependencies: Robust performance models; exploration budgets; generalized correctness checks for concurrency.
- Crossโvendor standardization of MIMW primitives
- Sectors: Policy, Standards, Semiconductors
- What: Define a common DSL/IR for warpโgroup execution, cluster collectives, and localโmemory semantics spanning NVIDIA/AMD (and future vendors).
- Tools/Workflows: Working groups (e.g., Khronos/oneAPIโlike); reference conformance suites; TLXโinspired IR proposals.
- Assumptions/Dependencies: Vendor buyโin; alignment on DSM/cluster semantics; IP considerations.
- Hardwareโsoftware coโdesign for cluster collectives
- Sectors: Semiconductors, AI/ML, HPC
- What: Evolve hardware features (e.g., generalized multiโCTA collectives beyond pairedโCTA MMA, richer multicast groups, chipletโscale XCD collectives) with TLXโstyle firstโclass abstractions.
- Tools/Workflows: Coโsimulation with TLX IR; prototype ISA extensions; performance studies on representative kernels.
- Assumptions/Dependencies: Hardware roadmaps; compiler/backend support; power/area tradeโoffs.
- Verified concurrency for GPU kernels
- Sectors: Safetyโcritical (Healthcare, Automotive/Robotics), Tooling
- What: Static analyzers and model checkers that verify mbarrier protocols, DSM aliasing, and absence of deadlocks/data races in TLX kernels.
- Tools/Workflows: Formal semantics for TLX tasks/barriers; symbolic execution over TTGIR; CI gating with proofs.
- Assumptions/Dependencies: Scalable verification methods; annotated kernels; integration with JITโd pipelines.
- Integration with graph compilers and runtimes
- Sectors: AI/ML, Software
- What: Make TLX a backend target for TVM/XLA/Inductor, enabling graphโlevel passes to emit orchestrated kernels with MIMW patterns and cluster residency decisions.
- Tools/Workflows: IR bridges and pattern libraries; graphโtoโTLX scheduling passes; runtime policies for persistent clusters.
- Assumptions/Dependencies: Stable interop layers; shared scheduling abstractions; runtime support for preemption/fairness.
- Expanded domain coverage (sparse/irregular and nonโML workloads)
- Sectors: HPC, Graph Analytics, Databases
- What: Extend MIMW idioms to sparse GEMM/SpMM, graph traversal, and database operators via specialized warp roles and clusterโlevel work redistribution.
- Tools/Workflows: New TLX patterns for irregular pipelines; hybrid software queues with CLC; DSMโassisted compaction.
- Assumptions/Dependencies: Hardware support for fineโgrained async/atomics; robust loadโbalancing strategies; algorithmic advances.
- Edge and realโtime systems
- Sectors: Robotics, Automotive, AR/VR
- What: Use MIMW pipelines to meet latency budgets by overlapping sensor I/O, pre/postโprocessing, and core inference on embedded GPUs.
- Tools/Workflows: TLX profiles tuned for lowโpower SOCs; QoSโaware persistent kernels; integration with realโtime schedulers.
- Assumptions/Dependencies: TLX backend maturity on embedded platforms; thermal/power constraints; realโtime preemption.
- Energyโefficiency policy and procurement guidance
- Sectors: Policy, Public Sector, Sustainability
- What: Codify best practices (e.g., cluster residency, multicast, localโmemory reuse) into procurement standards and energyโefficiency benchmarks for GPUโaccelerated systems.
- Tools/Workflows: Benchmark suites built on TLX kernels; reporting frameworks linking orchestration patterns to energy metrics.
- Assumptions/Dependencies: Stakeholder consensus; transparent measurements; evolving hardware baselines.
- Developer experience products around MIMW
- Sectors: Software Tooling, Developer Platforms
- What: Visual schedulers, schedule DSLs, and IDE support that generate TLX code from timeline sketches; pattern libraries for common pipelines.
- Tools/Workflows: TimelineโtoโTLX compilers; reusable templates for producer/consumer/MMA/epilogue; linting and refactoring tools.
- Assumptions/Dependencies: UX research with practitioners; alignment with Tritonโs evolution; sustained maintenance.
- Runtime and OS policies for persistent clusters
- Sectors: Cloud, OS/Runtime
- What: Fair preemption and multiโtenancy policies that coexist with TLX persistent kernels and cluster residency without harming isolation/QoS.
- Tools/Workflows: Clusterโaware schedulers; kernelโlevel telemetry; admission control tied to CLC usage.
- Assumptions/Dependencies: Vendor runtime hooks; predictable preemption latency; security/isolation validation.
Note on deployment risk and feasibility: Many immediate applications depend on Hopper/Blackwellโclass features (e.g., DSM, multicast, pairedโCTA MMA, CLC). Where these are absent, TLX often offers fallbacks with reduced benefit (e.g., emulate DSM via global memory, replace pairedโCTA MMA with independent tiles), which should be accounted for during planning and performance estimation.
Glossary
- Ampere: NVIDIA GPU microarchitecture generation that introduced warp-level matrix instructions. "Ampere's warp-level WMMA"
- ATen: PyTorchโs core tensor and operator library used for high-performance kernels. "ATen"
- barrier_expect_bytes: A PTX mbarrier operation that declares the expected byte count for an asynchronous response. "publishes the expected response size via barrier_expect_bytes"
- Blackwell: An NVIDIA GPU generation featuring paired-CTA tensor-core collectives. "Blackwell's CTA-pair tcgen05.mma [14]."
- Cluster Launch Control (CLC): Hardware-managed work-queue mechanism enabling dynamic, persistent work distribution across CTAs in a cluster. "Cluster Launch Control (CLC), distributed shared memory (DSM), and multicast are all asynchronous control and communication mechanisms."
- Compute Unit (CU): AMDโs per-core GPU execution unit, analogous to NVIDIAโs SM. "compute units (CUs)"
- Cooperative Thread Array (CTA): NVIDIAโs term for a thread block scheduled on an SM that cooperates via shared memory and sync. "cooperative thread arrays (CTAs)"
- CUDA: NVIDIAโs SIMT GPU programming model and toolchain. "CUDA thread-block clustering provides the execution domain"
- CuTeDSL: A CUDA ecosystem domain-specific language for blocked/tiled tensor programs. "CuTeDSL [19]"
- Distributed Shared Memory (DSM): Cluster-scoped shared memory allowing a CTA to address another CTAโs shared-memory region. "recent NVIDIA architectures expose DSM: a CTA can directly access the shared memory of another CTA"
- GEMM: General Matrix Multiply, a canonical high-performance linear algebra kernel. "GEMM performance on NVIDIA GB200."
- Gluon: A lower-level GPU programming system exposing fine-grained control akin to SIMT. "systems such as Gluon[29]"
- Hopper: NVIDIA GPU generation introducing warp-group MMA instructions. "Hopper's warp-group WGMMA"
- LDS (Local Data Share): AMDโs on-chip software-managed memory analogous to NVIDIA shared memory. "shared memory / LDS"
- mbarrier: GPU memory barrier primitive used to coordinate asynchronous operations and data dependencies. "per-stage mbarriers tracking empty and full slots"
- MIMW (Multi-Instruction, Multi-Warp): Programming model where different warp groups run distinct instruction streams with explicit dependencies. "It is built around MIMW (Multi-Instruction, Multi-Warp)"
- multicast: Cluster-level data movement where one TMA transfer populates shared-memory tiles for multiple CTAs. "Another important class is TMA multicast"
- Paired-CTA MMA: A tensor-core collective where two CTAs jointly issue a single MMA instruction and produce a shared result. "paired-CTA MMA: two CTAs jointly issue one tensor-core instruction"
- Persistent execution: Long-lived kernel execution model where CTAs repeatedly fetch new work to improve load balance. "enabling dynamic persistent execution"
- PTX: NVIDIAโs low-level virtual ISA used as a target for GPU compilers. "wrap the underlying PTX primitive"
- SIMB (Single-Instruction, Multi-Block): Blocked/tile-centric programming model where one program is replicated across blocks. "single-instruction, multi-block (SIMB)"
- SIMT (Single-Instruction, Multi-Thread): Execution model where threads in a warp follow the same instruction stream over different data. "single-instruction, multi-thread (SIMT)"
- Streaming Multiprocessor (SM): NVIDIAโs per-core GPU execution unit hosting warps, registers, and shared memory. "streaming multiprocessors (SMs)"
- Tensor Core: Specialized matrix-multiply hardware units on NVIDIA GPUs for high-throughput MMA. "including tensor cores, asynchronous copy engines, and tensor memory accelerators (TMA)"
- Tensor Memory Accelerator (TMA): Hardware engine for asynchronous tensor-shaped memory transfers and multicast. "A single Tensor Memory Accelerator transfer can populate shared-memory tiles for multiple CTAs in a cluster"
- ThunderKitten: A GPU DSL that extends tile-based programming with more explicit orchestration. "ThunderKitten[26]"
- TileLang: A GPU DSL exploring tile-centric programming with low-level scheduling control. "TileLang[30]"
- TLX (Triton Low-level Language Extensions): An embedded extension to Triton that exposes MIMW orchestration, async ops, and cluster-aware control. "We present TLX (Triton Low-level Language Extensions)"
- Triton: A tile-centric GPU DSL that raises abstraction from threads to blocks for productive high-performance kernels. "Triton[28] raises abstraction from threads to blocked tensor programs"
- TritonGPU: Tritonโs GPU-target IR layer with canonical layout encodings used during lowering. "TritonGPU's canonical representation"
- TTGIR: Triton GPU Intermediate Representation capturing target-aware GPU semantics. "passes the resulting TTGIR to the same downstream tile-level optimizations."
- TTIR: Triton Typed Intermediate Representation for tile-level tensor computations before GPU specialization. "Triton lowers the kernel to TTIR"
- Volta: NVIDIA GPU generation with early tensor-core MMA at quad granularity. "Volta's quad-level MMA"
- Warp: A hardware-executed group of threads that run in lockstep under SIMT. "warps on Nvidia and wavefronts on AMD"
- Warp group: A cooperating set of warps that can specialize in different roles and run distinct instruction streams. "The critical control point is the warp group"
- Warp specialization: Mapping concurrent tasks onto distinct warp groups within a CTA to overlap roles. "Warp specialization [3, 4, 8] maps async tasks onto disjoint warp groups within one CTA."
- Wavefront: AMDโs SIMT execution group analogous to NVIDIAโs warp. "wavefronts on AMD"
- WGMMA: Warp-Group Matrix Multiply-Accumulate instruction class introduced with Hopper. "Hopper's warp-group WGMMA"
- WMMA: Warp-level Matrix Multiply-Accumulate instruction class introduced with Ampere. "Ampere's warp-level WMMA"
- XCD Group: A hardware grouping construct (chiplet-local waves) on certain GPUs. "XCD Group A group of waves in the same chiplet."
- tcgen05.mma: Blackwell-generation tensor-core MMA instruction for CTA-pair collectives. "tcgen05.mma"
Collections
Sign up for free to add this paper to one or more collections.