ML-Triton, A Multi-Level Compilation and Language Extension to Triton GPU Programming (2503.14985v2)

Published 19 Mar 2025 in cs.CL

Abstract: In the era of LLMs, dense operations such as GEMM and MHA are critical components. These operations are well-suited for parallel execution using a tilebased approach. While traditional GPU programming often relies on low level interfaces like CUDA or SYCL, Triton has emerged as a DSL that offers a more user-friendly and portable alternative by programming at a higher level. The current Triton starts at the workgroup (aka threadblock) level, and directly lowers to per-thread level. And then attempt to coalesce and amend through a series of passes, promoting information from low-level representation. We believe this is pre-mature lowering based on the below observations. 1. GPU has a hierarchical structure both physically and logically. Modern GPUs often feature SIMD units capable of directly operating on tiles on a warp or warpgroup basis, such as blocked load and blocked MMA. 2. Multi-level gradual lowering can make compiler decoupled and clean by separating considerations inter and intra a logical layer. 3. Kernel developers often need fine control to get good performance on the latest hardware. FlashAttention2 advocates explicit data partition between warps to make a performance boost. In this context, we propose ML-Triton which features multi-level compilation flow and programming interface. Our approach begins at the workgroup level and progressively lowers to the warp and intrinsic level, implementing a multilevel lowering align with the hierarchical nature of GPU. Additionally, we extend triton language to support user-set compiler hint and warp level programming, enabling researchers to get good out-of-the box performance without awaiting compiler updates. Experimental results demonstrate that our approach achieves performance above 95% of expert-written kernels on Intel GPU, as measured by the geometric mean.

Summary

The paper introduces ML-Triton, a multi-level compilation framework that aligns optimizations with the hierarchical structure of modern GPUs for enhanced performance.
ML-Triton extends the Triton language with compiler hints for tiling and a warp-level API, enabling developers to manage resource orchestration and partition workloads explicitly.
Experimental results demonstrate ML-Triton achieves over 95% of expert kernel performance on Intel GPUs for key AI workloads like GEMM and Attention, validating its effectiveness in delivering near optimal efficiency.

Introduction

ML-Triton extends the Triton DSL by introducing a multi-level compilation flow that aligns with the inherent hierarchical structure of modern GPUs. The paper leverages a staged lowering strategy that begins at the workgroup level and incrementally refines the intermediate representation to warp and intrinsic levels. This approach addresses the limitations of early lowering in conventional GPU programming flows (e.g., CUDA or SYCL) and brings performance optimizations closer to fine-grained hardware capabilities.

Multi-Level Compilation Flow

The core contribution of ML-Triton is its multi-level compilation strategy that systematically decomposes high-level Triton semantics to the intrinsic operations supported directly by the GPU hardware. The process is divided into several passes:

Convert-Triton-to-TritonGPU-Warp:

This pass performs an analysis of the kernel’s workload and determines per-warp layout encodings tailored for specific operations such as GEMM and FlashAttention-2. It leverages a delayed lowering strategy by focusing on sizePerWarp instead of immediately per-thread partitioning, ensuring that intermediate representations retain higher-level semantic structure.

Distribute-to-Warps:

At this stage, the workload defined at the workgroup level is methodically partitioned across warps based on the layout encoding. The utilization of parameters embedded in the BlockedEncoding (e.g., sizePerWarp and WarpsPerCTA) facilitates a systematic and workload-aware distribution that can exploit either blocked load MAs or similar intrinsic GPU instructions.

Match-Target-Size:

Operations are refined to match the hardware-supported sizes of LLVM intrinsics. This step involves fragmenting operations by employing tt.extract operations where necessary, ensuring that each kernel’s subdivision conforms to the optimal intrinsic sizes, thereby exactly aligning with the SIMT/SIMD units.

Convert-TritonGPU-to-LLVM:

The final lowering translates the specialized Triton operations to LLVM IR by mapping them to corresponding LLVM intrinsics. This pass supports both SIMT and SIMD modes, enabling direct hardware utilization on Intel architectures and other target GPUs.

This hierarchical lowering not only decouples inter- and intra-layer optimizations—thus yielding a cleaner compiler design—but also permits more granular control over how workloads are partitioned and executed, maximizing hardware parallelism.

Language Extensions and Compiler Hints

ML-Triton introduces extensions to the Triton language that allow developers to inject compiler hints and perform warp-level programming. Key aspects include:

Compiler Hints:

These hints empower developers to manually define the tiling strategy (horizontal, vertical, or square tiling) used at the root operation level. This is pivotal in workload-specific optimizations, particularly given the divergent needs of dense operations like GEMM versus those of memory-bound operations in attention mechanisms.

Warp-Level API:

An extended warp-level programming interface is provided through constructs such as warp_level metadata and tl.warp_id(). Additionally, shared local memory (SLM) allocation can be managed via tl.alloc, and cross-warp communication is enhanced through an augmented tl.reduce. This level of control minimizes dependency on opaque compiler heuristics and allows explicit resource orchestration to improve runtime performance.

The ability to combine compiler hints with low-level warp programming offers the dual advantage of rapid prototyping with Triton’s high-level abstractions while still allowing for exhaustive fine-tuning where required.

Experimental Results and Performance

The empirical evaluations demonstrate strong numerical performance on Intel’s Ponte Vecchio (PVC) max 1550 GPU platform, with detailed comparisons against XeTLA:

GEMM Operations:

ML-Triton achieved a geometric mean performance of 96% of XeTLA’s performance for compute-bound GEMM, and 94% for memory-bound GEMM. This is indicative of the effectiveness of the multi-level lowering in matching the operational granularity required for high-throughput dense computations.

FlashAttention-2:

For this operation, the performance difference was maintained under a 5% gap compared to XeTLA, demonstrating that the explicit data partitioning between warps—integral to the FlashAttention-2 approach—is effectively supported by ML-Triton’s infrastructure.

Paged Attention:

The experimental results further validated that ML-Triton sustains above 95% of XeTLA’s performance, confirming that the staged lowering mechanism and hardware-aware operator splits adequately address both compute-intensive and memory-intensive kernels.

These performance benchmarks substantiate that the multi-level compilation not only simplifies the compiler design but also achieves near expert-level kernel performance without the need for highly specialized, hand-tuned implementations.

Conclusion

ML-Triton represents a significant step in the evolution of GPU programming frameworks by introducing a multi-level lowering strategy that mirrors the hardware’s hierarchical structure. With its workload-aware optimizations, explicit compiler hints, and low-level warp-level API, ML-Triton achieves robust performance—attaining above 95% of expert-written kernel efficiency on Intel hardware. The modularity of its compilation passes allows for targeted optimizations at the workgroup, warp, and intrinsic levels, which is instrumental for modern dense operations critical to LLM and AI workloads. This architecture not only simplifies the development process but also brings fine-tuned control necessary for high-performance GPU programming in contemporary accelerator landscapes.

PDF Markdown