- The paper introduces ML-Triton, a multi-level compilation framework that aligns optimizations with the hierarchical structure of modern GPUs for enhanced performance.
- ML-Triton extends the Triton language with compiler hints for tiling and a warp-level API, enabling developers to manage resource orchestration and partition workloads explicitly.
- Experimental results demonstrate ML-Triton achieves over 95% of expert kernel performance on Intel GPUs for key AI workloads like GEMM and Attention, validating its effectiveness in delivering near optimal efficiency.
Introduction
ML-Triton extends the Triton DSL by introducing a multi-level compilation flow that aligns with the inherent hierarchical structure of modern GPUs. The paper leverages a staged lowering strategy that begins at the workgroup level and incrementally refines the intermediate representation to warp and intrinsic levels. This approach addresses the limitations of early lowering in conventional GPU programming flows (e.g., CUDA or SYCL) and brings performance optimizations closer to fine-grained hardware capabilities.
Multi-Level Compilation Flow
The core contribution of ML-Triton is its multi-level compilation strategy that systematically decomposes high-level Triton semantics to the intrinsic operations supported directly by the GPU hardware. The process is divided into several passes:
- Convert-Triton-to-TritonGPU-Warp:
This pass performs an analysis of the kernel’s workload and determines per-warp layout encodings tailored for specific operations such as GEMM and FlashAttention-2. It leverages a delayed lowering strategy by focusing on sizePerWarp
instead of immediately per-thread partitioning, ensuring that intermediate representations retain higher-level semantic structure.
At this stage, the workload defined at the workgroup level is methodically partitioned across warps based on the layout encoding. The utilization of parameters embedded in the BlockedEncoding (e.g., sizePerWarp
and WarpsPerCTA
) facilitates a systematic and workload-aware distribution that can exploit either blocked load MAs or similar intrinsic GPU instructions.
Operations are refined to match the hardware-supported sizes of LLVM intrinsics. This step involves fragmenting operations by employing tt.extract
operations where necessary, ensuring that each kernel’s subdivision conforms to the optimal intrinsic sizes, thereby exactly aligning with the SIMT/SIMD units.
- Convert-TritonGPU-to-LLVM:
The final lowering translates the specialized Triton operations to LLVM IR by mapping them to corresponding LLVM intrinsics. This pass supports both SIMT and SIMD modes, enabling direct hardware utilization on Intel architectures and other target GPUs.
This hierarchical lowering not only decouples inter- and intra-layer optimizations—thus yielding a cleaner compiler design—but also permits more granular control over how workloads are partitioned and executed, maximizing hardware parallelism.
Language Extensions and Compiler Hints
ML-Triton introduces extensions to the Triton language that allow developers to inject compiler hints and perform warp-level programming. Key aspects include:
These hints empower developers to manually define the tiling strategy (horizontal, vertical, or square tiling) used at the root operation level. This is pivotal in workload-specific optimizations, particularly given the divergent needs of dense operations like GEMM versus those of memory-bound operations in attention mechanisms.
An extended warp-level programming interface is provided through constructs such as warp_level
metadata and tl.warp_id()
. Additionally, shared local memory (SLM) allocation can be managed via tl.alloc
, and cross-warp communication is enhanced through an augmented tl.reduce
. This level of control minimizes dependency on opaque compiler heuristics and allows explicit resource orchestration to improve runtime performance.
The ability to combine compiler hints with low-level warp programming offers the dual advantage of rapid prototyping with Triton’s high-level abstractions while still allowing for exhaustive fine-tuning where required.
The empirical evaluations demonstrate strong numerical performance on Intel’s Ponte Vecchio (PVC) max 1550 GPU platform, with detailed comparisons against XeTLA:
ML-Triton achieved a geometric mean performance of 96% of XeTLA’s performance for compute-bound GEMM, and 94% for memory-bound GEMM. This is indicative of the effectiveness of the multi-level lowering in matching the operational granularity required for high-throughput dense computations.
For this operation, the performance difference was maintained under a 5% gap compared to XeTLA, demonstrating that the explicit data partitioning between warps—integral to the FlashAttention-2 approach—is effectively supported by ML-Triton’s infrastructure.
The experimental results further validated that ML-Triton sustains above 95% of XeTLA’s performance, confirming that the staged lowering mechanism and hardware-aware operator splits adequately address both compute-intensive and memory-intensive kernels.
These performance benchmarks substantiate that the multi-level compilation not only simplifies the compiler design but also achieves near expert-level kernel performance without the need for highly specialized, hand-tuned implementations.
Conclusion
ML-Triton represents a significant step in the evolution of GPU programming frameworks by introducing a multi-level lowering strategy that mirrors the hardware’s hierarchical structure. With its workload-aware optimizations, explicit compiler hints, and low-level warp-level API, ML-Triton achieves robust performance—attaining above 95% of expert-written kernel efficiency on Intel hardware. The modularity of its compilation passes allows for targeted optimizations at the workgroup, warp, and intrinsic levels, which is instrumental for modern dense operations critical to LLM and AI workloads. This architecture not only simplifies the development process but also brings fine-tuned control necessary for high-performance GPU programming in contemporary accelerator landscapes.