- The paper introduces a hierarchical distributed tiling model spanning core, device, and task levels to efficiently manage parallel tensor program execution.
- It fuses computation and communication using JIT-pluggable swizzling modes and optimized code generation, achieving up to 30% speedup over expert-tuned approaches.
- Evaluation shows significant improvements in scalability and performance, with accelerated LLM inference and substantial GPU-hour savings in production training.
DITRON: A Distributed Multi-level Tiling Compiler for Parallel Tensor Programs
Motivation and Positioning
The problem of efficiently programming distributed kernels for large-scale deep learning, especially LLMs and MoE models, persists due to the dichotomy between highly-optimized, but inflexible, communication libraries (e.g., NCCL, CuBLAS) and higher-level tensor compilers (e.g., Triton, TileLang, Pallas) that lack explicit multi-level support for the complex hardware hierarchies of modern clusters. Existing software approaches fail to provide an expressive, portable, and cluster-scale abstraction that can flexibly overlap computation and communication or adapt to model and hardware heterogeneity with minimal programmer burden. DITRON directly targets these gaps by introducing a hierarchical distributed programming model spanning core, device, and task abstraction levels atop the widely deployed Triton compiler stack, coupled with hardware-agnostic IR and optimizations for overlapping and tiling.
System Architecture and Abstraction
DITRON's architecture is decomposed into three layers:
- Front-End—introduces hierarchical tiling:
- Core-level: Fine-grained, static-shaped tiles mapped to hardware units (Tensor Cores, TMA engines); designed to be syntax-compatible with standard Triton.
- Device-level: Coarse-grained, dynamically-shaped tiles corresponding to DMA/RDMA domains for intra/inter-node data movement. Incorporates asynchronous, shape-dynamic communication, critical for MoE-style token routing.
- Task-level: Global kernel fusion and DAG-level tiling, enabling communication/computation fusion for persistent MegaKernels, drastically reducing kernel launch and synchronization overhead.
- Mid-End—implements compute-communication overlap via distributed swizzling. The compiler formalizes two swizzling modes:
- Gather: For data-dependent computations (AllGather+GEMM), enables early issue of remote fetches and HBM caching.
- Scatter: E.g., GEMM+ReduceScatter prioritizes tiles needing long-haul transfer, frontloading latency.
The logic is JIT-pluggable and stateless, exposing practical primitives for kernel developers.
- Back-End—provides a suite of OpenSHMEM-compliant, hardware-agnostic primitives for communication, synchronization, and address mapping, which are instantiated for both NVSHMEM (NVIDIA) and rocSHMEM (AMD) platforms. The compiler stack seamlessly lowers from Distributed IR to hardware-specific code using LLVM and vendor libraries.
Optimizations
DITRON's codegen pipeline realizes several hardware- and problem-aware low-level optimizations:
- Low-Latency (LL) Protocols: Specialized protocols that minimize control path synchronization, analogously to NCCL-LL.
- Device-to-Device (D2D) Copy Fusion: Merges data movement operations within computation/communication kernels for minimized launch jitter and deterministic occupancy.
- PCIe-aware Consistency: Implements software barriers via volatile memory semantics to ensure correctness without assuming hardware cache coherence or atomicity.
Evaluation Highlights
DITRON is systematically evaluated on NVIDIA H800/Hopper and AMD GPUs across inference and training contexts, including large models (Qwen3-32B, LLaMA3-70B, Mixtral). Key findings:
- Microbenchmark Speedup: On critical kernel fusion workloads (AG-GEMM, GEMM-RS, GEMM-AR, AG-MoE, MoE-AR), DITRON delivers 6–30% speedup versus expert-tuned CUDA libraries, and 5–30% end-to-end improvement in vLLM integration for LLM inference. On AG-MoE, speedup to baseline exceeds 19x.
- Module-level and End-to-End Inference: For vLLM inference on large sequence and batch sizes, DITRON outperforms baseline by up to 30%, supporting throughput of over 17k tokens/s on Qwen3-32B, with scalability maintained for increased batch sizes. MegaKernel scheduling yields over 6x latency improvement versus PyTorch Eager.
- Training (TP, SP, EP, PP): Achieves >10% MFU improvement and 500k GPU-hour savings per month in production LLM training. For MoE and optimizer primitives, bitwise-accurate acceleration and >20% optimizer step improvements are demonstrated.
- Portability: On AMD, geometric speedup ranges from 2–38% over RocmBLAS+RCCL. On PCIe GPUs, mean speedup is 8.33x over equivalent CuBLAS+NCCL pipelines.
- Scalability: Both weak and strong scaling are validated up to 128 GPUs, maintaining performance advantage when per-rank problem sizes are sufficient to hide communication.
Implications
DITRON marks a convergence of productivity, flexibility, and hardware efficiency in distributed tensor program development:
- For practitioners, DITRON collapses development complexity, enabling kernel authors to retarget kernels to distributed and heterogeneous settings with minor code modification, thanks to the hierarchical tiling model and hardware-agnostic IR.
- For systems researchers, DITRON's composition with the Triton ecosystem offers a pathway toward auto-tuning, dynamic scheduling, and more sophisticated dynamic communication-computation scheduling policies at scale, as kernel logic and communication become pluggable and decoupled from explicit network/cluster details.
- For the theoretical community, DITRON's equivalence or superiority over manual expert scheduling, with natively bitwise-identical results and improved MFU, demonstrates that the compiler-based approach can close the gap with hand-tuned CUDA/NCCL pipelines even for overlapping and fused kernels.
- Toward hardware evolution, the decoupling of codegen from device specifics positions DITRON as a substrate for new backends (custom NICs, FPGAs), where primitives and translation logic can be added with minimal changes at the IR level.
Future Directions
Potential avenues for extension and application include:
- Auto-scheduling and dynamic performance modeling within DITRON's mid-end, leveraging the uniform tile-level view to drive distributed cost modeling.
- Integration with advanced agentic and dynamic model architectures (e.g., dynamic routing MoE, agent frameworks) enabled by device-level dynamic shape abstractions.
- Support for emerging hardware topologies (NVSwitch, composable disaggregated memory, in-network compute), which can be handled by extending back-end primitives.
- Formal verification and correctness-by-construction in distributed synchronization and scheduling, leveraging the explicit barrier/signal model.
Conclusion
DITRON demonstrates that a distributed multi-level tiling compiler can bring together the flexibility of domain-specific languages and the efficiency of hand-optimized libraries, matching or outperforming expert-tuned CUDA pipelines while simplifying kernel and model development for emerging deep learning workloads. Its design, grounded in hierarchical abstraction, overlapping-aware scheduling, and hardware-agnostic primitives, positions it as a catalytic platform for next-generation distributed AI systems.
Reference: "DITRON: Distributed Multi-level Tiling Compiler for Parallel Tensor Programs" (2605.02953)