DITRON: Distributed Multi-level Tiling Compiler for Parallel Tensor Programs

Published 2 May 2026 in cs.PL | (2605.02953v1)

Abstract: The scaling of LLMs is currently bottlenecked by the rigidity of distributed programming. While high-performance libraries like CuBLAS and NCCL provide optimized primitives, they lack the flexibility required for rapidly evolving model architectures. Conversely, existing tensor compilers fail to address the complex memory hierarchy of distributed clusters effectively. To bridge this gap, we propose DITRON, a scalable tile-level compiler that democratizes high-performance distributed kernel development. DITRON introduces a novel hierarchical programming abstraction spanning Core, Device, and Task levels to map tensor programs efficiently onto heterogeneous distributed hardware. This abstraction allows DITRON to support diverse parallelism strategies while abstracting away the complexity of inter-node and intra-node communication. Evaluated across large-scale clusters, DITRON achieves performance parity with or exceeding expert-tuned CUDA libraries, delivering speedups of $6\%-30\%$ on isolated kernels and $5\%-30\%$ on end-to-end inference in vLLM. Furthermore, DITRON demonstrates strong portability, achieving significant speedups on both NVIDIA and AMD platforms. \ours{} has been deployed at the enterprise level for both training and inference. It achieves an MFU improvement of over 10\% in training tasks, saving approximately 500,000 GPU hours of training cost per month. For inference tasks, it delivers an end-to-end gain of over 20\% and has been applied to cloud service inference and edge inference scenarios.

Abstract PDF Upgrade to Chat

Authors (19)

First 10 authors:

Summary

The paper introduces a hierarchical distributed tiling model spanning core, device, and task levels to efficiently manage parallel tensor program execution.
It fuses computation and communication using JIT-pluggable swizzling modes and optimized code generation, achieving up to 30% speedup over expert-tuned approaches.
Evaluation shows significant improvements in scalability and performance, with accelerated LLM inference and substantial GPU-hour savings in production training.

DITRON: A Distributed Multi-level Tiling Compiler for Parallel Tensor Programs

Motivation and Positioning

The problem of efficiently programming distributed kernels for large-scale deep learning, especially LLMs and MoE models, persists due to the dichotomy between highly-optimized, but inflexible, communication libraries (e.g., NCCL, CuBLAS) and higher-level tensor compilers (e.g., Triton, TileLang, Pallas) that lack explicit multi-level support for the complex hardware hierarchies of modern clusters. Existing software approaches fail to provide an expressive, portable, and cluster-scale abstraction that can flexibly overlap computation and communication or adapt to model and hardware heterogeneity with minimal programmer burden. DITRON directly targets these gaps by introducing a hierarchical distributed programming model spanning core, device, and task abstraction levels atop the widely deployed Triton compiler stack, coupled with hardware-agnostic IR and optimizations for overlapping and tiling.

System Architecture and Abstraction

DITRON's architecture is decomposed into three layers:

Front-End—introduces hierarchical tiling:
- Core-level: Fine-grained, static-shaped tiles mapped to hardware units (Tensor Cores, TMA engines); designed to be syntax-compatible with standard Triton.
- Device-level: Coarse-grained, dynamically-shaped tiles corresponding to DMA/RDMA domains for intra/inter-node data movement. Incorporates asynchronous, shape-dynamic communication, critical for MoE-style token routing.
- Task-level: Global kernel fusion and DAG-level tiling, enabling communication/computation fusion for persistent MegaKernels, drastically reducing kernel launch and synchronization overhead.
Mid-End—implements compute-communication overlap via distributed swizzling. The compiler formalizes two swizzling modes:
- Gather: For data-dependent computations (AllGather+GEMM), enables early issue of remote fetches and HBM caching.
- Scatter: E.g., GEMM+ReduceScatter prioritizes tiles needing long-haul transfer, frontloading latency. The logic is JIT-pluggable and stateless, exposing practical primitives for kernel developers.
Back-End—provides a suite of OpenSHMEM-compliant, hardware-agnostic primitives for communication, synchronization, and address mapping, which are instantiated for both NVSHMEM (NVIDIA) and rocSHMEM (AMD) platforms. The compiler stack seamlessly lowers from Distributed IR to hardware-specific code using LLVM and vendor libraries.

Optimizations

DITRON's codegen pipeline realizes several hardware- and problem-aware low-level optimizations:

Low-Latency (LL) Protocols: Specialized protocols that minimize control path synchronization, analogously to NCCL-LL.
Device-to-Device (D2D) Copy Fusion: Merges data movement operations within computation/communication kernels for minimized launch jitter and deterministic occupancy.
PCIe-aware Consistency: Implements software barriers via volatile memory semantics to ensure correctness without assuming hardware cache coherence or atomicity.

Evaluation Highlights

DITRON is systematically evaluated on NVIDIA H800/Hopper and AMD GPUs across inference and training contexts, including large models (Qwen3-32B, LLaMA3-70B, Mixtral). Key findings:

Microbenchmark Speedup: On critical kernel fusion workloads (AG-GEMM, GEMM-RS, GEMM-AR, AG-MoE, MoE-AR), DITRON delivers 6–30% speedup versus expert-tuned CUDA libraries, and 5–30% end-to-end improvement in vLLM integration for LLM inference. On AG-MoE, speedup to baseline exceeds 19x.
Module-level and End-to-End Inference: For vLLM inference on large sequence and batch sizes, DITRON outperforms baseline by up to 30%, supporting throughput of over 17k tokens/s on Qwen3-32B, with scalability maintained for increased batch sizes. MegaKernel scheduling yields over 6x latency improvement versus PyTorch Eager.
Training (TP, SP, EP, PP): Achieves >10% MFU improvement and 500k GPU-hour savings per month in production LLM training. For MoE and optimizer primitives, bitwise-accurate acceleration and >20% optimizer step improvements are demonstrated.
Portability: On AMD, geometric speedup ranges from 2–38% over RocmBLAS+RCCL. On PCIe GPUs, mean speedup is 8.33x over equivalent CuBLAS+NCCL pipelines.
Scalability: Both weak and strong scaling are validated up to 128 GPUs, maintaining performance advantage when per-rank problem sizes are sufficient to hide communication.

Implications

DITRON marks a convergence of productivity, flexibility, and hardware efficiency in distributed tensor program development:

For practitioners, DITRON collapses development complexity, enabling kernel authors to retarget kernels to distributed and heterogeneous settings with minor code modification, thanks to the hierarchical tiling model and hardware-agnostic IR.
For systems researchers, DITRON's composition with the Triton ecosystem offers a pathway toward auto-tuning, dynamic scheduling, and more sophisticated dynamic communication-computation scheduling policies at scale, as kernel logic and communication become pluggable and decoupled from explicit network/cluster details.
For the theoretical community, DITRON's equivalence or superiority over manual expert scheduling, with natively bitwise-identical results and improved MFU, demonstrates that the compiler-based approach can close the gap with hand-tuned CUDA/NCCL pipelines even for overlapping and fused kernels.
Toward hardware evolution, the decoupling of codegen from device specifics positions DITRON as a substrate for new backends (custom NICs, FPGAs), where primitives and translation logic can be added with minimal changes at the IR level.

Future Directions

Potential avenues for extension and application include:

Auto-scheduling and dynamic performance modeling within DITRON's mid-end, leveraging the uniform tile-level view to drive distributed cost modeling.
Integration with advanced agentic and dynamic model architectures (e.g., dynamic routing MoE, agent frameworks) enabled by device-level dynamic shape abstractions.
Support for emerging hardware topologies (NVSwitch, composable disaggregated memory, in-network compute), which can be handled by extending back-end primitives.
Formal verification and correctness-by-construction in distributed synchronization and scheduling, leveraging the explicit barrier/signal model.

Conclusion

DITRON demonstrates that a distributed multi-level tiling compiler can bring together the flexibility of domain-specific languages and the efficiency of hand-optimized libraries, matching or outperforming expert-tuned CUDA pipelines while simplifying kernel and model development for emerging deep learning workloads. Its design, grounded in hierarchical abstraction, overlapping-aware scheduling, and hardware-agnostic primitives, positions it as a catalytic platform for next-generation distributed AI systems.

Reference: "DITRON: Distributed Multi-level Tiling Compiler for Parallel Tensor Programs" (2605.02953)

Markdown Report Issue