- The paper introduces DynaFlow, which decouples logical model definitions from physical operator scheduling to enable flexible intra-device parallelism.
- It employs a programmable Python API for dynamic scheduling and minimal code changes across multiple ML frameworks.
- Empirical results show throughput improvements up to 1.29x with efficient resource utilization and targeted hardware optimizations.
DynaFlow: Decoupling Intra-Device Parallelism from Sequential ML Execution
The rapid increase in the scale and heterogeneity of modern ML models, particularly LLMs and diffusion architectures, exposes significant intra-device resource under-utilization during inference and training. As these models embed sequences of operators with highly diverse compute, memory, and communication requirements, classical sequential execution models embedded in popular systems (e.g., vLLM, SGLang) enforce avoidable resource idleness. Existing integration of intra-device parallelismโencompassing strategies such as operator overlap, fine-grained kernel fusion, and input/batch splittingโhas delivered notable throughput enhancement but at the cost of invasive, model-specific code rewrites and brittle, context-sensitive logic. A single parallelism scheme does not generalize across hardware, model architectures, or batch size distributions, forcing developers to maintain custom solutions with substantial engineering overhead.
The fundamental obstacle is the deep coupling of logical model definition with the physical, static execution order in existing ML frameworks. Compiler-based solutions (e.g., XLA) and hand-coded graph rewrites are insufficient to support the runtime dynamicity and workload specialization that effective intra-device parallelism demands.
DynaFlow: System Architecture and Abstractions
DynaFlow is introduced as a system-level substrate positioned between the ML model's logical graph and its concrete physical execution, delivering transparent and flexible intra-device parallelism. It decouples operator execution from model implementation by exposing a programmable execution schedule independent of model definition. The architecture bifurcates into a Python-native, annotation-based frontend for scheduling and partitioning, and a backend that manages asynchronous execution, memory, and compatibility with hardware-specific low-level optimizations (e.g., CUDA Graphs, TorchInductor).
Frontend: Granular, Dynamic Scheduling
The frontend uses TorchDynamo-traced graphs to identify candidate partition points at coarse operator granularity, usually corresponding to nn.Module boundaries or custom PyTorch API calls. Annotations (SplitModule, SplitFunc, and the context manager dynaflow.mark) allow developers to directly encode partitioning rules with minimal code. Programmable scheduling is realized via high-level Python APIs: split() (to subdivide the batch), get_ready_ops() (to dynamically query schedulable subgraphs), and execute() (to map operators/micro-batches to device streams or fused kernels). This architecture exposes both maximal flexibility and runtime dynamicity, while abstracting away the complex coordination and dependency management required for correct, efficient intra-device overlap, fusion, and splitting.
Backend: Efficient Data-Flow and Operator Management
The DynaFlow backend performs static analysis of graph metadata (reference counting, preallocation signaling for future merges, etc.), then orchestrates runtime execution and garbage-collects intermediates via reference tracking. To avoid the data movement overheads that traditionally consume the performance gains of micro-batch splitting, zero-copy memory pre-allocation is integrated for split/merge boundaries. Critical hardware and runtime optimizations (TorchInductor fusion, CUDA Graphs) are maintained by compiling and capturing at the partitioned subgraph level. Distinct CUDA graphs are reused across multiple micro-batches via an internal pooling mechanism for tractable memory scaling.
Empirical Evaluation
Integration Overhead and Flexibility
DynaFlow is implemented as a torch.compile backend with 4.1K lines of Python and hooks into any PyTorch-based inference or training pipeline. Empirical measurement demonstrates that enabling intra-device parallelism in six large ML frameworks (e.g., vLLM, SGLang, HuggingFace Transformers, Megatron-LM, FastVideo, xDiT) requires minimal code changesโtypically <100 lines system-wide and โ10 lines per model for operator annotation.
The expressive power of DynaFlow is exemplified by succinctly implementing strategies such as dual-batch overlap (DBO), NanoFlow-style batch splitting, Token Weave-style communication-computation fusion, and expert-parallel communication overlap. Implementing these generalizes previously model- or system-specific optimizations to a unified, reusable scheduling interface.
DynaFlow achieves strong empirical results across both synthetic and real-world evaluation scenarios:
- Throughput improvements up to 1.29x over baseline systems and up to 1.1x over existing highly optimized manual integrations (e.g., vLLM native DBO).
- Overlapping and micro-batching show context-sensitive improvements, and DynaFlowโs ability to dynamically disable splitting for non-beneficial batch/hardware configurations prevents performance regression that is common with static policies.
- Integration of advanced fusion kernels demonstrates the importance of rapid prototyping and validation: suboptimal kernels can be precisely identified as bottlenecks, emphasizing that general framework flexibility is as important as new kernel design.
Ablation reveals that the preservation of CUDA Graphs and zero-copy memory pre-allocation is essential to closing the gap between static and dynamic scheduling. CPU execution and initialization overheads are negligible relative to baseline initialization/launch costs and are typically amortized over multi-batch inference/training.
Implications and Future Directions
DynaFlowโs abstraction provides a unifying layer that enables ML system designers to flexibly combine parallelism strategies and choose optimal scheduling by workload, architecture, and available hardware. It is well-positioned for future extensions:
- Automated schedule optimization: The abstraction enables integration with cost models, reinforcement learning, or meta-scheduling to dynamically select optimal partitioning and execution policies at runtime.
- Fine-grained hardware adaptation: As new GPU microarchitectures expose new operator overlap/fusion capabilities and kernel launch models, DynaFlowโs decoupled scheduling layer will simplify integration.
- Generalization beyond current parallelism forms: Emerging compute paradigms (e.g., megakernels, hierarchical MoE) or alternative accelerators (TPUs, custom ASICs) could be accommodated by extending backend primitives and scheduling abstractions.
- Foundation for workload-sensitive serverless LLM inference: With zero-code-change integration for batch-dependent execution plans, DynaFlow could underpin infrastructure for elastically scalable, heterogeneous ML inference services.
Conclusion
DynaFlow delivers a transparent, flexible, and efficient framework for intra-device parallelism by decoupling logical model graphs from concrete operator scheduling. The system achieves robust throughput improvements with minimal engineering effort and preserves compatibility with existing ML optimizations. DynaFlowโs programmable API and efficient backend demonstrate that general-purpose, context-aware intra-device parallelism is tractable and practical, reshaping the engineering landscape for high-performance ML inference and training (2605.21603).