DynaFlow: Transparent and Flexible Intra-Device Parallelism via Programmable Operator Scheduling

Published 20 May 2026 in cs.DC | (2605.21603v1)

Abstract: Intra-device parallelism addresses resource under-utilization in ML inference and training by overlapping the execution of operators with different resource usage. However, its wide adoption is hindered by a fundamental conflict with the static, sequential programming model of existing frameworks. Integrating these strategies requires invasive, model-specific code overhauls, representing an intractable engineering cost. This is further amplified by the high sensitivity of strategies to execution contexts (e.g., workload, model architecture, hardware), forcing developers to implement and maintain multiple specialized solutions. To address this, we propose DynaFlow, a framework that enables the transparent and flexible integration of intra-device parallelism by decoupling the logical model definition from the physical execution schedule. DynaFlow introduces a flexible frontend with annotations for graph partitioning and a programmable interface for defining custom intra-device parallelism strategies. Its efficient backend manages complex control/data-flow asynchronously, uses custom memory management to eliminate copy overheads, and preserves compatibility with optimizations like CUDA Graphs and TorchInductor. We demonstrate that DynaFlow can integrate representative parallelism strategies into 6 state-of-the-art ML systems with minimal code changes, achieving up to a 1.29x throughput improvement. DynaFlow is publicly available at https://github.com/uw-syfi/DynaFlow.

Abstract PDF Upgrade to Chat

Authors (10)

Summary

The paper introduces DynaFlow, which decouples logical model definitions from physical operator scheduling to enable flexible intra-device parallelism.
It employs a programmable Python API for dynamic scheduling and minimal code changes across multiple ML frameworks.
Empirical results show throughput improvements up to 1.29x with efficient resource utilization and targeted hardware optimizations.

DynaFlow: Decoupling Intra-Device Parallelism from Sequential ML Execution

Motivation and Problem Formulation

The rapid increase in the scale and heterogeneity of modern ML models, particularly LLMs and diffusion architectures, exposes significant intra-device resource under-utilization during inference and training. As these models embed sequences of operators with highly diverse compute, memory, and communication requirements, classical sequential execution models embedded in popular systems (e.g., vLLM, SGLang) enforce avoidable resource idleness. Existing integration of intra-device parallelism—encompassing strategies such as operator overlap, fine-grained kernel fusion, and input/batch splitting—has delivered notable throughput enhancement but at the cost of invasive, model-specific code rewrites and brittle, context-sensitive logic. A single parallelism scheme does not generalize across hardware, model architectures, or batch size distributions, forcing developers to maintain custom solutions with substantial engineering overhead.

The fundamental obstacle is the deep coupling of logical model definition with the physical, static execution order in existing ML frameworks. Compiler-based solutions (e.g., XLA) and hand-coded graph rewrites are insufficient to support the runtime dynamicity and workload specialization that effective intra-device parallelism demands.

DynaFlow: System Architecture and Abstractions

DynaFlow is introduced as a system-level substrate positioned between the ML model's logical graph and its concrete physical execution, delivering transparent and flexible intra-device parallelism. It decouples operator execution from model implementation by exposing a programmable execution schedule independent of model definition. The architecture bifurcates into a Python-native, annotation-based frontend for scheduling and partitioning, and a backend that manages asynchronous execution, memory, and compatibility with hardware-specific low-level optimizations (e.g., CUDA Graphs, TorchInductor).

Frontend: Granular, Dynamic Scheduling

The frontend uses TorchDynamo-traced graphs to identify candidate partition points at coarse operator granularity, usually corresponding to nn.Module boundaries or custom PyTorch API calls. Annotations (SplitModule, SplitFunc, and the context manager dynaflow.mark) allow developers to directly encode partitioning rules with minimal code. Programmable scheduling is realized via high-level Python APIs: split() (to subdivide the batch), get_ready_ops() (to dynamically query schedulable subgraphs), and execute() (to map operators/micro-batches to device streams or fused kernels). This architecture exposes both maximal flexibility and runtime dynamicity, while abstracting away the complex coordination and dependency management required for correct, efficient intra-device overlap, fusion, and splitting.

Backend: Efficient Data-Flow and Operator Management

The DynaFlow backend performs static analysis of graph metadata (reference counting, preallocation signaling for future merges, etc.), then orchestrates runtime execution and garbage-collects intermediates via reference tracking. To avoid the data movement overheads that traditionally consume the performance gains of micro-batch splitting, zero-copy memory pre-allocation is integrated for split/merge boundaries. Critical hardware and runtime optimizations (TorchInductor fusion, CUDA Graphs) are maintained by compiling and capturing at the partitioned subgraph level. Distinct CUDA graphs are reused across multiple micro-batches via an internal pooling mechanism for tractable memory scaling.

Empirical Evaluation

Integration Overhead and Flexibility

DynaFlow is implemented as a torch.compile backend with 4.1K lines of Python and hooks into any PyTorch-based inference or training pipeline. Empirical measurement demonstrates that enabling intra-device parallelism in six large ML frameworks (e.g., vLLM, SGLang, HuggingFace Transformers, Megatron-LM, FastVideo, xDiT) requires minimal code changes—typically <100 lines system-wide and ≈10 lines per model for operator annotation.

The expressive power of DynaFlow is exemplified by succinctly implementing strategies such as dual-batch overlap (DBO), NanoFlow-style batch splitting, Token Weave-style communication-computation fusion, and expert-parallel communication overlap. Implementing these generalizes previously model- or system-specific optimizations to a unified, reusable scheduling interface.

Throughput and Performance Analysis

DynaFlow achieves strong empirical results across both synthetic and real-world evaluation scenarios:

Throughput improvements up to 1.29x over baseline systems and up to 1.1x over existing highly optimized manual integrations (e.g., vLLM native DBO).
Overlapping and micro-batching show context-sensitive improvements, and DynaFlow’s ability to dynamically disable splitting for non-beneficial batch/hardware configurations prevents performance regression that is common with static policies.
Integration of advanced fusion kernels demonstrates the importance of rapid prototyping and validation: suboptimal kernels can be precisely identified as bottlenecks, emphasizing that general framework flexibility is as important as new kernel design.

Ablation reveals that the preservation of CUDA Graphs and zero-copy memory pre-allocation is essential to closing the gap between static and dynamic scheduling. CPU execution and initialization overheads are negligible relative to baseline initialization/launch costs and are typically amortized over multi-batch inference/training.

Implications and Future Directions

DynaFlow’s abstraction provides a unifying layer that enables ML system designers to flexibly combine parallelism strategies and choose optimal scheduling by workload, architecture, and available hardware. It is well-positioned for future extensions:

Automated schedule optimization: The abstraction enables integration with cost models, reinforcement learning, or meta-scheduling to dynamically select optimal partitioning and execution policies at runtime.
Fine-grained hardware adaptation: As new GPU microarchitectures expose new operator overlap/fusion capabilities and kernel launch models, DynaFlow’s decoupled scheduling layer will simplify integration.
Generalization beyond current parallelism forms: Emerging compute paradigms (e.g., megakernels, hierarchical MoE) or alternative accelerators (TPUs, custom ASICs) could be accommodated by extending backend primitives and scheduling abstractions.
Foundation for workload-sensitive serverless LLM inference: With zero-code-change integration for batch-dependent execution plans, DynaFlow could underpin infrastructure for elastically scalable, heterogeneous ML inference services.

Conclusion

DynaFlow delivers a transparent, flexible, and efficient framework for intra-device parallelism by decoupling logical model graphs from concrete operator scheduling. The system achieves robust throughput improvements with minimal engineering effort and preserves compatibility with existing ML optimizations. DynaFlow’s programmable API and efficient backend demonstrate that general-purpose, context-aware intra-device parallelism is tractable and practical, reshaping the engineering landscape for high-performance ML inference and training (2605.21603).

Markdown Report Issue