VDCores: Resource Decoupled Programming and Execution for Asynchronous GPU

Published 4 May 2026 in cs.DC | (2605.03190v1)

Abstract: Modern GPUs increasingly rely on specialized and asynchronous hardware units to deliver high performance. Yet these units are often underutilized because today's GPU software stacks still organize programming and execution around a monolithic kernel model that mismatches asynchronous hardware. To address this issue, Virtual Decoupled Engines (VDCores) presents a new decoupled programming and execution model for asynchronous GPUs. VDCores abstracts asynchronous hardware execution units as resource isolated virtual cores and represents workloads as dependency-connected micro-operations (micro-ops). this abstraction removes static orchestration from the programmer, enables automatic overlap of memory and compute based on dependency and resource readiness, and thereby improves utilization of asynchronous hardware resources. Realizing such a decoupled abstraction efficiently on today's GPUs is itself challenging, VDCores addresses this through a GPU-specialized programming model and GPU runtime design that preserves the flexibility while minimizing implementation overhead. Across four LLM inference workloads on GH200, H100, and RTX 6000 Pro GPUs, VDCores significantly improves decoding throughput by 24% on average and by up to 77% under dynamic inputs, while reducing kernel programming and specialization effort by 90%. We have open sourced VDCores at https://github.com/vdcores/vdcores.

Abstract PDF Upgrade to Chat

Authors (4)

Summary

The paper introduces a resource-decoupled model that partitions GPU resources into virtual compute and memory cores, enabling fine-grained scheduling via dependency-linked micro-ops.
The paper demonstrates a 1.31× geometric mean throughput improvement with up to 77% latency reduction and up to 90% fewer lines of code compared to fused-kernel systems.
The paper’s design supports dynamic workload adaptation through run-time opportunistic scheduling and dynamic operator fusion that efficiently overlaps compute and memory operations.

VDCores: Resource-Decoupled Programming and Execution for Asynchronous GPU Architectures

Motivation: Modern GPU Utilization Challenges

Contemporary GPU architectures, exemplified by NVIDIA's Hopper and Blackwell and AMD's CDNA series, have increasingly adopted specialized, asynchronous hardware units—such as tensor cores and hardware-assisted asynchronous memory transfer engines. While these hardware advancements expose substantial task-level and pipeline parallelism, the prevailing software execution model based on monolithic kernels remains fundamentally coupled to synchronous, data-parallel designs. This persistent reliance creates a severe mismatch: kernel-centric execution statically orchestrates data movement, computation, and synchronization, causing complex non-modular code, brittle to input dynamics, and suboptimal utilization, especially with dynamic or variable-shape workloads.

Autonomous operator fusion, megakernel formation, and warp specialization within existing frameworks—CUTLASS, MPK, ThunderKittens—improve intra-kernel overlap and partial pipelining, but fundamentally retain the kernel/task as the granularity of composition. This static orchestration creates explicit performance cliffs when resource usage or operator dependencies change dynamically, particularly in latency-sensitive LLM inference scenarios.

The VDCores Model: Decoupling Resource Domains

VDCores introduces an explicit resource-decoupled abstraction for GPU programming and execution on asynchronous hardware (2605.03190). This model decomposes the hardware into software-managed virtual memory cores (VMCs) and virtual compute cores (VCCs), each operating as an independently scheduled execution context. The principal unit of scheduling and programming is a fine-grained dependency-linked micro-operation ( $\mu$ op), rather than the opaque kernel.

Key features include:

Direct Exposure of Hardware Units: Programmers target architectural units (e.g., tensor core, TMA engine) directly, rather than embedding asynchronous behaviors within general-purpose SIMT kernels.
Dependency Graph Specification: Workloads are lowered to a directed acyclic graph (DAG) of $\mu$ ops, with explicit inter- $\mu$ op dependencies encoding scheduling constraints, dataflow, and ordering.
Run-Time Opportunistic Scheduling: When dependencies are satisfied and the resource is available, $\mu$ ops are scheduled immediately, enabling automatic, fine-grained overlap of compute and memory, and dynamically adapting to runtime resource readiness.
Resource Isolation: Each VMC or VCC executes its assigned $\mu$ op flow independently, facilitating robust resource allocation, real-time load balancing, and efficient fusion via short-circuiting intermediate memory copies.

Key System Design and Implementation

VDCores’ runtime effectively virtualizes asynchronous hardware resources. Its primary contributions are:

Software-Pipelined Virtual Cores: VMCs and VCCs internally adopt a pipelined microarchitecture—decoding, registering, issuing, and executing $\mu$ ops cooperatively using SIMT and CUDA hardware resources. VMCs sustain $>10$ k tile-level ops/s to avoid memory bandwidth bottlenecks on H100-class hardware. Explicit decomposed pipelines, SIMD/SIMT allocation, and parallel resource management reduce head-of-line blocking and micromanage operator fusion, overlapping communication and computation phases.
Scalable Dependency Management: Dependency tracking is statically or semi-statically encoded into $\mu$ ops, with at most one inter-memory dependency and compact depId signaling fields. Virtual flows (chains of dependent $\mu$ ops) are locally ordered, while inter-flow dependencies are resolved through lightweight queue message passing; this design eschews expensive dynamic dependency scoreboards typical in fully dynamic dataflow machines.
Deadlock-Free Adaptive Scheduling: The compiler/runtime co-design ensures resource-absence deadlocks are avoided by in-order resource allocation with out-of-order execution. The runtime linearizes allocation events, but once resources are assigned, execution can reorder and overlap based on flow readiness.
Dynamic Fusion and Data Placement: Operator fusion is implemented by promoting intermediate data to shared memory when applicable, leveraging the fact that separate $\mu$ ops for compute and memory can be recomposed at runtime for locality or overlap, irrespective of static kernel specialization.

Quantitative Evaluation

VDCores is evaluated on LLM inference for multiple architectures (GH200, H100, RTX 6000 Pro), across Qwen3 and Llama3 models. Strong empirical evidence includes:

Throughput: VDCores consistently outperforms both operator-per-kernel (vLLM, SGLang) and highly-optimized megakernel (MPK, ThunderKittens) baselines. Geometric mean speedup of $\mu$ 0 (23% average latency reduction), with best-case improvement of $\mu$ 1 (77% reduction) in dynamic regimes.
Reduction in Programming Complexity: Implementation effort is reduced by up to 90% in lines of code over fused-kernel systems.
Dynamic Adaptation: For variable-input workloads (e.g., LoRA adapters, mixed-sequence batches), VDCores achieves up to $\mu$ 2 makespan improvement by dynamically recomposing $\mu$ 3op flows at runtime, completely avoiding inefficient static kernel/schedule selection.
Resource Efficiency: Peak throughput reaches $\mu$ 4 of device memory bandwidth and $\mu$ 5 peak FLOPS for kernels with large I/O granularity, and the profile-aggregated runtime overhead is only 3.1% of total core time.

Theoretical and Practical Implications

The VDCores model aligns more closely with classic dataflow and decoupled access-execute (DAE) architectures, but with strong domain- and hardware-specific constraints. Unlike prior dataflow machines, VDCores uses software-managed, dependency-restricted flows to minimize runtime synchronization cost, and exposes a programmer- and compiler-facing abstraction suitable for modern ML workloads.

Practical implications are substantial:

Simplification and Portability: Model-invariant $\mu$ 6op libraries and resource-isolated execution enable compositional optimization and fast porting across hardware generations (Hopper, Blackwell), supporting future architectural extensions (e.g., SM-group shared memory, persistent tensor cores).
Unified heterogeneity: The same model can generalize beyond GPUs (e.g., AWS Trainium, FPGAs, disaggregated compute/memory nodes), since resource domains are explicit, and coordination is message-passing, not implicit kernel launches.
Compiler and Learning Integration: The separation of fine-grained backend handlers and system-level dynamic scheduling creates a robust interface for learned compilers to map operators into hardware-friendly, dependency-wise $\mu$ 7op graphs, enabling future reinforcement-learning-based or metaheuristic optimization layers.
Dynamic Workload Adaptation: VDCores is robust to batch fragmentation, context switching, and in-flight scheduling, issues present in LLM serving, RAG pipelines, or variable-length sequence models.

Future Directions

Several avenues are enabled by this abstraction:

Advanced Compiler Support: Integration with high-level MLIR-like compilers and advanced auto-schedulers can leverage VDCores’ $\mu$ 8op-level granularity for schedule search, hardware-targeted cost modeling, and operator fusion.
Hybrid and Disaggregated Architectures: As device heterogeneity expands, the VDCores approach provides a clean cross-device ABI for pipeline construction, enabling unified pipelines with CPUs, GPUs, TPUs, DPUs, and memory tiering using explicit dependency graphs.
Learning-Based Runtime Scheduling: Adaptive policies for dynamic tiling, fusion, and resource mapping can be implemented as learned agents, with fine-grained resource feedback available from the runtime.

Conclusion

VDCores presents a resource-decoupled programming and execution architecture for asynchronous GPUs, establishing micro-operation graphs and virtual cores as the fundamental scheduling and programming units. This yields significantly improved hardware utilization, lower inference latency, and a substantial reduction in programming complexity over contemporary kernel- and megakernel-based runtime designs. The architecture also provides a future-proof path to decoupled, cross-device, and learning-augmented accelerators for dynamic deep learning workloads (2605.03190).

Markdown Report Issue