- The paper introduces a resource-decoupled model that partitions GPU resources into virtual compute and memory cores, enabling fine-grained scheduling via dependency-linked micro-ops.
- The paper demonstrates a 1.31ร geometric mean throughput improvement with up to 77% latency reduction and up to 90% fewer lines of code compared to fused-kernel systems.
- The paperโs design supports dynamic workload adaptation through run-time opportunistic scheduling and dynamic operator fusion that efficiently overlaps compute and memory operations.
VDCores: Resource-Decoupled Programming and Execution for Asynchronous GPU Architectures
Motivation: Modern GPU Utilization Challenges
Contemporary GPU architectures, exemplified by NVIDIA's Hopper and Blackwell and AMD's CDNA series, have increasingly adopted specialized, asynchronous hardware unitsโsuch as tensor cores and hardware-assisted asynchronous memory transfer engines. While these hardware advancements expose substantial task-level and pipeline parallelism, the prevailing software execution model based on monolithic kernels remains fundamentally coupled to synchronous, data-parallel designs. This persistent reliance creates a severe mismatch: kernel-centric execution statically orchestrates data movement, computation, and synchronization, causing complex non-modular code, brittle to input dynamics, and suboptimal utilization, especially with dynamic or variable-shape workloads.
Autonomous operator fusion, megakernel formation, and warp specialization within existing frameworksโCUTLASS, MPK, ThunderKittensโimprove intra-kernel overlap and partial pipelining, but fundamentally retain the kernel/task as the granularity of composition. This static orchestration creates explicit performance cliffs when resource usage or operator dependencies change dynamically, particularly in latency-sensitive LLM inference scenarios.
The VDCores Model: Decoupling Resource Domains
VDCores introduces an explicit resource-decoupled abstraction for GPU programming and execution on asynchronous hardware (2605.03190). This model decomposes the hardware into software-managed virtual memory cores (VMCs) and virtual compute cores (VCCs), each operating as an independently scheduled execution context. The principal unit of scheduling and programming is a fine-grained dependency-linked micro-operation (ฮผop), rather than the opaque kernel.
Key features include:
- Direct Exposure of Hardware Units: Programmers target architectural units (e.g., tensor core, TMA engine) directly, rather than embedding asynchronous behaviors within general-purpose SIMT kernels.
- Dependency Graph Specification: Workloads are lowered to a directed acyclic graph (DAG) of ฮผops, with explicit inter-ฮผop dependencies encoding scheduling constraints, dataflow, and ordering.
- Run-Time Opportunistic Scheduling: When dependencies are satisfied and the resource is available, ฮผops are scheduled immediately, enabling automatic, fine-grained overlap of compute and memory, and dynamically adapting to runtime resource readiness.
- Resource Isolation: Each VMC or VCC executes its assigned ฮผop flow independently, facilitating robust resource allocation, real-time load balancing, and efficient fusion via short-circuiting intermediate memory copies.
Key System Design and Implementation
VDCoresโ runtime effectively virtualizes asynchronous hardware resources. Its primary contributions are:
- Software-Pipelined Virtual Cores: VMCs and VCCs internally adopt a pipelined microarchitectureโdecoding, registering, issuing, and executing ฮผops cooperatively using SIMT and CUDA hardware resources. VMCs sustain >10k tile-level ops/s to avoid memory bandwidth bottlenecks on H100-class hardware. Explicit decomposed pipelines, SIMD/SIMT allocation, and parallel resource management reduce head-of-line blocking and micromanage operator fusion, overlapping communication and computation phases.
- Scalable Dependency Management: Dependency tracking is statically or semi-statically encoded into ฮผops, with at most one inter-memory dependency and compact depId signaling fields. Virtual flows (chains of dependent ฮผops) are locally ordered, while inter-flow dependencies are resolved through lightweight queue message passing; this design eschews expensive dynamic dependency scoreboards typical in fully dynamic dataflow machines.
- Deadlock-Free Adaptive Scheduling: The compiler/runtime co-design ensures resource-absence deadlocks are avoided by in-order resource allocation with out-of-order execution. The runtime linearizes allocation events, but once resources are assigned, execution can reorder and overlap based on flow readiness.
- Dynamic Fusion and Data Placement: Operator fusion is implemented by promoting intermediate data to shared memory when applicable, leveraging the fact that separate ฮผops for compute and memory can be recomposed at runtime for locality or overlap, irrespective of static kernel specialization.
Quantitative Evaluation
VDCores is evaluated on LLM inference for multiple architectures (GH200, H100, RTX 6000 Pro), across Qwen3 and Llama3 models. Strong empirical evidence includes:
- Throughput: VDCores consistently outperforms both operator-per-kernel (vLLM, SGLang) and highly-optimized megakernel (MPK, ThunderKittens) baselines. Geometric mean speedup of ฮผ0 (23% average latency reduction), with best-case improvement of ฮผ1 (77% reduction) in dynamic regimes.
- Reduction in Programming Complexity: Implementation effort is reduced by up to 90% in lines of code over fused-kernel systems.
- Dynamic Adaptation: For variable-input workloads (e.g., LoRA adapters, mixed-sequence batches), VDCores achieves up to ฮผ2 makespan improvement by dynamically recomposing ฮผ3op flows at runtime, completely avoiding inefficient static kernel/schedule selection.
- Resource Efficiency: Peak throughput reaches ฮผ4 of device memory bandwidth and ฮผ5 peak FLOPS for kernels with large I/O granularity, and the profile-aggregated runtime overhead is only 3.1% of total core time.
Theoretical and Practical Implications
The VDCores model aligns more closely with classic dataflow and decoupled access-execute (DAE) architectures, but with strong domain- and hardware-specific constraints. Unlike prior dataflow machines, VDCores uses software-managed, dependency-restricted flows to minimize runtime synchronization cost, and exposes a programmer- and compiler-facing abstraction suitable for modern ML workloads.
Practical implications are substantial:
- Simplification and Portability: Model-invariant ฮผ6op libraries and resource-isolated execution enable compositional optimization and fast porting across hardware generations (Hopper, Blackwell), supporting future architectural extensions (e.g., SM-group shared memory, persistent tensor cores).
- Unified heterogeneity: The same model can generalize beyond GPUs (e.g., AWS Trainium, FPGAs, disaggregated compute/memory nodes), since resource domains are explicit, and coordination is message-passing, not implicit kernel launches.
- Compiler and Learning Integration: The separation of fine-grained backend handlers and system-level dynamic scheduling creates a robust interface for learned compilers to map operators into hardware-friendly, dependency-wise ฮผ7op graphs, enabling future reinforcement-learning-based or metaheuristic optimization layers.
- Dynamic Workload Adaptation: VDCores is robust to batch fragmentation, context switching, and in-flight scheduling, issues present in LLM serving, RAG pipelines, or variable-length sequence models.
Future Directions
Several avenues are enabled by this abstraction:
- Advanced Compiler Support: Integration with high-level MLIR-like compilers and advanced auto-schedulers can leverage VDCoresโ ฮผ8op-level granularity for schedule search, hardware-targeted cost modeling, and operator fusion.
- Hybrid and Disaggregated Architectures: As device heterogeneity expands, the VDCores approach provides a clean cross-device ABI for pipeline construction, enabling unified pipelines with CPUs, GPUs, TPUs, DPUs, and memory tiering using explicit dependency graphs.
- Learning-Based Runtime Scheduling: Adaptive policies for dynamic tiling, fusion, and resource mapping can be implemented as learned agents, with fine-grained resource feedback available from the runtime.
Conclusion
VDCores presents a resource-decoupled programming and execution architecture for asynchronous GPUs, establishing micro-operation graphs and virtual cores as the fundamental scheduling and programming units. This yields significantly improved hardware utilization, lower inference latency, and a substantial reduction in programming complexity over contemporary kernel- and megakernel-based runtime designs. The architecture also provides a future-proof path to decoupled, cross-device, and learning-augmented accelerators for dynamic deep learning workloads (2605.03190).