VDCores Model: Virtual Decoupled GPU Cores
- VDCores is a novel GPU programming model that abstracts hardware as resource-isolated virtual cores and expresses workloads as dependency-connected micro-ops.
- It decouples low-level orchestration from application logic, enabling dynamic scheduling and improved hardware utilization for complex inference tasks.
- The model features a C++/CUDA micro-op API with low overhead and high concurrency, achieving up to 94% SM utilization and significant performance speedups.
VDCores, or Virtual Decoupled Cores, refers to a resource-decoupled programming and execution model specifically designed for asynchronous GPUs. It provides a principled abstraction of GPU hardware units as resource-isolated virtual cores (compute and memory) and expresses workloads as dependency-connected micro-operations (μ-ops) rather than as monolithic kernels. This abstraction decouples programming from low-level hardware orchestration, enables dynamic, fine-grained scheduling, and significantly improves hardware utilization for irregular and memory/computation-overlapping workloads such as LLM inference. The VDCores model is realized via a C++/CUDA-flavored micro-op API and a GPU runtime that provides high concurrency, low overhead, and efficient cross-core coordination (He et al., 4 May 2026).
1. Formal Abstraction and Mathematical Model
Let denote all asynchronous GPU hardware execution units, such as Tensor-Core pipelines and TMA (tensor-memory-accelerator) engines. VDCores virtualizes this set into two disjoint families:
A fixed pool of virtual cores is used, each mapped to a single SM’s hardware resources. Workloads are decomposed into a set of fine-grained μ-ops, partitioned as:
Dependencies among μ-ops are expressed as a directed acyclic graph , where enforces to wait for 0's result.
Each μ-op 1 is annotated with:
- 2: assigned virtual core
- 3: (direct) predecessor dependencies
- 4: virtual-flow ID for preserving per-flow ordering
Readiness at time 5 is given by:
6
Each virtual core 7 manages a local queue 8 of μ-ops with satisfied dependencies; execution is scheduled opportunistically and supports flow bypassing to avoid head-of-line blocking.
2. Programming Interface and Micro-Op API
VDCores introduces a concise API for μ-op definition, decoupling logic and orchestration. For example, a matrix-vector tile μ-op may be implemented as:
9
Channels (ctx.m2c, ctx.c2m) implement local FIFO queues between VMCs and VCCs; operations like pop_wait, push, and alloc_registers constitute the μ-op runtime interface. VDCores ships with ~30 built-in μ-ops (loads, stores, TMA ops, fused GEMM tiles, control loops, barriers) and allows extension with new μ-ops in under 50 lines, without monolithic kernel refactoring (He et al., 4 May 2026).
3. GPU Runtime Architecture and Scheduling
At initialization, VDCores launches a persistent kernel per GPU SM, simultaneously managing:
Each executor implements a two-stage pipeline:
- Control-flow unit (CFU): for μ-op decoding, register management, address arithmetic
- Execution units (EUs): actual μ-op computation
CFUs enqueue decoded μ-ops into small FIFOs for associated EUs. EUs execute, then notify readiness of dependents (μ-ops or shared-memory regions) using CUDA asynchronous barriers.
Global μ-op firing and data-flow depend on message-passing FIFOs:
- VMC→VCC "m2c" queues: loaded tiles for compute
- VCC→VMC "c2m" queues: used tiles for store/forward
- VMC→VMC queues: inter-memory μ-op region handoff
Scheduler efficiency targets decoding+dispatch every ~90 GPU cycles (H100) to approach hardware peak bandwidth. By software pipelining and exploiting SIMT for bitmask register allocation, VDCores attains >94% of peak memory throughput and >82% of peak FLOPS on isolated kernels, with ~3.1% overall cycle overhead (He et al., 4 May 2026).
4. Decoupled vs. Monolithic GPU Execution
Traditional monolithic (megakernel) programming fuses compute and memory orchestration in a large kernel, with asynchronous operations explicitly embedded. This approach incurs significant code size, tuning complexity, and pipeline inefficiency due to hardware underutilization when operator boundaries or pipeline "bubbles" arise.
VDCores’ resource- and schedule-decoupled paradigm:
- Exposes each hardware unit as a virtual core
- Represents kernel logic as μ-op DAGs with explicit producer-consumer edges
- Schedules μ-ops dynamically at runtime per dependency and available resource
The resulting benefits include:
- Automatic overlap of memory and compute (bubbles are filled)
- Dynamic μ-op fusion (e.g., inter-operator store→load optimized on-the-fly)
- Eliminating need for large, statically-tuned fused kernels
This enables substantially less code, higher resource utilization, and reduced specialization effort compared to monolithic baselines (He et al., 4 May 2026).
5. Performance Evaluation and Programming Effort
End-to-end LLM inference over representative models (Qwen1.7B, Qwen8B, Llama1B, Llama8B), using 64-step decoding, demonstrates:
- 1.31× geometric-mean throughput vs. expert-tuned megakernel baselines
- Up to 1.68× speedup for particular batch sizes; up to 6.18× for uneven context distributions
- Dynamic LoRA serving achieves up to 3.47× faster makespan versus S-LoRA staging
On H100, GH200, RTX 6000 Pro:
- Mean GPU–SM utilization rises from ~70% (monolithic) to ~94% (VDCores)
- Memory bandwidth utilization is near-peak, except at pipeline startup/tail
Table summarizing measured metrics:
| Metric | Monolithic Baseline | VDCores |
|---|---|---|
| SM utilization | ~70% | ~94% |
| Decoding throughput (avg.) | – | +24% |
| Specialization effort (LoC) | ~2–6K (GPU code) | ~741 |
| Code reduction | – | ~90% |
Complete Llama1B/8B end-to-end inference is implemented in 6 reusable μ-ops and ~741 LoC with VDCores, compared to 8–14 monolithic tasks, 3–8 fused variants, and 2–6K LoC in baselines (He et al., 4 May 2026).
6. Limitations and Prospective Extensions
Current VDCores restrictions include:
- Limited dependency expressivity: Each memory μ-op can have at most one inter-memory μ-op edge and an optional local compute dependency; richer DAGs with broader fan-in/fan-out are not directly supported.
- Operational overheads on sub-KB μ-ops: The model is optimized for coarse-tile granularity (≥4 KB), with higher overhead at finer scales.
Proposed future directions:
- Virtual-core expansion: Supporting future asynchronous hardware (e.g., inter-SM TMA, Blackwell ML cores)
- Compiler integration: Direct μ-op and dependency emission from tensor-IR compilers, eliminating fused kernel generation
- Multi-GPU/disaggregated execution: Abstracting PCIe/TMA and NVLink transfers as memory μ-ops on new VMCs
These extensions suggest applicability across next-generation GPU architectures and broader accelerator fabrics (He et al., 4 May 2026).