Virtual Decoupled Engines (VDCores)

Updated 4 July 2026

VDCores is a decoupled GPU programming model that abstracts hardware execution units as virtual cores and schedules dependency-connected micro-ops.
It uses a persistent, pipelined runtime that dynamically overlaps tasks, yielding up to 77% lower latency and significant throughput improvements.
The model simplifies GPU programming by reducing kernel specialization efforts and automating micro-op fusion, enhancing both performance and programmability.

Searching arXiv for the VDCores paper and closely related work to ground the article in current literature. Virtual Decoupled Engines (VDCores) are a decoupled programming and execution model for modern asynchronous GPUs in which asynchronous hardware execution units are abstracted as resource-isolated virtual cores and workloads are represented as dependency-connected micro-operations ( $\mu$ ops) (He et al., 4 May 2026). In its formal usage, the term denotes the GPU system introduced in “VDCores: Resource Decoupled Programming and Execution for Asynchronous GPU” (He et al., 4 May 2026). In a broader interpretive sense, closely related decoupling patterns appear in work on disaggregated NUMA placement, virtual ECUs, virtual-node execution for deep learning, row-scale compute-power contracts, per-phase inverter control, and auxiliary-core execution of kernel phases, although those systems do not define VDCores as a common cross-domain abstraction (Lakew et al., 2 Jan 2025).

1. Origin and problem formulation

VDCores arises from the observation that modern GPUs increasingly rely on specialized and asynchronous hardware units, yet GPU software is still organized around a monolithic kernel model that was designed for more synchronous SIMT execution (He et al., 4 May 2026). The motivating hardware trends explicitly include Tensor Cores, hardware-assisted asynchronous memory movement, Tensor Memory Accelerator (TMA) on Hopper, asynchronous tensor-core execution mechanisms such as WGMMA, and, more generally, distinct internal resource domains for memory movement and compute. The central claim is that the kernel abstraction packages multiple resources and execution phases into one opaque unit with a single launch boundary and completion boundary, thereby creating execution bubbles, underutilized asynchronous units, and missed overlap opportunities, especially under dynamic workloads (He et al., 4 May 2026).

The paper identifies a coupled programmability and utilization problem. Programmers must manually orchestrate memory movement, tensor-core execution, synchronization, pipelining, multibuffering, warp specialization, and producer/consumer coordination. At the same time, the hardware cannot opportunistically reuse partially idle resources across kernel boundaries because memory and compute are bundled into the same execution object. VDCores addresses this mismatch by replacing monolithic kernels with a finer-grained model based on virtual memory cores, virtual compute cores, and dependency-connected $\mu$ ops, with execution driven by dependency readiness and resource readiness rather than kernel-launch order (He et al., 4 May 2026).

A common misconception is that VDCores is simply another persistent-kernel or CUDA Graphs optimization. The paper argues for a stronger shift in abstraction: streams and kernel launches still compose execution out of whole kernels; CUDA Graphs reduce launch overhead but remain mostly a static graph of kernels; persistent kernels and megakernels still pack orchestration into statically designed tasks. VDCores instead lowers below the kernel abstraction and resolves overlap dynamically (He et al., 4 May 2026).

2. Core abstractions

The central abstractions of VDCores are Virtual Memory Cores (VMCs), Virtual Compute Cores (VCCs), and dependency-connected $\mu$ ops (He et al., 4 May 2026). A VMC handles memory and control $\mu$ ops, memory movement, local-memory management, loop/control state, and communication. A VCC handles compute resources including register files, SIMT cores, and asynchronous matrix-computation units such as tensor cores. A $\mu$ op is the smallest unit of programming and execution in VDCores and is intended to fit within one resource domain when the needed resource is ready.

Abstraction	Role	Representative content
VMC	Memory and control execution	load, store, local-memory management, loop/control state
VCC	Compute execution	register files, SIMT cores, tensor cores
$\mu$ op	Smallest programming and execution unit	memory $\mu$ ops, compute $\mu$ ops, control $\mu$ ops

High-level ML operators are lowered into a $\mu$ op graph composed of per-core $\mu$ 0op streams connected by dependency edges. The paper’s operator examples include matrix-vector multiplication, $\mu$ 1, and RoPE, $\mu$ 2 (He et al., 4 May 2026). The programming model includes compute, memory, and shared control $\mu$ 3ops for common ML patterns; control $\mu$ 4ops such as loop and continue_if; dynamic address generation using flags and accumulator registers; local-memory and global-memory paths; and local dependency resolution for one-to-one memory dependencies.

The abstraction removes static orchestration from the programmer. Programmers define what each $\mu$ 5op does, what resources it consumes and produces, and what dependencies exist; they do not need to fully hard-code the exact overlap schedule, the exact software pipeline structure across all resources, fixed fusion variants, or fixed synchronization orchestration inside one megakernel (He et al., 4 May 2026). One important consequence is dynamic fusion: if neighboring memory $\mu$ 6ops have one-to-one dependency on the same VMC, the global-memory communication path can be rewritten as store.local and load.local, turning global-memory communication into shared-memory communication without writing a new fused kernel.

3. Execution model and runtime organization

VDCores uses a persistent launch model in which each virtual core is launched once as a persistent kernel at the beginning of execution, and new $\mu$ 7ops are streamed to it as requests arrive (He et al., 4 May 2026). The runtime is described as analogous to a pipelined superscalar microarchitecture, but implemented on the GPU. This organization is designed to preserve flexibility while minimizing implementation overhead on current GPUs.

Within a VMC, each $\mu$ 8op goes through a 2-stage execution pipeline: a Front-end Control-Flow Unit (CFU) for register state, control $\mu$ 9ops, address generation, and decode, followed by Load Units (LDUs) and Store Units (STUs) for dependency resolution and memory movement. The paper reports that this pipelining alone improves effective $\mu$ 0op throughput by $\mu$ 1 relative to a naive single-loop design (He et al., 4 May 2026). Within each stage, SIMT cooperation is used to accelerate interpretation and control; shared-memory allocation state is tracked as a 32-bit bitmask, allocator threads probe candidate positions starting from their thread ID, and warp voting selects the first available location.

A VCC manages register files, SIMT cores, and asynchronous matrix-computation units. On H100, the example configuration is two EUs per SM, each EU handling 128 SIMT threads. The control-flow unit and execution unit form a software pipeline, with CFU state stored in EU-shared memory; for each instruction, the CFU runs first on the EU’s threads, updates state, and then yields to the EU to execute the $\mu$ 2op. The motivation is that more than 98.6% of execution time is spent in the EU (He et al., 4 May 2026).

Dependencies are resolved by FIFO message queues between virtual cores, implemented with mbarrier instructions. VCCs only communicate with VMCs on the same execution unit, while all VMCs can talk to each other. Communication includes VMC $\mu$ 3 VCC transfer of shared-memory regions loaded from global memory, VCC $\mu$ 4 VMC transfer of used or generated memory regions, and VMC $\mu$ 5 VMC ownership transfer through depId-based global queues (He et al., 4 May 2026). If an execution unit blocks on a message such as pop_wait(), it is descheduled and resumed when the message arrives.

To avoid arbitrary full dataflow scheduling overhead, the runtime uses virtual flows. The $\mu$ 6op generator assigns each memory $\mu$ 7op a virtualFlowId; $\mu$ 8ops with direct dependencies are assigned to the same virtual flow, while independent $\mu$ 9ops go to different virtual flows. At runtime, instructions within the same virtual flow execute in order, while instructions from different virtual flows may be reordered and overlapped. The paper states that enabling virtual-flow assignment gives a further 5% performance improvement on top of cross-task overlap (He et al., 4 May 2026).

The design is explicitly constrained. Each memory $\mu$ 0op has at most one inter-memory- $\mu$ 1op dependency plus an optional dependency to a compute $\mu$ 2op on the same SM. Deadlock avoidance follows the rule “in-order allocation and out-of-order execution”: the $\mu$ 3op generator constructs a baseline $\mu$ 4op order that is deadlock-free under in-order execution, the runtime preserves this order for resource allocation, and later execution can proceed out of order when dependencies are ready (He et al., 4 May 2026).

4. Evaluation and empirical results

The end-to-end evaluation uses H100, GH200, and RTX 6000 Pro GPUs, with a deployment configuration of 1 VMC and 2 VCCs per SM and 8KB memory slot size (He et al., 4 May 2026). The runtime executors and $\mu$ 5op generator comprise 5K lines of C++/CUDA, while higher-level APIs and PyTorch integration comprise 4K Python LoC. Workloads are four representative LLM inference workloads—Qwen3-1.7B, Qwen3-8B, Llama3.2-1B, and Llama3.1-8B—under offline decoding with KV cache and paged attention, fixed batch for 64 steps from a 128-token context, and batch sizes 1 to 8. Additional dynamic studies use dynamic uneven context lengths and dynamic LoRA serving. Baselines are vLLM, SGLang, Mirage, ThunderKittens-llama1B, and Torch + ThunderKittens (He et al., 4 May 2026).

Across 48 evaluated combinations of model, hardware, and batch size, VDCores consistently achieves the lowest per-token decoding latency, delivers 1.31x geometric-mean speedup over the best baseline in each setting, and yields a 23% average reduction in per-token latency, with up to 1.68x performance improvement (He et al., 4 May 2026). The abstract reports 24% average decoding throughput improvement and up to 77% under dynamic inputs. The gains are especially strong on smaller models, shorter tasks, and dynamic workloads, where shorter tasks amplify the cost of bubbles and boundary underutilization in kernel-based execution.

Ablation results attribute the dominant share of the benefit to cross-task overlap. Without it, VDCores is comparable to or sometimes slower than baselines; virtual-flow assignment adds another 5% (He et al., 4 May 2026). Dynamic fusion closely matches manual fused-kernel benefits for QKV-Projection + RoPE and MLP block. For Embedding + RMS, which VDCores fuses automatically, runtime is reduced from 4.30 $\mu$ 6 to 3.10 $\mu$ 7, a 28% improvement. Under four H100 dynamic-attention regimes, VDCores beats Mirage in all regimes, achieves up to $\mu$ 8 lower latency, and beats ThunderKittens by 15% on average. In dynamic LoRA serving, it improves makespan by up to $\mu$ 9, and its scheduling time is 1.4 ms, compared with 27000 ms for Mirage, 2.28 ms for TK (attn), 352 ms for vLLM, and 261 ms for SGLang (He et al., 4 May 2026).

The paper also presents a strong programmability result. For the evaluated Llama3-1B pipeline, VDCores uses 6 reusable $\mu$ 0ops, 741 LoC, and 0 fused tasks, compared with 2,339 LoC and 3 fused tasks for Mirage, 2,065 LoC and 5 fused tasks for TK-llama1b, and 6,424 LoC and 8 fused tasks for vLLM. This is summarized as roughly 90% reduction in kernel programming and specialization effort (He et al., 4 May 2026).

At operator level, VDCores stays within 8% of peak performance on average, sustains over 82% of peak FLOPS, and sustains 93% of peak memory bandwidth on H100 (He et al., 4 May 2026). Runtime overhead is quantified by startup latency of 422 ns, $\mu$ 1op initialization interval of 22 ns, average memory load of 34 ns, average memory store of 43 ns, average memory control of 22 ns, and aggregate runtime overhead of 3.1% of total core time. A notable caveat is that hand-specialized megakernels can approach VDCores in narrow regimes: ThunderKittens-Llama1B reaches about 96% of VDCores performance on a heavily hand-specialized Llama1B batch-size-1 setting (He et al., 4 May 2026).

5. Relation to adjacent decoupling paradigms

The exact term VDCores belongs to the asynchronous-GPU system of (He et al., 4 May 2026), but several adjacent systems can be read as domain-specific analogues of resource decoupling. In disaggregated NUMA infrastructure, “Optimising Virtual Resource Mapping in Multi-Level NUMA Disaggregated Systems” studies virtual machines whose vCPUs and memory are composed from a physically disaggregated, cache-coherent shared-memory pool spanning multiple servers (Lakew et al., 2 Jan 2025). Its mapping algorithm pins virtual cores and may remap VM layouts based on application sensitivity, interference class, and observed hardware counter behavior. The paper does not define a VDCores API, but it identifies control-plane requirements that closely resemble a compute/memory decoupling substrate: assign each vCPU to exactly one core, avoid assigning more than one vCPU to a core, minimize server and NUMA spreading, prefer local memory or close NUMA connectivity for remote-sensitive workloads, avoid harmful colocation pairs, minimize reshuffle, and trigger remapping only when measured degradation exceeds threshold. On its six-server, 288-core, 1176-GB platform, it reports overall average improvement as “up-to 50x” versus the default Linux scheduler (Lakew et al., 2 Jan 2025).

In deep learning systems, “VirtualFlow: Decoupling Deep Learning Models from the Underlying Hardware” introduces virtual node processing, in which the batch is partitioned across virtual nodes rather than hardware accelerators, and multiple virtual nodes may be mapped to each accelerator and processed sequentially within the same step (Or et al., 2020). This is a coarser-grained abstraction than VDCores: it preserves synchronous data-parallel semantics rather than lowering execution below the kernel level. Still, it demonstrates stable logical execution units distinct from hardware, many-to-one realization via sequential execution, elasticity, and heterogeneous assignment. Its reported gains include up to 48% lower job completion times with resource elasticity and up to 42% higher throughput with heterogeneous training (Or et al., 2020).

Industrial virtual-engineering work shows another variant of decoupling. “Full Virtualization of Renault's Engine Management Software and Application to System Development” virtualizes the full engine management application software scope as a virtual ECU on a Windows PC in closed loop with an engine model (Wissel et al., 2018). This is full application-level functional virtualization, not full hardware/platform virtualization and not production-binary equivalence. The contribution is architectural and workflow-oriented: a generated virtual operating system from an OS specification table, per-module wrapper generation, separate code generation per module, and closed-loop execution with plant models. The process reduces integrated feedback latency from weeks or months to minutes, with code generation per module about 40 s, full vECU compilation about 90 s, and incremental rebuild of the entire vECU in under 3 min when one module changes (Wissel et al., 2018).

Power-delivery research offers a physically grounded, rather than software-grounded, analogue. “Cognition Engines: A Row-Scale HVDC Architecture for Computational Continuity of AI” proposes a row-scale $\mu$ 2 Vdc architecture in which DRUs provide fast energy via controlled droop, SSTs regulate average power with bounded ramps, distributed film capacitance and clamps absorb the first edge, and the row boundary exports a bounded, contract-compliant power profile (Churnock, 16 Sep 2025). This is not a virtualized software substrate, but it is a decoupled compute-power engine with an explicit continuity contract: $\mu$ 3 steady-band, $\mu$ 4 transient deviation, $\mu$ 5 ms recovery, $\mu$ 6 margin, no reverse power flow at the PCC, and no high-frequency export at the PCC. The paper’s own limitation is that it is an architectural positioning rather than a deployment report (Churnock, 16 Sep 2025).

OS and control research reveal yet other interpretations. “Transkernel: Bridging Monolithic Kernels to Peripheral Cores” offloads specific kernel phases from a monolithic kernel to a low-power peripheral core using cross-ISA dynamic binary translation and a narrow stable binary interface of 12 functions and 1 variable, yielding 34% energy reduction in its ARM prototype (Guo et al., 2018). “A New Virtual Oscillator based Grid-forming Controller with Decoupled Control Over Individual Phases and Improved Performance of Unbalanced Fault Ride-through” decomposes synchronization through positive-, negative-, and zero-sequence oscillators while giving each phase its own reference and nested control loops, thereby functioning as a per-phase decoupled control architecture under unbalanced faults (Ghosh et al., 2022). These works do not define VDCores, but they show that virtualized or decoupled “engines” can be interpreted at the levels of OS phases, inverter phases, shared-memory resources, and power-delivery cells.

6. Limits, interpretations, and research directions

VDCores is not an unconstrained dataflow machine and should not be described as arbitrary full dataflow scheduling (He et al., 4 May 2026). Its dependency model is intentionally restricted; each memory $\mu$ 7op has at most one inter-memory- $\mu$ 8op dependency plus an optional dependency to a compute $\mu$ 9op on the same SM; within an individual LDU or STU, execution is still in order; VCC communication is local to the same execution unit; and the evaluated implementation is tuned around current NVIDIA GPU execution semantics and asynchronous features. The system is therefore most compelling where dynamic overlap opportunities are substantial, especially in LLM decoding, dynamic sequence lengths, dynamic batching, and LoRA serving (He et al., 4 May 2026).

A second misconception is that “virtual” in VDCores means hardware-independence in the broadest possible sense. The GPU paper instead presents a highly GPU-specialized programming model and GPU runtime design. The broader interpretive literature reinforces that decoupling is typically constrained by the substrate: disaggregated NUMA systems are limited by multi-level NUMA distance and interference (Lakew et al., 2 Jan 2025); VirtualFlow is limited to synchronous data-parallel training (Or et al., 2020); Renault’s virtual ECU omits hand-coded basic software, real hardware timing behavior, and preemptive multitasking effects (Wissel et al., 2018); Cognition Engines are physically implemented row-scale HVDC cells rather than software-only abstractions (Churnock, 16 Sep 2025); and transkernel relies on ISA similarity, shared addressability, and hot-path specialization (Guo et al., 2018).

The most defensible broader interpretation is therefore narrow but useful: VDCores names a concrete asynchronous-GPU model, while also exemplifying a more general systems pattern in which logical execution units are made distinct from the physical resource bundles that traditionally define them. This suggests that future decoupled engines will need to combine multiple control dimensions that the adjacent literature has already made explicit: topology-aware mapping of virtual compute and memory resources (Lakew et al., 2 Jan 2025), many-to-one remapping and weighted synchronization under changing hardware allocations (Or et al., 2020), generated scheduling layers that frontload system integration (Wissel et al., 2018), explicit power-continuity contracts at the row boundary (Churnock, 16 Sep 2025), and narrow stable virtualization boundaries for offloaded execution phases (Guo et al., 2018). A plausible implication is that the long-term significance of VDCores lies less in any single runtime mechanism than in the broader shift from monolithic, hardware-shaped execution objects toward dependency-driven, resource-isolated, and remappable execution engines.

The VDCores implementation evaluated in (He et al., 4 May 2026) has been open sourced at https://github.com/vdcores/vdcores.