VDCores Model: Virtual Decoupled GPU Cores

Updated 2 June 2026

VDCores is a novel GPU programming model that abstracts hardware as resource-isolated virtual cores and expresses workloads as dependency-connected micro-ops.
It decouples low-level orchestration from application logic, enabling dynamic scheduling and improved hardware utilization for complex inference tasks.
The model features a C++/CUDA micro-op API with low overhead and high concurrency, achieving up to 94% SM utilization and significant performance speedups.

VDCores, or Virtual Decoupled Cores, refers to a resource-decoupled programming and execution model specifically designed for asynchronous GPUs. It provides a principled abstraction of GPU hardware units as resource-isolated virtual cores (compute and memory) and expresses workloads as dependency-connected micro-operations (μ-ops) rather than as monolithic kernels. This abstraction decouples programming from low-level hardware orchestration, enables dynamic, fine-grained scheduling, and significantly improves hardware utilization for irregular and memory/computation-overlapping workloads such as LLM inference. The VDCores model is realized via a C++/CUDA-flavored micro-op API and a GPU runtime that provides high concurrency, low overhead, and efficient cross-core coordination (He et al., 4 May 2026).

1. Formal Abstraction and Mathematical Model

Let $H$ denote all asynchronous GPU hardware execution units, such as Tensor-Core pipelines and TMA (tensor-memory-accelerator) engines. VDCores virtualizes this set into two disjoint families:

$VCC = \{vcc_1, \ldots, vcc_n\}$ : virtual compute cores
$VMC = \{vmc_1, \ldots, vmc_m\}$ : virtual memory cores

A fixed pool $C = VCC \cup VMC$ of $|C| \ll |H|$ virtual cores is used, each mapped to a single SM’s hardware resources. Workloads are decomposed into a set $U$ of fine-grained μ-ops, partitioned as:

$U = U_{compute} \cup U_{mem} \cup U_{ctrl}$

Dependencies among μ-ops are expressed as a directed acyclic graph $G = (U, E)$ , where $(u \to v) \in E$ enforces $v$ to wait for $VCC = \{vcc_1, \ldots, vcc_n\}$ 0's result.

Each μ-op $VCC = \{vcc_1, \ldots, vcc_n\}$ 1 is annotated with:

$VCC = \{vcc_1, \ldots, vcc_n\}$ 2: assigned virtual core
$VCC = \{vcc_1, \ldots, vcc_n\}$ 3: (direct) predecessor dependencies
$VCC = \{vcc_1, \ldots, vcc_n\}$ 4: virtual-flow ID for preserving per-flow ordering

Readiness at time $VCC = \{vcc_1, \ldots, vcc_n\}$ 5 is given by:

$VCC = \{vcc_1, \ldots, vcc_n\}$ 6

Each virtual core $VCC = \{vcc_1, \ldots, vcc_n\}$ 7 manages a local queue $VCC = \{vcc_1, \ldots, vcc_n\}$ 8 of μ-ops with satisfied dependencies; execution is scheduled opportunistically and supports flow bypassing to avoid head-of-line blocking.

2. Programming Interface and Micro-Op API

VDCores introduces a concise API for μ-op definition, decoupling logic and orchestration. For example, a matrix-vector tile μ-op may be implemented as:

$VCC = \{vcc_1, \ldots, vcc_n\}$ 9

Channels (ctx.m2c, ctx.c2m) implement local FIFO queues between VMCs and VCCs; operations like pop_wait, push, and alloc_registers constitute the μ-op runtime interface. VDCores ships with ~30 built-in μ-ops (loads, stores, TMA ops, fused GEMM tiles, control loops, barriers) and allows extension with new μ-ops in under 50 lines, without monolithic kernel refactoring (He et al., 4 May 2026).

3. GPU Runtime Architecture and Scheduling

At initialization, VDCores launches a persistent kernel per GPU SM, simultaneously managing:

1 VMC executor (memory μ-ops)
2 VCC executors (compute μ-ops)

Each executor implements a two-stage pipeline:

Control-flow unit (CFU): for μ-op decoding, register management, address arithmetic
Execution units (EUs): actual μ-op computation

CFUs enqueue decoded μ-ops into small FIFOs for associated EUs. EUs execute, then notify readiness of dependents (μ-ops or shared-memory regions) using CUDA asynchronous barriers.

Global μ-op firing and data-flow depend on message-passing FIFOs:

VMC→VCC "m2c" queues: loaded tiles for compute
VCC→VMC "c2m" queues: used tiles for store/forward
VMC→VMC queues: inter-memory μ-op region handoff

Scheduler efficiency targets decoding+dispatch every ~90 GPU cycles (H100) to approach hardware peak bandwidth. By software pipelining and exploiting SIMT for bitmask register allocation, VDCores attains >94% of peak memory throughput and >82% of peak FLOPS on isolated kernels, with ~3.1% overall cycle overhead (He et al., 4 May 2026).

4. Decoupled vs. Monolithic GPU Execution

Traditional monolithic (megakernel) programming fuses compute and memory orchestration in a large kernel, with asynchronous operations explicitly embedded. This approach incurs significant code size, tuning complexity, and pipeline inefficiency due to hardware underutilization when operator boundaries or pipeline "bubbles" arise.

VDCores’ resource- and schedule-decoupled paradigm:

Exposes each hardware unit as a virtual core
Represents kernel logic as μ-op DAGs with explicit producer-consumer edges
Schedules μ-ops dynamically at runtime per dependency and available resource

The resulting benefits include:

Automatic overlap of memory and compute (bubbles are filled)
Dynamic μ-op fusion (e.g., inter-operator store→load optimized on-the-fly)
Eliminating need for large, statically-tuned fused kernels

This enables substantially less code, higher resource utilization, and reduced specialization effort compared to monolithic baselines (He et al., 4 May 2026).

5. Performance Evaluation and Programming Effort

End-to-end LLM inference over representative models (Qwen1.7B, Qwen8B, Llama1B, Llama8B), using 64-step decoding, demonstrates:

1.31× geometric-mean throughput vs. expert-tuned megakernel baselines
Up to 1.68× speedup for particular batch sizes; up to 6.18× for uneven context distributions
Dynamic LoRA serving achieves up to 3.47× faster makespan versus S-LoRA staging

On H100, GH200, RTX 6000 Pro:

Mean GPU–SM utilization rises from ~70% (monolithic) to ~94% (VDCores)
Memory bandwidth utilization is near-peak, except at pipeline startup/tail

Table summarizing measured metrics:

Metric	Monolithic Baseline	VDCores
SM utilization	~70%	~94%
Decoding throughput (avg.)	–	+24%
Specialization effort (LoC)	~2–6K (GPU code)	~741
Code reduction	–	~90%

Complete Llama1B/8B end-to-end inference is implemented in 6 reusable μ-ops and ~741 LoC with VDCores, compared to 8–14 monolithic tasks, 3–8 fused variants, and 2–6K LoC in baselines (He et al., 4 May 2026).

6. Limitations and Prospective Extensions

Current VDCores restrictions include:

Limited dependency expressivity: Each memory μ-op can have at most one inter-memory μ-op edge and an optional local compute dependency; richer DAGs with broader fan-in/fan-out are not directly supported.
Operational overheads on sub-KB μ-ops: The model is optimized for coarse-tile granularity (≥4 KB), with higher overhead at finer scales.

Proposed future directions:

Virtual-core expansion: Supporting future asynchronous hardware (e.g., inter-SM TMA, Blackwell ML cores)
Compiler integration: Direct μ-op and dependency emission from tensor-IR compilers, eliminating fused kernel generation
Multi-GPU/disaggregated execution: Abstracting PCIe/TMA and NVLink transfers as memory μ-ops on new VMCs

These extensions suggest applicability across next-generation GPU architectures and broader accelerator fabrics (He et al., 4 May 2026).

Markdown Report Issue Upgrade to Chat

References (1)

VDCores: Resource Decoupled Programming and Execution for Asynchronous GPU (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to VDCores Model.