Papers
Topics
Authors
Recent
Search
2000 character limit reached

Virtual Memory Cores (VMCs) in GPUs

Updated 2 June 2026
  • Virtual Memory Cores (VMCs) are software-managed abstractions that virtualize GPU memory pipelines to decouple and optimize memory operations.
  • They enable fine-grained dependency scheduling of micro-operations to maximize GPU utilization and achieve dynamic compute-memory overlap.
  • The VMC framework delivers significant throughput gains and reduced code complexity compared to traditional monolithic kernel designs.

VDCores is a framework for resource decoupled programming and execution on asynchronous GPUs. It virtualizes specialized GPU hardware into software-managed "virtual cores", orchestrates computation through fine-grained dependency-connected micro-operations, and provides a runtime and programming interface aimed at maximizing utilization of GPU resources such as Tensor-Core pipelines and TMA (Tensor-Memory-Accelerator) engines. VDCores enables significant throughput gains and code reduction for LLM inference workloads, with automatic compute-memory overlap and dynamic fusion capabilities, while minimizing monolithic kernel code (He et al., 4 May 2026).

1. Mathematical Foundation and Abstraction

VDCores formalizes GPU hardware resource management via the introduction of "virtual cores". Given the set HH of all asynchronous hardware execution units on the GPU, VDCores defines two disjoint families:

  • VCC={vcc1,,vccn}VCC = \{vcc_1, \dots, vcc_n\}: virtual compute cores.
  • VMC={vmc1,,vmcm}VMC = \{vmc_1, \dots, vmc_m\}: virtual memory cores.

The full pool C=VCCVMCC = VCC \cup VMC orchestrates software-level execution across physical GPU resources, with each virtual core associated to one streaming multiprocessor (SM).

Workloads are decomposed into a set UU of micro-operations (μ-ops), partitioned into compute (UcomputeU_{compute}), memory (UmemU_{mem}), and control (UctrlU_{ctrl}) μ-ops. Dependencies among μ-ops are encoded as a directed acyclic graph G=(U,E)G = (U, E), where (uv)E(u \rightarrow v) \in E indicates VCC={vcc1,,vccn}VCC = \{vcc_1, \dots, vcc_n\}0 awaits completion of VCC={vcc1,,vccn}VCC = \{vcc_1, \dots, vcc_n\}1. Each μ-op VCC={vcc1,,vccn}VCC = \{vcc_1, \dots, vcc_n\}2 carries:

  • VCC={vcc1,,vccn}VCC = \{vcc_1, \dots, vcc_n\}3: target virtual core.
  • VCC={vcc1,,vccn}VCC = \{vcc_1, \dots, vcc_n\}4: dependency set.
  • VCC={vcc1,,vccn}VCC = \{vcc_1, \dots, vcc_n\}5: flow identifier for in-order execution.

The runtime manages ready μ-ops per-core using queues VCC={vcc1,,vccn}VCC = \{vcc_1, \dots, vcc_n\}6 and issues any μ-op with satisfied dependencies. To mitigate head-of-line blocking, per-core queues VCC={vcc1,,vccn}VCC = \{vcc_1, \dots, vcc_n\}7 are flow-partitioned and flows may be bypassed if stalled.

This abstraction separates hardware orchestration from user code, allowing the runtime to automatically interleave memory and compute operations based on dynamic resource readiness (He et al., 4 May 2026).

2. Programming Interface and μ-Op API

VDCores presents a C++/CUDA-like API for users to define μ-ops and their handlers. A μ-op handler, such as a tiled matrix-vector multiplication, is expressed in a concise form:

VCC={vcc1,,vccn}VCC = \{vcc_1, \dots, vcc_n\}8

Key runtime services exposed to μ-ops handlers are:

  • alloc_registers(): register allocation.
  • push(), pop_wait(): FIFO-based communication between VMC and VCC.
  • μ-ops are otherwise pure user code.

VDCores ships with approximately 30 built-in μ-ops (covering loads, stores, TMA ops, fused GEMM tiles, control loops, barriers) and supports user-defined extensions, typically requiring under 50 lines of code per μ-op. This modular structure obviates extensive kernel refactoring or complex fused kernel bodies (He et al., 4 May 2026).

3. Runtime Architecture and Scheduling

At startup, VDCores launches a persistent kernel per GPU SM, each instantiating:

  • One VMC executor (for memory μ-ops).
  • Two VCC executors (for compute μ-ops).

Within an executor, operations are divided between a control-flow unit (CFU, responsible for μ-op decoding, register management, address computation) and one or more execution units (EUs, executing μ-ops). The CFU enqueues μ-ops into EUs via local FIFOs.

Inter-executor dependencies are coordinated via:

  • VMC→VCC ("m2c") message FIFOs for memory-to-compute tile deliveries.
  • VCC→VMC ("c2m") FIFOs returning used tiles.
  • VMC→VMC FIFOs for relay of global-memory regions.

The CFU must decode and dispatch a μ-op approximately every 90 GPU cycles to saturate the memory bandwidth of high-end GPUs (e.g., H100's ~3.3 TB/s). Techniques including software pipelining, SIMT-based bit-mask allocations, and multiplexing multiple flows per SM enable VDCores to achieve >94% of memory throughput and >82% of peak FLOPS (on isolated kernels), while runtime overhead remains around 3.1% of cycles (as measured by NCU) (He et al., 4 May 2026).

4. Model Comparison: VDCores vs. Monolithic Kernels

Traditional GPU programming paradigms deploy monolithic "megakernels" per operator. Achieving compute-memory overlap requires manual embedding of asynchronous copy and compute primitives, warpspecialization patterns, and complex software pipelines. This approach entails:

  • Substantial code size and tuning cost (hundreds of kilolines of CUDA and metaprogramming).
  • Runtime underutilization of TMA and tensor cores due to orchestration rigidity and pipeline bubbles (He et al., 4 May 2026).

VDCores instead:

  • Exposes each asynchronous resource as an independent virtual core.
  • Represents producer-consumer relations as fine-grained μ-op DAGs.
  • Defers scheduling to runtime, opportunistically filling execution bubbles and achieving dynamic fusion (on-the-fly combining of memory μ-ops without additional kernel code).

The result is full automatic compute-memory overlap, dynamic elimination of pipeline bubbles, and the elimination of large monolithic kernel bodies—μ-op fusion occurs at runtime with zero extra code (He et al., 4 May 2026).

5. Empirical Evaluation and Efficiency Gains

Extensive benchmarking on LLM inference workloads (Qwen1.7B, Qwen8B, Llama1B, Llama8B, 64-step decoding) across three modern NVIDIA GPUs (H100, GH200, RTX 6000 Pro) demonstrates:

  • 1.31× geometric-mean throughput gain versus expert-tuned monolithic and megakernel baselines.
  • Up to 1.68× speedup for select batch sizes, and up to 6.18× in scenarios with highly uneven sequence lengths.
  • Dynamic LoRA serving achieves up to 3.47× lower makespan than state-of-the-art S-LoRA staging.

Resource utilization metrics show average GPU-SM usage increasing from ~70% (monolithic) to ~94% (VDCores), and nearly peak memory bandwidth utilization throughout most of execution.

Programming effort required for complete model pipelines is sharply reduced: implementing Llama1B/8B end-to-end requires only 6 reusable μ-ops and approximately 741 lines of GPU code—with no fused kernel variants—contrasted with megakernel approaches requiring 8–14 distinct tasks, 3–8 fused variants, and up to 6,000 lines of code. This corresponds to approximately 90% less GPU code and complete avoidance of per-model fusion passes (He et al., 4 May 2026).

6. Limitations and Prospective Developments

The current VDCores dependency model imposes several restrictions:

  • Each memory μ-op can express at most one inter-memory-μ-op edge and an optional local compute dependency; more general DAGs with arbitrary fan-in/fan-out are not supported directly.
  • The framework targets operations at coarse-tile granularity (≥4 KB); fine-grained sub-kilobyte streaming μ-ops incur substantial overhead.

Proposed directions for extending VDCores include:

  • Addition of virtual-core types to support emerging asynchronous GPU engines (e.g., inter-SM TMA, novel ML cores in next-generation architectures).
  • Deep compiler integration to emit μ-ops and dependencies as bytecode from tensor-IR without fusing kernel logic.
  • Generalization to multi-GPU and heterogeneous fabrics by encoding PCIe/TMA/NVLink transfers as memory μ-ops handled by additional VMCs (He et al., 4 May 2026).
Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Virtual Memory Cores (VMCs).