Papers
Topics
Authors
Recent
Search
2000 character limit reached

CUDA Unified Memory Overview

Updated 14 June 2026
  • CUDA Unified Memory is a virtual memory abstraction that unifies host and device memories, enabling direct pointer access and transparent page migration.
  • It employs unified page tables, cache-coherent NVLink interconnects, and NUMA-aware policies to optimize performance for high-performance computing workloads.
  • Modern implementations on platforms like the Grace Hopper Superchip demonstrate significant speedups by reducing explicit memory copies through dynamic migration and prefetching techniques.

CUDA Unified Memory (UM) defines a virtual memory abstraction that unifies host (CPU) and device (GPU) memories into a coherent address space, delegating placement and migration to hardware and system software. UM eliminates the need for explicit memory copying in heterogeneous programming, enabling both CPUs and GPUs to directly dereference pointers to shared data. Modern implementations—culminating in the Grace Hopper Superchip platform—extend this model with architectural innovations in MMU design, cache coherence, and NUMA-aware page-migration schemes, providing new trade-offs in programmability and performance for scientific and high-performance computing workloads.

1. Architectural Foundations and Evolution

Early CUDA GPUs required explicit cudaMemcpy between host and device memory spaces. The introduction of Unified Virtual Memory (UVM), starting with CUDA 6 and generalized on Pascal-class hardware and later, established a single, 64-bit cross-platform virtual address space. On such platforms, pointers allocated with cudaMallocManaged are visible and accessible by both CPU and GPU, with transparent page-fault-driven migration (Garg et al., 2018, Gu et al., 2020).

On discrete pre-Grace Hopper GPUs, this abstraction is implemented through distinct CPU (system) and GPU page tables. The CUDA driver orchestrates on-demand migration and page table updates, typically at 4 KB page granularity. Caches and TLBs on each processor are made coherent through protocol-driven shootdowns, and device or host page faults trigger software logic to remap and migrate the physical memory. Oversubscription is supported by evicting cold pages according to LRU-style heuristics (Gu et al., 2020, Garg et al., 2018).

Grace Hopper Superchip hardware re-architects this process by supplying a single, integrated system page table managed by the Arm SMMUv3 unit, shared between Grace CPU and H100 GPU. All allocations—whether via malloc, cudaMallocManaged, or new—resolve via this unified MMU. Any virtual address can map to either the CPU’s LPDDR5X (up to 480 GB) or GPU’s HBM3 (up to 96 GB) memory banks, under a flat virtual address space (Schieffer et al., 2024, Li et al., 2024).

2. Hardware Mechanisms: Page Tables, Coherence, and Interconnects

Integrated System Page Table

Grace Hopper eliminates the traditional GPU-only page table for cudaMalloc-managed allocations. Instead, both system-allocated and managed memory populate the same system page table. Empty PTEs (Page Table Entries) are initialized on first touch—either by CPU or GPU—triggering the OS to assign physical memory on the accessing NUMA domain (Schieffer et al., 2024).

Memory coherence and remote load/store become possible through the NVLink Chip-2-Chip (C2C) interconnect, supporting cache-line granularity coherency. Following the AMBA CHI protocol, CPU and GPU can snoop and maintain coherence without software intervention. Measured peak bandwidth approaches 375 GB/s for host-to-device and 297 GB/s for device-to-host transfers, with aggregate GPU↔HBM3e bandwidth reported up to 3.7 TB/s (Li et al., 2024, Schieffer et al., 2024). Atomicity is hardware-enforced, eliminating the need for explicit synchronization primitives for basic memory operations.

3. Programming Models, Memory Policies, and Migration

Allocation Models

  • System-Allocated Memory (Malloc/New): Allocations reside only in the system page table. On Grace Hopper, the first-touch policy is employed: physical backing of a page is determined by which processor (CPU or GPU) first accesses it. Address Translation Services (ATS) accelerate TLB fills, and cache coherence eliminates the necessity of explicit page migration for most remote references (Schieffer et al., 2024, Li et al., 2024).
  • CUDA Managed Memory (cudaMallocManaged): These allocations expose a unified virtual address, but physical residency is tracked and migrated by the CUDA runtime. On pre-Grace platforms, two distinct page tables (system, GPU) are used. Page faults initiate migration and page table entry updates; migration can be guided by user hints (cudaMemAdvise*) or prefetch primitives (cudaMemPrefetchAsync). On Grace Hopper, cudaMallocManaged pages that are GPU-resident are mapped into HBM3 via the system page table with 2 MB granularity (Schieffer et al., 2024).

First-Touch Semantics

PTE initialization occurs lazily. When a first access triggers a page fault (either via the SMMU or GPU MMU), the OS or driver assigns a physical page in local memory, populates TLBs, and resumes execution. Explicitly "touching" pages from the CPU (prior to kernel launch) can eliminate GPU-side replay of SMMU page faults, reducing large kernel startup latencies in GPU-initialized workloads (Schieffer et al., 2024).

Page Migration and Access Patterns

On-demand page migration mechanisms ensure locality, but incur significant penalty on first access—estimated in Grace Hopper as Tmig≈Spage/Bcopy+TlatT_{\mathrm{mig}} \approx S_{\mathrm{page}} / B_{\mathrm{copy}} + T_{\mathrm{lat}}, where SpageS_{\mathrm{page}} is page size, BcopyB_{\mathrm{copy}} is measured NVLink bandwidth (∼350\sim 350 GB/s), and TlatT_{\mathrm{lat}} is the page fault handling latency (tens to hundreds of μ\mus) (Schieffer et al., 2024). Managed memory maintains background heuristics to prefetch, duplicate, or pin pages based on previous access patterns (Chien et al., 2019, Garg et al., 2018).

Access policies can be tuned:

  • Eager migration: lower thresholds for hot-page detection, rapidly promoting pages to HBM3 after limited accesses.
  • Lazy migration: delay migrations when large, infrequently reused working sets would otherwise flood device memory.
  • Auto migration thresholds and explicit cudaMemPrefetchAsync can be combined for optimal page placement ahead of iterative compute or oversubscription (Schieffer et al., 2024).

4. Performance Properties and Quantitative Evaluation

Benchmark Results

Comprehensive studies on Grace Hopper using six representative HPC applications (Qiskit, BFS, Needleman-Wunsch, Pathfinder, Hotspot, and SRAD) demonstrate that system-allocated memory can be 1.2×–1.8× faster than managed memory in CPU-initialized patterns, due to lower page-fault and migration overheads. Conversely, for pure GPU-init workloads, managed memory's 2 MB GPU page table enables a 1.1×–1.3× speedup over system-allocated memory, as initial page faults are handled in bulk (Schieffer et al., 2024).

In streaming kernels, effective bandwidth within local HBM3 approaches 3.2 TB/s, while remote CPU-to-GPU transfers achieve 320 GB/s over NVLink C2C (Schieffer et al., 2024, Li et al., 2024).

Automatic BLAS offloading tools that exploit UMA (Unified Memory Architecture) on Grace Hopper demonstrate up to 3.3× speedup for BLAS-heavy scientific codes, compared to CPU-only execution, with nearly all memcpy overheads eliminated relative to PCIe-based platforms (Li et al., 2024).

Page Size and Migratory Trade-offs

System page sizes can be switched between 4 KB or 64 KB at the OS level. Larger pages (64 KB) significantly reduce allocation and deallocation overhead (by up to 15×, with 4 KB deallocation up to 38× slower), and accelerate GPU first-touch PTE creation, especially evident in large initialization phases of scientific workloads (Schieffer et al., 2024). However, overly large pages can pull excessive cold data into HBM if auto-migration is not tuned conservatively.

5. Practical Programming Recommendations and Usage Modes

Mode Porting Effort Best For (Pattern)
System-Allocated (malloc/new) Minimal (drop explicit data copies) CPU-init, mixed-access
cudaMallocManaged Low–Moderate Pure GPU-init, large datasets
Explicit Memcpy High Legacy code, non-unified

For compute modules initialized on the CPU, malloc- or new-based allocations—backed by the system page table—yield the best overall performance, particularly with large page sizes and tuned migration thresholds. For GPU-initialized data, cudaMallocManaged can be advantageous due to the 2 MB GPU page table mapping, provided that kernel code is structured to touch all pages at kernel start, or by using a pre-touch kernel on the CPU (Schieffer et al., 2024).

Pre-launch migration via cudaMemPrefetchAsync can eliminate runtime stall in both memory-constrained and iterative kernels, restoring effective device-side bandwidth even when pages initially reside on host memory (Chien et al., 2019, Gu et al., 2020, Schieffer et al., 2024).

6. Limitations, Trade-Offs, and Platform-Specific Caveats

UVM and UM simplify programming but introduce potential pitfalls. Page-fault latency, especially with irregular or random access patterns, can severely degrade performance, particularly on platforms with limited interconnect bandwidth (e.g., PCIe Gen3 vs. NVLink). Oversubscription (working set > device memory) can lead to thrashing; kernels with small or scattered accesses suffer the most, sometimes experiencing slowdowns of more than 100× compared to traditional data movement models (Gu et al., 2020, Chien et al., 2019). Adaptive migration policies and data-structure reorganization to favor streaming access patterns are recommended (Garg et al., 2018, Gu et al., 2020).

Platform-specific results confirm that memory advise (cudaMemAdvise*) and prefetch (cudaMemPrefetchAsync) offer greatest benefit on PCIe-connected GPUs, with in-memory execution speedups up to 50%, but are less effective or even detrimental on NVLink-equipped platforms (e.g., Power9-Volta or Grace Hopper), especially under oversubscription conditions (Chien et al., 2019).

7. Future Directions and Research Areas

Ongoing research explores hardware-transparent prefetchers within GPU MMUs, pattern-aware eviction schemes, and cooperative page-table caching at the cluster or NUMA level. Benchmark suites such as UVMBench provide systematic evaluation of these strategies across application domains, access pattern regularity, and scaling regimes (Gu et al., 2020). Grace Hopper's approach—integrated system page tables, hardware coherence, and fine-grained remote memory access—presents a template for future tightly coupled CPU–GPU platforms, but optimal usage remains contingent on thoughtful kernel patterning, memory allocation, and explicit prefetch policy design (Schieffer et al., 2024, Li et al., 2024).


References:

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to CUDA Unified Memory.