Papers
Topics
Authors
Recent
2000 character limit reached

Unified Memory Architecture (UMA) Overview

Updated 30 December 2025
  • Unified Memory Architecture (UMA) is a design where CPUs and GPUs share a single coherent memory space, eliminating the need for separate host and accelerator memory.
  • UMA enables transparent data access with enforced cache coherence across heterogeneous cores, significantly boosting bandwidth and reducing latency in HPC and AI workloads.
  • UMA supports dynamic page migration and unified allocations via advanced programming models, simplifying application porting while optimizing memory performance.

A Unified Memory Architecture (UMA) is a hardware and system design paradigm in which multiple processor types—most notably CPUs and GPUs—share access to a single, coherent physical (and virtual) memory space. UMA eliminates the distinction between host and accelerator memory, enabling both explicit and implicit accesses to shared data at high bandwidth and low latency, and supporting cache coherence across heterogeneous cores and memory types. In recent system deployments such as NVIDIA’s Grace-Hopper (GH200) and AMD’s MI300A APUs, UMA underpins a transformative shift in high-performance computing (HPC), artificial intelligence, and deep learning workloads, fundamentally altering both hardware composition and software workflows (Li et al., 19 Apr 2024, Schieffer et al., 10 Jul 2024, Wahlgren et al., 18 Aug 2025, Tandon et al., 1 May 2024, Zhang et al., 2023).

1. Principles and Hardware Implementation

UMA hardware consolidates physical memory (e.g., DRAM, HBM, SSD/Flash) under a unified address space, enforcing cache coherence and providing direct loads, stores, and atomics for all participating processors. In GH200, the 72-core Arm Grace CPU and H100 GPU share a 48-bit virtual address space, connected by NVLink C2C interconnect (~900 GB/s) supporting MESI-like coherence over LPDDR5 and HBM3e. The Arm SMMUv3 system MMU manages unified page tables for both CPU and GPU, enabling direct cacheline DMA for remote data access and transparent on-demand migration via hardware counters and OS-managed "first-touch" policy (Schieffer et al., 10 Jul 2024, Li et al., 19 Apr 2024).

AMD’s MI300A integrates CPU and GPU chiplets with eight HBM3 stacks (128 GiB, 5.3 TB/s peak), using Infinity Fabric as both data and coherence backbone. The system employs Infinity Cache (256 MiB, up to 17.2 TB/s), with both CPU and GPU page tables synchronized under Linux HMM, supporting unified page-level allocation (via hipMalloc, hipHostMalloc, malloc+hostRegister) and automatic physical page migration and fragmentation handling (Wahlgren et al., 18 Aug 2025).

G10 further extends UMA concepts to the integration of GPU memory (HBM), host DRAM, and PCIe-attached SSD, all mapped in a unified virtual address space. Page table entries encode device location, and smart compiler/runtime cooperation supports preemptive tensor migration across device boundaries (Zhang et al., 2023).

2. Memory Hierarchy, Coherence, and Bandwidth

UMA designs organize memory hierarchies to maximize effective bandwidth and minimize latency. Key structural elements include:

Processor Memory Type Bandwidth Cache Coherence Mechanism
GH200 CPU LPDDR5 480–500 GB/s Arm AMBA CHI, MESI-like
GH200 GPU HBM3e 3.4–3.7 TB/s L2 shared, MESI across domains
MI300A CPU HBM3 208 GB/s (CPU) Infinity Fabric, MESI
MI300A GPU HBM3 3.5–3.6 TB/s Infinity Cache, page fragments

Remote accesses (e.g., CPU to GPU memory) traverse cache-coherent fabrics (NVLink C2C, Infinity Fabric) with latencies in the sub-microsecond to several microseconds range, depending on cache/TLB state, page size (e.g., 64 KB pages reduce fault overhead), and access patterns. Large hardware caches (Infinity Cache, per-core/SM L2) absorb most traffic, with hit rates up to 95% reducing off-chip bandwidth pressure by 5×. Bandwidth is maximized when allocations are physically contiguous and evenly striped across memory channels (Li et al., 19 Apr 2024, Schieffer et al., 10 Jul 2024, Wahlgren et al., 18 Aug 2025).

In systems integrating SSD, bandwidth constraints are managed by device selection (GPU, host DRAM, flash) per page; compiler-driven migration schedules exploit tensor inactivity intervals and predict safe prefetch times based on bandwidth availability (Zhang et al., 2023).

3. Memory Management and Page Migration Policies

UMA software stacks implement page allocation, migration, and mapping using hardware page tables, OS kernel services, and runtime drivers with fine-grained page fragmentation and migration control. Core allocation strategies:

  • First-Touch Policy: Physical page allocation is deferred until the location of first access by CPU or GPU, which then determines placement in system or device memory. For system-malloc(), PTs are populated at OS page sizes (4 KB/64 KB); for cudaMallocManaged, 2 MB pages are migrated on demand upon first remote touch (Schieffer et al., 10 Jul 2024).
  • Hardware Migration Counters: Hot pages (default threshold: 256 GPU accesses) are auto-migrated to GPU memory. Eviction is triggered by similar access counters for CPU workloads. Migration occurs in 64 KB granularity, minimizing overhead in reuse-dominated patterns (Schieffer et al., 10 Jul 2024).
  • Explicit vs. Unified Allocations: Legacy explicit copy models (cudaMemcpy) are replaced by direct pointer passing and unified allocations, removing software DMA overhead. On MI300A, hipMalloc-backed memory achieves lower TLB misses and higher bandwidth, while stack/static allocations require heap conversion for full UMA performance (Wahlgren et al., 18 Aug 2025, Tandon et al., 1 May 2024).
  • Oversubscription Handling: In managed memory, shortage triggers evictions and migrations that can thrash if working set lacks reuse; in system memory, remote cacheline loads incur no migration cost, and page faults occur once at first touch (Schieffer et al., 10 Jul 2024).

Compiler-driven tensor liveness and interval analysis, as in G10, permits optimal pre-eviction, bandwidth scheduling, and prefetch to satisfy compute-data overlap constraints (Zhang et al., 2023).

4. Programming Models and Software Portability

UMA reduces code modification requirements, simplifies accelerator offload, and enables incremental porting of large-scale applications. Programming models are characterized by:

  • Transparent Pointer Sharing: Applications allocate memory via malloc/new or standard language allocators; CPU and GPU dereference identical pointers into shared physical memory (Tandon et al., 1 May 2024, Wahlgren et al., 18 Aug 2025).
  • OpenMP 5.2 Unified Shared Memory: Directives #pragma omp requires unified_shared_memory and #pragma omp target teams distribute parallel for allow host and device executors to operate on the same buffer, without explicit map, enter/exit data, or synchronization clauses (Tandon et al., 1 May 2024).
  • Dynamic Binary Instrumentation: Tools such as automatic BLAS offloading intercept Level-3 BLAS symbol entry, inserting trampolines to redirect calls to cuBLAS or CPU routines based on runtime analysis, without recompilation or code change. LD_PRELOAD mechanisms apply at runtime (Li et al., 19 Apr 2024).
  • Legacy Application Enablement: Large C++/Fortran codes (OpenFOAM, PARSEC, MuST) port with minimal changes; STL containers may require custom allocators, and adapting static-managed variables is necessary for peak UMA performance (Wahlgren et al., 18 Aug 2025, Tandon et al., 1 May 2024).
  • Multi-Device and Storage Integration: G10 demonstrates compiler-level extraction of deep learning tensor lifetimes, enabling cross-device page migration (GPU↔DRAM↔SSD) with neither application intervention nor performance loss (Zhang et al., 2023).

5. Quantitative Performance Characterization

Empirical study substantiates UMA’s impact on memory efficiency and performance.

  • Bandwidth and Latency:
  • Benchmarks and Applications:
    • DGEMM (GH200, strategy 2): 2.0 TFLOP/s realized without explicit copies, vs. 85 GFLOP/s on CPU (Li et al., 19 Apr 2024).
    • PARSEC Si1947H604: 3.3× speedup using first-touch migration; migration overhead ~10 s amortized over ~445 matrix uses (Li et al., 19 Apr 2024).
    • MuST: 2.0× speedup for zgemm; automatic offload approaches native GPU performance (Li et al., 19 Apr 2024).
    • OpenFOAM CFD: MI300A single timestep at ¼ H100 and ⅕ MI210 wall-clock; >65% runtime saved by eliminating data transfers and page migration (Tandon et al., 1 May 2024).
  • Memory Savings:
    • Four of six Rodinia apps saved 10–44% of resident memory via UM allocation, avoiding buffer duplication (Wahlgren et al., 18 Aug 2025).
    • G10: overall throughput reaches 90.3% of ideal unlimited DRAM; up to 1.75× speedup over FlashNeuron, with only 1–6% of kernels slowed vs. ideal (Zhang et al., 2023).

6. Limitations, Optimization Strategies, and Future Directions

UMA systems exhibit architectural and systems-level trade-offs that shape optimization and future research:

  • Page Placement and Migration Tuning: Control of placement (first-touch, hot page thresholds) is application-dependent. Automated profiling and page-placement hints remain open areas (Li et al., 19 Apr 2024, Schieffer et al., 10 Jul 2024).
  • Bandwidth Granularity and Fragmentation: Large page sizes (64 KB, 2 MB) reduce fault overhead but may over-migrate unused data; finer granularity favors small working sets. Allocator choice directly influences TLB reach and cache stripe alignment (Wahlgren et al., 18 Aug 2025, Schieffer et al., 10 Jul 2024).
  • Coherence and Contention Costs: Atomic operations across CPU/GPU can induce contention; isolated workloads scale well, but mixed access needs double buffering or fine-grain synchronization for peak performance (Wahlgren et al., 18 Aug 2025).
  • Extensions and Tool Coverage: Existing tools often target Level-3 BLAS; coverage of Level-1/2 routines and legacy libraries (LAPACK, ScaLAPACK, MPI) is in progress. Storage-level UMA requires expanded hardware interfaces and compiler/runtime integration (Li et al., 19 Apr 2024, Zhang et al., 2023).
  • Porting Guidelines: Pinning data until device first-use, overlapping computation/migration, and explicit device-prefetch for GPU-initialized buffers improve performance. STL and system allocators should be adapted to favor contiguous allocation for high cache hit rates (Wahlgren et al., 18 Aug 2025, Tandon et al., 1 May 2024).
  • Memory Oversubscription: System memory with page-fault-only-once semantics is preferred under oversubscription; managed memory eviction can thrash without tensor reuse (Schieffer et al., 10 Jul 2024, Zhang et al., 2023).

7. Impact and Significance in High-Performance Computing

UMA architectures fundamentally simplify programming models, accelerate porting, and optimize memory usage and throughput in diverse scientific, AI, and simulation workloads. They eliminate explicit data transfers and page migration overhead, preserve cache and TLB locality, and deliver multi-teraflop performance with minimal code adaptation.

In HPC, UMA enables legacy CPU-centric codes to leverage GPUs seamlessly, drives 2–5× speedups in real applications, and supports scalable shared memory for concurrent computation, data analytics, and deep learning. Compiler-/runtime synergy in tensor migration (G10) extends UMA to storage-level capacity, achieving near-ideal throughput and transparent scaling (Li et al., 19 Apr 2024, Schieffer et al., 10 Jul 2024, Wahlgren et al., 18 Aug 2025, Tandon et al., 1 May 2024, Zhang et al., 2023).

A plausible implication is continued architectural innovation in coherent cache fabrics, migration policy automation, and storage integration, with the long-term direction pointing toward fully transparent, unified, and high-throughput memory systems across heterogeneous hardware.

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Unified Memory Architecture (UMA).