Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 72 tok/s

Gemini 2.5 Pro 45 tok/s Pro

GPT-5 Medium 33 tok/s Pro

GPT-5 High 29 tok/s Pro

GPT-4o 93 tok/s Pro

Kimi K2 211 tok/s Pro

GPT OSS 120B 442 tok/s Pro

Claude Sonnet 4.5 35 tok/s Pro

2000 character limit reached

Unified Physical Memory (UPM)

Updated 19 August 2025

Unified Physical Memory (UPM) is a unified memory architecture that merges disparate processor memories into a single, coherent physical address space.
UPM achieves significant performance benefits and memory savings—up to 44%—by eliminating explicit data transfers and buffer duplications.
The design simplifies software development and optimizes system performance by enabling direct, concurrent memory access across heterogeneous processors.

Unified Physical Memory (UPM) refers to an architectural paradigm and system design goal in which disparate processing elements (CPU, GPU, accelerators, PIM units, etc.) operate over a single, physically integrated memory address space with full hardware-level coherence and symmetry of access. UPM enables all processors—regardless of type—to access the same physical memory regions directly, eliminating the need for explicit data transfers, buffering, or multipath memory allocation. This model unifies memory management, allocation, and access, shifting away from the long-established paradigm of device-specific memory spaces and explicit host–device coordination. UPM has recently transitioned from theoretical proposals into production systems, notably with the deployment of architectures such as AMD’s MI300A Accelerated Processing Unit (APU), which underlies flagship systems like the El Capitan supercomputer (Wahlgren et al., 18 Aug 2025).

1. Architectural Foundations of UPM

UPM is distinguished by the absence of explicit boundaries between host and device memories. In AMD MI300A, both Zen 4 CPU cores and CDNA compute units (GPU) directly share the high-bandwidth memory (HBM) subsystem via an integrated, coherent memory fabric (Tandon et al., 1 May 2024, Wahlgren et al., 18 Aug 2025). All allocations—whether via malloc, new, or device-specific routines (e.g., hipMalloc)—allocate memory in this global space, accessible by any core type at the hardware paging level. This is in direct contrast with Unified Virtual Memory (UVM) and Shared Virtual Memory (SVM) schemes prevalent on previous generations, which merely virtualize or shadow-copy between siloed memory pools.

A core property of UPM is that for every virtual address p in the application’s address space,

$p_{CPU} = p_{GPU}$

and all such addresses are resolved to the same physical location, regardless of the access initiator. On MI300A, key memory latencies and bandwidths reflect unified topology: CPU L1 latency ≈1 ns, GPU L1 ≈57 ns, with shared HBM latencies of 236–350 ns and aggregate GPU bandwidth reaching 3.5–3.6 TB/s (Wahlgren et al., 18 Aug 2025).

2. System Software Support and Memory Management

System software for UPM must efficiently handle allocation, page faulting, TLB management, and coherency across heterogeneous processor types. Allocation strategies (malloc, hipMalloc, hipHostMalloc, hipMallocManaged) differ in when and how physical pages are mapped. On-demand (e.g., malloc) allocates quickly but incurs first-touch page faults, while up-front allocators (e.g., hipMalloc) eagerly allocate physical pages—trading off initial allocation time for reduced runtime page fault overhead (Wahlgren et al., 18 Aug 2025).

Page fault handling is streamlined: faults can be triggered by either CPU or GPU, but latency is minimized by “pre-faulting” on the CPU so subsequent GPU accesses only incur minor fault resolution. Typical major fault latency for the GPU is 18–22 μs, but CPU-handled faults are resolved in 9–11 μs. Additionally, larger and more contiguous memory fragments (enabled by up-front allocation) yield fewer TLB misses and more efficient use of hierarchical cache (e.g., only ≈158K TLB misses with hipMalloc in the STREAM benchmark, compared to over 1M with malloc) (Wahlgren et al., 18 Aug 2025).

Infinity Cache utilization is likewise maximized by ensuring physically interleaved (evenly distributed) pages; biased CPU malloc can disrupt this interleaving, reducing bandwidth and raising access latency as the working set scales toward cache capacity.

3. Application Porting Strategies and Programming Implications

Porting applications to exploit UPM involves removing explicit device-host memory management and leveraging a single allocation for each dataset. For concurrent CPU–GPU access, double-buffering is recommended: two buffers are continuously swapped between CPU- and GPU-resident computations, synchronized via barriers or handshakes, thus removing the need for device-to-host copies. Scientific applications that rely on querying “free” device or host memory must now turn to system counters (e.g., libnuma, /proc/meminfo) since the separation is gone (Wahlgren et al., 18 Aug 2025). Static and stack variables require caution: the GPU cannot directly access static host memory, necessitating managed or, preferably, dynamic (heap) allocation.

When porting from explicitly managed models (e.g., using hipGetMemInfo for buffer sizing, or partial host↔device data transfers in pipelines), the UPM model eliminates those steps. If C++ standard containers such as std::vector are involved, memory allocators should be customized to use the unified memory pool (e.g., via hipMalloc) to avoid performance penalties from fragmented page faults.

4. Empirical Performance and Memory Efficiency

Applications ported to run atop UPM on MI300A consistently match or outperform their explicitly managed counterparts. Notable results include a 19% reduction in execution time in backprop, an 86% reduction in compute time within dwt2d (when explicit transfers are removed), and parity or better for all other test kernels (Wahlgren et al., 18 Aug 2025). Importantly, UPM enables substantial memory savings:

$\text{Memory Savings (\%)} = \frac{M_\text{explicit} - M_\text{unified}}{M_\text{explicit}} \times 100$

with reductions of up to 44% recorded for applications such as hotspot and nn, as duplicated host/device buffers are replaced by a single shared allocation (Wahlgren et al., 18 Aug 2025).

UPM eliminates device-to-host latency and energy costs, removing the software overhead and data movement barriers typical of UVM or SVM-based systems (which historically suffered 2–3× slowdowns during page migrations or thrashing (Cooper et al., 10 May 2024)). Performance in UPM is further enhanced by the ability to exploit highly tuned allocation, TLB configuration, and cache mapping; however, poor allocator or container patterns can still induce avoidable runtime overhead.

5. Hardware and Architectural Impact

Unified memory architectures instantiate system-wide coherence and eliminate device spatial boundaries, as demonstrated in not only AMD MI300A (Wahlgren et al., 18 Aug 2025, Tandon et al., 1 May 2024), but also the Grace Hopper Superchip, which implements an integrated CPU–GPU system page table with hardware-level address translation and cache-coherent NVLink-C2C interconnect (Schieffer et al., 10 Jul 2024). In these systems, both CPU and GPU can perform memory operations at cache-line granularities across physical memory, sustained by hardware-accelerated address lookup and consistent cache coherence protocols. This transparency allows direct, fine-grained access to all memory by any core and supports access patterns ranging from contiguous block streaming to irregular/scattered fetches typical of HPC workloads.

Infinity Cache (on MI300A) further augments bandwidth and reduces effective memory latency, but only if allocations are channel-interleaved. System-level atomic operations, both integer and floating-point, are natively supported across processor boundaries with native coherent semantics, and performance matches or surpasses that of the best discrete devices, depending on contention and working set localization (Wahlgren et al., 18 Aug 2025).

6. HPC System-Level Implications

UPM has direct impact on cost-efficiency, scalability, and programming productivity in HPC. With duplicated buffers eliminated, the hardware RAM requirement per node is reduced by up to 44%, raising the maximum feasible problem size on fixed hardware. The absence of explicit transfer code reduces development time, and performance convergence with (or outperformance over) explicitly managed models enables migration to unified memory models without apparent speed loss (Wahlgren et al., 18 Aug 2025, Tandon et al., 1 May 2024).

The unified model particularly benefits data-intensive and memory-bound applications, which previously suffered from expensive host–device synchronization or device-local memory limits. Systems such as El Capitan, composed of thousands of MI300A APUs, leverage this for supporting large-scale simulation and AI with minimized energy and management overhead.

7. Comparative Context and Future Directions

UPM stands in contrast to precursor models: UVM (Unified Virtual Memory) and SVM (Shared Virtual Memory) deliver unified logical spaces but not physical integration, resulting in costly page migrations, aggressive prefetching with potential for thrashing, and complex driver- and application-level coordination (Cooper et al., 10 May 2024). UPM’s hardware-level unitary memory space, as in MI300A and Grace Hopper, removes these barriers and their associated costs.

Challenges remain regarding allocator and software compatibility, complex library interactions (e.g., handling of stack/static variables), and the efficient use of high-speed caches/interleaves. Nonetheless, UPM represents a convergence point in heterogeneous computing, promising unified abstraction, cost-effectiveness, and high performance for future scalable systems.

UPM enables all processors in a system—CPU, GPU, accelerators—to operate over a single, fully coherent physical memory address space, thereby eliminating device buffer duplication, minimizing or negating explicit data transfers, reducing hardware costs by up to 44%, matching or exceeding explicit model performance, and streamlining application porting in large-scale HPC and AI deployments (Wahlgren et al., 18 Aug 2025, Tandon et al., 1 May 2024, Schieffer et al., 10 Jul 2024).