Shared Virtual Memory: Its Design and Performance Implications for Diverse Applications (2405.06811v1)
Abstract: Discrete GPU accelerators, while providing massive computing power for supercomputers and data centers, have their separate memory domain. Explicit memory management across device and host domains in programming is tedious and error-prone. To improve programming portability and productivity, Unified Memory (UM) integrates GPU memory into the host virtual memory systems, and provides transparent data migration between them and GPU memory oversubscription. Nevertheless, current UM technologies cause significant performance loss for applications. With AMD GPUs increasingly being integrated into the world's leading supercomputers, it is necessary to understand their Shared Virtual Memory (SVM) and mitigate the performance impacts. In this work, we delve into the SVM design, examine its interactions with applications' data accesses at fine granularity, and quantitatively analyze its performance effects on various applications and identify the performance bottlenecks. Our research reveals that SVM employs an aggressive prefetching strategy for demand paging. This prefetching is efficient when GPU memory is not oversubscribed. However, in tandem with the eviction policy, it causes excessive thrashing and performance degradation for certain applications under oversubscription. We discuss SVM-aware algorithms and SVM design changes to mitigate the performance impacts. To the best of our knowledge, this work is the first in-depth and comprehensive study for SVM technologies.
- Tyler Allen and Rong Ge. 2021a. Demystifying GPU UVM Cost with Deep Runtime and Workload Analysis. In 2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS). 141–150.
- Tyler Allen and Rong Ge. 2021b. In-Depth Analyses of Unified Virtual Memory System for GPU Accelerated Computing. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (St. Louis, Missouri) (SC ’21). Association for Computing Machinery, New York, NY, USA, Article 64, 15 pages.
- AMD. 2023. AMD ROCm™ documentation. https://rocm.docs.amd.com/en/latest/
- AMD. 2023. rocBLAS Documentation. https://rocblas.readthedocs.io/en/master/index.html
- AMD. 2024. AMD Instinct™ MI300 Series Accelerators. https://www.amd.com/en/products/accelerators/instinct/mi300.html
- RAJA: Portable Performance for Large-Scale Scientific Applications. In 2019 IEEE/ACM International Workshop on Performance, Portability and Productivity in HPC (P3HPC). 71–81.
- Productive performance engineering for weather and climate modeling with Python. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (Dallas, Texas) (SC ’22). IEEE Press, Article 73, 14 pages.
- Language models are few-shot learners. Advances in neural information processing systems 33 (2020), 1877–1901.
- Kokkos. J. Parallel Distrib. Comput. 74, 12 (Dec. 2014), 3202–3216.
- To move or not to move? page migration for irregular applications in over-subscribed GPU memory systems with DynaMap. In Proceedings of the 14th ACM International Conference on Systems and Storage (Haifa, Israel) (SYSTOR ’21). Association for Computing Machinery, New York, NY, USA, Article 1, 12 pages.
- Performance Evaluation of Advanced Features in CUDA Unified Memory. In 2019 IEEE/ACM Workshop on Memory Centric High Performance Computing (MCHPC). 50–57.
- Improving Oversubscribed GPU Memory Performance in the PyTorch Framework. Cluster Computing 26 (2022), 2835 – 2850. https://api.semanticscholar.org/CorpusID:253492267
- Interplay between hardware prefetcher and page eviction policy in CPU-GPU unified virtual memory. In Proceedings of the 46th International Symposium on Computer Architecture (Phoenix, Arizona) (ISCA ’19). Association for Computing Machinery, New York, NY, USA, 224–235.
- Comparing Managed Memory and ATS with and without Prefetching on NVIDIA Volta GPUs. In 2019 IEEE/ACM Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS). 41–46.
- A New Prefetching Strategy Based on Access Density in Linux. In International Symposium on Computer Science and its Applications. 22–27.
- An Overview of the Trilinos Project. ACM Trans. Math. Softw. 31, 3 (Sept. 2005), 397–423.
- Richard D. Hornung and Holger E. Hones. 2017. RAJA Performance Suite. [Computer Software] https://doi.org/10.11578/dc.20201001.36. https://doi.org/10.11578/dc.20201001.36
- Simplifying GPU application development with heterogeneous memory management. https://developer.nvidia.com/blog/simplifying-gpu-application-development-with-heterogeneous-memory-management/
- John Hubbard and Jerome Glisee. 2017. GPUs: HMM: Heterogeneous Memory Management. https://www.redhat.com/files/summit/session-assets/2017/S104078-hubbard.pdf
- CLOCK-Pro: An Effective Improvement of the CLOCK Replacement. In 2005 USENIX Annual Technical Conference (USENIX ATC 05). USENIX Association, Anaheim, CA.
- Zheming Jin and Jeffrey S. Vetter. 2022. Evaluating Unified Memory Performance in HIP. In 2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW). 562–568.
- DeepUM: Tensor Migration and Prefetching in Unified Memory. In Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2 (Vancouver, BC, Canada) (ASPLOS 2023). Association for Computing Machinery, New York, NY, USA, 207–221.
- Khronos Group. 2023. Open Standard for Parallel Programming of Heterogeneous Systems. https://www.khronos.org/api/opencl
- Batch-Aware Unified Memory Management in GPUs for Irregular Workloads. In Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems (Lausanne, Switzerland) (ASPLOS ’20). Association for Computing Machinery, New York, NY, USA, 1357–1370.
- Marcin Knap and Paweł Czarnul. 2019. Performance evaluation of Unified Memory with prefetching and oversubscription for selected parallel CUDA applications on NVIDIA Pascal and Volta GPUs. The Journal of Supercomputing 75, 11 (Nov. 2019), 7625–7645.
- An investigation of Unified Memory Access performance in CUDA. In 2014 IEEE High Performance Extreme Computing Conference (HPEC). 1–6.
- Lawrence Livermore National Lab. 2022. Tioga. https://hpc.llnl.gov/hardware/compute-platforms/tioga
- Lawrence Livermore National Lab. 2023. El Capitan: Preparing for NNSA’s first exascale machine. https://asc.llnl.gov/exascale/el-capitan
- BLOOM: A 176B-Parameter Open-Access Multilingual Language Model. arXiv:2211.05100 [cs.CL]
- TOSS-2020: A Commodity Software Stack for HPC. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis (Atlanta, Georgia) (SC ’20). IEEE Press, Article 40, 15 pages.
- A Framework for Memory Oversubscription Management in Graphics Processing Units. In Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems (Providence, RI, USA) (ASPLOS ’19). Association for Computing Machinery, New York, NY, USA, 49–63.
- Linux Kernel Development Community. 2023. Heterogeneous Memory Management (HMM). https://www.kernel.org/doc/html/latest/mm/hmm.html
- A Research Retrospective on AMD’s Exascale Computing Journey. In Proceedings of the 50th Annual International Symposium on Computer Architecture (Orlando, FL, USA) (ISCA ’23). Association for Computing Machinery, New York, NY, USA, Article 81, 14 pages.
- PyTorch-Direct: Enabling GPU Centric Data Access for Very Large Graph Neural Network Training with Irregular Accesses. CoRR abs/2101.07956 (2021). arXiv:2101.07956
- EMOGI: Efficient Memory-Access for out-of-Memory Graph-Traversal in GPUs. Proc. VLDB Endow. 14, 2 (oct 2020), 114–127.
- Jose M. Nadal-Serrano and Marisa Lopez-Vallejo. 2016. A Performance Study of CUDA UVM versus Manual Optimizations in a Real-World Setup: Application to a Monte Carlo Wave-Particle Event-Based Interaction Model. IEEE Transactions on Parallel and Distributed Systems 27, 6 (2016), 1579–1588.
- Oak Ridge National Lab. 2022. Frontier User Guide - OLCF User Documentation. https://docs.olcf.ornl.gov/systems/frontier_user_guide.html
- Accelerating BWA-MEM Read Mapping on GPUs. In Proceedings of the 37th International Conference on Supercomputing (Orlando, FL, USA) (ICS ’23). Association for Computing Machinery, New York, NY, USA, 155–166.
- Locating system problems using dynamic instrumentation. In 2005 Ottawa Linux Symposium. New York, NY: IEEE, 49–64.
- Alan Smith and Norman James. 2022. AMD Instinct MI200 Series Accelerator and Node Architectures. In 2022 IEEE Hot Chips 34 Symposium (HCS). IEEE Computer Society, 1–23.
- Top500. 2023. November 2023. https://top500.org/lists/top500/2023/11/
- Unified Acceleration Foundation. 2023. oneAPI. https://www.oneapi.io/spec/
- Roofline: an insightful visual performance model for multicore architectures. Commun. ACM 52, 4 (2009), 65–76.
- faimGraph: High Performance Management of Fully-Dynamic Graphs Under Tight Memory Constraints on the GPU. In SC18: International Conference for High Performance Computing, Networking, Storage and Analysis. 754–766. https://doi.org/10.1109/SC.2018.00063
- XUnified: A Framework for Guiding Optimal Use of GPU Unified Memory. IEEE Access 10 (2022), 82614–82625.
- A quantitative evaluation of unified memory in GPUs. The Journal of Supercomputing 76, 4 (nov 2019), 2958–2985.
- Bennett Cooper (1 paper)
- Thomas R. W. Scogland (4 papers)
- Rong Ge (92 papers)