Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Optimizing Hardware Resource Partitioning and Job Allocations on Modern GPUs under Power Caps (2405.03838v1)

Published 6 May 2024 in cs.DC

Abstract: CPU-GPU heterogeneous systems are now commonly used in HPC (High-Performance Computing). However, improving the utilization and energy-efficiency of such systems is still one of the most critical issues. As one single program typically cannot fully utilize all resources within a node/chip, co-scheduling (or co-locating) multiple programs with complementary resource requirements is a promising solution. Meanwhile, as power consumption has become the first-class design constraint for HPC systems, such co-scheduling techniques should be well-tailored for power-constrained environments. To this end, the industry recently started supporting hardware-level resource partitioning features on modern GPUs for realizing efficient co-scheduling, which can operate with existing power capping features. For example, NVidia's MIG (Multi-Instance GPU) partitions one single GPU into multiple instances at the granularity of a GPC (Graphics Processing Cluster). In this paper, we explicitly target the combination of hardware-level GPU partitioning features and power capping for power-constrained HPC systems. We provide a systematic methodology to optimize the combination of chip partitioning, job allocations, as well as power capping based on our scalability/interference modeling while taking a variety of aspects into account, such as compute/memory intensity and utilization in heterogeneous computational resources (e.g., Tensor Cores). The experimental result indicates that our approach is successful in selecting a near optimal combination across multiple different workloads.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (35)
  1. [n. d.]. random-access-bench. https://github.com/cowsintuxedos/random-access-bench. Online; accessed 20 January 2022.
  2. [n. d.]. STREAM Benchmark in CUDA C++. https://github.com/bcumming/cuda-stream. Online; accessed 20 January 2022.
  3. [n. d.]. TOP 500. https://www.top500.org/statistics/list/. Online; accessed 20 January 2022.
  4. Slate: Enabling Workload-Aware Efficient Multiprocessing for Modern GPGPUs. In 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS). 252–261.
  5. Footprint-Aware Power Capping for Hybrid Memory Based Systems. In International Conference on High Performance Computing. 347–369.
  6. MASK: Redesigning the GPU Memory Hierarchy to Support Multi-Application Concurrency. In Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). 503–518.
  7. Adaptive Configuration Selection for Power-Constrained Heterogeneous Systems. In 2014 43rd International Conference on Parallel Processing (ICPP). 371–380.
  8. A Regression-Based Approach to Scalability Prediction. In 22nd Annual International Conference on Supercomputing (ICS). 368–377.
  9. Rodinia: A Benchmark Suite for Heterogeneous Computing. In 2009 IEEE International Symposium on Workload Characterization (IISWC). 44–54.
  10. SmCompactor: A Workload-Aware Fine-Grained Resource Management Framework for GPGPUs. In Proceedings of the 36th Annual ACM Symposium on Applied Computing (SAC). 1147–1155.
  11. Jack Choquette and Wish Gandhi. 2020. NVIDIA A100 GPU: Performance & Innovation for GPU Computing. In 2020 IEEE Hot Chips 32 Symposium (HCS). 1–43.
  12. Pack & Cap: Adaptive DVFS and Thread Packing under Power Caps. In 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). 175–185.
  13. Accelerate GPU Concurrent Kernel Execution by Mitigating Memory Pipeline Stalls. In 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA). 208–220.
  14. Application-Aware Prioritization Mechanisms for On-Chip Networks. In 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). 280–291.
  15. RAPL: Memory Power Estimation and Capping. In 2010 ACM/IEEE International Symposium on Low-Power Electronics and Design (ISLPED). 189–194.
  16. Benjamin C. Lee and David M. Brooks. 2006. Accurate and Efficient Regression Modeling for Microarchitectural Performance and Power Prediction. In 12th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). 185–194.
  17. Pierre Michaud. 2013. Demystifying Multicore Throughput Metrics. IEEE Computer Architecture Letters 12, 2 (2013), 63–66.
  18. Statistical Power Modeling of GPU Kernels using Performance Counters. In International Conference on Green Computing. 115–122.
  19. NVIDIA. [n. d.]a. CUTLASS 2.8. https://github.com/NVIDIA/cutlass. Online; accessed 20 January 2022.
  20. NVIDIA. [n. d.]b. Multi-Process Service. https://docs.nvidia.com/deploy/mps/index.html. Online; accessed 20 January 2022.
  21. NVIDIA. [n. d.]c. Nsight Compute. https://developer.nvidia.com/nsight-compute. Online; accessed 6 February 2022.
  22. NVIDIA. [n. d.]d. NVIDIA Multi-Instance GPU User Guide. https://docs.nvidia.com/datacenter/tesla/mig-user-guide/. Online; accessed 20 January 2022.
  23. Improving GPGPU Concurrency with Elastic Kernels. In Proceedings of the Eighteenth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). 407–418.
  24. Practical Resource Management in Power-Constrained, High Performance Computing. In Proceedings of the 24th International Symposium on High-Performance Parallel and Distributed Computing (HPDC). 121–132.
  25. Intra-Node Memory Safe GPU Co-Scheduling. IEEE Transactions on Parallel and Distributed Systems 29, 5 (2018), 1089–1102.
  26. Maximizing Throughput of Overprovisioned HPC Data Centers Under a Strict Power Budget. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC). 807–818.
  27. An Intra-Task Dvfs Technique Based on Statistical Analysis of Hardware Events. In 4th International Conference on Computing Frontiers (CF). 123–130.
  28. Coordinated Power-Performance Optimization in Manycores. In 22nd International Conference on Parallel Architectures and Compilation Techniques (PACT). 51–61.
  29. Scalability-Based Manycore Partitioning. In Proceedings of the 21st International Conference on Parallel Architectures and Compilation Techniques (PACT). 107–116.
  30. Automated Detection of Performance Regressions Using Regression Models on Clustered Performance Counters. In Proceedings of the 6th ACM/SPEC International Conference on Performance Engineering (ICPE). 15–26.
  31. Improving Provisioned Power Efficiency in HPC Systems with GPU-CAPP. In 2018 IEEE 25th International Conference on High Performance Computing (HiPC). 112–122.
  32. Enabling Preemptive Multiprogramming on GPUs. In 41st Annual International Symposium on Computer Architecuture (ISCA). 193–204.
  33. Slurm: Simple Linux Utility for Resource Management. In Workshop on job scheduling strategies for parallel processing. 44–60.
  34. Reza Zamani and Ahmad Afsahi. 2012. A study of hardware performance monitoring counter selection in power modeling of computing systems. In 2012 International Green Computing Conference (IGCC). 1–10.
  35. Co-Run Scheduling with Power Cap on Integrated CPU-GPU Systems. In 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS). 967–977.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Eishi Arima (6 papers)
  2. Minjoon Kang (1 paper)
  3. Issa Saba (2 papers)
  4. Josef Weidendorfer (1 paper)
  5. Carsten Trinitis (4 papers)
  6. Martin Schulz (30 papers)
Citations (7)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets