Papers
Topics
Authors
Recent
2000 character limit reached

Characterizing Network Requirements for GPU API Remoting in AI Applications (2401.13354v1)

Published 24 Jan 2024 in cs.OS and cs.NI

Abstract: GPU remoting is a promising technique for supporting AI applications. Networking plays a key role in enabling remoting. However, for efficient remoting, the network requirements in terms of latency and bandwidth are unknown. In this paper, we take a GPU-centric approach to derive the minimum latency and bandwidth requirements for GPU remoting, while ensuring no (or little) performance degradation for AI applications. Our study including theoretical model demonstrates that, with careful remoting design, unmodified AI applications can run on the remoting setup using commodity networking hardware without any overhead or even with better performance, with low network demands.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (47)
  1. Association., I. T. Infiniband architecture specification. https://cw.infinibandta.org/document/dl/7859, 2022.
  2. A virtual memory based runtime to support multi-tenancy in clusters with gpus. In The 21st International Symposium on High-Performance Parallel and Distributed Computing, HPDC’12, Delft, Netherlands - June 18 - 22, 2012 (2012), D. H. J. Epema, T. Kielmann, and M. Ripeanu, Eds., ACM, pp. 97–108.
  3. Polardb serverless: A cloud native database for disaggregated data centers. In SIGMOD ’21: International Conference on Management of Data, Virtual Event, China, June 20-25, 2021 (2021), G. Li, Z. Li, S. Idreos, and D. Srivastava, Eds., ACM, pp. 2477–2489.
  4. Exploring the suitability of remote GPGPU virtualization for the openacc programming model using rcuda. In 2015 IEEE International Conference on Cluster Computing, CLUSTER 2015, Chicago, IL, USA, September 8-11, 2015 (2015), IEEE Computer Society, pp. 92–95.
  5. FaRM: Fast remote memory. In Proceedings of the 11th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2014, Seattle, WA, USA, April 2-4, 2014 (2014), R. Mahajan and I. Stoica, Eds., USENIX Association, pp. 401–414.
  6. An efficient implementation of GPU virtualization in high performance clusters. In Euro-Par 2009 - Parallel Processing Workshops, HPPC, HeteroPar, PROPER, ROIA, UNICORE, VHPC, Delft, The Netherlands, August 25-28, 2009, Revised Selected Papers (2009), H. Lin, M. Alexander, M. Forsell, A. Knüpfer, R. Prodan, L. Sousa, and A. Streit, Eds., vol. 6043 of Lecture Notes in Computer Science, Springer, pp. 385–394.
  7. Modeling the cuda remoting virtualization behaviour in high performance networks. In First Workshop on Language, Compiler, and Architecture Support for GPGPU (2010).
  8. rcuda: Reducing the number of gpu-based accelerators in high performance clusters. In Proceedings of the 2010 International Conference on High Performance Computing & Simulation, HPCS 2010, June 28 - July 2, 2010, Caen, France (2010), W. W. Smari and J. P. McIntire, Eds., IEEE, pp. 224–231.
  9. Performance of CUDA virtualized remote gpus in high performance clusters. In International Conference on Parallel Processing, ICPP 2011, Taipei, Taiwan, September 13-16, 2011 (2011), G. R. Gao and Y. Tseng, Eds., IEEE Computer Society, pp. 365–374.
  10. Cricket: A virtualization layer for distributed execution of CUDA applications with checkpoint/restart support. Concurr. Comput. Pract. Exp. 34, 14 (2022).
  11. Face, H. The ai community building the future. https://huggingface.co, 2023.
  12. DGSF: disaggregated gpus for serverless functions. In 2022 IEEE International Parallel and Distributed Processing Symposium, IPDPS 2022, Lyon, France, May 30 - June 3, 2022 (2022), IEEE, pp. 739–750.
  13. Network requirements for resource disaggregation. In 12th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2016, Savannah, GA, USA, November 2-4, 2016 (2016), K. Keeton and T. Roscoe, Eds., USENIX Association, pp. 249–264.
  14. A GPGPU transparent virtualization component for high performance computing clouds. In Euro-Par 2010 - Parallel Processing, 16th International Euro-Par Conference, Ischia, Italy, August 31 - September 3, 2010, Proceedings, Part I (2010), P. D’Ambra, M. R. Guarracino, and D. Talia, Eds., vol. 6271 of Lecture Notes in Computer Science, Springer, pp. 379–391.
  15. Emf: Disaggregated gpus in datacenters for efficiency, modularity and flexibility. In 2019 IEEE International Conference on Cloud Computing in Emerging Markets (CCEM) (2019), IEEE, pp. 1–8.
  16. RDMA over commodity ethernet at scale. In Proceedings of the ACM SIGCOMM 2016 Conference, Florianopolis, Brazil, August 22-26, 2016 (2016), M. P. Barcellos, J. Crowcroft, A. Vahdat, and S. Katti, Eds., ACM, pp. 202–215.
  17. Gvim: Gpu-accelerated virtual machines. In Proceedings of the 3rd ACM Workshop on System-level Virtualization for High Performance Computing, HPCVirt ’09, Nuremburg, Germany, March 31, 2009 (2009), S. L. Scott and G. Vallée, Eds., ACM, pp. 17–24.
  18. Pegasus: Coordinated scheduling for virtualized accelerator-based systems. In 2011 USENIX Annual Technical Conference, Portland, OR, USA, June 15-17, 2011 (2011), J. Nieh and C. A. Waldspurger, Eds., USENIX Association.
  19. Dxpu: Large scale disaggregated GPU pools in the datacenter. CoRR abs/2310.04648 (2023).
  20. GPU virtualization and scheduling methods: A comprehensive survey. ACM Comput. Surv. 50, 3 (2017), 35:1–35:37.
  21. Skadi: Building a distributed runtime for data systems in disaggregated data centers. In Proceedings of the 19th Workshop on Hot Topics in Operating Systems, HOTOS 2023, Providence, RI, USA, June 22-24, 2023 (2023), M. Schwarzkopf, A. Baumann, and N. Crooks, Eds., ACM, pp. 94–102.
  22. Design guidelines for high performance RDMA systems. In 2016 USENIX Annual Technical Conference, USENIX ATC 2016, Denver, CO, USA, June 22-24, 2016 (2016), A. Gulati and H. Weatherspoon, Eds., USENIX Association, pp. 437–450.
  23. AlpaServe: Statistical multiplexing with model parallelism for deep learning serving. In 17th USENIX Symposium on Operating Systems Design and Implementation (OSDI 23) (Boston, MA, July 2023), USENIX Association, pp. 663–679.
  24. Mica: A holistic approach to fast in-memory key-value storage. In Proceedings of the 11th USENIX Conference on Networked Systems Design and Implementation (Berkeley, CA, USA, 2014), NSDI’14, USENIX Association, pp. 429–444.
  25. Mellanox. ConnectX-7 product brief. https://www.nvidia.com/content/dam/en-zz/Solutions/networking/ethernet-adapters/connectx-7-datasheet-Final.pdf, 2022.
  26. Ray: A distributed framework for emerging AI applications. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18) (Carlsbad, CA, Oct. 2018), USENIX Association, pp. 561–577.
  27. Efficient large-scale language model training on gpu clusters using megatron-lm. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (2021), pp. 1–15.
  28. Paella: Low-latency model serving with software-defined GPU scheduling. In Proceedings of the 29th Symposium on Operating Systems Principles, SOSP 2023, Koblenz, Germany, October 23-26, 2023 (2023), J. Flinn, M. I. Seltzer, P. Druschel, A. Kaufmann, and J. Mace, Eds., ACM, pp. 595–610.
  29. NVIDIA. NVIDIA DGX Platform. https://www.nvidia.com/en-us/data-center/dgx-platform/, 2024.
  30. OFED. Open Fabrics Enterprise Distribution (OFED) Performance Tests. https://github.com/linux-rdma/perftest, 2024.
  31. Efficiently scaling transformer inference. Proceedings of Machine Learning and Systems 5 (2023).
  32. Pytorch. Datasets and dataloaders. https://pytorch.org/tutorials/beginner/basics/data_tutorial.html, 2024.
  33. Pytorch. Torchelastic. https://pytorch.org/elastic/latest/, 2024.
  34. AIFM: high-performance, application-integrated far memory. In 14th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2020, Virtual Event, November 4-6, 2020 (2020), USENIX Association, pp. 315–332.
  35. Legoos: A disseminated, distributed OS for hardware resource disaggregation. In 13th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2018, Carlsbad, CA, USA, October 8-10, 2018 (2018), A. C. Arpaci-Dusseau and G. Voelker, Eds., USENIX Association, pp. 69–87.
  36. vcuda: Gpu-accelerated high-performance computing in virtual machines. IEEE Trans. Computers 61, 6 (2012), 804–816.
  37. Singularity: Planet-scale, preemptive and elastic scheduling of AI workloads. CoRR abs/2202.07848 (2022).
  38. Cachecloud: Towards speed-of-light datacenter communication. In 10th USENIX Workshop on Hot Topics in Cloud Computing, HotCloud 2018, Boston, MA, USA, July 9, 2018 (2018), G. Ananthanarayanan and I. Gupta, Eds., USENIX Association.
  39. Disaggregating persistent memory and controlling them remotely: An exploration of passive disaggregated key-value stores. In 2020 USENIX Annual Technical Conference, USENIX ATC 2020, July 15-17, 2020 (2020), A. Gavrilovska and E. Zadok, Eds., USENIX Association, pp. 33–48.
  40. Lite kernel rdma support for datacenter applications. In Proceedings of the 26th Symposium on Operating Systems Principles (New York, NY, USA, 2017), SOSP ’17, ACM, pp. 306–324.
  41. VMWare. Mware vsphere bitfusion. https://docs.vmware.com/cn/VMware-vSphere-Bitfusion/index.html, 2023.
  42. Semeru: A memory-disaggregated managed runtime. In 14th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2020, Virtual Event, November 4-6, 2020 (2020), USENIX Association, pp. 261–280.
  43. Deconstructing RDMA-enabled distributed transactions: Hybrid is better! In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18) (Carlsbad, CA, Oct. 2018), USENIX Association, pp. 233–251.
  44. Faaswap: Slo-aware, gpu-efficient serverless inference via model swapping. CoRR abs/2306.03622 (2023).
  45. vgasa: Adaptive scheduling algorithm of virtualized GPU resource in cloud gaming. IEEE Trans. Parallel Distributed Syst. 25, 11 (2014), 3036–3045.
  46. Partial failure resilient memory management system for (cxl-based) distributed shared memory. In Proceedings of the 29th Symposium on Operating Systems Principles, SOSP 2023, Koblenz, Germany, October 23-26, 2023 (2023), J. Flinn, M. I. Seltzer, P. Druschel, A. Kaufmann, and J. Mace, Eds., ACM, pp. 658–674.
  47. One-sided rdma-conscious extendible hashing for disaggregated memory. In 2021 USENIX Annual Technical Conference (USENIX ATC 21) (July 2021), USENIX Association, pp. 15–29.
Citations (2)

Summary

We haven't generated a summary for this paper yet.

Whiteboard

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 0 likes about this paper.