PARALLELGPUOS: A Concurrent OS-level GPU Checkpoint and Restore System using Validated Speculation (2405.12079v1)
Abstract: Checkpointing (C) and restoring (R) are key components for GPU tasks. POS is an OS-level GPU C/R system: It can transparently checkpoint or restore processes that use the GPU, without requiring any cooperation from the application, a key feature required by modern systems like the cloud. Moreover, POS is the first OS-level C/R system that can concurrently execute C/R with the application execution: a critical feature that can be trivially achieved when the processes only running on the CPU, but becomes challenging when the processes use GPU. The problem is how to ensure consistency during concurrent execution with the lack of application semantics due to transparency. CPU processes can leverage OS and hardware paging to fix inconsistency without application semantics. Unfortunately, GPU bypasses OS and paging for high performance. POS fills the semantic gap by speculatively extracting buffer access information of GPU kernels during runtime. Thanks to the simple and well-structured nature of GPU kernels, our speculative extraction (with runtime validation) achieves 100% accuracy on applications from training to inference whose domains span from vision, LLMs, and reinforcement learning. Based on the extracted semantics, we systematically overlap C/R with application execution, and achieves orders of magnitude higher performance under various tasks compared with the state-of-the-art OS-level GPU C/R, including training fault tolerance, live GPU process migration, and cold starts acceleration in GPU-based serverless computing.
- Vmflock: virtual machine co-migration for the cloud. In Proceedings of the 20th ACM International Symposium on High Performance Distributed Computing, HPDC 2011, San Jose, CA, USA, June 8-11, 2011 (2011), A. B. Maccabe and D. Thain, Eds., ACM, pp. 159–170.
- AMD. amdgpuplugin.c. https://github.com/checkpoint-restore/criu/blob/criu-dev/plugins/amdgpu/amdgpu_plugin.c, 2024.
- Demystifying nvidia gpu internals to enable reliable gpu management.
- Bernie Wu, Y. T. Achieving k8s and public cloud operational efficiency using a new checkpoint/restart feature for gpus. https://www.nvidia.com/gtc/posters/#/session/1705106137731001cNAN, 2024.
- The design and implementation of a verification technique for GPU kernels. ACM Trans. Program. Lang. Syst. 37, 3 (2015), 10:1–10:49.
- Gpuverify: a verifier for GPU kernels. In Proceedings of the 27th Annual ACM SIGPLAN Conference on Object-Oriented Programming, Systems, Languages, and Applications, OOPSLA 2012, part of SPLASH 2012, Tucson, AZ, USA, October 21-25, 2012 (2012), G. T. Leavens and M. B. Dwyer, Eds., ACM, pp. 113–132.
- Space-efficient scheduling of multithreaded computations. In Proceedings of the Twenty-Fifth Annual ACM Symposium on Theory of Computing, May 16-18, 1993, San Diego, CA, USA (1993), S. R. Kosaraju, D. S. Johnson, and A. Aggarwal, Eds., ACM, pp. 362–371.
- Formal analysis of GPU programs with atomics via conflict-directed delay-bounding. In NASA Formal Methods, 5th International Symposium, NFM 2013, Moffett Field, CA, USA, May 14-16, 2013. Proceedings (2013), G. Brat, N. Rungta, and A. Venet, Eds., vol. 7871 of Lecture Notes in Computer Science, Springer, pp. 213–228.
- Live migration of virtual machines. In 2nd Symposium on Networked Systems Design and Implementation NSDI (2005), May 2-4, 2005, Boston, Massachusetts, USA, Proceedings (2005), A. Vahdat and D. Wetherall, Eds., USENIX.
- CRIU. CRIU. https://criu.org/Main_Page, 2024.
- Serverless computing on heterogeneous computers. In ASPLOS ’22: 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Lausanne, Switzerland, 28 February 2022 - 4 March 2022 (2022), B. Falsafi, M. Ferdman, S. Lu, and T. F. Wenisch, Eds., ACM, pp. 797–813.
- Catalyzer: Sub-millisecond startup for serverless computing with initialization-less booting. In ASPLOS ’20: Architectural Support for Programming Languages and Operating Systems, Lausanne, Switzerland, March 16-20, 2020 (2020), J. R. Larus, L. Ceze, and K. Strauss, Eds., ACM, pp. 467–481.
- rcuda: Reducing the number of gpu-based accelerators in high performance clusters. In Proceedings of the 2010 International Conference on High Performance Computing & Simulation, HPCS 2010, June 28 - July 2, 2010, Caen, France (2010), W. W. Smari and J. P. McIntire, Eds., IEEE, pp. 224–231.
- A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems. The Journal of Supercomputing 65, 3 (2013), 1302–1326.
- Cricket: A virtualization layer for distributed execution of CUDA applications with checkpoint/restart support. Concurr. Comput. Pract. Exp. 34, 14 (2022).
- Check-n-run: a checkpointing system for training deep learning recommendation models. In 19th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2022, Renton, WA, USA, April 4-6, 2022 (2022), A. Phanishayee and V. Sekar, Eds., USENIX Association, pp. 929–943.
- Face, H. The ai community building the future. https://huggingface.co, 2024.
- DGSF: disaggregated gpus for serverless functions. In 2022 IEEE International Parallel and Distributed Processing Symposium, IPDPS 2022, Lyon, France, May 30 - June 3, 2022 (2022), IEEE, pp. 739–750.
- Serverlessllm: Locality-enhanced serverless inference for large language models. CoRR abs/2401.14351 (2024).
- An empirical study on quality issues of deep learning platform. In 45th IEEE/ACM International Conference on Software Engineering: Software Engineering in Practice, SEIP@ICSE 2023, Melbourne, Australia, May 14-20, 2023 (2023), IEEE, pp. 455–466.
- CRUM: checkpoint-restart support for cuda’s unified memory. In IEEE International Conference on Cluster Computing, CLUSTER 2018, Belfast, UK, September 10-13, 2018 (2018), IEEE Computer Society, pp. 302–313.
- Secure live migration of SGX enclaves on untrusted cloud. In 47th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN 2017, Denver, CO, USA, June 26-29, 2017 (2017), IEEE Computer Society, pp. 225–236.
- Microsecond-scale preemption for concurrent GPU-accelerated DNN inferences. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22) (Carlsbad, CA, July 2022), USENIX Association, pp. 539–558.
- Hardy, N. Keykos architecture. SIGOPS Oper. Syst. Rev. 19, 4 (oct 1985), 8–25.
- Berkeley lab checkpoint/restart (blcr) for linux clusters. In Journal of Physics: Conference Series (2006), vol. 46, IOP Publishing, p. 067.
- Post-copy live migration of virtual machines. ACM SIGOPS Oper. Syst. Rev. 43, 3 (2009), 14–26.
- Post-copy based live virtual machine migration using adaptive pre-paging and dynamic self-ballooning. In Proceedings of the 5th International Conference on Virtual Execution Environments, VEE 2009, Washington, DC, USA, March 11-13, 2009 (2009), A. L. Hosking, D. F. Bacon, and O. Krieger, Eds., ACM, pp. 51–60.
- Characterization and prediction of deep learning workloads in large-scale GPU datacenters. In International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2021, St. Louis, Missouri, USA, November 14-19, 2021 (2021), B. R. de Supinski, M. W. Hall, and T. Gamblin, Eds., ACM, p. 104.
- Dryad: distributed data-parallel programs from sequential building blocks. In Proceedings of the 2007 EuroSys Conference, Lisbon, Portugal, March 21-23, 2007 (2007), P. Ferreira, T. R. Gross, and L. Veiga, Eds., ACM, pp. 59–72.
- Analysis of Large-Scale Multi-Tenant GPU clusters for DNN training workloads. In 2019 USENIX Annual Technical Conference (USENIX ATC 19) (Renton, WA, July 2019), USENIX Association, pp. 947–960.
- Johnson, E. Starting up faster with aws lambda snapstart. https://aws.amazon.com/cn/blogs/compute/starting-up-faster-with-aws-lambda-snapstart/, 2024.
- Cloud programming simplified: A berkeley view on serverless computing. CoRR abs/1902.03383 (2019).
- iguard: In-gpu advanced race detection. In SOSP ’21: ACM SIGOPS 28th Symposium on Operating Systems Principles, Virtual Event / Koblenz, Germany, October 26-29, 2021 (2021), R. van Renesse and N. Zeldovich, Eds., ACM, pp. 49–65.
- Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems 25: 26th Annual Conference on Neural Information Processing Systems 2012. Proceedings of a meeting held December 3-6, 2012, Lake Tahoe, Nevada, United States (2012), P. L. Bartlett, F. C. N. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, Eds., pp. 1106–1114.
- Linux-cr: Transparent application checkpoint-restart in linux. In Linux Symposium (2010), vol. 159, Citeseer.
- Verifying GPU kernels by test amplification. In ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI ’12, Beijing, China - June 11 - 16, 2012 (2012), J. Vitek, H. Lin, and F. Tip, Eds., ACM, pp. 383–394.
- Scalable smt-based verification of GPU kernel functions. In Proceedings of the 18th ACM SIGSOFT International Symposium on Foundations of Software Engineering, 2010, Santa Fe, NM, USA, November 7-11, 2010 (2010), G. Roman and A. van der Hoek, Eds., ACM, pp. 187–196.
- GKLEE: concolic verification and test generation for gpus. In Proceedings of the 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPOPP 2012, New Orleans, LA, USA, February 25-29, 2012 (2012), J. Ramanujam and P. Sadayappan, Eds., ACM, pp. 215–224.
- Adding nvme ssds to enable and accelerate 100b model fine-tuning on a single GPU. CoRR abs/2403.06504 (2024).
- Checkpoint and migration of unix processes in the condor distributed processing system. Tech. rep., University of Wisconsin-Madison Department of Computer Sciences, 1997.
- Honeycomb: Secure and efficient GPU executions via static validation. In 17th USENIX Symposium on Operating Systems Design and Implementation (OSDI 23) (Boston, MA, 2023), USENIX Association, pp. 155–172.
- Meta. Llama 2. https://github.com/meta-llama/llama, 2024.
- Microsoft. Boost checkpoint speed and reduce cost with nebula. https://learn.microsoft.com/en-us/azure/machine-learning/reference-checkpoint-performance-for-large-models?view=azureml-api-2&tabs=PYTORCH, 2024.
- Checkfreq: Frequent, fine-grained DNN checkpointing. In 19th USENIX Conference on File and Storage Technologies, FAST 2021, February 23-25, 2021 (2021), M. K. Aguilera and G. Yadgar, Eds., USENIX Association, pp. 203–216.
- Naiad: a timely dataflow system. In ACM SIGOPS 24th Symposium on Operating Systems Principles, SOSP ’13, Farmington, PA, USA, November 3-6, 2013 (2013), M. Kaminsky and M. Dahlin, Eds., ACM, pp. 439–455.
- Paella: Low-latency model serving with software-defined GPU scheduling. In Proceedings of the 29th Symposium on Operating Systems Principles, SOSP 2023, Koblenz, Germany, October 23-26, 2023 (2023), J. Flinn, M. I. Seltzer, P. Druschel, A. Kaufmann, and J. Mace, Eds., ACM, pp. 595–610.
- Efficient checkpoint/restart of CUDA applications. Parallel Comput. 116 (2023), 103018.
- NVCR: A transparent checkpoint-restart library for NVIDIA CUDA. In 25th IEEE International Symposium on Parallel and Distributed Processing, IPDPS 2011, Anchorage, Alaska, USA, 16-20 May 2011 - Workshop Proceedings (2011), IEEE, pp. 104–113.
- NVIDIA. Basic linear algebra on nvidia gpus. https://developer.nvidia.com/cublas, 2024.
- NVIDIA. Cuda c++ programming guide. https://docs.nvidia.com/cuda/cuda-c-programming-guide/, 2024.
- NVIDIA. Cuda toolkit 12.3 downloads. https://developer.nvidia.com/cuda-12-3-0-download-archive, 2024.
- NVIDIA. Nvidia/cuda-checkpoint. https://github.com/NVIDIA/cuda-checkpoint, 2024.
- NVIDIA. Parallel thread execution isa version 8.4. https://docs.nvidia.com/cuda/parallel-thread-execution/index.html, 2024.
- OpenAI. Chatgpt. https://chat.openai.com, 2024.
- OpenAI. Openai gym. https://github.com/openai/gym, 2024.
- GPM: leveraging persistent memory from a GPU. In ASPLOS ’22: 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Lausanne, Switzerland, 28 February 2022 - 4 March 2022 (2022), B. Falsafi, M. Ferdman, S. Lu, and T. F. Wenisch, Eds., ACM, pp. 142–156.
- Checkpoint restart support for heterogeneous HPC applications. In 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing, CCGRID 2020, Melbourne, Australia, May 11-14, 2020 (2020), IEEE, pp. 242–251.
- Gpu-job migration: The rcuda case. IEEE Trans. Parallel Distributed Syst. 30, 12 (2019), 2718–2729.
- pytorch. Cuda semantics. https://pytorch.org/docs/stable/notes/cuda.html#cuda-memory-management, 2024.
- Pytorch. [rfc] upstream torchelastic to pytorch #50621. https://github.com/pytorch/pytorch/issues/50621, 2024.
- PyTorch. Torchelastic. https://pytorch.org/elastic/latest/, 2024.
- pytorch. torchvision. https://github.com/pytorch/vision, 2024.
- Decoupling the control plane from program control flow for flexibility and performance in cloud computing. In Proceedings of the Thirteenth EuroSys Conference, EuroSys 2018, Porto, Portugal, April 23-26, 2018 (2018), R. Oliveira, P. Felber, and Y. C. Hu, Eds., ACM, pp. 1:1–1:13.
- Ptask: operating system abstractions to manage gpus as compute devices. In Proceedings of the 23rd ACM Symposium on Operating Systems Principles 2011, SOSP 2011, Cascais, Portugal, October 23-26, 2011 (2011), T. Wobber and P. Druschel, Eds., ACM, pp. 233–248.
- EROS: a fast capability system. In Proceedings of the 17th ACM Symposium on Operating System Principles, SOSP 1999, Kiawah Island Resort, near Charleston, South Carolina, USA, December 12-15, 1999 (1999), D. Kotz and J. Wilkes, Eds., ACM, pp. 170–185.
- Singularity: Planet-scale, preemptive and elastic scheduling of AI workloads. CoRR abs/2202.07848 (2022).
- Mastering the game of go with deep neural networks and tree search. Nat. 529, 7587 (2016), 484–489.
- Llama 2: Open foundation and fine-tuned chat models. CoRR abs/2307.09288 (2023).
- The aurora single level store operating system. In SOSP ’21: ACM SIGOPS 28th Symposium on Operating Systems Principles, Virtual Event / Koblenz, Germany, October 26-29, 2021 (2021), R. van Renesse and N. Zeldovich, Eds., ACM, pp. 788–803.
- The aurora single level store operating system. In Proceedings of the ACM SIGOPS 28th Symposium on Operating Systems Principles (New York, NY, USA, 2021), SOSP ’21, Association for Computing Machinery, p. 788–803.
- Benchmarking, analysis, and optimization of serverless function snapshots. In Proceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’21) (2021), ACM.
- Nvbit: A dynamic binary instrumentation framework for NVIDIA gpus. In Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2019, Columbus, OH, USA, October 12-16, 2019 (2019), ACM, pp. 372–383.
- Characterizing network requirements for GPU API remoting in AI applications. CoRR abs/2401.13354 (2024).
- GEMINI: fast failure recovery in distributed training with in-memory checkpoints. In Proceedings of the 29th Symposium on Operating Systems Principles, SOSP 2023, Koblenz, Germany, October 23-26, 2023 (2023), J. Flinn, M. I. Seltzer, P. Druschel, A. Kaufmann, and J. Mace, Eds., ACM, pp. 364–381.
- No provisioned concurrency: Fast rdma-codesigned remote fork for serverless computing. In 17th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2023, Boston, MA, USA, July 10-12, 2023 (2023), R. Geambasu and E. Nightingale, Eds., USENIX Association, pp. 497–517.
- Treesls: A whole-system persistent microkernel with tree-structured state checkpoint on NVM. In Proceedings of the 29th Symposium on Operating Systems Principles, SOSP 2023, Koblenz, Germany, October 23-26, 2023 (2023), J. Flinn, M. I. Seltzer, P. Druschel, A. Kaufmann, and J. Mace, Eds., ACM, pp. 1–16.
- Gandiva: Introspective cluster scheduling for deep learning. In 13th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2018, Carlsbad, CA, USA, October 8-10, 2018 (2018), A. C. Arpaci-Dusseau and G. Voelker, Eds., USENIX Association, pp. 595–610.
- Tear up the bubble boom: Lessons learned from a deep learning research and development cluster. In IEEE 40th International Conference on Computer Design, ICCD 2022, Olympic Valley, CA, USA, October 23-26, 2022 (2022), IEEE, pp. 672–680.
- Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In Proceedings of the 9th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2012, San Jose, CA, USA, April 25-27, 2012 (2012), S. D. Gribble and D. Katabi, Eds., USENIX Association, pp. 15–28.
- Exoflow: A universal workflow system for exactly-once dags. In 17th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2023, Boston, MA, USA, July 10-12, 2023 (2023), R. Geambasu and E. Nightingale, Eds., USENIX Association, pp. 269–286.
- Resiliency at scale: Managing google’s tpuv4 machine learning supercomputer. In 21st USENIX Symposium on Networked Systems Design and Implementation, NSDI 2024, Santa Clara, CA, April 15-17, 2024 (2024), L. Vanbever and I. Zhang, Eds., USENIX Association, pp. 761–774.