SuperBench: Improving Cloud AI Infrastructure Reliability with Proactive Validation (2402.06194v2)
Abstract: Reliability in cloud AI infrastructure is crucial for cloud service providers, prompting the widespread use of hardware redundancies. However, these redundancies can inadvertently lead to hidden degradation, so called "gray failure", for AI workloads, significantly affecting end-to-end performance and concealing performance issues, which complicates root cause analysis for failures and regressions. We introduce SuperBench, a proactive validation system for AI infrastructure that mitigates hidden degradation caused by hardware redundancies and enhances overall reliability. SuperBench features a comprehensive benchmark suite, capable of evaluating individual hardware components and representing most real AI workloads. It comprises a Validator which learns benchmark criteria to clearly pinpoint defective components. Additionally, SuperBench incorporates a Selector to balance validation time and issue-related penalties, enabling optimal timing for validation execution with a tailored subset of benchmarks. Through testbed evaluation and simulation, we demonstrate that SuperBench can increase the mean time between incidents by up to 22.61x. SuperBench has been successfully deployed in Azure production, validating hundreds of thousands of GPUs over the last two years.
- AMD instinct accelerator claims. https://www.amd.com/en/claims/instinct. Accessed Jan 15, 2024.
- Democratizing AI with PyTorch foundation and ROCm support for PyTorch. https://pytorch.org/blog/democratizing-ai-with-pytorch/. Accessed Jan 15, 2024.
- TOP10 System - November 2023. https://www.top500.org/lists/top500/2023/11/. Accessed Jan 14, 2024.
- AMD. Introducing AMD CDNA 2 architecture. https://www.amd.com/system/files/documents/amd-cdna2-white-paper.pdf. Accessed April 29, 2022.
- AMD. Next generation BLAS implementation for ROCm platform. https://github.com/ROCmSoftwarePlatform/rocBLAS. Accessed April 29, 2022.
- AMD. RCCL performance benchmark tests. https://github.com/ROCmSoftwarePlatform/rccl-tests. Accessed April 29, 2022.
- AMD. ROCm releases. https://github.com/RadeonOpenCompute/ROCm/releases. Accessed April 7, 2023.
- NVIDIA Hopper architecture in-depth. https://developer.nvidia.com/blog/nvidia-hopper-architecture-in-depth/. Accessed April 29, 2022.
- Fail-stutter fault tolerance. In Proceedings Eighth Workshop on Hot Topics in Operating Systems, pages 33–38. IEEE, 2001.
- Jens Axboe. Flexible I/O tester. https://github.com/axboe/fio. Accessed April 29, 2022.
- Microsoft Azure. HPC images in Azure marketplace. https://github.com/Azure/azhpc-images. Accessed April 29, 2022.
- Microsoft Azure. Linux virtual machines pricing. https://azure.microsoft.com/en-us/pricing/details/virtual-machines/linux/#pricing. Accessed April 7, 2023.
- Alessandro Birolini. Reliability Engineering, volume 5. Springer, 2007.
- LOF: identifying density-based local outliers. In Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, pages 93–104, 2000.
- Language models are few-shot learners. Advances in Neural Information Processing Systems, 33:1877–1901, 2020.
- Charles Clos. A study of non-blocking switching networks. Bell System Technical Journal, 32(2):406–424, 1953.
- Google Cloud. Cloud TPU pricing. https://cloud.google.com/tpu/pricing/#pricing-components. Accessed April 7, 2023.
- Dawnbench: An end-to-end deep learning benchmark and competition. Training, 100(101):102, 2017.
- A Modern Introduction to Probability and Statistics: Understanding why and how, volume 488. Springer, 2005.
- Kaivalya M Dixit. Overview of the SPEC benchmarks, 1993.
- Limplock: Understanding the impact of limpware on scale-out cloud systems. In Proceedings of the 4th annual Symposium on Cloud Computing, pages 1–14, 2013.
- Jack J Dongarra. The linpack benchmark: An explanation. In International Conference on Supercomputing, pages 456–474. Springer, 1987.
- What bugs live in the cloud? a study of 3000+ issues in cloud systems. In Proceedings of the ACM symposium on cloud computing, pages 1–14, 2014.
- Fail-slow at scale: Evidence of hardware performance faults in large production systems. ACM Transactions on Storage (TOS), 14(3):1–26, 2018.
- Algorithm AS 136: A k-means clustering algorithm. Journal of the royal statistical society. series c (applied statistics), 28(1):100–108, 1979.
- Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 770–778, 2016.
- Horovod. Elastic horovod. https://horovod.readthedocs.io/en/v0.27.0/elastic.html. Accessed April 7, 2023.
- Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4700–4708, 2017.
- Gray failure: The achilles’ heel of cloud-scale systems. In Proceedings of the 16th Workshop on Hot Topics in Operating Systems, pages 150–155, 2017.
- Intel. Intel memory latency checker. https://www.intel.com/content/www/us/en/developer/articles/tool/intelr-memory-latency-checker.html. Accessed April 29, 2022.
- Jithin Jose. MVAPICH2 at Azure: Enabling high performance on cloud. https://hibd.cse.ohio-state.edu/static/media/talks/slide/Jithin-sc22-osu-bof.pdf. Accessed April 7, 2023.
- Google’s cloud TPU v4 provides exaflops-scale ml with industry-leading efficiency. https://cloud.google.com/blog/topics/systems/tpu-v4-enables-performance-energy-and-co2e-efficiency-gains. Accessed April 7, 2023.
- Service fabric: a distributed platform for building microservices in the cloud. In Proceedings of the thirteenth EuroSys conference, pages 1–15, 2018.
- Thomas P Kirkman. On a problem in combinations. Cambridge and Dublin Mathematical Journal, 2:191–204, 1847.
- Survival Analysis: Techniques for censored and truncated data, volume 1230. Springer, 2003.
- NVIDIA Ampere architecture in-depth. https://developer.nvidia.com/blog/nvidia-ampere-architecture-in-depth/. Accessed April 29, 2022.
- Time-to-event prediction with neural networks and Cox regression. arXiv preprint arXiv:1907.00825, 2019.
- Jennifer Langston. Microsoft announces new supercomputer, lays out vision for future AI work. https://blogs.microsoft.com/ai/openai-azure-supercomputer/. Accessed Dec 6, 2022.
- EasyScale: Accuracy-consistent elastic training for deep learning. arXiv preprint arXiv:2208.14228, 2022.
- Understanding, detecting and localizing partial failures in large system software. In 17th USENIX Symposium on Networked Systems Design and Implementation (NSDI 20), pages 559–574, 2020.
- Taming performance variability. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18), pages 409–425, 2018.
- MLPerf: An industry standard benchmark suite for machine learning performance. IEEE Micro, 40(2):8–16, 2020.
- Yusuf Mehdi. Reinventing search with a new AI-powered Microsoft Bing and Edge, your copilot for the web. https://blogs.microsoft.com/blog/2023/02/07/reinventing-search-with-a-new-ai-powered-microsoft-bing-and-edge-your-copilot-for-the-web/. Accessed April 7, 2023.
- NVIDIA. CUDA templates for linear algebra subroutines. https://github.com/NVIDIA/cutlass. Accessed April 29, 2022.
- NVIDIA. CUDA toolkit archive. https://developer.nvidia.com/cuda-toolkit-archive. Accessed April 7, 2023.
- NVIDIA. NCCL tests. https://github.com/NVIDIA/nccl-tests. Accessed April 29, 2022.
- NVIDIA. NVIDIA A100 GPU memory error management - row-remapping. https://docs.nvidia.com/deploy/a100-gpu-mem-error-mgmt/index.html#row-mapping. Accessed Dec 6, 2022.
- NVIDIA. The NVIDIA container image for PyTorch. https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/rel_20-12.html#rel_20-12. Accessed April 29, 2022.
- OpenAI. Introducing ChatGPT. https://openai.com/blog/chatgpt. Accessed April 7, 2023.
- OpenAI. GPT-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
- Sundar Pichai. An important next step on our AI journey. https://blog.google/technology/ai/bard-google-ai-search-updates/. Accessed April 7, 2023.
- PyTorch. Torch distributed elastic. https://pytorch.org/docs/2.0/elastic/quickstart.html. Accessed April 7, 2023.
- Litz: Elastic framework for high-performance distributed machine learning. In 2018 USENIX Annual Technical Conference (USENIX ATC 18), pages 631–644, 2018.
- Fault tolerance in iterative-convergent machine learning. In International Conference on Machine Learning, pages 5220–5230. PMLR, 2019.
- Linux RDMA. InfiniBand verbs performance tests. https://github.com/linux-rdma/perftest. Accessed April 29, 2022.
- Baidu Research. Benchmarking deep learning operations on different hardware. https://github.com/baidu-research/DeepBench. Accessed April 29, 2022.
- Estimating the support of a high-dimensional distribution. Neural computation, 13(7):1443–1471, 2001.
- Statsmodels: Econometric and statistical modeling with python. In Proceedings of the 9th Python in Science Conference, volume 57, pages 10–25080. Austin, TX, 2010.
- Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
- Not all gpus are created equal: characterizing variability in large-scale, accelerator-rich systems. arXiv preprint arXiv:2208.11035, 2022.
- Attention is all you need. Advances in Neural Information Processing Systems, 30, 2017.
- Apache hadoop yarn: Yet another resource negotiator. In Proceedings of the 4th Annual Symposium on Cloud Computing, pages 1–16, 2013.
- Large-scale cluster management at Google with Borg. In Proceedings of the Tenth European Conference on Computer Systems, pages 1–17, 2015.
- Rob Waters. Microsoft built a supercomputer to power OpenAI’s ChatGPT - cybersecurity careers blog. https://www.cybercareers.blog/2023/03/microsoft-built-a-supercomputer-to-power-openais-chatgpt/. Accessed April 7, 2023.
- OPT-175B logbook. https://github.com/facebookresearch/metaseq/blob/main/projects/OPT/chronicles/OPT175B_Logbook.pdf. Accessed May 19, 2022.
- OPT: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.