PVF (Parameter Vulnerability Factor): A Scalable Metric for Understanding AI Vulnerability Against SDCs in Model Parameters (2405.01741v3)
Abstract: Reliability of AI systems is a fundamental concern for the successful deployment and widespread adoption of AI technologies. Unfortunately, the escalating complexity and heterogeneity of AI hardware systems make them increasingly susceptible to hardware faults, e.g., silent data corruptions (SDC), that can potentially corrupt model parameters. When this occurs during AI inference/servicing, it can potentially lead to incorrect or degraded model output for users, ultimately affecting the quality and reliability of AI services. In light of the escalating threat, it is crucial to address key questions: How vulnerable are AI models to parameter corruptions, and how do different components (such as modules, layers) of the models exhibit varying vulnerabilities to parameter corruptions? To systematically address this question, we propose a novel quantitative metric, Parameter Vulnerability Factor (PVF), inspired by architectural vulnerability factor (AVF) in computer architecture community, aiming to standardize the quantification of AI model vulnerability against parameter corruptions. We define a model parameter's PVF as the probability that a corruption in that particular model parameter will result in an incorrect output. In this paper, we present several use cases on applying PVF to three types of tasks/models during inference -- recommendation (DLRM), vision classification (CNN), and text classification (BERT), while presenting an in-depth vulnerability analysis on DLRM. PVF can provide pivotal insights to AI hardware designers in balancing the tradeoff between fault protection and performance/efficiency such as mapping vulnerable AI parameter components to well-protected hardware modules. PVF metric is applicable to any AI model and has a potential to help unify and standardize AI vulnerability/resilience evaluation practice.
- Criteo kaggle display advertising dataset: https://ailab.criteo.com/ressources.
- tesla-release-notes-535-129-03: https://docs.nvidia.com/datacenter/tesla/tesla-release-notes-535-129-03/index.html. 2023.
- Resilience assessment of large language models under transient hardware faults. In 2023 IEEE 34th International Symposium on Software Reliability Engineering (ISSRE), pages 659–670. IEEE, 2023.
- Evaluating and accelerating high-fidelity error injection for hpc. In SC18: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 577–589. IEEE, 2018.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
- Harish Dattatraya Dixit et al. Detecting silent data corruptions in the wild. arXiv preprint arXiv:2203.08989, 2022.
- Peter Hazucha et al. Neutron soft error rate measurements in a 90-nm cmos process and scaling trends in sram from 0.25-/spl mu/m to 90-nm generation. In IEDM, 2003.
- Understanding and mitigating hardware failures in deep learning training systems. In Proceedings of the 50th Annual International Symposium on Computer Architecture, pages 1–16, 2023.
- Parametric noise injection: Trainable randomness to improve deep neural network robustness against adversarial attack. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 588–597, 2019.
- Cores that don’t count. In Proceedings of the Workshop on Hot Topics in Operating Systems, pages 9–16, 2021.
- Samuel Hsia et al. Mp-rec: Hardware-software co-design to enable multi-path recommendation. In ASPLOS, 2023.
- Xun Jiao et al. An assessment of vulnerability of hardware neural networks to dynamic voltage and temperature variations. In ICCAD, 2017.
- Sung Kim et al. Matic: Learning around errors for efficient low-voltage neural network accelerators. In DATE, 2018.
- Handwritten digit recognition with a back-propagation network. Advances in neural information processing systems, 2, 1989.
- Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
- Guanpeng Li et al. Understanding error propagation in deep learning neural network (dnn) accelerators and applications. In SC, 2017.
- Abdulrahman Mahmoud et al. Pytorchfi: A runtime perturbation tool for dnns. In DSN-W, 2020.
- A systematic methodology to compute the architectural vulnerability factors for a high-performance microprocessor. In Proceedings. 36th Annual IEEE/ACM International Symposium on Microarchitecture, 2003. MICRO-36., pages 29–40. IEEE, 2003.
- Deep learning recommendation model for personalization and recommendation systems. arXiv preprint arXiv:1906.00091, 2019.
- Bit-flip attack: Crushing neural network with progressive bit search. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1211–1220, 2019.
- Brandon Reagen et al. Ares: A framework for quantifying the resilience of deep neural networks. In DAC, 2018.
- One bit is (not) enough: An empirical study of the impact of single and multiple bit-flip errors. In 2017 47th annual IEEE/IFIP international conference on dependable systems and networks (DSN), pages 97–108. IEEE, 2017.
- Exploring the vulnerability of deep neural networks: A study of parameter corruption. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 11648–11656, 2021.
- Understanding silent data corruptions in a large production cpu population. In Proceedings of the 29th Symposium on Operating Systems Principles, pages 216–230, 2023.
- Character-level convolutional networks for text classification. Advances in neural information processing systems, 28, 2015.
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days freePaper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.