ALBERTA: ALgorithm-Based Error Resilience in Transformer Architectures (2310.03841v2)
Abstract: Vision Transformers are being increasingly deployed in safety-critical applications that demand high reliability. It is crucial to ensure the correctness of their execution in spite of potential errors such as transient hardware errors. We propose a novel algorithm-based resilience framework called ALBERTA that allows us to perform end-to-end resilience analysis and protection of transformer-based architectures. First, our work develops an efficient process of computing and ranking the resilience of transformers layers. We find that due to the large size of transformer models, applying traditional network redundancy to a subset of the most vulnerable layers provides high error coverage albeit with impractically high overhead. We address this shortcoming by providing a software-directed, checksum-based error detection technique aimed at protecting the most vulnerable general matrix multiply (GEMM) layers in the transformer models that use either floating-point or integer arithmetic. Results show that our approach achieves over 99% coverage for errors that result in a mismatch with less than 0.2% and 0.01% computation and memory overheads, respectively. Lastly, we present the applicability of our framework in various modern GPU architectures under different numerical precisions. We introduce an efficient self-correction mechanism for resolving erroneous detection with an average of less than 2% overhead per error.
- CUDA C programming guide – device memory l2 access management. https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#device-memory-l2-access-management.
- Understanding the propagation of transient errors in HPC applications. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–12, 2015.
- Deep learning in drug discovery: an integrative review and future challenges. Artif Intell Rev, pages 1–63, November 2022.
- A low-cost fault corrector for deep neural networks through range restriction. In IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), pages 1–13, 2021.
- Imagenet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255, 2009.
- Matrix multiplication on gpus with on-line fault tolerance. In 2011 IEEE Ninth International Symposium on Parallel and Distributed Processing with Applications, pages 311–317, 2011.
- Silent data corruptions at scale, 2021.
- An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
- Low-cost online convolution checksum checker. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 30(2):201–212, 2022.
- AST: audio spectrogram transformer. CoRR, abs/2104.01778, 2021.
- Making convolutions resilient via algorithm-based error detection techniques. IEEE Transactions on Dependable and Secure Computing, 19(4):2546–2558, 2021.
- Algorithm-based fault tolerance for matrix operations. IEEE transactions on computers, 100(6):518–528, 1984.
- Yu Huang and Yue Chen. Autonomous driving with deep learning: A survey of state-of-art technologies. CoRR, abs/2006.06091, 2020.
- Soft error resilience of deep residual networks for object recognition. IEEE Access, 8:19490–19503, 2020.
- Justin Johnson. Deep learning for computer vision, lecture 17: Attention. https://web.eecs.umich.edu/~justincj/slides/eecs498/WI2022/598_WI2022_lecture17.pdf. p. 62, Accessed: 2023-05-31.
- Transformers in vision: A survey. ACM computing surveys (CSUR), 54(10s):1–41, 2022.
- Understanding error propagation in deep learning neural network (dnn) accelerators and applications. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC), 2017.
- Selective hardening for neural networks in fpgas. IEEE Transactions on Nuclear Science, 66(1):216–222, 2019.
- Swin transformer: Hierarchical vision transformer using shifted windows. CoRR, abs/2103.14030, 2021.
- Error resilient transformers: A novel soft error vulnerability guided approach to error checking and suppression. In 2023 IEEE European Test Symposium (ETS), pages 1–6, 2023.
- Optimizing selective protection for CNN resilience. In ISSRE, pages 127–138, 2021.
- Goldeneye: A platform for evaluating emerging numerical data formats in dnn accelerators. In 2022 52nd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), pages 206–214. IEEE, 2022.
- PyTorch: An imperative style, high-performance deep learning library. CoRR, abs/1912.01703, 2019.
- Error resilient in-memory computing architecture for cnn inference on the edge. In Proceedings of the Great Lakes Symposium on VLSI 2022, GLSVLSI ’22, page 249–254, New York, NY, USA, 2022. Association for Computing Machinery.
- Automated design of error-resilient and hardware-efficient deep neural networks. CoRR, abs/1909.13844, 2019.
- An efficient bit-flip resilience optimization method for deep neural networks. In 2019 Design, Automation & Test in Europe Conference & Exhibition (DATE), pages 1507–1512, 2019.
- Deep learning in medical image analysis. Annu Rev Biomed Eng, 19:221–248, March 2017.
- Training data-efficient image transformers & distillation through attention. In International conference on machine learning, pages 10347–10357. PMLR, 2021.
- Attention is all you need. 2017.
- Understanding silent data corruptions in a large production cpu population. In Proceedings of the 29th Symposium on Operating Systems Principles, SOSP ’23, page 216–230, New York, NY, USA, 2023. Association for Computing Machinery.
- Styleswin: Transformer-based GAN for high-resolution image generation. CoRR, abs/2112.10762, 2021.
- Informer: Beyond efficient transformer for long sequence time-series forecasting. CoRR, abs/2012.07436, 2020.