Error Checking for Sparse Systolic Tensor Arrays (2402.10850v1)
Abstract: Structured sparsity is an efficient way to prune the complexity of modern Machine Learning (ML) applications and to simplify the handling of sparse data in hardware. In such cases, the acceleration of structured-sparse ML models is handled by sparse systolic tensor arrays. The increasing prevalence of ML in safety-critical systems requires enhancing the sparse tensor arrays with online error detection for managing random hardware failures. Algorithm-based fault tolerance has been proposed as a low-cost mechanism to check online the result of computations against random hardware failures. In this work, we address a key architectural challenge with structured-sparse tensor arrays: how to provide online error checking for a range of structured sparsity levels while maintaining high utilization of the hardware. Experimental results highlight the minimum hardware overhead incurred by the proposed checking logic and its error detection properties after injecting random hardware faults on sparse tensor arrays that execute layers of ResNet50 CNN.
- T. Hoefler et al., “Sparsity in deep learning: Pruning and growth for efficient inference and training in neural networks,” The Journal of Machine Learning Research, vol. 22, no. 1, pp. 10 882–11 005, 2021.
- U. Evci et al., “Rigging the lottery: Making all tickets winners,” in Inter. Conf. on Machine Learning, Jul. 2020, pp. 2943–2952.
- A. Mishra and other, “Accelerating sparse deep neural networks,” arXiv preprint arXiv:2104.08378, 2021.
- A. Zhou et al., “Learning N:M fine-grained structured sparse neural networks from scratch,” in Inter. Conf. on Learning Representations (ICLR), May 2021.
- H. T. Kung, “Why systolic architectures?” Computer, vol. 15, no. 1, pp. 37–46, 1982.
- A. Samajdar et al., “A systematic methodology for characterizing scalability of DNN accelerators using scale-sim,” in IEEE Intern. Symp. on Perf. Analysis of Systems and Software (ISPASS), Aug 2020.
- Z.-G. Liu, P. N. Whatmough, Y. Zhu, and M. Mattina, “S2TA: Exploiting structured sparsity for energy-efficient mobile CNN acceleration,” in IEEE Inter. Symp. on High-Performance Computer Architecture (HPCA), Apr. 2022, pp. 573–586.
- G. Jeong et al., “Vegeta: Vertically-integrated extensions for sparse/dense gemm tile acceleration on cpus,” in IEEE Inter. Symp. on High-Performance Computer Architecture (HPCA), Feb. 2023, pp. 259–272.
- R. Salay, R. Queiroz, and K. Czarnecki, “An analysis of ISO 26262: Using machine learning safely in automotive software,” 2017.
- R. L. R. Junior and P. Rech, “Reliability of google’s tensor processing units for convolutional neural networks,” in IEEE/IFIP Intern. Conf. on Dependable Systems and Networks (DSN), 2022, pp. 25–27.
- S. Kundu, S. Banerjee, A. Raha, S. Natarajan, and K. Basu, “Toward functional safety of systolic array-based deep learning hardware accelerators,” IEEE Trans. on Very Large Scale Integration (VLSI) Systems, vol. 29, no. 3, pp. 485–498, 2021.
- K. T. Chitty-Venkata and A. Somani, “Impact of structural faults on neural network performance,” in IEEE Intern. Conf. on Application-specific Systems, Architectures and Processors (ASAP), 2019.
- Y. Zhao, K. Wang, and A. Louri, “FSA: An efficient fault-tolerant systolic array-based dnn accelerator architecture,” in IEEE Intern. Conference on Computer Design (ICCD), 2022, pp. 545–552.
- K. T. Chitty-Venkata and A. K. Somani, “Model compression on faulty array-based neural network accelerator,” in IEEE Pacific Rim Intern. Symp. on Dependable Computing (PRDC), 2020, pp. 90–99.
- J. J. Zhang, K. Basu, and S. Garg, “Fault-tolerant systolic array based accelerators for deep neural network execution,” IEEE Design & Test, vol. 36, no. 5, pp. 44–53, 2019.
- H. Lee, J. Kim, J. Park, and S. Kang, “STRAIT: Self-test and self-recovery for AI accelerator,” IEEE Trans. on Computer-Aided Design of Integrated Circuits and Systems, 2023.
- E. Vacca, G. Ajmone, L. Sterpone et al., “RunSAFER: A novel runtime fault detection approach for systolic array accelerators,” in IEEE International Conference on Computer Design (ICCD), 2023.
- R. Baumann, “Radiation-induced soft errors in advanced semiconductor technologies,” IEEE Trans. on Device and Materials Reliability, vol. 5, no. 3, pp. 305–316, 2005.
- S. Borkar, “Designing reliable systems from unreliable components: the challenges of transistor variability and degradation,” IEEE Micro, vol. 25, no. 6, pp. 10–16, 2005.
- K.-H. Huang and J. A. Abraham, “Algorithm-based fault tolerance for matrix operations,” IEEE Trans. on Computers, vol. C-33, no. 6, pp. 518–528, 1984.
- S.-J. Wang and N. Jha, “Algorithm-based fault tolerance for FFT networks,” IEEE Trans. on Computers, vol. 43, no. 7, pp. 849–854, 1994.
- Abraham, Banerjee, C.-Y. Chen, Fuchs, S.-Y. Kuo, and N. Reddy, “Fault tolerance techniques for systolic arrays,” Computer, vol. 20, no. 7, pp. 65–75, 1987.
- P. Wu, Q. Guan, N. DeBardeleben, S. Blanchard, D. Tao, X. Liang, J. Chen, and Z. Chen, “Towards practical algorithm based fault tolerance in dense linear algebra,” in Proc. of the ACM Intern. Symp.on High-Performance Parallel and Distributed Computing, 2016, p. 31–42.
- D. Filippas, N. Margomenos, N. Mitianoudis, C. Nicopoulos, and G. Dimitrakopoulos, “Low-cost online convolution checksum checker,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 30, no. 2, pp. 201–212, 2022.
- M. Safarpour, R. Inanlou, and O. Silvén, “Algorithm level error detection in low voltage systolic array,” IEEE Trans. on Circuits and Systems II: Express Briefs, vol. 69, no. 2, pp. 569–573, 2021.
- F. Libano, P. Rech, and J. Brunhaver, “Efficient error detection for matrix multiplication with systolic arrays on fpgas,” IEEE Transactions on Computers, 2023.
- S. Bal, C. S. Mummidi, V. Da Cruz Ferreira, S. Srinivasan, and S. Kundu, “A novel fault-tolerant architecture for tiled matrix multiplication,” in Design, Automation & Test in Europe (DATE), 2023.
- K. He et al., “Deep residual learning for image recognition,” in IEEE Conf. on Computer Vision and Pattern Recognition(CVPR), Jun 2016.