Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
144 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Cross-Layer Optimization for Fault-Tolerant Deep Learning (2312.13754v1)

Published 21 Dec 2023 in cs.AR, cs.AI, and cs.LG

Abstract: Fault-tolerant deep learning accelerator is the basis for highly reliable deep learning processing and critical to deploy deep learning in safety-critical applications such as avionics and robotics. Since deep learning is known to be computing- and memory-intensive, traditional fault-tolerant approaches based on redundant computing will incur substantial overhead including power consumption and chip area. To this end, we propose to characterize deep learning vulnerability difference across both neurons and bits of each neuron, and leverage the vulnerability difference to enable selective protection of the deep learning processing components from the perspective of architecture layer and circuit layer respectively. At the same time, we observe the correlation between model quantization and bit protection overhead of the underlying processing elements of deep learning accelerators, and propose to reduce the bit protection overhead by adding additional quantization constrain without compromising the model accuracy. Finally, we employ Bayesian optimization strategy to co-optimize the correlated cross-layer design parameters at algorithm layer, architecture layer, and circuit layer to minimize the hardware resource consumption while fulfilling multiple user constraints including reliability, accuracy, and performance of the deep learning processing at the same time.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (43)
  1. S.-H. Jeon, J.-H. Cho, Y. Jung, S. Park, and T.-M. Han, “Automotive hardware development according to iso 26262,” in 13th international conference on advanced communication technology (ICACT2011).   IEEE, 2011, pp. 588–592.
  2. S. Grigorescu, B. Trasnea, T. Cocias, and G. Macesanu, “A survey of deep learning techniques for autonomous driving,” Journal of Field Robotics, vol. 37, no. 3, pp. 362–386, 2020.
  3. C. Liu, C. Chu, D. Xu, Y. Wang, Q. Wang, H. Li, X. Li, and K.-T. Cheng, “Hyca: A hybrid computing architecture for fault-tolerant deep learning,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 41, no. 10, pp. 3400–3413, 2021.
  4. C. Liu, Z. Gao, S. Liu, X. Ning, H. Li, and X. Li, “Special session: Fault-tolerant deep learning: A hierarchical perspective,” in 2022 IEEE 40th VLSI Test Symposium (VTS).   IEEE, 2022, pp. 1–12.
  5. M. Jenihhin, M. S. Reorda, A. Balakrishnan, and D. Alexandrescu, “Challenges of reliability assessment and enhancement in autonomous systems,” in 2019 IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems (DFT).   IEEE, 2019, pp. 1–6.
  6. X. Xue, H. Huang, C. Liu, T. Luo, L. Zhang, and Y. Wang, “Winograd convolution: A perspective from fault tolerance,” in Proceedings of the 59th ACM/IEEE Design Automation Conference, 2022, pp. 853–858.
  7. A. Dixit and A. Wood, “The impact of new technology on soft error rates,” in 2011 International Reliability Physics Symposium.   IEEE, 2011, pp. 5B–4.
  8. M. Rabe, S. Milz, and P. Mader, “Development methodologies for safety critical machine learning applications in the automotive domain: A survey,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 129–141.
  9. S. Mittal, “A survey on modeling and improving reliability of dnn algorithms and accelerators,” Journal of Systems Architecture, vol. 104, p. 101689, 2020.
  10. Y. Chen, T. Luo, S. Liu, S. Zhang, L. He, J. Wang, L. Li, T. Chen, Z. Xu, N. Sun et al., “Dadiannao: A machine-learning supercomputer,” in 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.   IEEE, 2014, pp. 609–622.
  11. D. Xu, C. Chu, Q. Wang, C. Liu, Y. Wang, L. Zhang, H. Liang, and K.-T. Cheng, “A hybrid computing architecture for fault-tolerant deep learning accelerators,” in 2020 IEEE 38th International Conference on Computer Design (ICCD).   IEEE, 2020, pp. 478–485.
  12. Y.-H. Chen, J. Emer, and V. Sze, “Eyeriss: A spatial architecture for energy-efficient dataflow for convolutional neural networks,” ACM SIGARCH computer architecture news, vol. 44, no. 3, pp. 367–379, 2016.
  13. D. Xu, K. Xing, C. Liu, Y. Wang, Y. Dai, L. Cheng, H. Li, and L. Zhang, “Resilient neural network training for accelerators with computing errors,” in 2019 IEEE 30th International Conference on Application-specific Systems, Architectures and Processors (ASAP), vol. 2160.   IEEE, 2019, pp. 99–102.
  14. D. Xu, Z. Zhu, C. Liu, Y. Wang, S. Zhao, L. Zhang, H. Liang, H. Li, and K.-T. Cheng, “Reliability evaluation and analysis of fpga-based neural network acceleration system,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 29, no. 3, pp. 472–484, 2021.
  15. D. Xu, Z. Zhu, C. Liu, Y. Wang, H. Li, L. Zhang, and K.-T. Cheng, “Persistent fault analysis of neural networks on fpga-based acceleration system,” in 2020 IEEE 31st International Conference on Application-specific Systems, Architectures and Processors (ASAP).   IEEE, 2020, pp. 85–92.
  16. W. Li, X. Ning, G. Ge, X. Chen, Y. Wang, and H. Yang, “Ftt-nas: Discovering fault-tolerant neural architecture,” in 2020 25th Asia and South Pacific Design Automation Conference (ASP-DAC).   IEEE, 2020, pp. 211–216.
  17. M. A. Hanif, R. Hafiz, and M. Shafique, “Error resilience analysis for systematically employing approximate computing in convolutional neural networks,” in 2018 Design, Automation & Test in Europe Conference & Exhibition (DATE).   IEEE, 2018, pp. 913–916.
  18. A. Samajdar, Y. Zhu, P. Whatmough, M. Mattina, and T. Krishna, “Scale-sim: Systolic cnn accelerator simulator,” arXiv preprint arXiv:1811.02883, 2018.
  19. J. J. Zhang, T. Gu, K. Basu, and S. Garg, “Analyzing and mitigating the impact of permanent faults on a systolic array based neural network accelerator,” in 2018 IEEE 36th VLSI Test Symposium (VTS).   IEEE, 2018, pp. 1–6.
  20. M.-L. Li, P. Ramachandran, S. K. Sahoo, S. V. Adve, V. S. Adve, and Y. Zhou, “Understanding the propagation of hard errors to software and implications for resilient system design,” ACM Sigplan Notices, vol. 43, no. 3, pp. 265–276, 2008.
  21. B. Reagen, U. Gupta, L. Pentecost, P. Whatmough, S. K. Lee, N. Mulholland, D. Brooks, and G.-Y. Wei, “Ares: A framework for quantifying the resilience of deep neural networks,” in Proceedings of the 55th Annual Design Automation Conference, 2018, pp. 1–6.
  22. M. A. Neggaz, I. Alouani, P. R. Lorenzo, and S. Niar, “A reliability study on cnns for critical embedded systems,” in 2018 IEEE 36th International Conference on Computer Design (ICCD).   IEEE, 2018, pp. 476–479.
  23. K. Zhao, S. Di, S. Li, X. Liang, Y. Zhai, J. Chen, K. Ouyang, F. Cappello, and Z. Chen, “Ft-cnn: Algorithm-based fault tolerance for convolutional neural networks,” IEEE Transactions on Parallel and Distributed Systems, vol. 32, no. 7, pp. 1677–1689, 2020.
  24. E. Ozen and A. Orailoglu, “Sanity-check: Boosting the reliability of safety-critical deep neural network applications,” in 2019 IEEE 28th Asian Test Symposium (ATS).   IEEE, 2019, pp. 7–75.
  25. C. Schorn, A. Guntoro, and G. Ascheid, “Accurate neuron resilience prediction for a flexible reliability management in neural network accelerators,” in 2018 Design, Automation & Test in Europe Conference & Exhibition (DATE).   IEEE, 2018, pp. 979–984.
  26. F. Libano, B. Wilson, J. Anderson, M. J. Wirthlin, C. Cazzaniga, C. Frost, and P. Rech, “Selective hardening for neural networks in fpgas,” IEEE Transactions on Nuclear Science, vol. 66, no. 1, pp. 216–222, 2018.
  27. D. Xu, M. He, C. Liu, Y. Wang, L. Cheng, H. Li, X. Li, and K.-T. Cheng, “R2f: A remote retraining framework for aiot processors with computing errors,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 29, no. 11, pp. 1955–1966, 2021.
  28. M. Shafique, M. Naseer, T. Theocharides, C. Kyrkou, O. Mutlu, L. Orosa, and J. Choi, “Robust machine learning systems: Challenges, current trends, perspectives, and the road ahead,” IEEE Design & Test, vol. 37, no. 2, pp. 30–57, 2020.
  29. Z. Gao, H. Zhang, Y. Yao, J. Xiao, S. Zeng, G. Ge, Y. Wang, A. Ullah, and P. Reviriego, “Soft error tolerant convolutional neural networks on fpgas with ensemble learning,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 30, no. 3, pp. 291–302, 2022.
  30. H. R. Mahdiani, S. M. Fakhraie, and C. Lucas, “Relaxed fault-tolerant hardware implementation of neural networks in the presence of multiple transient errors,” IEEE transactions on neural networks and learning systems, vol. 23, no. 8, pp. 1215–1228, 2012.
  31. L.-H. Hoang, M. A. Hanif, and M. Shafique, “Ft-clipact: Resilience analysis of deep neural networks and improving their fault tolerance using clipped activation,” in 2020 Design, Automation & Test in Europe Conference & Exhibition (DATE).   IEEE, 2020, pp. 1241–1246.
  32. J. Zhan, R. Sun, W. Jiang, Y. Jiang, X. Yin, and C. Zhuo, “Improving fault tolerance for reliable dnn using boundary-aware activation,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 41, no. 10, pp. 3414–3425, 2021.
  33. J. A. Clemente, W. Mansour, R. Ayoubi, F. Serrano, H. Mecha, H. Ziade, W. El Falou, and R. Velazco, “Hardware implementation of a fault-tolerant hopfield neural network on fpgas,” Neurocomputing, vol. 171, pp. 1606–1609, 2016.
  34. Z. Chen, G. Li, and K. Pattabiraman, “A low-cost fault corrector for deep neural networks through range restriction,” in 2021 51st Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN).   IEEE, 2021, pp. 1–13.
  35. H. Wang, R. Feng, Z.-F. Han, and C.-S. Leung, “Admm-based algorithm for training fault tolerant rbf networks and selecting centers,” IEEE transactions on neural networks and learning systems, vol. 29, no. 8, pp. 3870–3878, 2017.
  36. X. He, L. Ke, W. Lu, G. Yan, and X. Zhang, “Axtrain: Hardware-oriented neural network training for approximate inference,” in Proceedings of the international symposium on low power electronics and design, 2018, pp. 1–6.
  37. A. Ruospo, G. Gavarini, I. Bragaglia, M. Traiola, A. Bosio, and E. Sanchez, “Selective hardening of critical neurons in deep neural networks,” in 2022 25th International Symposium on Design and Diagnostics of Electronic Circuits and Systems (DDECS).   IEEE, 2022, pp. 136–141.
  38. T. G. Bertoa, G. Gambardella, N. J. Fraser, M. Blott, and J. McAllister, “Fault tolerant neural network accelerators with selective tmr,” IEEE Design & Test, 2022.
  39. V. T. Lee, A. Alaghi, R. Pamula, V. S. Sathe, L. Ceze, and M. Oskin, “Architecture considerations for stochastic computing accelerators,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 37, no. 11, pp. 2277–2289, 2018.
  40. A. Ardakani, A. Ardakani, and W. J. Gross, “Fault-tolerance of binarized and stochastic computing-based neural networks,” in 2021 IEEE Workshop on Signal Processing Systems (SiPS).   IEEE, 2021, pp. 52–57.
  41. W. Qian, X. Li, M. D. Riedel, K. Bazargan, and D. J. Lilja, “An architecture for fault-tolerant computation with stochastic logic,” IEEE transactions on computers, vol. 60, no. 1, pp. 93–105, 2010.
  42. P. Pandey, P. Basu, K. Chakraborty, and S. Roy, “Greentpu: Improving timing error resilience of a near-threshold tensor processing unit,” in Proceedings of the 56th Annual Design Automation Conference 2019, 2019, pp. 1–6.
  43. A. Mahmoud, S. K. S. Hari, C. W. Fletcher, S. V. Adve, C. Sakr, N. Shanbhag, P. Molchanov, M. B. Sullivan, T. Tsai, and S. W. Keckler, “Hardnn: Feature map vulnerability evaluation in cnns,” arXiv preprint arXiv:2002.09786, 2020.

Summary

We haven't generated a summary for this paper yet.