Papers
Topics
Authors
Recent
Search
2000 character limit reached

YFlows: Systematic Dataflow Exploration and Code Generation for Efficient Neural Network Inference using SIMD Architectures on CPUs

Published 1 Oct 2023 in cs.AR, cs.LG, and cs.PF | (2310.00574v3)

Abstract: We address the challenges associated with deploying neural networks on CPUs, with a particular focus on minimizing inference time while maintaining accuracy. Our novel approach is to use the dataflow (i.e., computation order) of a neural network to explore data reuse opportunities using heuristic-guided analysis and a code generation framework, which enables exploration of various Single Instruction, Multiple Data (SIMD) implementations to achieve optimized neural network execution. Our results demonstrate that the dataflow that keeps outputs in SIMD registers while also maximizing both input and weight reuse consistently yields the best performance for a wide variety of inference workloads, achieving up to 3x speedup for 8-bit neural networks, and up to 4.8x speedup for binary neural networks, respectively, over the optimized implementations of neural networks today.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (116)
  1. W. Shi, J. Cao, Q. Zhang, Y. Li, and L. Xu, “Edge computing: Vision and challenges,” IEEE Internet of Things Journal, vol. 3, no. 5, pp. 637–646, 2016.
  2. R. Hadidi, J. Cao, Y. Xie, B. Asgari, T. Krishna, and H. Kim, “Characterizing the deployment of deep neural networks on commercial edge devices,” in 2019 IEEE International Symposium on Workload Characterization (IISWC).   IEEE, 2019, pp. 35–48.
  3. W. Shi, J. Cao, Q. Zhang, Y. Li, and L. Xu, “Edge computing: Vision and challenges,” IEEE internet of things journal, vol. 3, no. 5, pp. 637–646, 2016.
  4. N. D. Lane, S. Bhattacharya, P. Georgiev, C. Forlivesi, and F. Kawsar, “Deepx: A software accelerator for low-power deep learning inference on mobile devices,” Proceedings of the 15th International Conference on Information Processing in Sensor Networks, pp. 1–12, 2016.
  5. V. Sze, Y.-H. Chen, T.-J. Yang, and J. S. Emer, “Efficient processing of deep neural networks: A tutorial and survey,” Proceedings of the IEEE, vol. 105, no. 12, pp. 2295–2329, 2017.
  6. A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam, “Mobilenets: Efficient convolutional neural networks for mobile vision applications,” arXiv preprint arXiv:1704.04861, 2017.
  7. F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf, W. J. Dally, and K. Keutzer, “Squeezenet: Alexnet-level accuracy with 50x fewer parameters and¡ 0.5 mb model size,” arXiv preprint arXiv:1602.07360, 2016.
  8. S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. A. Horowitz, and W. J. Dally, “Eie: efficient inference engine on compressed deep neural network,” in Proceedings of the 43rd International Symposium on Computer Architecture, 2016, pp. 243–254.
  9. S.-J. Lee, S.-S. Park, and K.-S. Chung, “Efficient simd implementation for accelerating convolutional neural network,” in Proceedings of the 4th International Conference on Communication and Information Processing, 2018, pp. 174–179.
  10. Y. Pu, Y. He, Z. Ye, S. M. Londono, A. A. Abbo, R. Kleihorst, and H. Corporaal, “From xetal-ii to xetal-pro: On the road toward an ultralow-energy and high-throughput simd processor,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 21, no. 4, pp. 472–484, 2011.
  11. Y.-H. Chen, J. Emer, and V. Sze, “Using dataflow to optimize energy efficiency of deep neural network accelerators,” IEEE Micro, vol. 37, no. 3, pp. 12–21, 2017.
  12. A. Parashar, M. Rhu, A. Mukkara, A. Puglielli, R. Venkatesan, B. Khailany, J. Emer, S. W. Keckler, and W. J. Dally, “Scnn: An accelerator for compressed-sparse convolutional neural networks,” in Proceedings of the 44th Annual International Symposium on Computer Architecture (ISCA ’17), 2017, pp. 27–40.
  13. T. Chen, Z. Du, N. Sun, J. Wang, C. Wu, Y. Chen, and O. Temam, “Diannao: A small-footprint high-throughput accelerator for ubiquitous machine-learning,” in Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems, ser. ASPLOS ’14.   New York, NY, USA: Association for Computing Machinery, 2014, p. 269–284. [Online]. Available: https://doi.org/10.1145/2541940.2541967
  14. S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. A. Horowitz, and W. J. Dally, “Eie: efficient inference engine on compressed deep neural network,” in Proceedings of the 43rd International Symposium on Computer Architecture (ISCA ’16), 2016, pp. 243–254.
  15. M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mané, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viégas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng, “Tensorflow: Large-scale machine learning on heterogeneous systems,” 2015, software available from tensorflow.org. [Online]. Available: https://www.tensorflow.org/
  16. A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala, “Pytorch: An imperative style, high-performance deep learning library,” 2019.
  17. T. Chen, T. Moreau, Z. Jiang, L. Zheng, E. Yan, H. Shen, …, and Y. Chen, “Tvm: An automated end-to-end optimizing compiler for deep learning,” in 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18), 2018, pp. 578–594.
  18. Y. Hu and et al., “Bitflow: Exploiting vector parallelism for binary neural networks on cpu,” in 2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2018, pp. 244–253.
  19. Y. Liu, Y. Wang, R. Yu, M. Li, V. Sharma, and Y. Wang, “Optimizing cnn model inference on cpus,” in Proc. USENIX Annu. Tech. Conf., 2019, pp. 1025–1040.
  20. Intel, “Intel onednn developer guide and reference,” 2023, accessed: 18-05-2023. [Online]. Available: https://github.com/oneapi-src/oneDNN/blob/master/src/cpu/
  21. L. Contributors, “Larq: An open-source library for training binarized neural networks,” https://github.com/larq/larq, 2023, gitHub repository.
  22. M. Cowan, T. Moreau, T. Chen, J. Bornholt, and L. Ceze, “Automatic generation of high-performance quantized machine learning kernels,” in Proceedings of the 18th ACM/IEEE International Symposium on Code Generation and Optimization, ser. CGO 2020.   New York, NY, USA: Association for Computing Machinery, 2020, p. 305–316. [Online]. Available: https://doi.org/10.1145/3368826.3377912
  23. H. Amiri and A. Shahbahrami, “Simd programming using intel vector extensions,” Journal of Parallel and Distributed Computing, vol. 135, pp. 83–100, 2020.
  24. V. Govindaraju, T. Nowatzki, and K. Sankaralingam, “Breaking simd shackles with an exposed flexible microarchitecture and the access execute pdg,” in Proceedings of the 22nd international conference on Parallel architectures and compilation techniques.   IEEE, 2013, pp. 341–351.
  25. S. Larsen and S. Amarasinghe, “Exploiting superword level parallelism with multimedia instruction sets,” Acm Sigplan Notices, vol. 35, no. 5, pp. 145–156, 2000.
  26. Z. Liu, S. Mada, and J. Regehr, “Minotaur: A simd-oriented synthesizing superoptimizer,” 2023.
  27. A. Limited, “Neon programmer’s guide for armv8-a,” https://developer.arm.com/documentation/100069/0101/, 2023, accessed: 2023-05-19.
  28. I. Corporation, “Intel intrinsics guide,” https://software.intel.com/sites/landingpage/IntrinsicsGuide/, 2023, accessed: 2023-05-19.
  29. C. Mendis and S. Amarasinghe, “Goslp: Globally optimized superword level parallelism framework,” Proc. ACM Program. Lang., vol. 2, no. OOPSLA, oct 2018. [Online]. Available: https://doi.org/10.1145/3276480
  30. S. Kim and H. Han, “Efficient simd code generation for irregular kernels,” in Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming, 2012, pp. 55–64.
  31. Y. Xiao, N. Ahmed, M. Capotă, G. Ma, T. L. Willke, S. Nazarian, and P. Bogdan, “Structural code representation learning for auto-vectorization,” 2022.
  32. A. Haj-Ali, N. K. Ahmed, T. Willke, Y. S. Shao, K. Asanovic, and I. Stoica, “Neurovectorizer: End-to-end vectorization with deep reinforcement learning,” in Proceedings of the 18th ACM/IEEE International Symposium on Code Generation and Optimization, 2020, pp. 242–255.
  33. C. Lattner and V. Adve, “Llvm: A compilation framework for lifelong program analysis & transformation,” in Proceedings of the International Symposium on Code Generation and Optimization: Feedback-Directed and Runtime Optimization, ser. CGO ’04.   USA: IEEE Computer Society, 2004, p. 75.
  34. M. Alwani, H. Chen, M. Ferdman, and P. Milder, “Fused-layer cnn accelerators,” in 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), 2016, pp. 1–12.
  35. H. Kwon, P. Chatarasi, M. Pellauer, A. Parashar, V. Sarkar, and T. Krishna, “Understanding reuse, performance, and hardware cost of dnn dataflow: A data-centric approach,” in Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture, 2019, pp. 754–768.
  36. D. Yang, A. Ghasemazar, X. Ren, M. Golub, G. Lemieux, and M. Lis, “Procrustes: a dataflow and accelerator for sparse deep neural network training,” in 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).   IEEE, 2020, pp. 711–724.
  37. A. Samajdar, Y. Zhu, P. Whatmough, M. Mattina, and T. Krishna, “Scale-sim: Systolic cnn accelerator simulator,” arXiv preprint arXiv:1811.02883, 2018.
  38. A. Limited, “Architectures — instruction sets — intrinsics,” 2023, accessed: 2023-08-27. [Online]. Available: https://developer.arm.com/architectures/instruction-sets/intrinsics/
  39. A. d. L. Santana, A. Armejach, and M. Casas, “Efficient direct convolution using long simd instructions,” in Proceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming, ser. PPoPP ’23.   New York, NY, USA: Association for Computing Machinery, 2023, p. 342–353. [Online]. Available: https://doi.org/10.1145/3572848.3577435
  40. Y. LeCun, L. Jackel, L. Bottou, A. Brunot, C. Cortes, J. Denker, H. Drucker, I. Guyon, U. Muller, E. Sackinger et al., “Comparison of learning algorithms for handwritten digit recognition,” in International conference on artificial neural networks, vol. 60, no. 1.   Perth, Australia, 1995, pp. 53–60.
  41. L. Sifre and S. Mallat, “Rigid-motion scattering for texture classification,” arXiv preprint arXiv:1403.1687, 2014.
  42. A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” Advances in neural information processing systems, vol. 25, 2012.
  43. X. Zhang, X. Zhou, M. Lin, and J. Sun, “Shufflenet: An extremely efficient convolutional neural network for mobile devices,” 2017.
  44. F. Chollet, “Xception: Deep learning with depthwise separable convolutions,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 1251–1258.
  45. X. Zhang, X. Zhou, M. Lin, and J. Sun, “Shufflenet: An extremely efficient convolutional neural network for mobile devices,” Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 6848–6856, 2018.
  46. B. H. Ahn, J. Lee, J. M. Lin, H.-P. Cheng, J. Hou, and H. Esmaeilzadeh, “Ordering chaos: Memory-aware scheduling of irregularly wired neural networks for edge devices,” Proceedings of Machine Learning and Systems, vol. 2, pp. 44–57, 2020.
  47. L. Lu and Y. Liang, “Spwa: An efficient sparse winograd convolutional neural networks accelerator on fpgas,” in Proceedings of the 55th Annual Design Automation Conference, ser. DAC ’18.   New York, NY, USA: Association for Computing Machinery, 2018. [Online]. Available: https://doi.org/10.1145/3195970.3196120
  48. Intel, “Intel® advanced matrix extensions overview,” 2023, accessed: Aug 23, 2023. [Online]. Available: https://www.intel.com/content/www/us/en/products/docs/accelerator-engines/advanced-matrix-extensions/overview.html
  49. “GNU Compiler Collection,” https://gcc.gnu.org/, accessed: 2023-08-28.
  50. K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
  51. K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
  52. A. S. Foundation, “tvm.autotvm — tvm 0.14.dev0 documentation,” 2023, accessed: 2023-08-31. [Online]. Available: https://tvm.apache.org/docs/reference/api/python/autotvm.html
  53. S. Han, H. Mao, and W. J. Dally, “Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding,” in International Conference on Learning Representations, 2016.
  54. S. Zhou, Y. Wu, Z. Ni, X. Zhou, H. Wen, and Y. Zou, “Dorefa-net: Training low bitwidth convolutional neural networks with low bitwidth gradients,” arXiv preprint arXiv:1606.06160, 2016.
  55. M. Courbariaux, Y. Bengio, and J.-P. David, “Binaryconnect: Training deep neural networks with binary weights during propagations,” in Advances in Neural Information Processing Systems, 2015, pp. 3123–3131.
  56. E. Hoffer, I. Hubara, and D. Soudry, “Train longer, generalize better: closing the generalization gap in large batch training of neural networks,” in Advances in Neural Information Processing Systems, 2017, pp. 1731–1741.
  57. J. Choi, M. El-Khamy, and J. Lee, “Towards the limit of network quantization,” in International Conference on Learning Representations, 2018.
  58. J. Ragan-Kelley, C. Barnes, A. Adams, S. Paris, F. Durand, and S. Amarasinghe, “Halide: A language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines,” in Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), 2013.
  59. X. Yang, M. Gao, Q. Liu, J. Setter, J. Pu, A. Nayak, S. Bell, K. Cao, H. Ha, P. Raina, C. Kozyrakis, and M. Horowitz, “Interstellar: Using halide’s scheduling language to analyze dnn accelerators,” in Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS ’20), 2020, pp. 369–383.
  60. S.-J. Lee, S.-S. Park, and K.-S. Chung, “Efficient simd implementation for accelerating convolutional neural network,” in Proceedings of the 4th International Conference on Communication and Information Processing, ser. ICCIP ’18.   New York, NY, USA: Association for Computing Machinery, 2018, p. 174–179. [Online]. Available: https://doi.org/10.1145/3290420.3290444
  61. K. Chellapilla, S. Puri, and P. Simard, “High performance convolutional neural networks for document processing,” in Tenth International Workshop on Frontiers in Handwriting Recognition, 2006.
  62. N. Vasilache, J. Johnson, M. Mathieu, S. Chintala, S. Piantino, and Y. LeCun, “Tensor comprehensions: Framework-agnostic high-performance machine learning abstractions,” in International Conference on Learning Representations (ICLR), 2018.
  63. M. Rhu, N. Gimelshein, J. Clemons, A. Zulfiqar, and S. W. Keckler, “vdnn: Virtualized deep neural networks for scalable, memory-efficient neural network design,” in Advances in Neural Information Processing Systems (NIPS), 2016.
  64. Y. Jia, S. Yin, C. He, and T. Zhang, “Mlfusion: Multi-layer fusion for fpga-based cnn accelerators,” in Proceedings of the 27th International Conference on Field Programmable Logic and Applications (FPL), 2018.
  65. Y. Qiao, Y. Zhang, J. Wang, T. Tang, and Y. Wang, “Layer fusion for memory-efficient inference of convolutional neural networks on gpus,” in International Symposium on Benchmarking, Measuring and Optimizing (Bench), 2019.
  66. S. Carr, C. Ding, and P. Sweany, “Improving software pipelining with unroll-and-jam,” in Proceedings of HICSS-29: 29th Hawaii International Conference on System Sciences, vol. 1.   IEEE, 1996, pp. 183–192.
  67. J. Mellor-Crummey and J. Garvin, “Optimizing sparse matrix–vector product computations using unroll and jam,” The International Journal of High Performance Computing Applications, vol. 18, no. 2, pp. 225–236, 2004.
  68. S. Carr and Y. Guan, “Unroll-and-jam using uniformly generated sets,” in Proceedings of 30th Annual International Symposium on Microarchitecture.   IEEE, 1997, pp. 349–357.
  69. K. Stock, L.-N. Pouchet, and P. Sadayappan, “Using machine learning to improve automatic vectorization,” ACM Transactions on Architecture and Code Optimization (TACO), vol. 8, no. 4, pp. 1–23, 2012.
  70. A. Mandal, “Optimizing convolutions in state-of-the-art convolutional neural networks on intel xeon phi,” Ph.D. dissertation, Rice University, 2017.
  71. A. Venkat, T. Rusira, R. Barik, M. Hall, and L. Truong, “Swirl: High-performance many-core cpu code generation for deep neural networks,” The International Journal of High Performance Computing Applications, vol. 33, no. 6, pp. 1275–1289, 2019.
  72. X. Liu, J. Pool, S. Han, and W. J. Dally, “Efficient sparse-winograd convolutional neural networks,” arXiv preprint arXiv:1802.06367, 2018.
  73. L. Meng and J. Brothers, “Efficient winograd convolution via integer arithmetic,” arXiv preprint arXiv:1901.01965, 2019.
  74. D. Yan, W. Wang, and X. Chu, “Optimizing batched winograd convolution on gpus,” in Proceedings of the 25th ACM SIGPLAN symposium on principles and practice of parallel programming, 2020, pp. 32–44.
  75. S. A. Alam, A. Anderson, B. Barabasz, and D. Gregg, “Winograd convolution for deep neural networks: Efficient point selection,” ACM Transactions on Embedded Computing Systems, vol. 21, no. 6, pp. 1–28, 2022.
  76. A. Zlateski, Z. Jia, K. Li, and F. Durand, “The anatomy of efficient fft and winograd convolutions on modern cpus,” in Proceedings of the ACM International Conference on Supercomputing, 2019, pp. 414–424.
  77. Z. Jia, A. Zlateski, F. Durand, and K. Li, “Optimizing n-dimensional, winograd-based convolution for manycore cpus,” in Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2018, pp. 109–123.
  78. G. Li, Z. Jia, X. Feng, and Y. Wang, “Lowino: Towards efficient low-precision winograd convolutions on modern cpus,” in Proceedings of the 50th International Conference on Parallel Processing, 2021, pp. 1–11.
  79. P. Maji, A. Mundy, G. Dasika, J. Beu, M. Mattina, and R. Mullins, “Efficient winograd or cook-toom convolution kernel implementation on widely used mobile cpus,” in 2019 2nd Workshop on Energy Efficient Machine Learning and Cognitive Computing for Embedded Applications (EMC2).   IEEE, 2019, pp. 1–5.
  80. D. Li, D. Huang, Z. Chen, and Y. Lu, “Optimizing massively parallel winograd convolution on arm processor,” in Proceedings of the 50th International Conference on Parallel Processing, 2021, pp. 1–12.
  81. R. Wu, F. Zhang, J. Guan, Z. Zheng, X. Du, and X. Shen, “Drew: Efficient winograd cnn inference with deep reuse,” in Proceedings of the ACM Web Conference 2022, ser. WWW ’22.   New York, NY, USA: Association for Computing Machinery, 2022, p. 1807–1816. [Online]. Available: https://doi.org/10.1145/3485447.3511985
  82. V. Chikin and V. Kryzhanovskiy, “Channel balancing for accurate quantization of winograd convolutions,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 12 507–12 516.
  83. J. Fernandez-Marques, P. Whatmough, A. Mundy, and M. Mattina, “Searching for winograd-aware quantized networks,” Proceedings of Machine Learning and Systems, vol. 2, pp. 14–29, 2020.
  84. J. Fernandez-Marques. (2020, April) Even faster convolutions: Winograd convolutions meet integer quantization and architecture search. Accessed: [Your Access Date Here]. [Online]. Available: https://community.arm.com/arm-research/b/articles/posts/even-faster-convolutions-winograd-convolutions-meet-integer-quantization-and-architecture-search
  85. S. Khan, M. Naseer, M. Hayat, S. W. Zamir, F. S. Khan, and M. Shah, “Transformers in vision: A survey,” ACM computing surveys (CSUR), vol. 54, no. 10s, pp. 1–41, 2022.
  86. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
  87. C. Subakan, M. Ravanelli, S. Cornell, M. Bronzi, and J. Zhong, “Attention is all you need in speech separation,” in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2021, pp. 21–25.
  88. F. Shamshad, S. Khan, S. W. Zamir, M. H. Khan, M. Hayat, F. S. Khan, and H. Fu, “Transformers in medical imaging: A survey,” Medical Image Analysis, p. 102802, 2023.
  89. X. Yang, J. Bian, W. R. Hogan, and Y. Wu, “Clinical concept extraction using transformers,” Journal of the American Medical Informatics Association, vol. 27, no. 12, pp. 1935–1942, 2020.
  90. C. Yang, H. Mei, and J. Eisner, “Transformer embeddings of irregularly spaced events and their participants,” arXiv preprint arXiv:2201.00044, 2021.
  91. A. Ivanov, N. Dryden, T. Ben-Nun, S. Li, and T. Hoefler, “Data movement is all you need: A case study on optimizing transformers,” Proceedings of Machine Learning and Systems, vol. 3, pp. 711–732, 2021.
  92. H. Wang, Z. Wu, Z. Liu, H. Cai, L. Zhu, C. Gan, and S. Han, “Hat: Hardware-aware transformers for efficient natural language processing,” arXiv preprint arXiv:2005.14187, 2020.
  93. J. Jiang, J. Du, D. Huang, D. Li, J. Zheng, and Y. Lu, “Characterizing and optimizing transformer inference on arm many-core processor,” in Proceedings of the 51st International Conference on Parallel Processing, 2022, pp. 1–11.
  94. D. Dice and A. Kogan, “Optimizing inference performance of transformers on cpus,” 2021.
  95. F. Lagunas, E. Charlaix, V. Sanh, and A. M. Rush, “Block pruning for faster transformers,” arXiv preprint arXiv:2109.04838, 2021.
  96. W. Kwon, S. Kim, M. W. Mahoney, J. Hassoun, K. Keutzer, and A. Gholami, “A fast post-training pruning framework for transformers,” Advances in Neural Information Processing Systems, vol. 35, pp. 24 101–24 116, 2022.
  97. M. Zhu, Y. Tang, and K. Han, “Vision transformer pruning,” arXiv preprint arXiv:2104.08500, 2021.
  98. J. Mao, H. Yang, A. Li, H. Li, and Y. Chen, “Tprune: Efficient transformer pruning for mobile devices,” ACM Transactions on Cyber-Physical Systems, vol. 5, no. 3, pp. 1–22, 2021.
  99. Z. Liu, Y. Wang, K. Han, W. Zhang, S. Ma, and W. Gao, “Post-training quantization for vision transformer,” Advances in Neural Information Processing Systems, vol. 34, pp. 28 092–28 103, 2021.
  100. Y. Bondarenko, M. Nagel, and T. Blankevoort, “Understanding and overcoming the challenges of efficient transformer quantization,” arXiv preprint arXiv:2109.12948, 2021.
  101. I. Chung, B. Kim, Y. Choi, S. J. Kwon, Y. Jeon, B. Park, S. Kim, and D. Lee, “Extremely low bit transformer quantization for on-device neural machine translation,” arXiv preprint arXiv:2009.07453, 2020.
  102. G. Prato, E. Charlaix, and M. Rezagholizadeh, “Fully quantized transformer for improved translation,” 2019.
  103. X. Chen, Q. Cao, Y. Zhong, J. Zhang, S. Gao, and D. Tao, “Dearkd: data-efficient early knowledge distillation for vision transformers,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 12 052–12 062.
  104. Y. Jiang, B. Sharma, M. Madhavi, and H. Li, “Knowledge distillation from bert transformer to speech transformer for intent classification,” arXiv preprint arXiv:2108.02598, 2021.
  105. R. Liu, K. Yang, A. Roitberg, J. Zhang, K. Peng, H. Liu, and R. Stiefelhagen, “Transkd: Transformer knowledge distillation for efficient semantic segmentation,” arXiv preprint arXiv:2202.13393, 2022.
  106. W. Wang, F. Wei, L. Dong, H. Bao, N. Yang, and M. Zhou, “Minilm: Deep self-attention distillation for task-agnostic compression of pre-trained transformers,” Advances in Neural Information Processing Systems, vol. 33, pp. 5776–5788, 2020.
  107. J. Liu, X. Huang, G. Song, H. Li, and Y. Liu, “Uninet: Unified architecture search with convolution, transformer, and mlp,” in European Conference on Computer Vision.   Springer, 2022, pp. 33–49.
  108. Y. Yin, C. Chen, L. Shang, X. Jiang, X. Chen, and Q. Liu, “Autotinybert: Automatic hyper-parameter optimization for efficient pre-trained language models,” arXiv preprint arXiv:2107.13686, 2021.
  109. S. Hong, S. Moon, J. Kim, S. Lee, M. Kim, D. Lee, and J.-Y. Kim, “Dfx: A low-latency multi-fpga appliance for accelerating transformer-based text generation,” in 2022 55th IEEE/ACM International Symposium on Microarchitecture (MICRO).   IEEE, 2022, pp. 616–630.
  110. Z. Zhang, Y. Chen, B. He, and Z. Zhang, “Niot: A novel inference optimization of transformers on modern cpus,” in IEEE Transactions on Parallel and Distributed Systems, vol. 34, 2023, pp. 1982–1995.
  111. H. You, Z. Sun, H. Shi, Z. Yu, Y. Zhao, Y. Zhang, C. Li, B. Li, and Y. Lin, “Vitcod: Vision transformer acceleration via dedicated algorithm and accelerator co-design,” in 2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA).   IEEE, 2023, pp. 273–286.
  112. L. Lu, Y. Jin, H. Bi, Z. Luo, P. Li, T. Wang, and Y. Liang, “Sanger: A co-design framework for enabling sparse attention using reconfigurable architecture,” in MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture, 2021, pp. 977–991.
  113. G. Shen, J. Zhao, Q. Chen, J. Leng, C. Li, and M. Guo, “Salo: an efficient spatial accelerator enabling hybrid sparse attention mechanisms for long sequences,” in Proceedings of the 59th ACM/IEEE Design Automation Conference, 2022, pp. 571–576.
  114. Z. Zhao, R. Cao, K.-F. Un, W.-H. Yu, P.-I. Mak, and R. P. Martins, “An fpga-based transformer accelerator using output block stationary dataflow for object recognition applications,” IEEE Transactions on Circuits and Systems II: Express Briefs, vol. 70, no. 1, pp. 281–285, 2022.
  115. Intel, “4th gen xeon scalable processors,” 2023, accessed: Aug 23, 2023. [Online]. Available: https://www.intel.com/content/www/us/en/products/docs/processors/xeon-accelerated/4th-gen-xeon-scalable-processors.html
  116. J. Zhang, Y. Pan, T. Yao, H. Zhao, and T. Mei, “dabnn: A super fast inference framework for binary neural networks on arm devices,” in Proceedings of the 27th ACM international conference on multimedia, 2019, pp. 2272–2275.
Citations (1)

Summary

No one has generated a summary of this paper yet.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.