Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
126 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

OzMAC: An Energy-Efficient Sparsity-Exploiting Multiply-Accumulate-Unit Design for DL Inference (2402.19376v1)

Published 29 Feb 2024 in cs.AR

Abstract: General Matrix Multiply (GEMM) hardware, employing large arrays of multiply-accumulate (MAC) units, perform bulk of the computation in deep learning (DL). Recent trends have established 8-bit integer (INT8) as the most widely used precision for DL inference. This paper proposes a novel MAC design capable of dynamically exploiting bit sparsity (i.e., number of `0' bits within a binary value) in input data to achieve significant improvements on area, power and energy. The proposed architecture, called OzMAC (Omit-zero-MAC), skips over zeros within a binary input value and performs simple shift-and-add-based compute in place of expensive multipliers. We implement OzMAC in SystemVerilog and present post-synthesis performance-power-area (PPA) results using commercial TSMC N5 (5nm) process node. Using eight pretrained INT8 deep neural networks (DNNs) as benchmarks, we demonstrate the existence of high bit sparsity in real DNN workloads and show that 8-bit OzMAC improves all three metrics of area, power, and energy significantly by 21%, 70%, and 28%, respectively. Similar improvements are achieved when scaling data precisions (4, 8, 16 bits) and clock frequencies (0.5 GHz, 1 GHz, 1.5 GHz). For the 8-bit OzMAC, scaling its frequency to normalize the throughput relative to conventional MAC, it still achieves 30% improvement on both power and energy.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (35)
  1. “Quantization,” https://pytorch.org/docs/stable/quantization.html.
  2. “Quantized models,” https://pytorch.org/vision/stable/models.html#quantized-models.
  3. M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard et al., “{{\{{TensorFlow}}\}}: a system for {{\{{Large-Scale}}\}} machine learning,” in 12th USENIX symposium on operating systems design and implementation (OSDI 16), 2016, pp. 265–283.
  4. V. Camus, L. Mei, C. Enz, and M. Verhelst, “Review and benchmarking of precision-scalable multiply-accumulate unit architectures for embedded neural-network processing,” IEEE Journal on Emerging and Selected Topics in Circuits and Systems, vol. 9, no. 4, 2019.
  5. Y.-H. Chen, J. Emer, and V. Sze, “Using dataflow to optimize energy efficiency of deep neural network accelerators,” IEEE Micro, vol. 37, no. 3, pp. 12–21, 2017.
  6. Y.-H. Chen, T. Krishna, J. S. Emer, and V. Sze, “Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks,” IEEE journal of solid-state circuits, vol. 52, no. 1, pp. 127–138, 2016.
  7. C. De Sa, M. Leszczynski, J. Zhang, A. Marzoev, C. R. Aberger, K. Olukotun, and C. Ré, “High-accuracy low-precision training,” arXiv preprint arXiv:1803.03383, 2018.
  8. J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in 2009 IEEE conference on computer vision and pattern recognition.   Ieee, 2009, pp. 248–255.
  9. C. Farabet, B. Martini, P. Akselrod, S. Talay, Y. LeCun, and E. Culurciello, “Hardware accelerated convolutional neural networks for synthetic vision systems,” in Proceedings of 2010 IEEE International Symposium on Circuits and Systems, 2010, pp. 257–260.
  10. T. Gale, E. Elsen, and S. Hooker, “The state of sparsity in deep neural networks,” arXiv preprint arXiv:1902.09574, 2019.
  11. S. Gupta, A. Agrawal, K. Gopalakrishnan, and P. Narayanan, “Deep learning with limited numerical precision,” in International conference on machine learning.   PMLR, 2015, pp. 1737–1746.
  12. W. Hu, Y. Huang, L. Wei, F. Zhang, and H. Li, “Deep convolutional neural networks for hyperspectral image classification,” Journal of Sensors, vol. 2015, 2015.
  13. A. Inci, M. M. Isgenc, and D. Marculescu, “Deepnvm++: Cross-layer modeling and optimization framework of non-volatile memories for deep learning,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2021.
  14. N. P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa, S. Bates, S. Bhatia, N. Boden, A. Borchers et al., “In-datacenter performance analysis of a tensor processing unit,” in Proceedings of the 44th annual international symposium on computer architecture, 2017, pp. 1–12.
  15. P. Judd, J. Albericio, T. Hetherington, T. M. Aamodt, and A. Moshovos, “Stripes: Bit-serial deep neural network computing,” in 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).   IEEE, 2016, pp. 1–12.
  16. B. Li, L. Song, F. Chen, X. Qian, Y. Chen, and H. H. Li, “Reram-based accelerator for deep learning,” in 2018 Design, Automation & Test in Europe Conference & Exhibition (DATE).   IEEE, 2018, pp. 815–820.
  17. Z. Liu, A. Ali, P. Kenesei, A. Miceli, H. Sharma, N. Schwarz, D. Trujillo, H. Yoo, R. Coffee, N. Layad et al., “Bridge data center ai systems with edge computing for actionable information retrieval,” arXiv preprint arXiv:2105.13967, 2021.
  18. N. Mellempudi, A. Kundu, D. Das, D. Mudigere, and B. Kaul, “Mixed low-precision deep learning inference using dynamic fixed point,” arXiv preprint arXiv:1701.08978, 2017.
  19. S. Y. H. Mirmahaleh, M. Reshadi, and N. Bagherzadeh, “Flow mapping on mesh-based deep learning accelerator,” Journal of Parallel and Distributed Computing, vol. 144, pp. 80–97, 2020.
  20. D. Miyashita, S. Kousai, T. Suzuki, and J. Deguchi, “A neuromorphic chip optimized for deep learning and cmos technology with time-domain analog and digital mixed-signal processing,” IEEE Journal of Solid-State Circuits, vol. 52, no. 10, pp. 2679–2689, 2017.
  21. D. Miyashita, E. H. Lee, and B. Murmann, “Convolutional neural networks using logarithmic data representation,” arXiv preprint arXiv:1603.01025, 2016.
  22. A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga et al., “Pytorch: An imperative style, high-performance deep learning library,” Advances in neural information processing systems, vol. 32, 2019.
  23. M. A. Raihan and T. Aamodt, “Sparse weight activation training,” Advances in Neural Information Processing Systems, vol. 33, 2020.
  24. A. Reuther, P. Michaleas, M. Jones, V. Gadepally, S. Samsi, and J. Kepner, “Survey of machine learning accelerators,” in 2020 IEEE high performance extreme computing conference (HPEC).   IEEE, 2020, pp. 1–12.
  25. H. Sharma, J. Park, N. Suda, L. Lai, B. Chau, V. Chandra, and H. Esmaeilzadeh, “Bit fusion: Bit-level dynamically composable architecture for accelerating deep neural network,” in 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA), 2018.
  26. X. Sun, N. Wang, C.-Y. Chen, J. Ni, A. Agrawal, X. Cui, S. Venkataramani, K. El Maghraoui, V. V. Srinivasan, and K. Gopalakrishnan, “Ultra-low precision 4-bit training of deep neural networks,” Advances in Neural Information Processing Systems, vol. 33, pp. 1796–1807, 2020.
  27. V. Sze, Y.-H. Chen, T.-J. Yang, and J. S. Emer, “Efficient processing of deep neural networks: A tutorial and survey,” Proceedings of the IEEE, vol. 105, no. 12, pp. 2295–2329, 2017.
  28. V. Sze, Y.-H. Chen, T.-J. Yang, and J. S. Emer, “Efficient processing of deep neural networks,” Synthesis Lectures on Computer Architecture, vol. 15, no. 2, pp. 1–341, 2020.
  29. S.-N. Tang and Y.-S. Han, “A high-accuracy hardware-efficient multiply–accumulate (mac) unit based on dual-mode truncation error compensation for cnns,” IEEE Access, vol. 8, pp. 214 716–214 731, 2020.
  30. H. Vanholder, “Efficient inference with tensorrt,” in GPU Technology Conference, vol. 1, 2016, p. 2.
  31. N. Wang, J. Choi, and K. Gopalakrishnan, “8-bit precision for training deep learning systems,” 2018.
  32. X. Wang, Y. Han, V. C. Leung, D. Niyato, X. Yan, and X. Chen, “Convergence of edge computing and deep learning: A comprehensive survey,” IEEE Communications Surveys & Tutorials, vol. 22, no. 2, pp. 869–904, 2020.
  33. S. Wu, G. Li, F. Chen, and L. Shi, “Training and inference with integers in deep neural networks,” arXiv preprint arXiv:1802.04680, 2018.
  34. H. Zhang, J. He, and S.-B. Ko, “Efficient posit multiply-accumulate unit generator for deep learning applications,” in 2019 IEEE International Symposium on Circuits and Systems (ISCAS).   IEEE, 2019, pp. 1–5.
  35. Z. Zou, Y. Jin, P. Nevalainen, Y. Huan, J. Heikkonen, and T. Westerlund, “Edge and fog computing enabled ai for iot-an overview,” in 2019 IEEE International Conference on Artificial Intelligence Circuits and Systems (AICAS).   IEEE, 2019, pp. 51–56.

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com