Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Hardware-Aware DNN Compression via Diverse Pruning and Mixed-Precision Quantization (2312.15322v1)

Published 23 Dec 2023 in cs.LG

Abstract: Deep Neural Networks (DNNs) have shown significant advantages in a wide variety of domains. However, DNNs are becoming computationally intensive and energy hungry at an exponential pace, while at the same time, there is a vast demand for running sophisticated DNN-based services on resource constrained embedded devices. In this paper, we target energy-efficient inference on embedded DNN accelerators. To that end, we propose an automated framework to compress DNNs in a hardware-aware manner by jointly employing pruning and quantization. We explore, for the first time, per-layer fine- and coarse-grained pruning, in the same DNN architecture, in addition to low bit-width mixed-precision quantization for weights and activations. Reinforcement Learning (RL) is used to explore the associated design space and identify the pruning-quantization configuration so that the energy consumption is minimized whilst the prediction accuracy loss is retained at acceptable levels. Using our novel composite RL agent we are able to extract energy-efficient solutions without requiring retraining and/or fine tuning. Our extensive experimental evaluation over widely used DNNs and the CIFAR-10/100 and ImageNet datasets demonstrates that our framework achieves $39\%$ average energy reduction for $1.7\%$ average accuracy loss and outperforms significantly the state-of-the-art approaches.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (45)
  1. N. P. Jouppi et al., “In-datacenter performance analysis of a tensor processing unit,” in International Symposium on Computer Architecture, 2017, p. 1–12.
  2. Y.-H. Chen, T. Krishna, J. S. Emer, and V. Sze, “Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks,” IEEE Journal of Solid-State Circuits, vol. 52, no. 1, pp. 127–138, 2016.
  3. H. Amrouch, G. Zervakis, S. Salamin, H. Kattan, I. Anagnostopoulos, and J. Henkel, “Npu thermal management,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 39, no. 11, pp. 3842–3855, 2020.
  4. S. Han, J. Pool, J. Tran, and W. Dally, “Learning both weights and connections for efficient neural network,” Advances in neural information processing systems, vol. 28, 2015.
  5. N. Lee, T. Ajanthan, and P. H. Torr, “Snip: Single-shot network pruning based on connection sensitivity,” in International Conference on Learning Representations, 2019.
  6. Y. Guo, A. Yao, and Y. Chen, “Dynamic network surgery for efficient dnns,” Advances in neural information processing systems, vol. 29, 2016.
  7. H. Li, A. Kadav, I. Durdanovic, H. Samet, and H. P. Graf, “Pruning filters for efficient convnets,” in International Conference on Learning Representations, 2017.
  8. X. Gao, Y. Zhao, Ł. Dudziak, R. Mullins, and C.-z. Xu, “Dynamic channel pruning: Feature boosting and suppression,” in International Conference on Learning Representations, 2019.
  9. X. Ma et al., “Non-structured dnn weight pruning–is it beneficial in any platform?” IEEE Transactions on Neural Networks and Learning Systems, 2021.
  10. J. Choi, Z. Wang, S. Venkataramani, P. I.-J. Chuang, V. Srinivasan, and K. Gopalakrishnan, “PACT: Parameterized clipping activation for quantized neural networks,” in International Conference on Learning Representations, 2018.
  11. B. Jacob et al., “Quantization and training of neural networks for efficient integer-arithmetic-only inference,” in Conference on Computer Vision and Pattern Recognition, 2018.
  12. T.-J. Yang et al., “Netadapt: Platform-aware neural network adaptation for mobile applications,” in European Conference on Computer Vision, 2018.
  13. T.-J. Yang, Y.-H. Chen, and V. Sze, “Designing energy-efficient convolutional neural networks using energy-aware pruning,” in Conference on Computer Vision and Pattern Recognition, 2017, pp. 5687–5695.
  14. T. Wang et al., “Apq: Joint search for network architecture, pruning and quantization policy,” in Conference on Computer Vision and Pattern Recognition, 2020.
  15. Y. He, J. Lin, Z. Liu, H. Wang, L.-J. Li, and S. Han, “Amc: Automl for model compression and acceleration on mobile devices,” in European Conference on Computer Vision, 2018.
  16. G. Zervakis, O. Spantidi, I. Anagnostopoulos, H. Amrouch, and J. Henkel, “Control variate approximation for dnn accelerators,” in Design Automation Conference, 2021, pp. 481–486.
  17. K. Wang, Z. Liu, Y. Lin, J. Lin, and S. Han, “Haq: Hardware-aware automated quantization with mixed precision,” in Conference on Computer Vision and Pattern Recognition, 2019, pp. 8604–8612.
  18. P. Hu, X. Peng, H. Zhu, M. M. S. Aly, and J. Lin, “Opq: Compressing deep neural networks with one-shot pruning-quantization,” in AAAI Conference on Artificial Intelligence, vol. 35, no. 9, 2021, pp. 7780–7788.
  19. J. Frankle and M. Carbin, “The lottery ticket hypothesis: Finding sparse, trainable neural networks,” in International Conference on Learning Representations, 2019.
  20. W. Niu et al., “Patdnn: Achieving real-time dnn execution on mobile devices with pattern-based weight pruning,” in International Conference on Architectural Support for Programming Languages and Operating Systems.   Association for Computing Machinery, 2020, p. 907–922.
  21. R. Banner, Y. Nahshan, and D. Soudry, “Post training 4-bit quantization of convolutional networks for rapid-deployment,” in Conference on Neural Information Processing Systems.   Curran Associates Inc., 2019.
  22. R. Zhao, Y. Hu, J. Dotzel, C. De Sa, and Z. Zhang, “Improving neural network quantization without retraining using outlier channel splitting,” International Conference on Machine Learning, pp. 7543–7552, June 2019.
  23. C. Gong, Y. Chen, Y. Lu, T. Li, C. Hao, and D. Chen, “Vecq: Minimal loss dnn model compression with vectorized weight quantization,” IEEE Transactions on Computers, vol. 70, no. 05, pp. 696–710, 2021.
  24. H. Yang, S. Gui, Y. Zhu, and J. Liu, “Automatic neural network compression by sparsity-quantization joint learning: A constrained optimization-based approach,” in Conference on Computer Vision and Pattern Recognition, 2020.
  25. A. T. Elthakeb, P. Pilligundla, F. Mireshghallah, A. Yazdanbakhsh, and H. Esmaeilzadeh, “Releq : A reinforcement learning approach for automatic deep quantization of neural networks,” IEEE Micro, vol. 40, no. 5, pp. 37–45, 2020.
  26. S. Han, H. Mao, and W. J. Dally, “Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding,” in International Conference on Learning Representations, 2015.
  27. H. Cai, C. Gan, T. Wang, Z. Zhang, and S. Han, “Once-for-all: Train one network and specialize it for efficient deployment,” in International Conference on Learning Representations, 2019.
  28. Y. Wang, Y. Lu, and T. Blankevoort, “Differentiable joint pruning and quantization for hardware efficiency,” in European Conference on Computer Vision.   Springer, 2020, pp. 259–277.
  29. F. Tung and G. Mori, “Deep neural network compression by in-parallel pruning-quantization,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 42, no. 3, pp. 568–579, 2018.
  30. J. Kim, K. Yoo, and N. Kwak, “Position-based scaled gradient for model quantization and pruning,” Advances in Neural Information Processing Systems, vol. 33, pp. 20 415–20 426, 2020.
  31. S. J. Kwon, D. Lee, B. Kim, P. Kapoor, B. Park, and G.-Y. Wei, “Structured compression by weight encryption for unstructured pruning and quantization,” in Conference on Computer Vision and Pattern Recognition, 2020, pp. 1909–1918.
  32. J. Kim, S. Chang, and N. Kwak, “PQK: Model Compression via Pruning, Quantization, and Knowledge Distillation,” in Interspeech, 2021, pp. 4568–4572.
  33. E. Frantar and D. Alistarh, “Optimal brain compression: A framework for accurate post-training quantization and pruning,” Advances in Neural Information Processing Systems, vol. 35, pp. 4475–4488, 2022.
  34. M. Van Baalen et al., “Bayesian bits: Unifying quantization and pruning,” Advances in Neural Information Processing Systems, vol. 33, pp. 5741–5752, 2020.
  35. Y. He, X. Zhang, and J. Sun, “Channel pruning for accelerating very deep neural networks,” in Conference on Computer Vision and Pattern Recognition, 2017, pp. 1389–1397.
  36. H. Pan, X. Niu, R. Li, S. Shen, and Y. Dou, “Dropfilter: a novel regularization method for learning convolutional neural networks,” Neural Processing Letters, vol. 51, pp. 1285–1298, 2020.
  37. J.-S. Park et al., “9.5 a 6k-mac feature-map-sparsity-aware neural processing unit in 5nm flagship mobile soc,” in IEEE International Solid-State Circuits Conference, vol. 64, 2021, pp. 152–154.
  38. S. Venkataramani et al., “Rapid: Ai accelerator for ultra-low precision training and inference,” in International Symposium on Computer Architecture, 2021, pp. 153–166.
  39. N. P. Jouppi et al., “Ten lessons from three generations shaped google’s tpuv4i : Industrial product,” in International Symposium on Computer Architecture, 2021.
  40. T. P. Lillicrap et al., “Continuous control with deep reinforcement learning,” in International Conference on Learning Representations, 2015.
  41. M. Hessel et al., “Rainbow: Combining improvements in deep reinforcement learning,” in AAAI Conference on Artificial Intelligence, vol. 32, no. 1, 2018.
  42. Z. Wang, T. Schaul, M. Hessel, H. van Hasselt, M. Lanctot, and N. de Freitas, “Dueling network architectures for deep reinforcement learning,” in International Conference on Machine Learning, 2016.
  43. M. Gao, J. Pu, X. Yang, M. Horowitz, and C. Kozyrakis, “Tetris: Scalable and efficient neural network acceleration with 3d memory,” in International Conference on Architectural Support for Programming Languages and Operating Systems, 2017, pp. 751–764.
  44. M. Gao, X. Yang, J. Pu, M. Horowitz, and C. Kozyrakis, “Tangram: Optimized coarse-grained dataflow for scalable nn accelerators,” in International Conference on Architectural Support for Programming Languages and Operating Systems, 2019, pp. 807–820.
  45. K. Deb, A. Pratap, S. Agarwal, and T. Meyarivan, “A fast and elitist multiobjective genetic algorithm: Nsga-ii,” IEEE Transactions on Evolutionary Computation, vol. 6, no. 2, pp. 182–197, 2002.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Konstantinos Balaskas (10 papers)
  2. Andreas Karatzas (10 papers)
  3. Christos Sad (1 paper)
  4. Kostas Siozios (9 papers)
  5. Iraklis Anagnostopoulos (18 papers)
  6. Georgios Zervakis (31 papers)
  7. Jörg Henkel (44 papers)
Citations (7)