Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Mixed-TD: Efficient Neural Network Accelerator with Layer-Specific Tensor Decomposition (2306.05021v2)

Published 8 Jun 2023 in cs.LG and cs.AR

Abstract: Neural Network designs are quite diverse, from VGG-style to ResNet-style, and from Convolutional Neural Networks to Transformers. Towards the design of efficient accelerators, many works have adopted a dataflow-based, inter-layer pipelined architecture, with a customised hardware towards each layer, achieving ultra high throughput and low latency. The deployment of neural networks to such dataflow architecture accelerators is usually hindered by the available on-chip memory as it is desirable to preload the weights of neural networks on-chip to maximise the system performance. To address this, networks are usually compressed before the deployment through methods such as pruning, quantization and tensor decomposition. In this paper, a framework for mapping CNNs onto FPGAs based on a novel tensor decomposition method called Mixed-TD is proposed. The proposed method applies layer-specific Singular Value Decomposition (SVD) and Canonical Polyadic Decomposition (CPD) in a mixed manner, achieving 1.73x to 10.29x throughput per DSP to state-of-the-art CNNs. Our work is open-sourced: https://github.com/Yu-Zhewen/Mixed-TD

Definition Search Book Streamline Icon: https://streamlinehq.com
References (29)
  1. S. I. Venieris, A. Kouris, and C.-S. Bouganis, “Toolflows for mapping convolutional neural networks on fpgas: A survey and future directions,” arXiv preprint arXiv:1803.05900, 2018.
  2. E. Wang, J. J. Davis, R. Zhao, H.-C. Ng, X. Niu, W. Luk, P. Y. Cheung, and G. A. Constantinides, “Deep neural network approximation for custom hardware: Where we’ve been, where we’re going,” ACM Computing Surveys (CSUR), vol. 52, no. 2, pp. 1–39, 2019.
  3. L. Liu and S. Brown, “Leveraging fine-grained structured sparsity for cnn inference on systolic array architectures,” in 2021 31st International Conference on Field-Programmable Logic and Applications (FPL).   IEEE, 2021, pp. 301–305.
  4. J. Meng, S. K. Venkataramanaiah, C. Zhou, P. Hansen, P. Whatmough, and J.-s. Seo, “Fixyfpga: Efficient fpga accelerator for deep neural networks with high element-wise sparsity and without external memory access,” in 2021 31st International Conference on Field-Programmable Logic and Applications (FPL).   IEEE, 2021, pp. 9–16.
  5. S. Rasoulinezhad, H. Zhou, L. Wang, and P. H. Leong, “Pir-dsp: An fpga dsp block architecture for multi-precision deep neural networks,” in 2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM).   IEEE, 2019, pp. 35–44.
  6. T. Alonso, L. Petrica, M. Ruiz, J. Petri-Koenig, Y. Umuroglu, I. Stamelos, E. Koromilas, M. Blott, and K. Vissers, “Elastic-df: Scaling performance of dnn inference in fpga clouds through automatic partitioning,” ACM Transactions on Reconfigurable Technology and Systems (TRETS), vol. 15, no. 2, pp. 1–34, 2021.
  7. C. Latotzke, T. Ciesielski, and T. Gemmeke, “Design of high-throughput mixed-precision cnn accelerators on fpga,” arXiv preprint arXiv:2208.04854, 2022.
  8. M. Heath, A. Laub, C. Paige, and R. Ward, “Computing the singular value decomposition of a product of two matrices,” SIAM Journal on Scientific and Statistical Computing, vol. 7, no. 4, pp. 1147–1159, 1986.
  9. T. G. Kolda and B. W. Bader, “Tensor decompositions and applications,” SIAM review, vol. 51, no. 3, pp. 455–500, 2009.
  10. J. Kossaifi, Y. Panagakis, A. Anandkumar, and M. Pantic, “Tensorly: Tensor learning in python,” arXiv preprint arXiv:1610.09555, 2016.
  11. Z. Yu and C.-S. Bouganis, “Svd-nas: Coupling low-rank approximation and neural architecture search,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2023, pp. 1503–1512.
  12. “The tensor network.” [Online]. Available: https://tensornetwork.org/diagrams/#Penrose:19711
  13. S. I. Venieris and C.-S. Bouganis, “fpgaconvnet: A framework for mapping convolutional neural networks on fpgas,” in 2016 IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM).   IEEE, 2016, pp. 40–47.
  14. L. Petrica, T. Alonso, M. Kroes, N. Fraser, S. Cotofana, and M. Blott, “Memory-efficient dataflow inference for deep cnns on fpga,” in 2020 International Conference on Field-Programmable Technology (ICFPT).   IEEE, 2020, pp. 48–55.
  15. M. Hall and V. Betz, “Hpipe: Heterogeneous layer-pipelined and sparse-aware cnn inference for fpgas,” arXiv preprint arXiv:2007.10451, 2020.
  16. K. Goetschalckx, B. Moons, P. Wambacq, and M. Verhelst, “Efficiently combining svd, pruning, clustering and retraining for enhanced neural network compression,” in Proceedings of the 2nd International Workshop on Embedded and Mobile Deep Learning, 2018, pp. 1–6.
  17. H. Yang, S. Gui, Y. Zhu, and J. Liu, “Automatic neural network compression by sparsity-quantization joint learning: A constrained optimization-based approach,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 2178–2188.
  18. A. Montgomerie-Corcoran, S. I. Venieris, and C.-S. Bouganis, “Power-aware fpga mapping of convolutional neural networks,” in 2019 International Conference on Field-Programmable Technology (ICFPT).   IEEE, 2019, pp. 327–330.
  19. X. Dai, P. Zhang, B. Wu, H. Yin, F. Sun, Y. Wang, M. Dukhan, Y. Hu, Y. Wu, Y. Jia et al., “Chamnet: Towards efficient network design through platform-aware model adaptation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 11 398–11 407.
  20. A. Montgomerie-Corcoran, Z. Yu, and C.-S. Bouganis, “Samo: Optimised mapping of convolutional neural networks to streaming architectures,” in 2022 32nd International Conference on Field-Programmable Logic and Applications (FPL).   IEEE, 2022, pp. 418–424.
  21. L. Dudziak, T. Chau, M. Abdelfattah, R. Lee, H. Kim, and N. Lane, “Brp-nas: Prediction-based nas using gcns,” Advances in Neural Information Processing Systems, vol. 33, pp. 10 480–10 490, 2020.
  22. Z. Yu and C.-S. Bouganis, “Streamsvd: Low-rank approximation and streaming accelerator co-design,” in 2021 International Conference on Field-Programmable Technology (ICFPT).   IEEE, 2021, pp. 1–9.
  23. L. Shannon, V. Cojocaru, C. N. Dao, and P. H. Leong, “Technology scaling in fpgas: Trends in applications and architectures,” in 2015 IEEE 23rd Annual International Symposium on Field-Programmable Custom Computing Machines.   IEEE, 2015, pp. 1–8.
  24. Y. Gong, Z. Xu, Z. He, W. Zhang, X. Tu, X. Liang, and L. Jiang, “N3h-core: neuron-designed neural network accelerator via fpga-based heterogeneous computing cores,” in Proceedings of the 2022 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, 2022, pp. 112–122.
  25. M. Sun, Z. Li, A. Lu, Y. Li, S.-E. Chang, X. Ma, X. Lin, and Z. Fang, “Film-qnn: Efficient fpga acceleration of deep neural networks with intra-layer, mixed-precision quantization,” in Proceedings of the 2022 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, 2022, pp. 134–145.
  26. X. Ding, X. Zhang, N. Ma, J. Han, G. Ding, and J. Sun, “Repvgg: Making vgg-style convnets great again,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 13 733–13 742.
  27. R. Banner, Y. Nahshan, and D. Soudry, “Post training 4-bit quantization of convolutional networks for rapid-deployment,” Advances in Neural Information Processing Systems, vol. 32, 2019.
  28. I. V. Oseledets, “Tensor-train decomposition,” SIAM Journal on Scientific Computing, vol. 33, no. 5, pp. 2295–2317, 2011.
  29. Q. Zhao, G. Zhou, S. Xie, L. Zhang, and A. Cichocki, “Tensor ring decomposition,” arXiv preprint arXiv:1606.05535, 2016.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Zhewen Yu (11 papers)
  2. Christos-Savvas Bouganis (38 papers)

Summary

We haven't generated a summary for this paper yet.

Github Logo Streamline Icon: https://streamlinehq.com

GitHub