SMOF: Streaming Modern CNNs on FPGAs with Smart Off-Chip Eviction (2403.18921v1)
Abstract: Convolutional Neural Networks (CNNs) have demonstrated their effectiveness in numerous vision tasks. However, their high processing requirements necessitate efficient hardware acceleration to meet the application's performance targets. In the space of FPGAs, streaming-based dataflow architectures are often adopted by users, as significant performance gains can be achieved through layer-wise pipelining and reduced off-chip memory access by retaining data on-chip. However, modern topologies, such as the UNet, YOLO, and X3D models, utilise long skip connections, requiring significant on-chip storage and thus limiting the performance achieved by such system architectures. The paper addresses the above limitation by introducing weight and activation eviction mechanisms to off-chip memory along the computational pipeline, taking into account the available compute and memory resources. The proposed mechanism is incorporated into an existing toolflow, expanding the design space by utilising off-chip memory as a buffer. This enables the mapping of such modern CNNs to devices with limited on-chip memory, under the streaming architecture design approach. SMOF has demonstrated the capacity to deliver competitive and, in some cases, state-of-the-art performance across a spectrum of computer vision tasks, achieving up to 10.65 X throughput improvement compared to previous works.
- V. Gokhale, A. Zaidy, A. X. M. Chang, and E. Culurciello, “Snowflake: A model agnostic accelerator for deep convolutional neural networks,” arXiv preprint arXiv:1708.02579, 2017.
- K. Guo, L. Sui, J. Qiu, J. Yu, J. Wang, S. Yao, S. Han, Y. Wang, and H. Yang, “Angel-eye: A complete design flow for mapping cnn onto embedded fpga,” IEEE transactions on computer-aided design of integrated circuits and systems, vol. 37, no. 1, pp. 35–47, 2017.
- S. I. Venieris and C. S. Bouganis, “FpgaConvNet: Mapping Regular and Irregular Convolutional Neural Networks on FPGAs,” IEEE Transactions on Neural Networks and Learning Systems, vol. 30, no. 2, 2019.
- Y. Umuroglu, N. J. Fraser, G. Gambardella, M. Blott, P. Leong, M. Jahre, and K. Vissers, “Finn: A framework for fast, scalable binarized neural network inference,” in Proceedings of the 2017 ACM/SIGDA international symposium on field-programmable gate arrays, 2017, pp. 65–74.
- F. Fahim, B. Hawks, C. Herwig, J. Hirschauer, S. Jindariani, N. Tran, L. P. Carloni, G. Di Guglielmo, P. Harris, J. Krupa et al., “hls4ml: An open-source codesign workflow to empower scientific low-power machine learning devices,” arXiv preprint arXiv:2103.05579, 2021.
- M. Hall and V. Betz, “From tensorflow graphs to luts and wires: Automated sparse and physically aware cnn hardware generation,” in 2020 International Conference on Field-Programmable Technology (ICFPT). IEEE, 2020, pp. 56–65.
- O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18. Springer, 2015, pp. 234–241.
- J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once: Unified, real-time object detection,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 779–788.
- C. Feichtenhofer, “X3D: Expanding Architectures for Efficient Video Recognition,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020.
- J. Fowers, K. Ovtcharov, M. Papamichael, T. Massengill, M. Liu, D. Lo, S. Alkalay, M. Haselman, L. Adams, M. Ghandi et al., “A configurable cloud-scale dnn processor for real-time ai,” in 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA). IEEE, 2018, pp. 1–14.
- J. Shen, Y. Huang, Z. Wang, Y. Qiao, M. Wen, and C. Zhang, “Towards a uniform template-based architecture for accelerating 2d and 3d cnns on fpga,” in Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, 2018, pp. 97–106.
- S. Liu, H. Fan, X. Niu, H.-c. Ng, Y. Chu, and W. Luk, “Optimizing cnn-based segmentation with deeply customized convolutional and deconvolutional architectures on fpga,” ACM Transactions on Reconfigurable Technology and Systems (TRETS), vol. 11, no. 3, pp. 1–22, 2018.
- V. Kathail, “Xilinx vitis unified software platform,” in Proceedings of the 2020 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, 2020, pp. 173–174.
- Y. Wang, J. Xu, Y. Han, H. Li, and X. Li, “Deepburning: Automatic generation of fpga-based learning accelerators for the neural network family,” in Proceedings of the 53rd Annual Design Automation Conference, 2016, pp. 1–6.
- X. Zhang, J. Wang, C. Zhu, Y. Lin, J. Xiong, W.-m. Hwu, and D. Chen, “Dnnbuilder: An automated tool for building high-performance dnn hardware accelerators for fpgas,” in 2018 IEEE/ACM International Conference on Computer-Aided Design (ICCAD). IEEE, 2018, pp. 1–8.
- X. Wei, C. H. Yu, P. Zhang, Y. Chen, Y. Wang, H. Hu, Y. Liang, and J. Cong, “Automated systolic array architecture synthesis for high throughput cnn inference on fpgas,” in Proceedings of the 54th Annual Design Automation Conference 2017, 2017, pp. 1–6.
- A. Samajdar, Y. Zhu, P. Whatmough, M. Mattina, and T. Krishna, “Scale-sim: Systolic cnn accelerator simulator,” arXiv preprint arXiv:1811.02883, 2018.
- E. Luo, H. Huang, C. Liu, G. Li, B. Yang, Y. Wang, H. Li, and X. Li, “Deepburning-mixq: An open source mixed-precision neural network accelerator design framework for fpgas,” in 2023 IEEE/ACM International Conference on Computer Aided Design (ICCAD). IEEE, 2023, pp. 1–9.
- A. Montgomerie-Corcoran, Z. Yu, J. Cheng, and C.-S. Bouganis, “Pass: Exploiting post-activation sparsity in streaming architectures for cnn acceleration,” in 2023 33rd International Conference on Field-Programmable Logic and Applications (FPL). IEEE, 2023, pp. 288–293.
- L. Petrica, T. Alonso, M. Kroes, N. Fraser, S. Cotofana, and M. Blott, “Memory-efficient dataflow inference for deep cnns on fpga,” in 2020 International Conference on Field-Programmable Technology (ICFPT). IEEE, 2020, pp. 48–55.
- T. Alonso, L. Petrica, M. Ruiz, J. Petri-Koenig, Y. Umuroglu, I. Stamelos, E. Koromilas, M. Blott, and K. Vissers, “Elastic-df: Scaling performance of dnn inference in fpga clouds through automatic partitioning,” ACM Transactions on Reconfigurable Technology and Systems (TRETS), vol. 15, no. 2, pp. 1–34, 2021.
- M. Ibrahim, Z. Zhao, M. Hall, and V. Betz, “Extending data flow architectures for convolutional neural networks to multiple fpgas,” in 2023 International Conference on Field-Programmable Technology (ICFPT). IEEE, 2023, pp. 126–135.
- A. Montgomerie-Corcoran, P. Toupas, Z. Yu, and C.-S. Bouganis, “Satay: a streaming architecture toolflow for accelerating yolo models on fpga devices,” arXiv preprint arXiv:2309.01587, 2023.
- Z. Yu and C.-S. Bouganis, “Autows: Automate weights streaming in layer-wise pipelined dnn accelerators,” 2023.
- K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
- C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 1–9.
- K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
- A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam, “Mobilenets: Efficient convolutional neural networks for mobile vision applications,” arXiv preprint arXiv:1704.04861, 2017.
- Y. LeCun, “The mnist database of handwritten digits,” http://yann. lecun. com/exdb/mnist/, 1998.
- A. Krizhevsky, G. Hinton et al., “Learning multiple layers of features from tiny images,” 2009.
- J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in 2009 IEEE conference on computer vision and pattern recognition. Ieee, 2009, pp. 248–255.
- H. Nakahara, H. Yonekawa, T. Fujii, and S. Sato, “A lightweight yolov2: A binarized cnn with a parallel support vector regression for an fpga,” in Proceedings of the 2018 ACM/SIGDA International Symposium on field-programmable gate arrays, 2018, pp. 31–40.
- A. Anupreetham, M. Ibrahim, M. Hall, A. Boutros, A. Kuzhively, A. Mohanty, E. Nurvitadhi, V. Betz, Y. Cao, and J.-s. Seo, “High throughput fpga-based object detection via algorithm-hardware co-design,” ACM Transactions on Reconfigurable Technology and Systems, 2023.
- P. Toupas, C.-S. Bouganis, and D. Tzovaras, “fpgahart: A toolflow for throughput-oriented acceleration of 3d cnns for har onto fpgas,” arXiv preprint arXiv:2305.19896, 2023.
- P. Toupas, A. Montgomerie-Corcoran, C.-S. Bouganis, and D. Tzovaras, “Harflow3d: A latency-oriented 3d-cnn accelerator toolflow for har on fpga devices,” in 2023 IEEE 31st Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM), 2023, pp. 144–154.
- A. Montgomerie-Corcoran, Z. Yu, and C.-S. Bouganis, “Samo: Optimised mapping of convolutional neural networks to streaming architectures,” in 2022 32nd International Conference on Field-Programmable Logic and Applications (FPL). IEEE, 2022, pp. 418–424.
- G. Jocher, A. Chaurasia, and J. Qiu, “Ultralytics yolov8,” 2023. [Online]. Available: https://github.com/ultralytics/ultralytics
- Ö. Çiçek, A. Abdulkadir, S. S. Lienkamp, T. Brox, and O. Ronneberger, “3d u-net: Learning dense volumetric segmentation from sparse annotation,” in Medical Image Computing and Computer-Assisted Intervention – MICCAI 2016, S. Ourselin, L. Joskowicz, M. R. Sabuncu, G. Unal, and W. Wells, Eds. Cham: Springer International Publishing, 2016, pp. 424–432.
- T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13. Springer, 2014, pp. 740–755.
- G. J. Brostow, J. Fauqueur, and R. Cipolla, “Semantic object classes in video: A high-definition ground truth database,” Pattern Recognition Letters, vol. 30, no. 2, pp. 88–97, 2009.
- B. H. Menze, A. Jakab, S. Bauer, J. Kalpathy-Cramer, K. Farahani, J. Kirby, Y. Burren, N. Porz, J. Slotboom, R. Wiest et al., “The multimodal brain tumor image segmentation benchmark (brats),” IEEE transactions on medical imaging, vol. 34, no. 10, pp. 1993–2024, 2014.
- K. Soomro, A. R. Zamir, and M. Shah, “Ucf101: A dataset of 101 human actions classes from videos in the wild,” arXiv preprint arXiv:1212.0402, 2012.
- S. Basalama, A. Sohrabizadeh, J. Wang, L. Guo, and J. Cong, “Flexcnn: An end-to-end framework for composing cnn accelerators on fpga,” vol. 16, no. 2, mar 2023. [Online]. Available: https://doi.org/10.1145/3570928
- S. Liu and W. Luk, “Towards an efficient accelerator for dnn-based remote sensing image segmentation on fpgas,” in 2019 29th International Conference on Field Programmable Logic and Applications (FPL), 2019, pp. 187–193.
- Petros Toupas (6 papers)
- Zhewen Yu (11 papers)
- Christos-Savvas Bouganis (38 papers)
- Dimitrios Tzovaras (16 papers)