A Cost-Efficient FPGA Implementation of Tiny Transformer Model using Neural ODE (2401.02721v3)
Abstract: Transformer has been adopted to image recognition tasks and shown to outperform CNNs and RNNs while it suffers from high training cost and computational complexity. To address these issues, a hybrid approach has become a recent research trend, which replaces a part of ResNet with an MHSA (Multi-Head Self-Attention). In this paper, we propose a lightweight hybrid model which uses Neural ODE (Ordinary Differential Equation) as a backbone instead of ResNet so that we can increase the number of iterations of building blocks while reusing the same parameters, mitigating the increase in parameter size per iteration. The proposed model is deployed on a modest-sized FPGA device for edge computing. The model is further quantized by QAT (Quantization Aware Training) scheme to reduce FPGA resource utilization while suppressing the accuracy loss. The quantized model achieves 79.68% top-1 accuracy for STL10 dataset that contains 96$\times$96 pixel images. The weights of the feature extraction network are stored on-chip to minimize the memory transfer overhead, allowing faster inference. By eliminating the overhead of memory transfers, inference can be executed seamlessly, leading to accelerated inference. The proposed FPGA implementation accelerates the backbone and MHSA parts by 34.01$\times$, and achieves an overall 9.85$\times$ speedup when taking into account the software pre- and post-processing. The FPGA acceleration leads to 7.10$\times$ better energy efficiency compared to the ARM Cortex-A53 CPU. The proposed lightweight Transformer model is demonstrated on Xilinx ZCU104 board for the image recognition of 96$\times$96 pixel images in this paper and can be applied to different image sizes by modifying the pre-processing layer.
- A. Vaswani et al., “Attention is All you Need,” in Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Jun 2017, pp. 5998–6008.
- A. Dosovitskiy et al., “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale,” in Proceedings of the International Conference on Learning Representations (ICLR), Jan 2021, pp. 1–21.
- C. Sun et al., “Revisiting Unreasonable Effectiveness of Data in Deep Learning Era,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Oct 2017, pp. 843–852.
- K. Yuan et al., “Incorporating Convolution Designs into Visual Transformers,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Oct 2021, pp. 579–588.
- H. Wu et al., “CvT: Introducing Convolutions to Vision Transformers,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Oct 2021, pp. 22–31.
- A. Trockman et al., “Patches Are All You Need?” arXiv:2201.09792, Jan 2022.
- A. Srinivas et al., “Bottleneck Transformers for Visual Recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Aug 2021, pp. 16 514–16 524.
- N. Park et al., “How Do Vision Transformers Work?” in Proceeding of the International Conference on Learning Representations (ICLR), Apr 2022, pp. 1–26.
- Z. Dai et al., “CoAtNet: Marrying Convolution and Attention for All Data Sizes,” arXiv:2106.04803, Jun 2021.
- I. Okubo et al., “A Lightweight Transformer Model using Neural ODE for FPGAs,” in Proceedings of the IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), May 2023, pp. 105–112.
- R. T. Q. Chen et al., “Neural Ordinary Differential Equations,” in Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Dec 2018, pp. 6571–6583.
- A. Krizhevsky, “Learning Multiple Layers of Features from Tiny Images,” University of Toronto, Tech. Rep., Apr 2009.
- O. Russakovsky et al., “ImageNet Large Scale Visual Recognition Challenge,” International Journal of Computer Vision (IJCV), vol. 115(3), pp. 211–252, Dec 2015.
- N. Kitaev et al., “Reformer: The Efficient Transformer,” in Proceedings of the International Conference on Learning Representations (ICLR), Apr 2020, pp. 1–12.
- A. Katharopoulos et al., “Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention,” in Proceedings of the International Conference on Machine Learning (ICML), Jul 2020, pp. 5156–5165.
- S. Wang et al., “Linformer: Self-Attention with Linear Complexity,” arXiv:2006.04768, Jun 2020.
- T. Dao et al., “FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness,” in Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Nov 2022, pp. 16 344–16 359.
- Z. Liu et al., “Swin Transformer: Hierarchical Vision Transformer using Shifted Windows,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Mar 2021, pp. 10 012–10 022.
- X. Liu et al., “Neural SDE: Stabilizing Neural ODE Networks with Stochastic Noise,” arXiv:1906.02355, Jun 2019.
- C. Finlay et al., “How to Train Your Neural ODE: the World of Jacobian and Kinetic Regularization,” in Proceedings of the International Conference on Machine Learning (ICML), Jul 2020, pp. 3154–3164.
- H. Yan et al., “On Robustness of Neural Ordinary Differential Equations,” arXiv:1910.05513, Oct 2019.
- E. Dupont et al., “Augmented Neural ODEs,” in Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Dec 2019, pp. 3140–3150.
- H. Peng et al., “Accelerating Transformer-based Deep Learning Models on FPGAs using Column Balanced Block Pruning,” Apr 2021, pp. 142–148.
- B. Li et al., “FTRANS: Energy-Efficient Acceleration of Transformers Using FPGA,” in Proceedings of the ACM/IEEE International Symposium on Low Power Electronics and Design (ISLPED), Aug 2020, pp. 175–180.
- M. Sun et al., “VAQF: Fully Automatic Software-Hardware Co-Design Framework for Low-Bit Vision Transformer,” arXiv:2201.06618, Jan 2022.
- L. Wang et al., “Learnable Lookup Table for Neural Network Quantization,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Sep 2022, pp. 12 423–12 433.
- H. Kawakami et al., “dsODENet: Neural ODE and Depthwise Separable Convolution for Domain Adaptation on FPGAs,” in Proceeding of the Euromicro International Conference on Parallel, Distributed, and Network-Based Processing (PDP), Mar 2022, pp. 152–156.
- A. G. Howard et al., “MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications,” arXiv:1704.04861, Apr 2017.
- F. Chollet, “Xception: Deep Learning with Depthwise Separable Convolutions,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Jul 2017, pp. 1800–1807.
- I. Bello et al., “Attention Augmented Convolutional Networks,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Nov 2019, pp. 3286–3295.
- B. Zhang et al., “Sparse Attention with Linear Units,” in Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), Nov 2021, pp. 6507–6520.
- ——, “Root Mean Square Layer Normalization,” in Proceedings of the Neural Information Processing Systems (NeuralIPS), Oct 2019, pp. 12 360–12 371.
- M. Tan and Q. Le, “EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks,” in Proceedings of the International Conference on Machine Learning (PMLR), Jun 2019, pp. 6105–6114.
- ——, “EfficientNetV2: Smaller Models and Faster Training,” in Proceedings of the International Conference on Machine Learning (PMLR), Jul 2021, pp. 10 096–10 106.
- M. Sandler et al., “MobileNetV2: Inverted Residuals and Linear Bottlenecks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Jun 2018, pp. 4510–4520.
- S. Qian et al., “MobileNetV3 for Image Classification,” in Proceeding of the IEEE International Conference on Big Data, Artificial Intelligence and Internet of Things Engineering (ICBAIE), Mar 2021, pp. 490–497.
- A. Coates et al., “An Analysis of Single-Layer Networks in Unsupervised Feature Learning,” in Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS), Apr 2011, pp. 215–223.
- Ikumi Okubo (1 paper)
- Keisuke Sugiura (19 papers)
- Hiroki Matsutani (24 papers)