Vision Transformer Computation and Resilience for Dynamic Inference (2212.02687v3)
Abstract: State-of-the-art deep learning models for computer vision tasks are based on the transformer architecture and often deployed in real-time applications. In this scenario, the resources available for every inference can vary, so it is useful to be able to dynamically adapt execution to trade accuracy for efficiency. To create dynamic models, we leverage the resilience of vision transformers to pruning and switch between different scaled versions of a model. Surprisingly, we find that most FLOPs are generated by convolutions, not attention. These relative FLOP counts are not a good predictor of GPU performance since GPUs have special optimizations for convolutions. Some models are fairly resilient and their model execution can be adapted without retraining, while all models achieve better accuracy with retraining alternative execution paths. These insights mean that we can leverage CNN accelerators and these alternative execution paths to enable efficient and dynamic vision transformer inference. Our analysis shows that leveraging this type of dynamic execution can lead to saving 28\% of energy with a 1.4\% accuracy drop for SegFormer (63 GFLOPs), with no additional training, and 53\% of energy for ResNet-50 (4 GFLOPs) with a 3.3\% accuracy drop by switching between pretrained Once-For-All models.
- K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask r-cnn,” in International Conference on Computer Vision (ICCV), 2017.
- R. Girshick, “Fast r-cnn,” in International Conference on Computer Vision (ICCV), 2015.
- S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” in Advances in Neural Information Processing Systems (NeurIPS), 2015.
- R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for accurate object detection and semantic segmentation,” in Conference on Computer Vision and Pattern Recognition (CVPR), 2014.
- J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once: Unified, real-time object detection,” in Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
- W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg, “Ssd: Single shot multibox detector,” in European Conference on Computer Vision (ECCV), 2016.
- S. Zheng, J. Lu, H. Zhao, X. Zhu, Z. Luo, Y. Wang, Y. Fu, J. Feng, T. Xiang, P. H. Torr, and L. Zhang, “Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers,” in Conference on Computer Vision and Pattern Recognition, 2021.
- J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
- K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
- K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” in International Conference on Learning Representations (ICLR), 2015.
- M. Tan and Q. Le, “Efficientnet: Rethinking model scaling for convolutional neural networks,” in International Conference on Machine Learning (ICML), 2019.
- C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking the inception architecture for computer vision,” in Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
- N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko, “End-to-end object detection with transformers,” in European Conference on Computer Vision (ECCV), 2020.
- S. Liu, F. Li, H. Zhang, X. Yang, X. Qi, H. Su, J. Zhu, and L. Zhang, “DAB-DETR: Dynamic anchor boxes are better queries for DETR,” in International Conference on Learning Representations (ICLR), 2022.
- Y. Wang, X. Zhang, T. Yang, and J. Sun, “Anchor detr: Query design for transformer-based detector,” in AAAI Conference on Artificial Intelligence (AAAI), 2022.
- D. Meng, X. Chen, Z. Fan, G. Zeng, H. Li, Y. Yuan, L. Sun, and J. Wang, “Conditional detr for fast training convergence,” in International Conference on Computer Vision (ICCV), 2021.
- E. Xie, W. Wang, Z. Yu, A. Anandkumar, J. M. Alvarez, and P. Luo, “Segformer: Simple and efficient design for semantic segmentation with transformers,” in Advances in Neural Information Processing Systems (NeurIPS), 2021.
- Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” in International Conference on Computer Vision (ICCV), 2021.
- OpenAI, “Gpt-4v(ision) system card,” 2023. [Online]. Available: https://openai.com/research/gpt-4v-system-card
- A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An image is worth 16x16 words: Transformers for image recognition at scale,” in International Conference on Learning Representations (ICLR), 2021.
- H. Touvron, M. Cord, M. Douze, F. Massa, A. Sablayrolles, and H. Jégou, “Training data-efficient image transformers & distillation through attention,” in International Conference on Machine Learning (ICML), 2021.
- A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y. Lo et al., “Segment anything,” in International Conference on Computer Vision (ICCV), 2023.
- A. Hatamizadeh, H. Yin, G. Heinrich, J. Kautz, and P. Molchanov, “Global context vision transformers,” in International Conference on Machine Learning (ICML), 2023.
- C.-F. R. Chen, Q. Fan, and R. Panda, “Crossvit: Cross-attention multi-scale vision transformer for image classification,” in International Conference on Computer Vision (ICCV), 2021.
- X. Zhu, W. Su, L. Lu, B. Li, X. Wang, and J. Dai, “Deformable detr: Deformable transformers for end-to-end object detection,” in International Conference on Learning Representations (ICLR), 2021.
- X. Dai, Y. Chen, J. Yang, P. Zhang, L. Yuan, and L. Zhang, “Dynamic detr: End-to-end object detection with dynamic attention,” in International Conference on Computer Vision (ICCV), 2021.
- M. Zheng, P. Gao, R. Zhang, K. Li, X. Wang, H. Li, and H. Dong, “End-to-end object detection with adaptive clustering transformer,” in British Machine Vision Conference (BMVC), 2021.
- Z. Yao, J. Ai, B. Li, and C. Zhang, “Efficient detr: improving end-to-end object detector with dense prior,” arXiv preprint arXiv:2104.01318, 2021.
- B. Roh, J. Shin, W. Shin, and S. Kim, “Sparse detr: Efficient end-to-end object detection with learnable sparsity,” in International Conference on Learning Representations (ICLR), 2022.
- R. Strudel, R. Garcia, I. Laptev, and C. Schmid, “Segmenter: Transformer for semantic segmentation,” in International Conference on Computer Vision (ICCV), 2021.
- H. Zhang, F. Li, S. Liu, L. Zhang, H. Su, J. Zhu, L. M. Ni, and H.-Y. Shum, “Dino: Detr with improved denoising anchor boxes for end-to-end object detection,” in International Conference on Learning Representations (ICLR), 2023.
- Z. Zong, G. Song, and Y. Liu, “Detrs with collaborative hybrid assignments training,” in International Conference on Computer Vision (ICCV), 2023.
- X. Zhou, R. Girdhar, A. Joulin, P. Krähenbühl, and I. Misra, “Detecting twenty-thousand classes using image-level supervision,” in European Conference on Computer Vision (EECV), 2022.
- A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in Neural Information Processing Systems (NeurIPS), 2017.
- J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018.
- A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever, “Improving language understanding by generative pre-training,” 2018.
- A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever, “Language models are unsupervised multitask learners,” OpenAI blog, 2019.
- T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Amodei, “Language models are few-shot learners,” Advances in Neural Information Processing Systems (NeurIPS), 2020.
- OpenAI, “Gpt-4 technical report,” arXiv preprint arXiv:2303.08774, 2023.
- C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu, “Exploring the limits of transfer learning with a unified text-to-text transformer,” Journal of Machine Learning Research (JMLR), 2020.
- S. Khan, M. Naseer, M. Hayat, S. W. Zamir, F. S. Khan, and M. Shah, “Transformers in vision: A survey,” ACM computing surveys (CSUR), 2022.
- D. Feng, C. Haase-Schütz, L. Rosenbaum, H. Hertlein, C. Glaeser, F. Timm, W. Wiesbeck, and K. Dietmayer, “Deep multi-modal object detection and semantic segmentation for autonomous driving: Datasets, methods, and challenges,” IEEE Transactions on Intelligent Transportation Systems, 2020.
- M. Siam, M. Gamal, M. Abdel-Razek, S. Yogamani, M. Jagersand, and H. Zhang, “A comparative study of real-time semantic segmentation for autonomous driving,” in Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2018.
- T. Zhou, F. Porikli, D. J. Crandall, L. Van Gool, and W. Wang, “A survey on deep learning technique for video segmentation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023.
- J. Sun, W. Yi, P. K. Varshney, and L. Kong, “Resource scheduling for multi-target tracking in multi-radar systems with imperfect detection,” IEEE Transactions on Signal Processing, 2022.
- S. Kim, J. Zhao, K. Asanovic, B. Nikolic, and Y. S. Shao, “Aurora: Virtualized accelerator orchestration for multi-tenant workloads,” in International Symposium on Microarchitecture (MICRO), 2023.
- J. Clemons, I. Frosio, M. Shen, J. M. Alvarez, and S. W. Keckler, “Augmenting legacy networks for flexible inference,” in IEEE Intelligent Vehicles Symposium (IV). IEEE, 2023.
- Y. Wang, J. Shen, T.-K. Hu, P. Xu, T. Nguyen, R. Baraniuk, Z. Wang, and Y. Lin, “Dual dynamic inference: Enabling more efficient, adaptive, and controllable deep inference,” IEEE Journal of Selected Topics in Signal Processing, 2020.
- J. Xin, R. Tang, J. Lee, Y. Yu, and J. Lin, “Deebert: Dynamic early exiting for accelerating bert inference,” Association for Computational Linguistics (ACL), 2020.
- Y. Kaya, S. Hong, and T. Dumitras, “Shallow-deep networks: Understanding and mitigating network overthinking,” in International Conference on Machine Learning (ICML), 2019.
- T.-K. Hu, T. Chen, H. Wang, and Z. Wang, “Triple wins: Boosting accuracy, robustness and efficiency together by enabling input-adaptive inference,” International Conference on Learning Representations (ICLR), 2020.
- A. Graves, “Adaptive computation time for recurrent neural networks,” arXiv preprint arXiv:1603.08983, 2016.
- W. Liu, P. Zhou, Z. Zhao, Z. Wang, H. Deng, and Q. Ju, “Fastbert: a self-distilling bert with adaptive inference time,” Association for Computational Linguistics (ACL), 2020.
- M. Figurnov, M. D. Collins, Y. Zhu, L. Zhang, J. Huang, D. Vetrov, and R. Salakhutdinov, “Spatially adaptive computation time for residual networks,” in Computer Vision and Pattern Recognition (CVPR), 2017.
- Y. Han, G. Huang, S. Song, L. Yang, H. Wang, and Y. Wang, “Dynamic neural networks: A survey,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021.
- M. Dehghani, S. Gouws, O. Vinyals, J. Uszkoreit, and Ł. Kaiser, “Universal transformers,” International Conference on Learning Representations (ICLR), 2018.
- S. Teerapittayanon, B. McDanel, and H.-T. Kung, “Branchynet: Fast inference via early exiting from deep neural networks,” in International Conference on Pattern Recognition (ICPR), 2016.
- W. Zhou, C. Xu, T. Ge, J. McAuley, K. Xu, and F. Wei, “Bert loses patience: Fast and robust inference with early exit,” Advances in Neural Information Processing Systems (NeurIPS), 2020.
- T. Tambe, C. Hooper, L. Pentecost, T. Jia, E.-Y. Yang, M. Donato, V. Sanh, P. Whatmough, A. M. Rush, D. Brooks, and G.-Y. Wei, “Edgebert: Sentence-level energy optimizations for latency-aware multi-task nlp inference,” in International Symposium on Microarchitecture (MICRO), 2021.
- X. Wang, F. Yu, Z.-Y. Dou, T. Darrell, and J. E. Gonzalez, “Skipnet: Learning dynamic routing in convolutional networks,” in European Conference on Computer Vision (ECCV), 2018.
- S. Jastrzkebski, D. Arpit, N. Ballas, V. Verma, T. Che, and Y. Bengio, “Residual connections encourage iterative inference,” International Conference on Learning Representations (ICLR), 2018.
- P. Michel, O. Levy, and G. Neubig, “Are sixteen heads really better than one?” Advances in Neural Information Processing Systems (NeurIPS), 2019.
- E. Voita, D. Talbot, F. Moiseev, R. Sennrich, and I. Titov, “Analyzing multi-head self-attention: Specialized heads do the heavy lifting, the rest can be pruned,” Association for Computational Linguistics (ACL), 2019.
- H. Cai, C. Gan, T. Wang, Z. Zhang, and S. Han, “Once-for-all: Train one network and specialize it for efficient deployment,” International Conference on Learning Representations (ICLR), 2020.
- T. Xiao, Y. Liu, B. Zhou, Y. Jiang, and J. Sun, “Unified perceptual parsing for scene understanding,” in European Conference on Computer Vision (ECCV), 2018.
- J. Choquette, “Nvidia hopper h100 gpu: Scaling performance,” IEEE Micro, 2023.
- “Nvidia tensorrt,” https://developer.nvidia.com/tensorrt, 2023.
- S. Chetlur, C. Woolley, P. Vandermersch, J. Cohen, J. Tran, B. Catanzaro, and E. Shelhamer, “cudnn: Efficient primitives for deep learning,” arXiv preprint arXiv:1410.0759, 2014.
- R. Venkatesan, Y. S. Shao, M. Wang, J. Clemons, S. Dai, M. Fojtik, B. Keller, A. Klinefelter, N. Pinckney, P. Raina, Y. Zhang, B. Zimmer, W. J. Dally, J. Emer, S. W. Keckler, and B. Khailany, “Magnet: A modular accelerator generator for neural networks,” in International Conference on Computer-Aided Design (ICCAD). IEEE, 2019.
- B. Keller, R. Venkatesan, S. Dai, S. G. Tell, B. Zimmer, W. J. Dally, C. T. Gray, and B. Khailany, “A 17–95.6 tops/w deep learning inference accelerator with per-vector scaled 4-bit quantization for transformers in 5nm,” in IEEE Symposium on VLSI Technology and Circuits (VLSI), 2022.
- H. Yu, Z. Yang, L. Tan, Y. Wang, W. Sun, M. Sun, and Y. Tang, “Methods and datasets on semantic segmentation: A review,” Neurocomputing, 2018.
- Z. Zou, K. Chen, Z. Shi, Y. Guo, and J. Ye, “Object detection in 20 years: A survey,” Proceedings of the IEEE, 2023.
- M. Alam, M. D. Samad, L. Vidyaratne, A. Glandon, and K. M. Iftekharuddin, “Survey on deep neural networks in speech and vision systems,” Neurocomputing, 2020.
- B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso, and A. Torralba, “Scene parsing through ade20k dataset,” in Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
- M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele, “The cityscapes dataset for semantic urban scene understanding,” in Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
- T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in European Conference on Computer Vision (ECCV), 2014.
- W. Wang, E. Xie, X. Li, D.-P. Fan, K. Song, D. Liang, T. Lu, P. Luo, and L. Shao, “Pyramid vision transformer: A versatile backbone for dense prediction without convolutions,” in International Conference on Computer Vision (ICCV), 2021.
- E. Xie, W. Wang, W. Wang, P. Sun, H. Xu, D. Liang, and P. Luo, “Segmenting transparent object in the wild with transformer,” International Joint Conference on Artificial Intelligence (IJCAI), 2021.
- Y. Li, Y. Hu, F. Wu, and K. Li, “Divit: Algorithm and architecture co-design of differential attention in vision transformer,” Journal of Systems Architecture, 2022.
- X. Wang, L. L. Zhang, Y. Wang, and M. Yang, “Towards efficient vision transformer inference: a first study of transformers on mobile devices,” in International Workshop on Mobile Computing Systems and Applications (HotMobile), 2022.
- H. You, Z. Sun, H. Shi, Z. Yu, Y. Zhao, Y. Zhang, C. Li, B. Li, and Y. Lin, “Vitcod: Vision transformer acceleration via dedicated algorithm and accelerator co-design,” International Symposium on High-Performance Computer Architecture (HPCA), 2023.
- “detrex: An research platform for transformer-based object detection algorithms,” https://github.com/IDEA-Research/detrex, 2022.
- J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “ImageNet: A Large-Scale Hierarchical Image Database,” in Conference on Computer Vision and Pattern Recognition (CVPR), 2009.
- K. Zeng, Q. Ma, J. W. Wu, Z. Chen, T. Shen, and C. Yan, “Fpga-based accelerator for object detection: A comprehensive survey,” The Journal of Supercomputing, 2022.
- X. Yang, M. Gao, Q. Liu, J. Setter, J. Pu, A. Nayak, S. Bell, K. Cao, H. Ha, P. Raina, C. Kozyrakis, and M. Horowitz, “Interstellar: Using halide’s scheduling language to analyze dnn accelerators,” in International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2020.
- H. Wang, Z. Wu, Z. Liu, H. Cai, L. Zhu, C. Gan, and S. Han, “Hat: Hardware-aware transformers for efficient natural language processing,” Association for Computational Linguistics (ACL), 2020.
- T.-J. Yang, A. Howard, B. Chen, X. Zhang, A. Go, M. Sandler, V. Sze, and H. Adam, “Netadapt: Platform-aware neural network adaptation for mobile applications,” in European Conference on Computer Vision (ECCV), 2018.
- V. Sanh, L. Debut, J. Chaumond, and T. Wolf, “Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter,” Workshop on Energy Efficient Machine Learning and Cognitive Computing co-located with Advances in Neural Information Processing Systems (NeurIPS), 2019.
- S. Sun, Y. Cheng, Z. Gan, and J. Liu, “Patient knowledge distillation for bert model compression,” Conference on Empirical Methods in Natural Language Processing (EMNLP), 2019.
- H. Bai, W. Zhang, L. Hou, L. Shang, J. Jin, X. Jiang, Q. Liu, M. Lyu, and I. King, “Binarybert: Pushing the limit of bert quantization,” in Association for Computational Linguistics (ACL), 2021.
- Z. Liu, Y. Wang, K. Han, W. Zhang, S. Ma, and W. Gao, “Post-training quantization for vision transformer,” Advances in Neural Information Processing Systems (NeurIPS), 2021.
- O. Zafrir, G. Boudoukh, P. Izsak, and M. Wasserblat, “Q8bert: Quantized 8bit bert,” in Workshop on Energy Efficient Machine Learning and Cognitive Computing (EMC2) co-located with Advances in Neural Information Processing Systems (NeurIPS), 2019.