DyCE: Dynamically Configurable Exiting for Deep Learning Compression and Real-time Scaling (2403.01695v3)
Abstract: Conventional deep learning (DL) model compression and scaling methods focus on altering the model's components, impacting the results across all samples uniformly. However, since samples vary in difficulty, a dynamic model that adapts computation based on sample complexity offers a novel perspective for compression and scaling. Despite this potential, existing dynamic models are typically monolithic and model-specific, limiting their generalizability as broad compression and scaling methods. Additionally, most deployed DL systems are fixed, unable to adjust their scale once deployed and, therefore, cannot adapt to the varying real-time demands. This paper introduces DyCE, a dynamically configurable system that can adjust the performance-complexity trade-off of a DL model at runtime without requiring re-initialization or redeployment on inference hardware. DyCE achieves this by adding small exit networks to intermediate layers of the original model, allowing computation to terminate early if acceptable results are obtained. DyCE also decouples the design of an efficient dynamic model, facilitating easy adaptation to new base models and potential general use in compression and scaling. We also propose methods for generating optimized configurations and determining the types and positions of exit networks to achieve desired performance and complexity trade-offs. By enabling simple configuration switching, DyCE provides fine-grained performance tuning in real-time. We demonstrate the effectiveness of DyCE through image classification tasks using deep convolutional neural networks (CNNs). DyCE significantly reduces computational complexity by 23.5% for ResNet152 and 25.9% for ConvNextv2-tiny on ImageNet, with accuracy reductions of less than 0.5%.
- M. Dayarathna, Y. Wen, and R. Fan, “Data Center Energy Consumption Modeling: A Survey,” IEEE Communications Surveys & Tutorials, vol. 18, no. 1, pp. 732–794, 2016, conference Name: IEEE Communications Surveys & Tutorials.
- B. Wu, Y. Wang, P. Zhang, Y. Tian, P. Vajda, and K. Keutzer, “Mixed Precision Quantization of ConvNets via Differentiable Neural Architecture Search,” arXiv:1812.00090 [cs], 2018. [Online]. Available: http://arxiv.org/abs/1812.00090
- D. Blalock, J. J. Gonzalez Ortiz, J. Frankle, and J. Guttag, “What is the State of Neural Network Pruning?” Proceedings of Machine Learning and Systems, vol. 2, pp. 129–146, Mar. 2020. [Online]. Available: https://proceedings.mlsys.org/paper/2020/hash/d2ddea18f00665ce8623e36bd4e3c7c5-Abstract.html
- M. Lin, L. Cao, S. Li, Q. Ye, Y. Tian, J. Liu, Q. Tian, and R. Ji, “Filter Sketch for Network Pruning,” IEEE Transactions on Neural Networks and Learning Systems, pp. 1–10, 2021.
- S. Teerapittayanon, B. McDanel, and H. Kung, “BranchyNet: Fast inference via early exiting from deep neural networks,” in 2016 23rd International Conference on Pattern Recognition (ICPR), 2016, pp. 2464–2469.
- Y. Kaya, S. Hong, and T. Dumitras, “Shallow-Deep Networks: Understanding and Mitigating Network Overthinking,” in Proceedings of the 36th International Conference on Machine Learning. PMLR, 2019, pp. 3301–3310. [Online]. Available: https://proceedings.mlr.press/v97/kaya19a.html
- O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, and others, “Imagenet large scale visual recognition challenge,” International journal of computer vision, vol. 115, no. 3, pp. 211–252, 2015, publisher: Springer.
- S. Hong, Y. Kaya, I.-V. Modoranu, and T. Dumitraş, “A Panda? No, It’s a Sloth: Slowdown Attacks on Adaptive Multi-Exit Neural Network Inference,” Feb. 2021, arXiv:2010.02432 [cs]. [Online]. Available: http://arxiv.org/abs/2010.02432
- R. Hang, X. Qian, and Q. Liu, “MSNet: Multi-Resolution Synergistic Networks for Adaptive Inference,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 33, no. 5, pp. 2009–2018, May 2023, conference Name: IEEE Transactions on Circuits and Systems for Video Technology.
- X. Chen, H. Dai, Y. Li, X. Gao, and L. Song, “Learning to Stop While Learning to Predict,” Jun. 2020, arXiv:2006.05082 [cs, stat]. [Online]. Available: http://arxiv.org/abs/2006.05082
- X. Dai, X. Kong, and T. Guo, “EPNet: Learning to Exit with Flexible Multi-Branch Network,” in Proceedings of the 29th ACM International Conference on Information & Knowledge Management, ser. CIKM ’20. New York, NY, USA: Association for Computing Machinery, Oct. 2020, pp. 235–244. [Online]. Available: https://dl.acm.org/doi/10.1145/3340531.3411973
- X. Wang, F. Yu, Z.-Y. Dou, T. Darrell, and J. E. Gonzalez, “SkipNet: Learning Dynamic Routing in Convolutional Networks,” Jul. 2018, arXiv:1711.09485 [cs]. [Online]. Available: http://arxiv.org/abs/1711.09485
- J. Lin, Y. Rao, J. Lu, and J. Zhou, “Runtime Neural Pruning,” in Advances in Neural Information Processing Systems, vol. 30. Curran Associates, Inc., 2017. [Online]. Available: https://papers.nips.cc/paper_files/paper/2017/hash/a51fb975227d6640e4fe47854476d133-Abstract.html
- K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
- M. Tan and Q. V. Le, “EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks,” in Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, ser. Proceedings of Machine Learning Research, K. Chaudhuri and R. Salakhutdinov, Eds., vol. 97. PMLR, 2019, pp. 6105–6114. [Online]. Available: http://proceedings.mlr.press/v97/tan19a.html
- D. D. Lin, S. S. Talathi, and V. S. Annapureddy, “Fixed Point Quantization of Deep Convolutional Networks,” in Proceedings of the 33nd International Conference on Machine Learning, ICML 2016, New York City, NY, USA, June 19-24, 2016, ser. JMLR Workshop and Conference Proceedings, M.-F. Balcan and K. Q. Weinberger, Eds., vol. 48. JMLR.org, 2016, pp. 2849–2858. [Online]. Available: http://proceedings.mlr.press/v48/linb16.html
- Y. Han, G. Huang, S. Song, L. Yang, H. Wang, and Y. Wang, “Dynamic Neural Networks: A Survey,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 11, pp. 7436–7456, Nov. 2022, conference Name: IEEE Transactions on Pattern Analysis and Machine Intelligence.
- W. Liu, P. Zhou, Z. Zhao, Z. Wang, H. Deng, and Q. Ju, “FastBERT: a Self-distilling BERT with Adaptive Inference Time,” Apr. 2020, arXiv:2004.02178 [cs]. [Online]. Available: http://arxiv.org/abs/2004.02178
- I. Leontiadis, S. Laskaridis, S. I. Venieris, and N. D. Lane, “It’s always personal: Using Early Exits for Efficient On-Device CNN Personalisation,” in Proceedings of the 22nd International Workshop on Mobile Computing Systems and Applications, Feb. 2021, pp. 15–21, arXiv:2102.01393 [cs]. [Online]. Available: http://arxiv.org/abs/2102.01393
- M. W. Gardner and S. R. Dorling, “Artificial neural networks (the multilayer perceptron)—a review of applications in the atmospheric sciences,” Atmospheric Environment, vol. 32, no. 14, pp. 2627–2636, 1998. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S1352231097004470
- G. Hinton, O. Vinyals, and J. Dean, “Distilling the Knowledge in a Neural Network,” arXiv:1503.02531 [cs, stat], Mar. 2015, arXiv: 1503.02531. [Online]. Available: http://arxiv.org/abs/1503.02531
- S. Woo, S. Debnath, R. Hu, X. Chen, Z. Liu, I. S. Kweon, and S. Xie, “ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders,” Jan. 2023, arXiv:2301.00808 [cs]. [Online]. Available: http://arxiv.org/abs/2301.00808
- P. Team, “ResNet - PyTorch.” [Online]. Available: https://pytorch.org/hub/pytorch_vision_resnet
- A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is All you Need,” in Advances in Neural Information Processing Systems, vol. 30. Curran Associates, Inc., 2017. [Online]. Available: https://papers.nips.cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html