Real-Time Execution of Large-scale Language Models on Mobile (2009.06823v2)
Abstract: Pre-trained large-scale LLMs have increasingly demonstrated high accuracy on many NLP tasks. However, the limited weight storage and computational speed on hardware platforms have impeded the popularity of pre-trained models, especially in the era of edge computing. In this paper, we seek to find the best model structure of BERT for a given computation size to match specific devices. We propose the first compiler-aware neural architecture optimization framework. Our framework can guarantee the identified model to meet both resource and real-time specifications of mobile devices, thus achieving real-time execution of large transformer-based models like BERT variants. We evaluate our model on several NLP tasks, achieving competitive results on well-known benchmarks with lower latency on mobile devices. Specifically, our model is 5.2x faster on CPU and 4.1x faster on GPU with 0.5-2% accuracy loss compared with BERT-base. Our overall framework achieves up to 7.8x speedup compared with TensorFlow-Lite with only minor accuracy loss.
- Wei Niu (68 papers)
- Zhenglun Kong (33 papers)
- Geng Yuan (58 papers)
- Weiwen Jiang (62 papers)
- Jiexiong Guan (8 papers)
- Caiwen Ding (98 papers)
- Pu Zhao (82 papers)
- Sijia Liu (204 papers)
- Bin Ren (136 papers)
- Yanzhi Wang (197 papers)