Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Quantized Transformer Language Model Implementations on Edge Devices (2310.03971v1)

Published 6 Oct 2023 in cs.CL and cs.AR

Abstract: Large-scale transformer-based models like the Bidirectional Encoder Representations from Transformers (BERT) are widely used for NLP applications, wherein these models are initially pre-trained with a large corpus with millions of parameters and then fine-tuned for a downstream NLP task. One of the major limitations of these large-scale models is that they cannot be deployed on resource-constrained devices due to their large model size and increased inference latency. In order to overcome these limitations, such large-scale models can be converted to an optimized FlatBuffer format, tailored for deployment on resource-constrained edge devices. Herein, we evaluate the performance of such FlatBuffer transformed MobileBERT models on three different edge devices, fine-tuned for Reputation analysis of English language tweets in the RepLab 2013 dataset. In addition, this study encompassed an evaluation of the deployed models, wherein their latency, performance, and resource efficiency were meticulously assessed. Our experiment results show that, compared to the original BERT large model, the converted and quantized MobileBERT models have 160$\times$ smaller footprints for a 4.1% drop in accuracy while analyzing at least one tweet per second on edge devices. Furthermore, our study highlights the privacy-preserving aspect of TinyML systems as all data is processed locally within a serverless environment.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (26)
  1. Arduino uno rev3, available at: https://store.arduino.cc/products/arduino-uno-rev3.
  2. Tensorflow: a system for large-scale machine learning. In Osdi, volume 16, pages 265–283. Savannah, GA, USA, 2016.
  3. Overview of replab 2013: Evaluating online reputation monitoring systems. In Information Access Evaluation. Multilinguality, Multimodality, and Visualization: 4th International Conference of the CLEF Initiative, CLEF 2013, Valencia, Spain, September 23-26, 2013. Proceedings 4, pages 333–352. Springer, 2013.
  4. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  5. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149, 2015.
  6. Amc: Automl for model compression and acceleration on mobile devices. In Proceedings of the European conference on computer vision (ECCV), pages 784–800, 2018.
  7. Parameter-efficient transfer learning for nlp. In International Conference on Machine Learning, pages 2790–2799. PMLR, 2019.
  8. Squeezenet: Alexnet-level accuracy with 50x fewer parameters and< 0.5 mb model size. arXiv preprint arXiv:1602.07360, 2016.
  9. Quantization and training of neural networks for efficient integer-arithmetic-only inference. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2704–2713, 2018.
  10. Tinybert: Distilling bert for natural language understanding. arXiv preprint arXiv:1909.10351, 2019.
  11. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
  12. Serverless computing: a security perspective. Journal of Cloud Computing, 11(1):1–12, 2022.
  13. Real-time execution of large-scale language models on mobile. arXiv preprint arXiv:2009.06823, 2020.
  14. A compression-compilation framework for on-mobile real-time bert applications. arXiv preprint arXiv:2106.00526, 2021.
  15. OpenAI. Gpt-4 technical report, 2023.
  16. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
  17. A bert-based deep learning approach for reputation analysis in social media. In 2022 IEEE/ACS 19th International Conference on Computer Systems and Applications (AICCSA), pages 1–8. IEEE, 2022.
  18. S. Soro. Tinyml for ubiquitous edge ai. 2021.
  19. Mobilebert: a compact task-agnostic bert for resource-limited devices. arXiv preprint arXiv:2004.02984, 2020.
  20. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  21. Hat: Hardware-aware transformers for efficient natural language processing. arXiv preprint arXiv:2005.14187, 2020.
  22. Haq: Hardware-aware automated quantization with mixed precision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8612–8620, 2019.
  23. P. Warden and D. Situnayake. Tinyml: Machine learning with tensorflow lite on arduino and ultra-low-power microcontrollers. O’Reilly Media, 2019.
  24. Lite transformer with long-short range attention. arXiv preprint arXiv:2004.11886, 2020.
  25. Micronet for efficient language modeling. In NeurIPS 2019 Competition and Demonstration Track, pages 215–231. PMLR, 2020.
  26. Xlnet: Generalized autoregressive pretraining for language understanding. Advances in neural information processing systems, 32, 2019.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Mohammad Wali Ur Rahman (4 papers)
  2. Murad Mehrab Abrar (6 papers)
  3. Hunter Gibbons Copening (1 paper)
  4. Salim Hariri (13 papers)
  5. Sicong Shao (7 papers)
  6. Pratik Satam (12 papers)
  7. Soheil Salehi (15 papers)
Citations (6)