Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Merino: Entropy-driven Design for Generative Language Models on IoT Devices (2403.07921v2)

Published 28 Feb 2024 in cs.LG, cs.AI, and cs.CL

Abstract: Generative LLMs stand as a revolutionary advancement in the modern era of AI. However, scaling down LLMs for resource-constrained hardware, such as Internet-of-Things (IoT) devices requires non-trivial efforts and domain knowledge. In this paper, we propose a novel information-entropy framework for designing mobile-friendly generative LLMs. The whole design procedure involves solving a mathematical programming (MP) problem, which can be done on the CPU within minutes, making it nearly zero-cost. We evaluate our designed models, termed MeRino, across fourteen NLP downstream tasks, showing their competitive performance against the state-of-the-art autoregressive transformer models under the mobile setting. Notably, MeRino achieves similar or better performance on both LLMing and zero-shot learning tasks, compared to the 350M parameter OPT while being 4.9x faster on NVIDIA Jetson Nano with 5.5x reduction in model size.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (49)
  1. Attention is all you need. ArXiv, abs/1706.03762, 2017.
  2. Bert: Pre-training of deep bidirectional transformers for language understanding. ArXiv, abs/1810.04805, 2019.
  3. Roberta: A robustly optimized bert pretraining approach. ArXiv, abs/1907.11692, 2019.
  4. Language models are unsupervised multitask learners. 2019.
  5. Language models are few-shot learners. ArXiv, abs/2005.14165, 2020.
  6. Opt: Open pre-trained transformer language models. ArXiv, abs/2205.01068, 2022.
  7. Llama: Open and efficient foundation language models. ArXiv, abs/2302.13971, 2023.
  8. Distilling the knowledge in a neural network. ArXiv, abs/1503.02531, 2015.
  9. A short study on compressing decoder-based language models. ArXiv, abs/2110.08460, 2021.
  10. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. ArXiv, abs/1910.01108, 2019.
  11. Patient knowledge distillation for bert model compression. In Conference on Empirical Methods in Natural Language Processing, 2019.
  12. Automated machine learning: Methods, systems, challenges. Automated Machine Learning, 2019.
  13. Hat: Hardware-aware transformers for efficient natural language processing. ArXiv, abs/2005.14187, 2020.
  14. Nas-bert: Task-agnostic and adaptive-size bert compression with neural architecture search. Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, 2021.
  15. Autotinybert: Automatic hyper-parameter optimization for efficient pre-trained language models. In Annual Meeting of the Association for Computational Linguistics, 2021.
  16. Once for all: Train one network and specialize it for efficient deployment. ArXiv, abs/1908.09791, 2019.
  17. Proxylessnas: Direct neural architecture search on target task and hardware. ArXiv, abs/1812.00332, 2018.
  18. Evaluating efficient performance estimators of neural architectures. In Neural Information Processing Systems, 2020.
  19. Edwin T. Jaynes. Information theory and statistical mechanics. Physical Review, 106:620–630, 1957.
  20. Tinybert: Distilling bert for natural language understanding. ArXiv, abs/1909.10351, 2019.
  21. Litetransformersearch: Training-free on-device search for efficient autoregressive language models. ArXiv, abs/2203.02094, 2022.
  22. Redunet: A white-box deep network from the principle of maximizing rate reduction. ArXiv, abs/2105.10446, 2021.
  23. On the information bottleneck theory of deep learning. Journal of Statistical Mechanics: Theory and Experiment, 2019, 2018.
  24. Deepmad: Mathematical architecture design for deep convolutional neural network. ArXiv, abs/2303.02165, 2023.
  25. Mae-det: Revisiting maximum entropy principle in zero-shot nas for efficient object detection. In International Conference on Machine Learning, 2021.
  26. Elements of information theory. 1991.
  27. The principles of deep learning theory. ArXiv, abs/2106.10165, 2021.
  28. Improving deep transformer with depth-scaled initialization and merged attention. ArXiv, abs/1908.11365, 2019.
  29. Improving transformer optimization through better initialization. In International Conference on Machine Learning, 2020.
  30. Transformers without tears: Improving the normalization of self-attention. ArXiv, abs/1910.05895, 2019.
  31. Deep residual learning for image recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2015.
  32. Learning deep transformer models for machine translation. In Annual Meeting of the Association for Computational Linguistics, 2019.
  33. Rezero is all you need: Fast convergence at large depth. ArXiv, abs/2003.04887, 2020.
  34. Scaling language models: Methods, analysis & insights from training gopher. ArXiv, abs/2112.11446, 2021.
  35. The depth-to-width interplay in self-attention. arXiv: Learning, 2020.
  36. One billion word benchmark for measuring progress in statistical language modeling. In Interspeech, 2013.
  37. Training-free transformer architecture search. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10884–10893, 2022.
  38. Albert: A lite bert for self-supervised learning of language representations. ArXiv, abs/1909.11942, 2019.
  39. Colin R. Reeves. Evolutionary computation: a unified approach. Genetic Programming and Evolvable Machines, 8:293–295, 2007.
  40. Pythia: A suite for analyzing large language models across training and scaling. 2023.
  41. The pile: An 800gb dataset of diverse text for language modeling. ArXiv, abs/2101.00027, 2020.
  42. Decoupled weight decay regularization. In International Conference on Learning Representations, 2017.
  43. Hellaswag: Can a machine really finish your sentence? In Annual Meeting of the Association for Computational Linguistics, 2019.
  44. Winogrande: An adversarial winograd schema challenge at scale. Commun. ACM, 64:99–106, 2019.
  45. Can a suit of armor conduct electricity? a new dataset for open book question answering. In Conference on Empirical Methods in Natural Language Processing, 2018.
  46. Pubmedqa: A dataset for biomedical research question answering. In Conference on Empirical Methods in Natural Language Processing, 2019.
  47. Logiqa: A challenge dataset for machine reading comprehension with logical reasoning. In International Joint Conference on Artificial Intelligence, 2020.
  48. Superglue: A stickier benchmark for general-purpose language understanding systems. ArXiv, abs/1905.00537, 2019.
  49. Cerebras-gpt: Open compute-optimal language models trained on the cerebras wafer-scale cluster. 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Youpeng Zhao (16 papers)
  2. Ming Lin (65 papers)
  3. Huadong Tang (3 papers)
  4. Qiang Wu (154 papers)
  5. Jun Wang (992 papers)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets