Merino: Entropy-driven Design for Generative Language Models on IoT Devices (2403.07921v2)
Abstract: Generative LLMs stand as a revolutionary advancement in the modern era of AI. However, scaling down LLMs for resource-constrained hardware, such as Internet-of-Things (IoT) devices requires non-trivial efforts and domain knowledge. In this paper, we propose a novel information-entropy framework for designing mobile-friendly generative LLMs. The whole design procedure involves solving a mathematical programming (MP) problem, which can be done on the CPU within minutes, making it nearly zero-cost. We evaluate our designed models, termed MeRino, across fourteen NLP downstream tasks, showing their competitive performance against the state-of-the-art autoregressive transformer models under the mobile setting. Notably, MeRino achieves similar or better performance on both LLMing and zero-shot learning tasks, compared to the 350M parameter OPT while being 4.9x faster on NVIDIA Jetson Nano with 5.5x reduction in model size.
- Attention is all you need. ArXiv, abs/1706.03762, 2017.
- Bert: Pre-training of deep bidirectional transformers for language understanding. ArXiv, abs/1810.04805, 2019.
- Roberta: A robustly optimized bert pretraining approach. ArXiv, abs/1907.11692, 2019.
- Language models are unsupervised multitask learners. 2019.
- Language models are few-shot learners. ArXiv, abs/2005.14165, 2020.
- Opt: Open pre-trained transformer language models. ArXiv, abs/2205.01068, 2022.
- Llama: Open and efficient foundation language models. ArXiv, abs/2302.13971, 2023.
- Distilling the knowledge in a neural network. ArXiv, abs/1503.02531, 2015.
- A short study on compressing decoder-based language models. ArXiv, abs/2110.08460, 2021.
- Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. ArXiv, abs/1910.01108, 2019.
- Patient knowledge distillation for bert model compression. In Conference on Empirical Methods in Natural Language Processing, 2019.
- Automated machine learning: Methods, systems, challenges. Automated Machine Learning, 2019.
- Hat: Hardware-aware transformers for efficient natural language processing. ArXiv, abs/2005.14187, 2020.
- Nas-bert: Task-agnostic and adaptive-size bert compression with neural architecture search. Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, 2021.
- Autotinybert: Automatic hyper-parameter optimization for efficient pre-trained language models. In Annual Meeting of the Association for Computational Linguistics, 2021.
- Once for all: Train one network and specialize it for efficient deployment. ArXiv, abs/1908.09791, 2019.
- Proxylessnas: Direct neural architecture search on target task and hardware. ArXiv, abs/1812.00332, 2018.
- Evaluating efficient performance estimators of neural architectures. In Neural Information Processing Systems, 2020.
- Edwin T. Jaynes. Information theory and statistical mechanics. Physical Review, 106:620–630, 1957.
- Tinybert: Distilling bert for natural language understanding. ArXiv, abs/1909.10351, 2019.
- Litetransformersearch: Training-free on-device search for efficient autoregressive language models. ArXiv, abs/2203.02094, 2022.
- Redunet: A white-box deep network from the principle of maximizing rate reduction. ArXiv, abs/2105.10446, 2021.
- On the information bottleneck theory of deep learning. Journal of Statistical Mechanics: Theory and Experiment, 2019, 2018.
- Deepmad: Mathematical architecture design for deep convolutional neural network. ArXiv, abs/2303.02165, 2023.
- Mae-det: Revisiting maximum entropy principle in zero-shot nas for efficient object detection. In International Conference on Machine Learning, 2021.
- Elements of information theory. 1991.
- The principles of deep learning theory. ArXiv, abs/2106.10165, 2021.
- Improving deep transformer with depth-scaled initialization and merged attention. ArXiv, abs/1908.11365, 2019.
- Improving transformer optimization through better initialization. In International Conference on Machine Learning, 2020.
- Transformers without tears: Improving the normalization of self-attention. ArXiv, abs/1910.05895, 2019.
- Deep residual learning for image recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2015.
- Learning deep transformer models for machine translation. In Annual Meeting of the Association for Computational Linguistics, 2019.
- Rezero is all you need: Fast convergence at large depth. ArXiv, abs/2003.04887, 2020.
- Scaling language models: Methods, analysis & insights from training gopher. ArXiv, abs/2112.11446, 2021.
- The depth-to-width interplay in self-attention. arXiv: Learning, 2020.
- One billion word benchmark for measuring progress in statistical language modeling. In Interspeech, 2013.
- Training-free transformer architecture search. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10884–10893, 2022.
- Albert: A lite bert for self-supervised learning of language representations. ArXiv, abs/1909.11942, 2019.
- Colin R. Reeves. Evolutionary computation: a unified approach. Genetic Programming and Evolvable Machines, 8:293–295, 2007.
- Pythia: A suite for analyzing large language models across training and scaling. 2023.
- The pile: An 800gb dataset of diverse text for language modeling. ArXiv, abs/2101.00027, 2020.
- Decoupled weight decay regularization. In International Conference on Learning Representations, 2017.
- Hellaswag: Can a machine really finish your sentence? In Annual Meeting of the Association for Computational Linguistics, 2019.
- Winogrande: An adversarial winograd schema challenge at scale. Commun. ACM, 64:99–106, 2019.
- Can a suit of armor conduct electricity? a new dataset for open book question answering. In Conference on Empirical Methods in Natural Language Processing, 2018.
- Pubmedqa: A dataset for biomedical research question answering. In Conference on Empirical Methods in Natural Language Processing, 2019.
- Logiqa: A challenge dataset for machine reading comprehension with logical reasoning. In International Joint Conference on Artificial Intelligence, 2020.
- Superglue: A stickier benchmark for general-purpose language understanding systems. ArXiv, abs/1905.00537, 2019.
- Cerebras-gpt: Open compute-optimal language models trained on the cerebras wafer-scale cluster. 2023.
- Youpeng Zhao (16 papers)
- Ming Lin (65 papers)
- Huadong Tang (3 papers)
- Qiang Wu (154 papers)
- Jun Wang (992 papers)