MELTing point: Mobile Evaluation of Language Transformers
Abstract: Transformers have revolutionized the machine learning landscape, gradually making their way into everyday tasks and equipping our computers with "sparks of intelligence". However, their runtime requirements have prevented them from being broadly deployed on mobile. As personal devices become increasingly powerful and prompt privacy becomes an ever more pressing issue, we explore the current state of mobile execution of LLMs. To achieve this, we have created our own automation infrastructure, MELT, which supports the headless execution and benchmarking of LLMs on device, supporting different models, devices and frameworks, including Android, iOS and Nvidia Jetson devices. We evaluate popular instruction fine-tuned LLMs and leverage different frameworks to measure their end-to-end and granular performance, tracing their memory and energy requirements along the way. Our analysis is the first systematic study of on-device LLM execution, quantifying performance, energy efficiency and accuracy across various state-of-the-art models and showcases the state of on-device intelligence in the era of hyperscale models. Results highlight the performance heterogeneity across targets and corroborates that LLM inference is largely memory-bound. Quantization drastically reduces memory requirements and renders execution viable, but at a non-negligible accuracy cost. Drawing from its energy footprint and thermal behavior, the continuous execution of LLMs remains elusive, as both factors negatively affect user experience. Last, our experience shows that the ecosystem is still in its infancy, and algorithmic as well as hardware breakthroughs can significantly shift the execution cost. We expect NPU acceleration, and framework-hardware co-design to be the biggest bet towards efficient standalone execution, with the alternative of offloading tailored towards edge deployments.
- Best of both worlds: Automl codesign of a cnn and its hardware accelerator. In 2020 57th ACM/IEEE Design Automation Conference (DAC). IEEE, 1–6.
- GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Houda Bouamor, Juan Pino, and Kalika Bali (Eds.). Association for Computational Linguistics, Singapore, 4895–4901.
- Alibaba. 2023. MNN-LLM. https://github.com/alibaba/MNN
- LLM in a flash: Efficient Large Language Model Inference with Limited Memory. arXiv:2312.11514Â [cs.CL]
- EmBench: Quantifying performance variations of deep neural networks across modern commodity devices. In The 3rd international workshop on deep learning for mobile systems and applications. 1–6.
- Smart at what cost? characterising mobile deep neural networks in the wild. In Proceedings of the 21st ACM Internet Measurement Conference. 658–672.
- DeepSpeed-inference: enabling efficient inference of transformer models at unprecedented scale. In SC22: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 1–15.
- android.com. 2023. AICore. https://developer.android.com/ml/aicore Accessed: Dec 2023.
- Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073 (2022).
- Barbara Krasnoff,. 2021. How to use Android 12’s call screening features. https://www.theverge.com/22792060/call-screening-android-12-google-pixel-how-to Accessed: Mar 2024.
- Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150 (2020).
- Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712 (2023).
- Medusa: Simple Framework for Accelerating LLM Generation with Multiple Decoding Heads. https://github.com/FasterDecoding/Medusa.
- Accelerating large language model decoding with speculative sampling. arXiv preprint arXiv:2302.01318 (2023).
- TVM: end-to-end optimization stack for deep learning. arXiv preprint arXiv:1802.04799 11, 20 (2018).
- François Chollet. 2019. On the measure of intelligence. arXiv preprint arXiv:1911.01547 (2019).
- Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416 (2022).
- To what extent do deep learning-based code recommenders generate predictions by cloning code from the training set?. In Proceedings of the 19th International Conference on Mining Software Repositories (Pittsburgh, Pennsylvania) (MSR ’22). Association for Computing Machinery, New York, NY, USA, 167–178. https://doi.org/10.1145/3524842.3528440
- commoncrawl.org. 2024. CommonCrawl Dataset. https://commoncrawl.org/ Accessed: 2024-02-06.
- Tri Dao. 2023. Flashattention-2: Faster attention with better parallelism and work partitioning. arXiv preprint arXiv:2307.08691 (2023).
- Flashattention: Fast and memory-efficient exact attention with io-awareness. Advances in Neural Information Processing Systems 35 (2022), 16344–16359.
- SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression. arXiv preprint arXiv:2306.03078 (2023).
- BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Jill Burstein, Christy Doran, and Thamar Solorio (Eds.). Association for Computational Linguistics, Minneapolis, Minnesota, 4171–4186. https://doi.org/10.18653/v1/N19-1423
- Towards Next-Generation Intelligent Assistants Leveraging LLM Techniques. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (Long Beach, CA, USA) (KDD ’23). Association for Computing Machinery, New York, NY, USA, 5792–5793. https://doi.org/10.1145/3580305.3599572
- An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In International Conference on Learning Representations.
- Ronen Eldan and Yuanzhi Li. 2023. TinyStories: How Small Can Language Models Be and Still Speak Coherent English? arXiv preprint arXiv:2305.07759 (2023).
- Adaptable butterfly accelerator for attention-based NNs via hardware and algorithm co-design. In 2022 55th IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 599–615.
- Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. The Journal of Machine Learning Research 23, 1 (2022), 5232–5270.
- Jonathan Frankle and Michael Carbin. 2019. The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks. In International Conference on Learning Representations.
- Elias Frantar and Dan Alistarh. 2023a. QMoE: Practical Sub-1-Bit Compression of Trillion-Parameter Models. arXiv preprint arXiv:2310.16795 (2023).
- Elias Frantar and Dan Alistarh. 2023b. SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot. arXiv preprint arXiv:2301.00774 (2023).
- Gptq: Accurate post-training quantization for generative pre-trained transformers. arXiv preprint arXiv:2210.17323 (2022).
- A framework for few-shot language model evaluation. https://doi.org/10.5281/zenodo.5371628
- Georgi Gerganov. 2023. llama.cpp. https://github.com/ggerganov/llama.cpp
- Learning to forget: Continual prediction with LSTM. Neural computation 12, 10 (2000), 2451–2471.
- Google Inc. 2024. Gemma: Introducing new state-of-the-art open models. https://blog.google/technology/developers/gemma-open-models/ Accessed: Mar 2024.
- PoWER-BERT: Accelerating BERT inference via progressive word-vector elimination. In International Conference on Machine Learning. PMLR, 3690–3699.
- Albert Gu and Tri Dao. 2023. Mamba: Linear-Time Sequence Modeling with Selective State Spaces. arXiv preprint arXiv:2312.00752 (2023).
- Knowledge Distillation of Large Language Models. arXiv preprint arXiv:2306.08543 (2023).
- Transkimmer: Transformer Learns to Layer-wise Skim. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Dublin, Ireland, 7275–7286. https://doi.org/10.18653/v1/2022.acl-long.502
- guinmoon. 2023. LLMFarm. https://github.com/guinmoon/LLMFarm
- MLX: Efficient and flexible machine learning on Apple silicon. https://github.com/ml-explore
- Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 (2015).
- Training compute-optimal large language models. arXiv preprint arXiv:2203.15556 (2022).
- AI Benchmark: Running Deep Neural Networks on Android Smartphones. In Proceedings of the European Conference on Computer Vision (ECCV) Workshops.
- Phi-2: The surprising power of small language models. https://www.microsoft.com/en-us/research/blog/phi-2-the-surprising-power-of-small-language-models/ Accessed: Mar 2024.
- Mistral 7B. arXiv preprint arXiv:2310.06825 (2023).
- Andrej Karpathy. 2023. llama2.c. https://github.com/karpathy/llama2.c Accessed: Dec 2023.
- Gyuwan Kim and Kyunghyun Cho. 2021. Length-Adaptive Transformer: Train Once with Length Drop, Use Anytime with Search. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Association for Computational Linguistics, Online, 6501–6511. https://doi.org/10.18653/v1/2021.acl-long.508
- SqueezeLLM: Dense-and-Sparse Quantization. arXiv preprint arXiv:2306.07629 (2023).
- OpenAssistant Conversations–Democratizing Large Language Model Alignment. arXiv preprint arXiv:2304.07327 (2023).
- Fluid Batching: Exit-Aware Preemptive Serving of Early-Exit Neural Networks on Edge NPUs. arXiv preprint arXiv:2209.13443 (2022).
- Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems 25 (2012).
- Efficient Memory Management for Large Language Model Serving with PagedAttention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles.
- Adaptive inference through early-exit networks: Design, challenges and directions. In Proceedings of the 5th International Workshop on Embedded and Mobile Deep Learning. 1–6.
- SPINN: Synergistic Progressive Inference of Neural Networks over Device and Cloud. In Proceedings of the 26th Annual International Conference on Mobile Computing and Networking (London, United Kingdom) (MobiCom ’20). Association for Computing Machinery, New York, NY, USA, Article 37, 15 pages. https://doi.org/10.1145/3372224.3419194
- The Future of Consumer Edge-AI Computing. arXiv preprint arXiv:2210.10514 (2022).
- Fast inference from transformers via speculative decoding. In International Conference on Machine Learning. PMLR, 19274–19286.
- Mapping Natural Language Instructions to Mobile UI Action Sequences. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault (Eds.). Association for Computational Linguistics, Online, 8198–8210. https://doi.org/10.18653/v1/2020.acl-main.729
- libimobiledevice. 2024. ideviceinstaller. https://github.com/libimobiledevice/ideviceinstaller Accessed: Mar 2024.
- AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration. arXiv preprint arXiv:2306.00978 (2023).
- Truthfulqa: Measuring how models mimic human falsehoods. arXiv preprint arXiv:2109.07958 (2021).
- Visual instruction tuning. arXiv preprint arXiv:2304.08485 (2023).
- LLM-QAT: Data-Free Quantization Aware Training for Large Language Models. arXiv preprint arXiv:2305.17888 (2023).
- MobileLLM: Optimizing Sub-billion Parameter Language Models for On-Device Use Cases. arXiv preprint arXiv:2402.14905 (2024).
- llama.cpp Team. 2023. k-quants. https://github.com/ggerganov/llama.cpp/pull/1684 Accessed: March 2024.
- Calabash: Accelerating Attention Using a Systolic Array Chain on FPGAs. In 2023 33rd International Conference on Field-Programmable Logic and Applications (FPL). IEEE, 242–247.
- LLM-Pruner: On the Structural Pruning of Large Language Models. In Advances in Neural Information Processing Systems.
- A survey on mobile edge computing: The communication perspective. IEEE communications surveys & tutorials 19, 4 (2017), 2322–2358.
- Mark Sherwood. 2024. Large Language Models On-Device with MediaPipe and TensorFlow Lite. https://developers.googleblog.com/2024/03/running-large-language-models-on-device-with-mediapipe-andtensorflow-lite.html Accessed: March 2024.
- MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training. arXiv preprint arXiv:2403.09611 (2024).
- mit-han lab. 2023. TinyChatEngine. https://github.com/mit-han-lab/TinyChatEngine Accessed: Dec 2023.
- Monsoon Solutions Inc. 2023. Monsoon Solutions Inc. https://www.msoon.com Accessed: Dec 2023.
- Scalable extraction of training data from (production) language models. arXiv preprint arXiv:2311.17035 (2023).
- Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems 35 (2022), 27730–27744.
- The carbon footprint of machine learning training will plateau, then shrink. Computer 55, 7 (2022), 18–28.
- Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021).
- Tianduo Wang Wei Lu Peiyuan Zhang, Guangtao Zeng. 2023. TinyLlama. https://github.com/jzhang38/TinyLlama
- RWKV: Reinventing RNNs for the Transformer Era. arXiv preprint arXiv:2305.13048 (2023).
- Efficiently scaling transformer inference. Proceedings of Machine Learning and Systems 5 (2023).
- Qualcomm. 2023. The future of AI is hybrid. White Paper. Qualcomm.
- Learning transferable visual models from natural language supervision. In International conference on machine learning. PMLR, 8748–8763.
- Robust speech recognition via large-scale weak supervision. In International Conference on Machine Learning. PMLR, 28492–28518.
- Language models are unsupervised multitask learners. OpenAI blog 1, 8 (2019), 9.
- Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research 21, 1 (2020), 5485–5551.
- ZeRO: Memory Optimizations toward Training Trillion Parameter Models. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (Atlanta, Georgia) (SC ’20). IEEE Press, Article 20, 16 pages.
- NeMo Guardrails: A Toolkit for Controllable and Safe LLM Applications with Programmable Rails. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Yansong Feng and Els Lefever (Eds.). Association for Computational Linguistics, Singapore, 431–445. https://doi.org/10.18653/v1/2023.emnlp-demo.40
- Mlperf inference benchmark. In 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA). IEEE, 446–459.
- WinoGrande: An Adversarial Winograd Schema Challenge at Scale. Commun. ACM 64, 9 (aug 2021), 99–106. https://doi.org/10.1145/3474381
- Are emergent abilities of large language models a mirage? Advances in Neural Information Processing Systems 36 (2024).
- Toolformer: Language models can teach themselves to use tools. Advances in Neural Information Processing Systems 36 (2024).
- BLEURT: Learning Robust Metrics for Text Generation. In Proceedings of ACL.
- Q-bert: Hessian based ultra low precision quantization of bert. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 8815–8821.
- FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU. In Proceedings of the 40th International Conference on Machine Learning (Honolulu, Hawaii, USA) (ICML’23). JMLR.org, Article 1288, 23 pages.
- MLC team. 2023. MLC-LLM. https://github.com/mlc-ai/mlc-llm
- MobiLlama: Towards Accurate and Lightweight Fully Transparent GPT. arXiv:2402.16840Â [cs.CL]
- tinygrad. 2023. Tinygrad. https://github.com/tinygrad/tinygrad Accessed: Dec 2023.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023).
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023).
- BatteryLab: A Collaborative Platform for Power Monitoring: https://batterylab. dev. In International Conference on Passive and Active Network Measurement. Springer, 97–121.
- Attention is all you need. Advances in neural information processing systems 30 (2017).
- Stylianos I Venieris and Christos-Savvas Bouganis. 2016. fpgaConvNet: A framework for mapping convolutional neural networks on FPGAs. In 2016 IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM). IEEE, 40–47.
- Small Language Models Improve Giants by Rewriting Their Outputs. arXiv preprint arXiv:2305.13514 (2023).
- Efficient large language models: A survey. arXiv preprint arXiv:2312.03863 1 (2023).
- Enabling Conversational Interaction with Mobile UI Using Large Language Models. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems (Hamburg, Germany) (CHI ’23). Association for Computing Machinery, New York, NY, USA, Article 432, 17 pages. https://doi.org/10.1145/3544548.3580895
- Linformer: Self-attention with linear complexity. arXiv preprint arXiv:2006.04768 (2020).
- Some like it hot: thermal feedback for mobile devices. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (Vancouver, Canada) (CHI ’11). Association for Computing Machinery, New York, NY, USA, 2555–2564. https://doi.org/10.1145/1978942.1979316
- Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4 (2022), 795–813.
- Offsite-tuning: Transfer learning without full model. arXiv preprint arXiv:2302.04870 (2023).
- Smoothquant: Accurate and efficient post-training quantization for large language models. In International Conference on Machine Learning. PMLR, 38087–38099.
- LLMCad: Fast and Scalable On-device Large Language Model Inference. arXiv preprint arXiv:2309.04255 (2023).
- Penetrative AI: Making LLMs Comprehend the Physical World. In Proceedings of the 25th International Workshop on Mobile Computing Systems and Applications (San Diego, CA, USA) (HOTMOBILE ’24). Association for Computing Machinery, New York, NY, USA, 1–7. https://doi.org/10.1145/3638550.3641130
- A First Look at Deep Learning Apps on Smartphones. In The World Wide Web Conference (San Francisco, CA, USA) (WWW ’19). Association for Computing Machinery, New York, NY, USA, 2125–2136. https://doi.org/10.1145/3308558.3313591
- Tensorgpt: Efficient compression of the embedding layer in llms based on the tensor-train decomposition. arXiv preprint arXiv:2307.00526 (2023).
- A survey of resource-efficient llm and multimodal foundation models. arXiv preprint arXiv:2401.08092 (2024).
- yepkit.com. 2023. YKUSH USB Controller. https://www.yepkit.com/products/ykush Accessed: Dec 2023.
- EdgeMoE: Fast On-Device Inference of MoE-based Large Language Models. arXiv preprint arXiv:2308.14352 (2023).
- Zeus: Understanding and Optimizing GPU Energy Consumption of DNN Training. In 20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23). USENIX Association, Boston, MA, 119–139. https://www.usenix.org/conference/nsdi23/presentation/you
- Hellaswag: Can a machine really finish your sentence? arXiv preprint arXiv:1905.07830 (2019).
- Lima: Less is more for alignment. Advances in Neural Information Processing Systems 36 (2024).
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.