Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

38 tokens/sec

GPT-4o

59 tokens/sec

Gemini 2.5 Pro Pro

41 tokens/sec

o3 Pro

7 tokens/sec

GPT-4.1 Pro

50 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

53 32 1

MELTing point: Mobile Evaluation of Language Transformers (2403.12844v4)

Published 19 Mar 2024 in cs.LG

Abstract: Transformers have revolutionized the machine learning landscape, gradually making their way into everyday tasks and equipping our computers with "sparks of intelligence". However, their runtime requirements have prevented them from being broadly deployed on mobile. As personal devices become increasingly powerful and prompt privacy becomes an ever more pressing issue, we explore the current state of mobile execution of LLMs. To achieve this, we have created our own automation infrastructure, MELT, which supports the headless execution and benchmarking of LLMs on device, supporting different models, devices and frameworks, including Android, iOS and Nvidia Jetson devices. We evaluate popular instruction fine-tuned LLMs and leverage different frameworks to measure their end-to-end and granular performance, tracing their memory and energy requirements along the way. Our analysis is the first systematic study of on-device LLM execution, quantifying performance, energy efficiency and accuracy across various state-of-the-art models and showcases the state of on-device intelligence in the era of hyperscale models. Results highlight the performance heterogeneity across targets and corroborates that LLM inference is largely memory-bound. Quantization drastically reduces memory requirements and renders execution viable, but at a non-negligible accuracy cost. Drawing from its energy footprint and thermal behavior, the continuous execution of LLMs remains elusive, as both factors negatively affect user experience. Last, our experience shows that the ecosystem is still in its infancy, and algorithmic as well as hardware breakthroughs can significantly shift the execution cost. We expect NPU acceleration, and framework-hardware co-design to be the biggest bet towards efficient standalone execution, with the alternative of offloading tailored towards edge deployments.

PDF HTML Abstract

Analyzing On-Device Execution of LLMs: A Technical Deep Dive

The research paper titled "MELTing Point: Mobile Evaluation of Language Transformers" presents a systematic investigation into the execution of LLMs on mobile devices. The paper initiates a meticulous examination of the feasibility and performance of running LLMs at the consumer edge, with a primary focus on mobile systems. As edge devices are gaining computational capabilities, the possibility of private, efficient, and localized execution of LLMs becomes increasingly plausible. This analysis seeks to outline the paper's findings, quilted together through a rigorous methodological approach and substantiated by a sophisticated infrastructure termed MELT.

The authors begin by contextualizing the need for running LLMs on-device, emphasizing privacy concerns, decentralization, and the democratization of machine intelligence. The aims of the paper are positioned around core research questions: the feasibility of on-device deployment, the inference performance across heterogeneous consumer devices, and the bottlenecks impeding such deployments. Additionally, it investigates the trade-offs incurred by quantization—a technique often employed to reduce model size and memory footprint, albeit at potential accuracy costs.

Methodology and Infrastructure

MELT, the bespoke infrastructure introduced in the paper, serves as the cornerstone of this research. This infrastructure entails an integrated system for downloading, converting, deploying, and benchmarking LLMs across a gamut of devices, including iOS and Android-based platforms, utilizing various execution frameworks. The authors meticulously constructed a device farm encompassing high-end and mid-tier devices, accompanied by an elaborate energy monitoring setup. This detailed infrastructure enabled them to trace performance, energy consumption, and thermal behavior systematically across devices.

The paper employs a wide array of LLMs, sourced and configured to support different quantization schemes and frameworks, namely MLC-LLM and llama.cpp. The evaluation primarily focuses on conversational agents, leveraging a dataset of multi-turn prompts. Through MELT, the paper automates interaction, monitors power consumption, and records inference performance metrics, offering a granular view of each aspect of the on-device LLM execution.

Key Findings

Performance and Throughput: The paper highlights significant heterogeneity in LLM performance across devices, primarily contingent on the model size, framework, and device tier. Interestingly, the paper reports higher prefill throughput than generation throughput, attributive to the compute vs. memory-bound characteristics of the workload.
Energy Efficiency: The research outlines the pronounced energy demands of LLM inference, citing that quantization, while reducing memory demands, incurs an accuracy loss. Furthermore, the high power draw during inference poses challenges for sustained on-device execution, impacting user experience significantly.
Quantization Impacts: A notable insight from this work is the precision vs. accuracy trade-offs inherent in quantization. The paper evidences that while quantization can render LLMs deployable on resource-constrained devices, it often results in noticeable performance degradation, specifically in models quantized below 4-bit precision.
Memory and Computational Bottlenecks: The paper confirms that LLM inference remains predominantly memory-bound. The memory bandwidth becomes a critical bottleneck, especially during the decode operation in the generation phase of LLM inference.
Quality of Experience: The paper emphasizes that the on-device deployment of LLMs can adversely affect user experience, noting device responsiveness issues during model load times and execution phases.

Future Implications and Research Directions

In view of the findings, the paper speculates on the potential shifts in AI and edge computing. It hypothesizes that future advancements might manifest through algorithmic innovations or hardware evolutions, such as improved neural processing units (NPUs) and hardware-software co-design frameworks, aimed at optimizing these memory-intensive workloads. The sustainability concerns discussed also hint at the necessity for cloud-edge hybrid models that can balance resource efficiencies.

Finally, the paper posits the broadened utility of extended on-device capabilities, opening avenues for personalized, multimodal, and context-aware intelligent assistants. This could lead to an evolution in how users interact with digital systems, shifting from conventional multi-step processes to natural language-driven workflows facilitated by robust on-device AI capabilities.

The paper contributes valuable benchmarks and insights into the field of mobile AI, setting a foundational platform for the continued exploration and adaptation of LLMs at the edge. As hardware progresses and algorithmic novel techniques unfold, this research paves the path toward realizing efficient, privacy-centric, on-device AI solutions.

PDF Markdown Bookmark Chat (Pro)

References (120)

Authors (4)

Stefanos Laskaridis (20 papers)
Lorenzo Minto (2 papers)
Hamed Haddadi (131 papers)
Kleomenis Katevas (20 papers)

Citations (10)

View on Semantic Scholar

Tweets

https://twitter.com/realhamed/status/1818021013155910022

https://twitter.com/stevelaskaridis/status/1857824237878141416

https://twitter.com/stevelaskaridis/status/1800433007674986822

https://twitter.com/denis_bykov/status/1818018847141556524

https://twitter.com/realhamed/status/1792254340884738381

https://twitter.com/realhamed/status/1816773902959300966

YouTube

Show All Videos

MELTing point paper - A Local AI overview by Brave's ML team (32 points, 5 comments)