MobileQuant: Mobile-friendly Quantization for On-device Language Models (2408.13933v2)

Published 25 Aug 2024 in cs.CL

Abstract: LLMs have revolutionized language processing, delivering outstanding results across multiple applications. However, deploying LLMs on edge devices poses several challenges with respect to memory, energy, and compute costs, limiting their widespread use in devices such as mobile phones. A promising solution is to reduce the number of bits used to represent weights and activations. While existing works have found partial success at quantizing LLMs to lower bitwidths, e.g. 4-bit weights, quantizing activations beyond 16 bits often leads to large computational overheads due to poor on-device quantization support, or a considerable accuracy drop. Yet, 8-bit activations are very attractive for on-device deployment as they would enable LLMs to fully exploit mobile-friendly hardware, e.g. Neural Processing Units (NPUs). In this work, we make a first attempt to facilitate the on-device deployment of LLMs using integer-only quantization. We first investigate the limitations of existing quantization methods for on-device deployment, with a special focus on activation quantization. We then address these limitations by introducing a simple post-training quantization method, named MobileQuant, that extends previous weight equivalent transformation works by jointly optimizing the weight transformation and activation range parameters in an end-to-end manner. MobileQuant demonstrates superior capabilities over existing methods by 1) achieving near-lossless quantization on a wide range of LLM benchmarks, 2) reducing latency and energy consumption by 20\%-50\% compared to current on-device quantization strategies, 3) requiring limited compute budget, 4) being compatible with mobile-friendly compute units, e.g. NPU.

PDF HTML Abstract

MobileQuant: Mobile-friendly Quantization for On-device LLMs

The paper "MobileQuant: Mobile-friendly Quantization for On-device LLMs" addresses the challenge of deploying LLMs on resource-constrained edge devices, such as mobile phones. These devices have limited memory, compute, and energy resources, making traditional deployment of LLMs impractical. MobileQuant aims to mitigate these constraints through an optimized quantization strategy that supports integer-only operations, fully leveraging mobile hardware like Neural Processing Units (NPUs).

Background and Challenges

Quantizing LLMs involves reducing the bit-width of weights and activations to limit memory usage, computational overhead, and energy consumption. Existing methods typically divide into two groups: weight-only quantization (e.g., GPTQ, AWQ) and weight-activation quantization (e.g., SmoothQuant, OmniQuant). Weight-only quantization reduces storage but still operates in floating-point for activations, incurring significant energy and latency costs. Weight-activation quantization reduces computational costs by employing fixed-point operators but often suffers from accuracy degradation, particularly when approximate methods are used.

The primary innovation of this paper lies in establishing a method for near-lossless 8-bit activation quantization—enabling LLMs to exploit the efficient fixed-point operations—while preserving model accuracy. This requires addressing two key limitations of current methods: the inability to propagate weight transformations beyond specific non-linear operators and the difficulty in accurately quantizing dynamically ranged activations on-device.

Contributions

The key contributions of MobileQuant comprise:

Post-Training Quantization Method: A novel approach named MobileQuant, aimed at minimizing accuracy loss while optimizing for on-device deployment.
Weight Transformation Extensions: Expanding existing weight equivalent transformations to support a broader range of layers and jointly optimizing weight transformations with activation ranges.
Integer-Only Quantization: Comprehensive on-device evaluation demonstrating MobileQuant's ability to reduce latency and energy consumption significantly.

Methodology

MobileQuant employs several design decisions to achieve its goals:

Fixed-Point Integer Arithmetic: The method supports int8-int8 operations where feasible, as these are widely optimized on mobile hardware. Where necessary, higher bit-width activations (e.g., int16) are maintained to avoid accuracy losses.
Per-Tensor/Channel Quantization: It uses static quantization statistics computed offline from a calibration set, enhancing compatibility with existing hardware.
Weight Equivalent Transformation: Transformation parameters are optimized to balance the distribution of weight and activation ranges, ensuring quantized representations are efficient.

These methodological choices are integrated into an end-to-end optimization strategy, contrasting with block-wise methods like OmniQuant. This holistic optimization is shown to improve both the quantization process and model performance.

Experimental Results

The experimental section evaluates MobileQuant on several lightweight LLMs (e.g., TinyLLaMA, StableLM-2) across multiple benchmarks within the LLM Evaluation Harness:

Model Accuracy: Compared to other state-of-the-art methods like SmoothQuant and OmniQuant, MobileQuant achieves superior accuracy. For instance, in the W8A8 configuration, it outperforms other methods significantly on common benchmarks such as WikiText, AI2 Reasoning Challenge, and MMLU.
On-Device Performance: MobileQuant demonstrated a reduction in energy usage by up to 50% and latency by up to 40% during prompt encoding, showcasing its suitability for real-time mobile applications.
Scalability: MobileQuant's end-to-end training method improves model performance with increased training data and epochs, supporting broader applicability across various LLM architectures.

Implications and Speculation on Future Research

The implications of this research are twofold:

Practical Deployment: MobileQuant provides a feasible path for deploying complex LLMs on mobile devices, expanding the accessibility and utility of these powerful models in real-world, everyday applications.
Theoretical Insights: The paper sheds light on how specific quantization techniques can be tailored for edge devices, opening avenues for further improvements in edge computing efficiency and real-time applications.

Future developments could focus on:

Larger Models: Extending MobileQuant to support even larger LLMs while maintaining efficiency on mobile hardware could unlock new capabilities in mobile AI.
Customized Hardware: As hardware evolves, future research may explore tighter co-designs of quantization methods and dedicated processing units, potentially improving both performance and efficiency further.

Conclusion

MobileQuant represents a significant advance in the practical deployment of LLMs on mobile devices by leveraging optimized quantization techniques. The results indicate substantial improvements in efficiency and accuracy, positioning MobileQuant as a foundational approach in the pursuit of more accessible, real-time LLMs. This research underscores the potential for tailored quantization strategies to overcome existing barriers in mobile AI deployments.

PDF Markdown Bookmark Chat (Pro)

Authors (8)

Fuwen Tan (10 papers)
Royson Lee (19 papers)
Łukasz Dudziak (41 papers)
Shell Xu Hu (18 papers)
Sourav Bhattacharya (75 papers)
Timothy Hospedales (101 papers)
Georgios Tzimiropoulos (86 papers)
Brais Martinez (38 papers)

Citations (1)

View on Semantic Scholar

Related Papers

Find Related Papers

Tweets

https://twitter.com/_akhaliq/status/1828274756656709973