Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
51 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

EdgeQAT: Entropy and Distribution Guided Quantization-Aware Training for the Acceleration of Lightweight LLMs on the Edge (2402.10787v1)

Published 16 Feb 2024 in cs.LG, cs.AI, and cs.CL

Abstract: Despite the remarkable strides of LLMs in various fields, the wide applications of LLMs on edge devices are limited due to their massive parameters and computations. To address this, quantization is commonly adopted to generate lightweight LLMs with efficient computations and fast inference. However, Post-Training Quantization (PTQ) methods dramatically degrade in quality when quantizing weights, activations, and KV cache together to below 8 bits. Besides, many Quantization-Aware Training (QAT) works quantize model weights, leaving the activations untouched, which do not fully exploit the potential of quantization for inference acceleration on the edge. In this paper, we propose EdgeQAT, the Entropy and Distribution Guided QAT for the optimization of lightweight LLMs to achieve inference acceleration on Edge devices. We first identify that the performance drop of quantization primarily stems from the information distortion in quantized attention maps, demonstrated by the different distributions in quantized query and key of the self-attention mechanism. Then, the entropy and distribution guided QAT is proposed to mitigate the information distortion. Moreover, we design a token importance-aware adaptive method to dynamically quantize the tokens with different bit widths for further optimization and acceleration. Our extensive experiments verify the substantial improvements with our framework across various datasets. Furthermore, we achieve an on-device speedup of up to 2.37x compared with its FP16 counterparts across multiple edge devices, signaling a groundbreaking advancement.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (37)
  1. Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
  2. Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432.
  3. Language models are few-shot learners. NeurIPS, 33:1877–1901.
  4. Language models are few-shot learners.
  5. Understanding the potential of fpga-based spatial acceleration for large language model inference. arXiv preprint arXiv:2312.15159.
  6. Pact: Parameterized clipping activation for quantized neural networks. arXiv preprint arXiv:1805.06085.
  7. Llm. int8 (): 8-bit matrix multiplication for transformers at scale. arXiv preprint arXiv:2208.07339.
  8. Heatvit: Hardware-efficient adaptive token pruning for vision transformers. In 2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA), pages 442–455. IEEE.
  9. Learned step size quantization. arXiv preprint arXiv:1902.08153.
  10. Elias Frantar and Dan Alistarh. 2023. Sparsegpt: Massive language models can be accurately pruned in one-shot. In International Conference on Machine Learning, pages 10323–10337. PMLR.
  11. Gptq: Accurate post-training quantization for generative pre-trained transformers. arXiv preprint arXiv:2210.17323.
  12. Specializing smaller language models towards multi-step reasoning. arXiv preprint arXiv:2301.12726.
  13. Georgi Gerganov. 2023. llama.cpp: Low-latency audio streaming library for c++.
  14. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531.
  15. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685.
  16. Token fusion: Bridging the gap between token pruning and token merging. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 1383–1392.
  17. Spvit: Enabling faster vision transformers via latency-aware soft token pruning. In European Conference on Computer Vision, pages 620–640. Springer.
  18. Awq: Activation-aware weight quantization for llm compression and acceleration. arXiv preprint arXiv:2306.00978.
  19. Llm-qat: Data-free quantization aware training for large language models. arXiv preprint arXiv:2305.17888.
  20. D. Messerschmitt. 1971. Quantizing for maximum output entropy (corresp.). IEEE Transactions on Information Theory, 17(5):612–612.
  21. Nipq: Noise injection pseudo quantization for automated dnn optimization. arXiv preprint arXiv:2206.00820.
  22. Language models are unsupervised multitask learners.
  23. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9.
  24. Agile-quant: Activation-guided quantization for faster inference of llms on the edge. arXiv preprint arXiv:2312.05693.
  25. The lottery ticket hypothesis for vision transformers. arXiv preprint arXiv:2211.01484.
  26. Inar Timiryasov and Jean-Loup Tastet. 2023. Baby llama: knowledge distillation from an ensemble of teachers trained on a small dataset with no performance penalty. arXiv preprint arXiv:2308.02019.
  27. Llama: Open and efficient foundation language models. arXiv.
  28. Attention is all you need. NeurIPS, 30.
  29. Superglue: A stickier benchmark for general-purpose language understanding systems. Advances in neural information processing systems, 32.
  30. BLiMP: The Benchmark of Linguistic Minimal Pairs for English. Transactions of the Association for Computational Linguistics, 8:377–392.
  31. Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100.
  32. Zeroquant-fp: A leap forward in llms post-training w4a8 quantization using floating-point formats. arXiv preprint arXiv:2307.09782.
  33. Smoothquant: Accurate and efficient post-training quantization for large language models. In International Conference on Machine Learning, pages 38087–38099. PMLR.
  34. Penn & BGU BabyBERTa+ for Strict-Small BabyLM Challenge. Technical report.
  35. Zeroquant: Efficient and affordable post-training quantization for large-scale transformers. Advances in Neural Information Processing Systems, 35:27168–27183.
  36. Opt: Open pre-trained transformer language models. arXiv.
  37. The open cloud testbed (oct): A platform for research into new cloud technologies. In 2021 IEEE 10th International Conference on Cloud Networking (CloudNet), pages 140–147. IEEE.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (14)
  1. Xuan Shen (29 papers)
  2. Zhenglun Kong (33 papers)
  3. Changdi Yang (10 papers)
  4. Zhaoyang Han (7 papers)
  5. Lei Lu (55 papers)
  6. Peiyan Dong (18 papers)
  7. Cheng Lyu (11 papers)
  8. Chih-hsiang Li (2 papers)
  9. Xuehang Guo (2 papers)
  10. Zhihao Shu (4 papers)
  11. Wei Niu (68 papers)
  12. Miriam Leeser (10 papers)
  13. Pu Zhao (82 papers)
  14. Yanzhi Wang (197 papers)
Citations (10)
X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets