Papers
Topics
Authors
Recent
Search
2000 character limit reached

Squat: Quant Small Language Models on the Edge

Published 16 Feb 2024 in cs.LG, cs.AI, and cs.CL | (2402.10787v2)

Abstract: A growing trend has emerged in designing high-quality Small LLMs (SLMs) with a few million parameters. This trend is driven by the increasing concerns over cloud costs, privacy, and latency. Considering that full parameter training is feasible for SLMs on mobile devices, Quantization-Aware Training (QAT) is employed to improve efficiency by reducing computational overhead and memory footprint. However, previous QAT works adopt fine-grained quantization methods to compress models with billions of parameters on GPUs, incompatible with current commodity hardware, such as mobile and edge devices, which relies on Single Instruction Multiple Data (SIMD) instructions. Thus, the generalization of these methods to SLMs on mobile devices is limited. In this paper, we propose Squat method, an effective QAT framework with deployable quantization for SLMs on mobile devices. Specifically, we propose entropy-guided and distribution-aligned distillation to mitigate the distortion of attention information from quantization. Besides, we employ sub-8-bit token adaptive quantization, assigning varying bit widths to different tokens based on their importance. Furthermore, we develop a SIMD-based Multi-Kernel Mixed-Precision (MKMP) multiplier to support sub-8-bit mixed-precision MAC on mobile devices. Our extensive experiments verify the substantial improvements of our method compared to other QAT methods across various datasets. Furthermore, we achieve an on-device speedup of up to 2.37x compared with its FP16 counterparts, signaling a great advancement. Code: https://github.com/shawnricecake/squant

Definition Search Book Streamline Icon: https://streamlinehq.com
References (37)
  1. Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
  2. Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432.
  3. Language models are few-shot learners. NeurIPS, 33:1877–1901.
  4. Language models are few-shot learners.
  5. Understanding the potential of fpga-based spatial acceleration for large language model inference. arXiv preprint arXiv:2312.15159.
  6. Pact: Parameterized clipping activation for quantized neural networks. arXiv preprint arXiv:1805.06085.
  7. Llm. int8 (): 8-bit matrix multiplication for transformers at scale. arXiv preprint arXiv:2208.07339.
  8. Heatvit: Hardware-efficient adaptive token pruning for vision transformers. In 2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA), pages 442–455. IEEE.
  9. Learned step size quantization. arXiv preprint arXiv:1902.08153.
  10. Elias Frantar and Dan Alistarh. 2023. Sparsegpt: Massive language models can be accurately pruned in one-shot. In International Conference on Machine Learning, pages 10323–10337. PMLR.
  11. Gptq: Accurate post-training quantization for generative pre-trained transformers. arXiv preprint arXiv:2210.17323.
  12. Specializing smaller language models towards multi-step reasoning. arXiv preprint arXiv:2301.12726.
  13. Georgi Gerganov. 2023. llama.cpp: Low-latency audio streaming library for c++.
  14. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531.
  15. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685.
  16. Token fusion: Bridging the gap between token pruning and token merging. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 1383–1392.
  17. Spvit: Enabling faster vision transformers via latency-aware soft token pruning. In European Conference on Computer Vision, pages 620–640. Springer.
  18. Awq: Activation-aware weight quantization for llm compression and acceleration. arXiv preprint arXiv:2306.00978.
  19. Llm-qat: Data-free quantization aware training for large language models. arXiv preprint arXiv:2305.17888.
  20. D. Messerschmitt. 1971. Quantizing for maximum output entropy (corresp.). IEEE Transactions on Information Theory, 17(5):612–612.
  21. Nipq: Noise injection pseudo quantization for automated dnn optimization. arXiv preprint arXiv:2206.00820.
  22. Language models are unsupervised multitask learners.
  23. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9.
  24. Agile-quant: Activation-guided quantization for faster inference of llms on the edge. arXiv preprint arXiv:2312.05693.
  25. The lottery ticket hypothesis for vision transformers. arXiv preprint arXiv:2211.01484.
  26. Inar Timiryasov and Jean-Loup Tastet. 2023. Baby llama: knowledge distillation from an ensemble of teachers trained on a small dataset with no performance penalty. arXiv preprint arXiv:2308.02019.
  27. Llama: Open and efficient foundation language models. arXiv.
  28. Attention is all you need. NeurIPS, 30.
  29. Superglue: A stickier benchmark for general-purpose language understanding systems. Advances in neural information processing systems, 32.
  30. BLiMP: The Benchmark of Linguistic Minimal Pairs for English. Transactions of the Association for Computational Linguistics, 8:377–392.
  31. Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100.
  32. Zeroquant-fp: A leap forward in llms post-training w4a8 quantization using floating-point formats. arXiv preprint arXiv:2307.09782.
  33. Smoothquant: Accurate and efficient post-training quantization for large language models. In International Conference on Machine Learning, pages 38087–38099. PMLR.
  34. Penn & BGU BabyBERTa+ for Strict-Small BabyLM Challenge. Technical report.
  35. Zeroquant: Efficient and affordable post-training quantization for large-scale transformers. Advances in Neural Information Processing Systems, 35:27168–27183.
  36. Opt: Open pre-trained transformer language models. arXiv.
  37. The open cloud testbed (oct): A platform for research into new cloud technologies. In 2021 IEEE 10th International Conference on Cloud Networking (CloudNet), pages 140–147. IEEE.
Citations (10)

Summary

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 0 likes about this paper.