Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
GPT-4o
Gemini 2.5 Pro Pro
o3 Pro
GPT-4.1 Pro
DeepSeek R1 via Azure Pro
2000 character limit reached

INT-FP-QSim: Mixed Precision and Formats For Large Language Models and Vision Transformers (2307.03712v1)

Published 7 Jul 2023 in cs.LG, cs.CL, and cs.CV

Abstract: The recent rise of LLMs has resulted in increased efforts towards running LLMs at reduced precision. Running LLMs at lower precision supports resource constraints and furthers their democratization, enabling users to run billion-parameter LLMs on their personal devices. To supplement this ongoing effort, we propose INT-FP-QSim: an open-source simulator that enables flexible evaluation of LLMs and vision transformers at various numerical precisions and formats. INT-FP-QSim leverages existing open-source repositories such as TensorRT, QPytorch and AIMET for a combined simulator that supports various floating point and integer formats. With the help of our simulator, we survey the impact of different numerical formats on the performance of LLMs and vision transformers at 4-bit weights and 4-bit or 8-bit activations. We also compare recently proposed methods like Adaptive Block Floating Point, SmoothQuant, GPTQ and RPTQ on the model performances. We hope INT-FP-QSim will enable researchers to flexibly simulate models at various precisions to support further research in quantization of LLMs and vision transformers.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (15)
  1. Smoothquant: Accurate and efficient post-training quantization for large language models. arXiv preprint arXiv:2211.10438, 2022.
  2. Llm. int8 (): 8-bit matrix multiplication for transformers at scale. arXiv preprint arXiv:2208.07339, 2022.
  3. Gptq: Accurate post-training quantization for generative pre-trained transformers. arXiv preprint arXiv:2210.17323, 2022.
  4. Rptq: Reorder-based post-training quantization for large language models. arXiv preprint arXiv:2304.01089, 2023.
  5. Vs-quant: Per-vector scaled quantization for accurate low-precision neural network inference. Proceedings of Machine Learning and Systems, 3:873–884, 2021.
  6. Adaptive block floating-point for analog deep learning hardware. arXiv preprint arXiv:2205.06287, 2022.
  7. Integer quantization for deep learning inference: Principles and empirical evaluation. arXiv preprint arXiv:2004.09602, 2020.
  8. Tensorrt - pytorch quantization. https://github.com/NVIDIA/TensorRT/tree/master/tools/pytorch-quantization. Accessed: 2023-05-18.
  9. Qpytorch: A low-precision arithmetic simulation framework. In 2019 Fifth Workshop on Energy Efficient Machine Learning and Cognitive Computing-NeurIPS Edition (EMC2-NIPS), pages 10–13. IEEE, 2019.
  10. Ai model efficiency toolkit. https://github.com/quic/aimet. Accessed: 2023-05-18.
  11. Fp8 quantization: The power of the exponent. arXiv preprint arXiv:2208.09225, 2022.
  12. A comprehensive study on post-training quantization for large language models. arXiv preprint arXiv:2303.08302, 2023.
  13. Fp8 formats for deep learning. arXiv preprint arXiv:2209.05433, 2022.
  14. Optimal clipping and magnitude-aware differentiation for improved quantization-aware training. In International Conference on Machine Learning, pages 19123–19138. PMLR, 2022.
  15. An electro-photonic system for accelerating deep neural networks. arXiv preprint arXiv:2109.01126, 2021.
Citations (1)

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Youtube Logo Streamline Icon: https://streamlinehq.com