Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 148 tok/s
Gemini 2.5 Pro 44 tok/s Pro
GPT-5 Medium 23 tok/s Pro
GPT-5 High 30 tok/s Pro
GPT-4o 86 tok/s Pro
Kimi K2 197 tok/s Pro
GPT OSS 120B 458 tok/s Pro
Claude Sonnet 4.5 38 tok/s Pro
2000 character limit reached

Towards Cheaper Inference in Deep Networks with Lower Bit-Width Accumulators (2401.14110v1)

Published 25 Jan 2024 in cs.LG, cs.AI, and cs.AR

Abstract: The majority of the research on the quantization of Deep Neural Networks (DNNs) is focused on reducing the precision of tensors visible by high-level frameworks (e.g., weights, activations, and gradients). However, current hardware still relies on high-accuracy core operations. Most significant is the operation of accumulating products. This high-precision accumulation operation is gradually becoming the main computational bottleneck. This is because, so far, the usage of low-precision accumulators led to a significant degradation in performance. In this work, we present a simple method to train and fine-tune high-end DNNs, to allow, for the first time, utilization of cheaper, $12$-bits accumulators, with no significant degradation in accuracy. Lastly, we show that as we decrease the accumulation precision further, using fine-grained gradient approximations can improve the DNN accuracy.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (30)
  1. 9.1 a 7nm 4-core ai chip with 25.6 tflops hybrid fp8 training, 102.4 tops int4 inference and workload-aware throttling. In 2021 IEEE International Solid-State Circuits Conference (ISSCC), volume 64, pp.  144–146. IEEE, 2021.
  2. Nvidia hopper architecture in-depth, Apr 2022. URL https://developer.nvidia.com/blog/nvidia-hopper-architecture-in-depth/.
  3. David H Bailey. High-precision floating-point arithmetic in scientific computation. Computing in science & engineering, 7(3):54–61, 2005.
  4. Scalable methods for 8-bit training of neural networks. In Advances in Neural Information Processing Systems, pp. 5145–5153, 2018.
  5. Estimating or propagating gradients through stochastic neurons for conditional computation, 2013.
  6. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  7. Logarithmic unbiased quantization: Practical 4-bit training in deep learning. arXiv preprint arXiv:2112.10769, 2021.
  8. Binarized neural networks. Advances in Neural Information Processing Systems, 2016.
  9. Theodorus Jozef Dekker. A floating-point technique for extending the available precision. Numerische Mathematik, 18(3):224–242, 1971.
  10. Qlora: Efficient finetuning of quantized llms. arXiv preprint arXiv:2305.14314, 2023.
  11. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  12. Deep learning with limited numerical precision. In International Conference on Machine Learning, pp. 1737–1746, 2015.
  13. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  770–778, 2016.
  14. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2020.
  15. Nicholas J Higham. The accuracy of floating point summation. SIAM Journal on Scientific Computing, 14(4):783–799, 1993.
  16. Quantized neural networks: Training neural networks with low precision weights and activations. The Journal of Machine Learning Research, 18(1):6869–6898, 2017.
  17. Openassistant conversations–democratizing large language model alignment. arXiv preprint arXiv:2304.07327, 2023.
  18. Fp8 quantization: The power of the exponent. Advances in Neural Information Processing Systems, 35:14651–14662, 2022.
  19. Visualizing the loss landscape of neural nets. In Neural Information Processing Systems, 2018.
  20. Pruning and quantization for deep neural network acceleration: A survey. Neurocomputing, 461:370–403, 2021.
  21. Overcoming oscillations in quantization-aware training. arXiv preprint arXiv:2203.11086, 2022.
  22. Wrapnet: Neural net inference with ultra-low-resolution arithmetic. arXiv preprint arXiv:2007.13242, 2020.
  23. Accumulation bit-width scaling for ultra-low precision training of deep networks. arXiv preprint arXiv:1901.06588, 2019.
  24. Hybrid 8-bit floating point (hfp8) training and inference for deep neural networks. 2019.
  25. Ultra-low precision 4-bit training of deep neural networks. Advances in Neural Information Processing Systems, 33:1796–1807, 2020.
  26. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  27. Fp8 versus int8 for efficient deep learning inference. arXiv preprint arXiv:2303.17951, 2023.
  28. Training deep neural networks with 8-bit floating point numbers. In Advances in neural information processing systems, pp. 7675–7684, 2018.
  29. Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771, 2019.
  30. Qpytorch: A low-precision arithmetic simulation framework, 2019.
Citations (2)

Summary

We haven't generated a summary for this paper yet.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Youtube Logo Streamline Icon: https://streamlinehq.com