Towards Cheaper Inference in Deep Networks with Lower Bit-Width Accumulators (2401.14110v1)
Abstract: The majority of the research on the quantization of Deep Neural Networks (DNNs) is focused on reducing the precision of tensors visible by high-level frameworks (e.g., weights, activations, and gradients). However, current hardware still relies on high-accuracy core operations. Most significant is the operation of accumulating products. This high-precision accumulation operation is gradually becoming the main computational bottleneck. This is because, so far, the usage of low-precision accumulators led to a significant degradation in performance. In this work, we present a simple method to train and fine-tune high-end DNNs, to allow, for the first time, utilization of cheaper, $12$-bits accumulators, with no significant degradation in accuracy. Lastly, we show that as we decrease the accumulation precision further, using fine-grained gradient approximations can improve the DNN accuracy.
- 9.1 a 7nm 4-core ai chip with 25.6 tflops hybrid fp8 training, 102.4 tops int4 inference and workload-aware throttling. In 2021 IEEE International Solid-State Circuits Conference (ISSCC), volume 64, pp. 144–146. IEEE, 2021.
- Nvidia hopper architecture in-depth, Apr 2022. URL https://developer.nvidia.com/blog/nvidia-hopper-architecture-in-depth/.
- David H Bailey. High-precision floating-point arithmetic in scientific computation. Computing in science & engineering, 7(3):54–61, 2005.
- Scalable methods for 8-bit training of neural networks. In Advances in Neural Information Processing Systems, pp. 5145–5153, 2018.
- Estimating or propagating gradients through stochastic neurons for conditional computation, 2013.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- Logarithmic unbiased quantization: Practical 4-bit training in deep learning. arXiv preprint arXiv:2112.10769, 2021.
- Binarized neural networks. Advances in Neural Information Processing Systems, 2016.
- Theodorus Jozef Dekker. A floating-point technique for extending the available precision. Numerische Mathematik, 18(3):224–242, 1971.
- Qlora: Efficient finetuning of quantized llms. arXiv preprint arXiv:2305.14314, 2023.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
- Deep learning with limited numerical precision. In International Conference on Machine Learning, pp. 1737–1746, 2015.
- Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778, 2016.
- Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2020.
- Nicholas J Higham. The accuracy of floating point summation. SIAM Journal on Scientific Computing, 14(4):783–799, 1993.
- Quantized neural networks: Training neural networks with low precision weights and activations. The Journal of Machine Learning Research, 18(1):6869–6898, 2017.
- Openassistant conversations–democratizing large language model alignment. arXiv preprint arXiv:2304.07327, 2023.
- Fp8 quantization: The power of the exponent. Advances in Neural Information Processing Systems, 35:14651–14662, 2022.
- Visualizing the loss landscape of neural nets. In Neural Information Processing Systems, 2018.
- Pruning and quantization for deep neural network acceleration: A survey. Neurocomputing, 461:370–403, 2021.
- Overcoming oscillations in quantization-aware training. arXiv preprint arXiv:2203.11086, 2022.
- Wrapnet: Neural net inference with ultra-low-resolution arithmetic. arXiv preprint arXiv:2007.13242, 2020.
- Accumulation bit-width scaling for ultra-low precision training of deep networks. arXiv preprint arXiv:1901.06588, 2019.
- Hybrid 8-bit floating point (hfp8) training and inference for deep neural networks. 2019.
- Ultra-low precision 4-bit training of deep neural networks. Advances in Neural Information Processing Systems, 33:1796–1807, 2020.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
- Fp8 versus int8 for efficient deep learning inference. arXiv preprint arXiv:2303.17951, 2023.
- Training deep neural networks with 8-bit floating point numbers. In Advances in neural information processing systems, pp. 7675–7684, 2018.
- Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771, 2019.
- Qpytorch: A low-precision arithmetic simulation framework, 2019.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.