Fast Implementation of 4-bit Convolutional Neural Networks for Mobile Devices (2009.06488v2)

Published 14 Sep 2020 in cs.CV

Abstract: Quantized low-precision neural networks are very popular because they require less computational resources for inference and can provide high performance, which is vital for real-time and embedded recognition systems. However, their advantages are apparent for FPGA and ASIC devices, while general-purpose processor architectures are not always able to perform low-bit integer computations efficiently. The most frequently used low-precision neural network model for mobile central processors is an 8-bit quantized network. However, in a number of cases, it is possible to use fewer bits for weights and activations, and the only problem is the difficulty of efficient implementation. We introduce an efficient implementation of 4-bit matrix multiplication for quantized neural networks and perform time measurements on a mobile ARM processor. It shows 2.9 times speedup compared to standard floating-point multiplication and is 1.5 times faster than 8-bit quantized one. We also demonstrate a 4-bit quantized neural network for OCR recognition on the MIDV-500 dataset. 4-bit quantization gives 95.0% accuracy and 48% overall inference speedup, while an 8-bit quantized network gives 95.4% accuracy and 39% speedup. The results show that 4-bit quantization perfectly suits mobile devices, yielding good enough accuracy and low inference time.

View on arXiv

Authors (5)

Anton Trusov (3 papers)
Elena Limonova (7 papers)
Dmitry Slugin (2 papers)
Dmitry Nikolaev (33 papers)
Vladimir V. Arlazarov (12 papers)

Citations (15)

View on Semantic Scholar

Summary

We haven't generated a summary for this paper yet.

Summarize Now

Fast Implementation of 4-bit Convolutional Neural Networks for Mobile Devices (2009.06488v2)

Summary

Related Papers