Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Differentiable Fine-grained Quantization for Deep Neural Network Compression (1810.10351v3)

Published 20 Oct 2018 in cs.CV and cs.AI

Abstract: Neural networks have shown great performance in cognitive tasks. When deploying network models on mobile devices with limited resources, weight quantization has been widely adopted. Binary quantization obtains the highest compression but usually results in big accuracy drop. In practice, 8-bit or 16-bit quantization is often used aiming at maintaining the same accuracy as the original 32-bit precision. We observe different layers have different accuracy sensitivity of quantization. Thus judiciously selecting different precision for different layers/structures can potentially produce more efficient models compared to traditional quantization methods by striking a better balance between accuracy and compression rate. In this work, we propose a fine-grained quantization approach for deep neural network compression by relaxing the search space of quantization bitwidth from discrete to a continuous domain. The proposed approach applies gradient descend based optimization to generate a mixed-precision quantization scheme that outperforms the accuracy of traditional quantization methods under the same compression rate.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Hsin-Pai Cheng (17 papers)
  2. Yuanjun Huang (3 papers)
  3. Xuyang Guo (10 papers)
  4. Yifei Huang (71 papers)
  5. Feng Yan (67 papers)
  6. Hai Li (159 papers)
  7. Yiran Chen (176 papers)
Citations (13)