Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Mixed Precision Post Training Quantization of Neural Networks with Sensitivity Guided Search (2302.01382v2)

Published 2 Feb 2023 in cs.LG

Abstract: Serving large-scale ML models efficiently and with low latency has become challenging owing to increasing model size and complexity. Quantizing models can simultaneously reduce memory and compute requirements, facilitating their widespread access. However, for large models not all layers are equally amenable to the same numerical precision and aggressive quantization can lead to unacceptable loss in model accuracy. One approach to prevent this accuracy degradation is mixed-precision quantization, which allows different tensors to be quantized to varying levels of numerical precision, leveraging the capabilities of modern hardware. Such mixed-precision quantiztaion can more effectively allocate numerical precision to different tensors `as needed' to preserve model accuracy while reducing footprint and compute latency. In this paper, we propose a method to efficiently determine quantization configurations of different tensors in ML models using post-training mixed precision quantization. We analyze three sensitivity metrics and evaluate them for guiding configuration search of two algorithms. We evaluate our method for computer vision and natural language processing and demonstrate latency reductions of up to 27.59% and 34.31% compared to the baseline 16-bit floating point model while guaranteeing no more than 1% accuracy degradation.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (10)
  1. Clemens JS Schaefer (10 papers)
  2. Elfie Guo (2 papers)
  3. Caitlin Stanton (2 papers)
  4. Xiaofan Zhang (79 papers)
  5. Tom Jablin (2 papers)
  6. Navid Lambert-Shirzad (2 papers)
  7. Jian Li (667 papers)
  8. Chiachen Chou (3 papers)
  9. Siddharth Joshi (28 papers)
  10. Yu Emma Wang (9 papers)
Citations (3)