Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
144 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

An Investigation on Different Underlying Quantization Schemes for Pre-trained Language Models (2010.07109v1)

Published 14 Oct 2020 in cs.CL

Abstract: Recently, pre-trained LLMs like BERT have shown promising performance on multiple natural language processing tasks. However, the application of these models has been limited due to their huge size. To reduce its size, a popular and efficient way is quantization. Nevertheless, most of the works focusing on BERT quantization adapted primary linear clustering as the quantization scheme, and few works try to upgrade it. That limits the performance of quantization significantly. In this paper, we implement k-means quantization and compare its performance on the fix-precision quantization of BERT with linear quantization. Through the comparison, we verify that the effect of the underlying quantization scheme upgrading is underestimated and there is a huge development potential of k-means quantization. Besides, we also compare the two quantization schemes on ALBERT models to explore the robustness differences between different pre-trained models.

Citations (10)

Summary

We haven't generated a summary for this paper yet.