How Does Quantization Affect Multilingual LLMs? (2407.03211v2)

Published 3 Jul 2024 in cs.CL and cs.LG

Abstract: Quantization techniques are widely used to improve inference speed and deployment of LLMs. While a wide body of work examines the impact of quantization on LLMs in English, none have evaluated across languages. We conduct a thorough analysis of quantized multilingual LLMs, focusing on performance across languages and at varying scales. We use automatic benchmarks, LLM-as-a-Judge, and human evaluation, finding that (1) harmful effects of quantization are apparent in human evaluation, which automatic metrics severely underestimate: a 1.7% average drop in Japanese across automatic tasks corresponds to a 16.0% drop reported by human evaluators on realistic prompts; (2) languages are disparately affected by quantization, with non-Latin script languages impacted worst; and (3) challenging tasks like mathematical reasoning degrade fastest. As the ability to serve low-compute models is critical for wide global adoption of NLP technologies, our results urge consideration of multilingual performance as a key evaluation criterion for efficient models.

PDF HTML Abstract

The Impact of Quantization on Multilingual LLMs

In the paper titled "How Does Quantization Affect Multilingual LLMs?", Marchisio et al. delve into the nuanced effects of quantization on LLMs that support multiple languages. The authors systematically evaluate quantized multilingual LLMs, focusing on performance degradation across different languages and scales using automatic benchmarks, LLM-as-a-Judge methods, and human evaluations. The paper is novel in its broad assessment of how quantization influences the multilingual capabilities of contemporary LLMs.

Key Findings

The paper presents several crucial findings regarding the quantization of multilingual LLMs:

Underestimation by Automatic Metrics:
- It is shown that automatic evaluation benchmarks significantly underestimate the detrimental effects of quantization. For instance, a performance drop of 1.7% in Japanese tasks (evaluated automatically) corresponds to a notable 16.0% drop when assessed through human evaluation.
Variable Impact by Language:
- Quantization impacts languages disparately, with non-Latin script languages suffering the most. The paper reports a relative performance drop of -0.7% for Latin-script languages compared to -1.9% for non-Latin scripts when utilizing a 103 billion parameter model.
Degradation in Challenging Tasks:
- Certain complex tasks degrade faster due to quantization. Mathematical reasoning tasks, for example, show severe performance declines—the 35 billion parameter model experiences a significant drop of -13.1% in mathematical reasoning ability when quantized.
Quantization occasionally offers benefits:
- In some cases, quantization slightly improves model performance. The 35 billion parameter model showed an average performance boost of 1.3% when quantized with W8A8.

Implications

The practical implications of these findings are significant, especially considering the trade-offs in deploying LLMs in resource-constrained environments. Quantization is a key technique enabling the deployment of LLMs on devices with limited computational power. However, this paper shows that while quantization can maintain or even enhance performance in certain scenarios, it often results in considerable degradation, notably for complex tasks and non-Latin script languages.

From a theoretical perspective, the research emphasizes the critical need for language-specific considerations when applying quantization. The disproportionate impact on non-Latin script languages points toward inherent biases in model design and pre-training that favor Latin-script languages. Furthermore, the variability in degradation across different language tasks suggests that model architects and practitioners should focus on developing more nuanced quantization techniques that consider linguistic diversity.

Future Directions

The findings pave the way for future research that can address the identified shortcomings:

Advanced quantization techniques: Future studies could develop and refine quantization methods to mitigate the identified performance issues, particularly the significant drops observed in human evaluations for non-Latin scripts.
Multilingual-aware training: Incorporating multilingual awareness in the training phase can potentially reduce the disparity in performance drops among different languages.
Exploration of post-training quantization strategies: Given that post-training quantization methods like SmoothQuant and group-wise scaling showed varying degrees of efficacy, further exploration in this domain is warranted. The surprising results on cross-lingual language confusion, for instance, highlight the need for targeted enhancements in quantization strategies.

Conclusion

Marchisio et al.'s work underscores the intricacies and essential considerations when applying quantization to multilingual LLMs. By delivering a comprehensive analysis that spans multiple languages and tasks, the paper highlights the considerable variation in quantization’s impact, urging a cautious and informed approach to deploying quantized LLMs globally. The findings call for a reassessment of model design choices to ensure fair and efficient multilingual performance, making a compelling case for continued research in this important area.