MoQa: Rethinking MoE Quantization with Multi-stage Data-model Distribution Awareness (2503.21135v2)

Published 27 Mar 2025 in cs.LG

Abstract: With the advances in artificial intelligence, Mix-of-Experts (MoE) has become the main form of LLMs, and its demand for model compression is increasing. Quantization is an effective method that not only compresses the models but also significantly accelerates their performance. Existing quantization methods have gradually shifted the focus from parameter scaling to the analysis of data distributions. However, their analysis is designed for dense LLMs, which are suboptimal for MoE quantization, due to MoEs' complex data-model distribution. To address this problem, we decouple the complexity of MoEs' data-model distribution into a multi-stage analysis and reveal MoEs' inherent dynamics. The analysis results show that the expert performance of MoE varies dynamically both within and across data distributions. Based on these, we design two quantization strategies with data-model distribution awareness and integrate them into an end-to-end framework for MoE quantization, which is named MoQa. MoQa uses an expert-level mix-precision base quantization with distribution awareness. Moreover, MoQa uses a channel-level quantization adjustment to dynamically adjust expert performance to adapt to novel distributions. Experiments show that MoQa's base quantization achieves a 0.49~8.51 PPL decrease on known distributions. With the adjustments, MoQa achieves a 2.74~6.44 PPL decrease and 1.85%~3.77% average accuracy improvements on novel distributions. We believe MoQa will play a role in future MoE construction, optimization, and compression.

Summary

We haven't generated a summary for this paper yet.

Summarize Now

MoQa: Rethinking MoE Quantization with Multi-stage Data-model Distribution Awareness (2503.21135v2)

Summary

Related Papers

Tweets