Mixture of Quantized Experts (MoQE): Complementary Effect of Low-bit Quantization and Robustness (2310.02410v1)
Abstract: Large Mixture of Experts (MoE) models could achieve state-of-the-art quality on various language tasks, including machine translation task, thanks to the efficient model scaling capability with expert parallelism. However, it has brought a fundamental issue of larger memory consumption and increased memory bandwidth bottleneck at deployment time. In this paper, we propose Mixture of Quantized Experts (MoQE) which is a simple weight-only quantization method applying ultra low-bit down to 2-bit quantizations only to expert weights for mitigating the increased memory and latency issues of MoE models. We show that low-bit quantization together with the MoE architecture delivers a reliable model performance while reducing the memory size significantly even without any additional training in most cases. In particular, expert layers in MoE models are much more robust to the quantization than conventional feedforward networks (FFN) layers. In our comprehensive analysis, we show that MoE models with 2-bit expert weights can deliver better model performance than the dense model trained on the same dataset. As a result of low-bit quantization, we show the model size can be reduced by 79.6% of the original half precision floating point (fp16) MoE model. Combined with an optimized GPU runtime implementation, it also achieves 1.24X speed-up on A100 GPUs.
- Compressing neural machine translation models with 4-bit precision. In NGT, 2020.
- Efficient large scale language modeling with mixtures of experts. arXiv preprint arXiv:2112.10684, 2021.
- Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. arXiv preprint arXiv:2101.03961, 2021.
- Deep encoder, shallow decoder: Reevaluating non-autoregressive machine translation. In ICLR, 2021.
- Rethinking positional encoding in language pre-training. ArXiv, abs/2006.15595, 2021.
- From research to production and back: Ludicrously fast neural machine translation. In Proceedings of the 3rd Workshop on Neural Generation and Translation, pp. 280–288, 2019.
- Scalable and efficient moe training for multitask multilingual models. arXiv preprint arXiv:2109.10465, 2021.
- Beyond distillation: Task-level mixture-of-experts for efficient inference. In EMNLP, 2021.
- Gshard: Scaling giant models with conditional computation and automatic sharding. arXiv preprint arXiv:2006.16668, 2020.
- Gating dropout: Communication-efficient regularization for sparsely activated transformers. In International Conference on Machine Learning, pp. 13782–13792. PMLR, 2022.
- Deepspeed-moe: Advancing mixture-of-experts inference and training to power next-generation ai scale. In ICML, 2022.
- Attention is all you need. In NIPS, 2017.
- Multi-task learning for multilingual neural machine translation. arXiv preprint arXiv:2010.02523, 2020.
- On layer normalization in the transformer architecture. In ICML, 2020.
- Designing effective sparse expert models. arXiv preprint arXiv:2202.08906, 2022.
- Young Jin Kim (31 papers)
- Raffy Fahim (4 papers)
- Hany Hassan Awadalla (24 papers)