MoNDE: Mixture of Near-Data Experts for Large-Scale Sparse Models (2405.18832v1)
Abstract: Mixture-of-Experts (MoE) LLMs (LLM) have memory requirements that often exceed the GPU memory capacity, requiring costly parameter movement from secondary memories to the GPU for expert computation. In this work, we present Mixture of Near-Data Experts (MoNDE), a near-data computing solution that efficiently enables MoE LLM inference. MoNDE reduces the volume of MoE parameter movement by transferring only the $\textit{hot}$ experts to the GPU, while computing the remaining $\textit{cold}$ experts inside the host memory device. By replacing the transfers of massive expert parameters with the ones of small activations, MoNDE enables far more communication-efficient MoE inference, thereby resulting in substantial speedups over the existing parameter offloading frameworks for both encoder and decoder operations.
- Jacob Devlin et al. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In NAACL-HLT.
- William Fedus et al. 2022. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. JMLR (2022).
- Google. 2021. Hugging Face Switch Transformers. https://huggingface.co/docs/transformers/model_doc/switch_transformers.
- Haiyang Huang et al. 2023. Towards MoE Deployment: Mitigating Inefficiencies in Mixture-of-Expert (MoE) Inference. arXiv preprint arXiv:2303.06182 (2023).
- Jin Hyun Kim et al. 2023. Samsung PIM/PNM for Transformer Based AI: Energy Efficiency on PIM/PNM Cluster. In HCS.
- Yoongu Kim et al. 2015. Ramulator: A fast and extensible DRAM simulator. IEEE CAL (2015).
- Meta. 2022. Hugging Face NLLB MoE Model Hub. https://huggingface.co/facebook/nllb-moe-54b.
- OpenAI. 2023. GPT-4 Technical Report. arXiv preprint arXiv:2303.08774 (2023).
- Colin Raffel et al. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. JMLR (2020).
- Samyam Rajbhandari et al. 2021. Zero-infinity: Breaking the gpu memory wall for extreme scale deep learning. In SC.
- Jeff Rasley et al. 2020. Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. In KDD.
- Jie Ren et al. 2021. ZeRO-Offload: Democratizing Billion-Scale Model Training.. In ATC.
- Swapnil Sharma et al. 2023. Stochastic Code Generation. arXiv:2304.08243 (2023).
- Noam Shazeer et al. 2017. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. In ICLR.
- Liang Shen et al. 2023. SE-MoE: A Scalable and Efficient Mixture-of-Experts Distributed Training and Inference System. arXiv:2205.10034 (2023).
- Thomas Wolf et al. 2020. Transformers: State-of-the-Art Natural Language Processing. In EMNLP.
- Taehyun Kim (71 papers)
- Kwanseok Choi (1 paper)
- Youngmock Cho (1 paper)
- Jaehoon Cho (8 papers)
- Hyuk-Jae Lee (14 papers)
- Jaewoong Sim (7 papers)