Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Fast Inference of Mixture-of-Experts Language Models with Offloading (2312.17238v1)

Published 28 Dec 2023 in cs.LG, cs.AI, and cs.DC
Fast Inference of Mixture-of-Experts Language Models with Offloading

Abstract: With the widespread adoption of LLMs, many deep learning practitioners are looking for strategies of running these models more efficiently. One such strategy is to use sparse Mixture-of-Experts (MoE) - a type of model architectures where only a fraction of model layers are active for any given input. This property allows MoE-based LLMs to generate tokens faster than their dense counterparts, but it also increases model size due to having multiple experts. Unfortunately, this makes state-of-the-art MoE LLMs difficult to run without high-end GPUs. In this work, we study the problem of running large MoE LLMs on consumer hardware with limited accelerator memory. We build upon parameter offloading algorithms and propose a novel strategy that accelerates offloading by taking advantage of innate properties of MoE LLMs. Using this strategy, we build can run Mixtral-8x7B with mixed quantization on desktop hardware and free-tier Google Colab instances.

Introduction

LLMs have revolutionized natural language processing, but deploying them can be resource-intensive due to their massive size. They frequently require several high-end GPUs for operation, which can be a barrier for those without access to such hardware. This challenge is particularly acute with a subclass of LLMs known as Mixture-of-Experts (MoE) models, which offer efficient token generation but have larger model sizes that make them difficult to run on consumer-grade machines.

Addressing the MoE Challenge

The paper focuses on enabling the use of MoE LLMs on hardware with limited GPU memory, which is critical for making these powerful models more accessible. The research builds on parameter offloading techniques to cope with the limited memory in consumer accelerators. The authors developed tactics to effectively run a large MoE model known as Mixtral-8x7B on standard desktop computers and even free compute instances like Google Colab.

Offloading Strategy and Mixed Quantization

Two significant strategies were introduced: MoE-specific offloading and mixed quantization. The offloading approach observes regularities in how MoE models access their experts, which informed the development of an improved caching method that reduces the need for GPU-RAM data transfer, thus accelerating token generation. Moreover, the method speculatively loads experts for computation after identifying predictable patterns in expert layer usage.

Mixed quantization involves compressing the model parameters to reduce their size, allowing for easier transmission to the GPU. A system design combining the offloading strategies with a mixed MoE quantization scheme is laid out, which tailors the quantization levels for different layers of the MoE models. This strategy reduces loading times without severely compromising model performance.

Experimental Results and Conclusion

Through comprehensive experiments, the research confirms the efficacy of the caching and offloading techniques. When applying these to the Mixtral-8x7B MoE model, substantial increments in token generation speed were recorded across multiple hardware configurations. The authors' implementation was able to generate 2-3 tokens per second, depending on the hardware, showing a clear advantage over naive offloading methods.

This paper offers a significant advancement in the practical deployment of large MoE models, broadening their accessibility. Future work will focus on refining these offloading strategies further and possibly exploring new approaches for speculative expert prediction to enhance performance even on more restricted hardware setups. The source code for this implementation has been made available, encouraging further research and development in this space.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (45)
  1. Deepspeed-inference: Enabling efficient inference of transformer models at unprecedented scale. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC ’22. IEEE Press, 2022. ISBN 9784665454445.
  2. Half-quadratic quantization of large machine learning models, November 2023. URL https://mobiusml.github.io/hqq_blog/.
  3. Pythia: A suite for analyzing large language models across training and scaling. arXiv preprint arXiv:2304.01373, 2023.
  4. Language models are few-shot learners. In Conference on Neural Information Processing Systems (NeurIPS), 2020.
  5. Quip: 2-bit quantization of large language models with guarantees, 2023.
  6. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
  7. A parallel mixture of svms for very large scale problems. In Advances in Neural Information Processing Systems, pp.  633–640, 2002.
  8. The case for 4-bit precision: k-bit inference scaling laws. arXiv preprint arXiv:2212.09720, 2022.
  9. LLM.int8(): 8-bit matrix multiplication for transformers at scale. Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, 2022.
  10. Spqr: A sparse-quantized representation for near-lossless llm weight compression. arXiv preprint arXiv:2306.03078, 2023.
  11. Glam: Efficient scaling of language models with mixture-of-experts, 2022.
  12. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. arXiv preprint arXiv:2101.03961, 2021.
  13. SparseGPT: Massive language models can be accurately pruned in one-shot. arXiv preprint arXiv:2301.00774, 2023a.
  14. Qmoe: Practical sub-1-bit compression of trillion-parameter models, 2023b.
  15. Gptq: Accurate post-training quantization for generative pre-trained transformers. arXiv preprint arXiv:2210.17323, 2022.
  16. A survey of quantization methods for efficient neural network inference. arXiv preprint arXiv:2103.13630, 2021.
  17. Google. Google colaboratory, 2023. URL https://colab.research.google.com/.
  18. Measuring massive multitask language understanding. Proceedings of the International Conference on Learning Representations (ICLR), 2021.
  19. Language model compression with weighted low-rank factorization. arXiv preprint arXiv:2207.00112, 2022.
  20. Adaptive mixtures of local experts. Neural Computation, 3(1):79–87, March 1991. ISSN 0899-7667. doi: 10.1162/neco.1991.3.1.79. URL https://doi.org/10.1162/neco.1991.3.1.79.
  21. Hierarchical mixtures of experts and the em algorithm. Neural computation, 6(2):181–214, 1994.
  22. Mixture of quantized experts (moqe): Complementary effect of low-bit quantization and robustness, 2023.
  23. Openassistant conversations – democratizing large language model alignment, 2023.
  24. Large memory layers with product keys. In Wallach, H., Larochelle, H., Beygelzimer, A., dÁlché-Buc, F., Fox, E., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 32, pp.  8546–8557. Curran Associates, Inc., 2019. URL http://papers.nips.cc/paper/9061-large-memory-layers-with-product-keys.pdf.
  25. Gshard: Scaling giant models with conditional computation and automatic sharding. arXiv preprint arXiv:2006.16668, 2020.
  26. Base layers: Simplifying training of large, sparse models. arXiv preprint arXiv:2103.16716, 2021.
  27. Pruning and quantization for deep neural network acceleration: A survey. CoRR, abs/2101.09671, 2021. URL https://arxiv.org/abs/2101.09671.
  28. Awq: Activation-aware weight quantization for llm compression and acceleration. arXiv preprint arXiv:2306.00978, 2023.
  29. Llm-pruner: On the structural pruning of large language models, 2023.
  30. Pointer sentinel mixture models. arXiv preprint arXiv:1609.07843, 2016.
  31. Mixtral AI team. Mixtral of experts a high quality sparse mixture of experts, 2023. URL https://mistral.ai/news/mixtral-of-experts/.
  32. Up or down? Adaptive rounding for post-training quantization. In International Conference on Machine Learning (ICML), 2020.
  33. OpenAI. Gpt-4 technical report. arXiv, 2023.
  34. Training large neural networks with constant memory using a new execution algorithm. CoRR, abs/2002.05645, 2020. URL https://arxiv.org/abs/2002.05645.
  35. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140):1–67, 2020.
  36. Zero-offload: Democratizing billion-scale model training. CoRR, abs/2101.06840, 2021. URL https://arxiv.org/abs/2101.06840.
  37. Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100, 2022.
  38. Nonlinear models using dirichlet process mixtures. Journal of Machine Learning Research, 10(Aug):1829–1850, 2009.
  39. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538, 2017.
  40. Flexgen: High-throughput generative inference of large language models with a single gpu. In International Conference on Machine Learning, pp.  31094–31116. PMLR, 2023.
  41. Steam. Steam hardware & software survey: October 2023, accessed on 2023.11.02, 2023. URL https://store.steampowered.com/hwsurvey/videocard/.
  42. Gemini: A family of highly capable multimodal models, 2023.
  43. TII UAE. The Falcon family of large language models. https://huggingface.co/tiiuae/falcon-40b, May 2023.
  44. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  45. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Artyom Eliseev (1 paper)
  2. Denis Mazur (5 papers)
Citations (23)
Github Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com