Fast Inference of Mixture-of-Experts Language Models with Offloading (2312.17238v1)

Published 28 Dec 2023 in cs.LG, cs.AI, and cs.DC

Abstract: With the widespread adoption of LLMs, many deep learning practitioners are looking for strategies of running these models more efficiently. One such strategy is to use sparse Mixture-of-Experts (MoE) - a type of model architectures where only a fraction of model layers are active for any given input. This property allows MoE-based LLMs to generate tokens faster than their dense counterparts, but it also increases model size due to having multiple experts. Unfortunately, this makes state-of-the-art MoE LLMs difficult to run without high-end GPUs. In this work, we study the problem of running large MoE LLMs on consumer hardware with limited accelerator memory. We build upon parameter offloading algorithms and propose a novel strategy that accelerates offloading by taking advantage of innate properties of MoE LLMs. Using this strategy, we build can run Mixtral-8x7B with mixed quantization on desktop hardware and free-tier Google Colab instances.

PDF HTML Abstract

Introduction

LLMs have revolutionized natural language processing, but deploying them can be resource-intensive due to their massive size. They frequently require several high-end GPUs for operation, which can be a barrier for those without access to such hardware. This challenge is particularly acute with a subclass of LLMs known as Mixture-of-Experts (MoE) models, which offer efficient token generation but have larger model sizes that make them difficult to run on consumer-grade machines.

Addressing the MoE Challenge

The paper focuses on enabling the use of MoE LLMs on hardware with limited GPU memory, which is critical for making these powerful models more accessible. The research builds on parameter offloading techniques to cope with the limited memory in consumer accelerators. The authors developed tactics to effectively run a large MoE model known as Mixtral-8x7B on standard desktop computers and even free compute instances like Google Colab.

Offloading Strategy and Mixed Quantization

Two significant strategies were introduced: MoE-specific offloading and mixed quantization. The offloading approach observes regularities in how MoE models access their experts, which informed the development of an improved caching method that reduces the need for GPU-RAM data transfer, thus accelerating token generation. Moreover, the method speculatively loads experts for computation after identifying predictable patterns in expert layer usage.

Mixed quantization involves compressing the model parameters to reduce their size, allowing for easier transmission to the GPU. A system design combining the offloading strategies with a mixed MoE quantization scheme is laid out, which tailors the quantization levels for different layers of the MoE models. This strategy reduces loading times without severely compromising model performance.

Experimental Results and Conclusion

Through comprehensive experiments, the research confirms the efficacy of the caching and offloading techniques. When applying these to the Mixtral-8x7B MoE model, substantial increments in token generation speed were recorded across multiple hardware configurations. The authors' implementation was able to generate 2-3 tokens per second, depending on the hardware, showing a clear advantage over naive offloading methods.

This paper offers a significant advancement in the practical deployment of large MoE models, broadening their accessibility. Future work will focus on refining these offloading strategies further and possibly exploring new approaches for speculative expert prediction to enhance performance even on more restricted hardware setups. The source code for this implementation has been made available, encouraging further research and development in this space.