Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

MoE-Infinity: Offloading-Efficient MoE Model Serving (2401.14361v2)

Published 25 Jan 2024 in cs.LG and cs.PF
MoE-Infinity: Offloading-Efficient MoE Model Serving

Abstract: This paper presents MoE-Infinity, an offloading-efficient serving system for sparse mixture-of-experts (MoE) models. To optimize offloading, MoE-Infinity achieves novel request-level tracing for expert activation, capturing MoE's sparse execution patterns such as selective activation, group activation, and skewed reuse. Leveraging the request-level trace, MoE-Infinity performs effective expert prefetching and expert caching, achieving high efficiency in transferring model parameters from host memory to GPU memory. Experimental results demonstrate that MoE-Infinity achieves low latency comparable to expensive full-GPU deployments, which require up to 4X more GPU resources than MoE-Infinity. Compared to offloading-supporting LLM serving systems such as DeepSpeed-Inference, Llama.cpp, Mixtral Offloading, and BrainStorm, MoE-Infinity exhibits superior latency performance, providing 2-20X improvements when serving various MoE models for a large collection of LLM tasks. MoE-Infinity's source code is publicly available a https://github.com/TorchMoE/MoE-Infinity

Introduction

A recently developed system, MOE-INFINITY, introduces a cost-efficient serving strategy for mixture-of-expert (MoE) models. It addresses a critical challenge associated with MoE deployment, which involves the handling of substantial model parameters leading to significant memory costs. MOE-INFINITY presents an innovative solution that navigates the memory constraints using an offloading technique that is distinct from the techniques employed in existing systems.

Sequence-Level Expert Activation Tracing

Many of the current offloading systems target dense models and lack the granularity to cater to MoEs’ specific requirements. MOE-INFINITY addresses this by introducing sequence-level expert activation tracing, which preserves the sparse activation and temporal locality characteristics inherent to MoEs. Through this granular tracing, coupled with an optimized construction algorithm for the Expert Activation Matrix Collection (EAMC), it captures a range of activation patterns across sequences to predict and prefetch necessary experts effectively.

Activation-Aware Expert Prefetching and Caching

The system employs novel activation-aware methodologies in both prefetching and caching of offloaded parameters to significantly reduce latency. For prefetching, MOE-INFINITY considers the likelihood of expert activation and the proximity of an expert's layer to the currently executed layer, enabling efficient prefetching strategies. Caching is approached similarly, where the system allocates priority to experts based on their past activation ratios and their place within the MoE layers, further boosting the cache hit ratio and reducing execution latency.

Evaluation and Performance

MOE-INFINITY outperforms numerous existing systems in both latency and cost. With extensive experimental validation in various conditions – including different models sizes, batch sizes, and dataset characteristics – the system demonstrates substantial latency reduction (4-20X) and a significant decrease in deployment costs (over 8X). Furthermore, its design elements, the EAMC, and continuous refinement of prefetching priorities establish robust performance even under distribution shifts in serving scenarios. MOE-INFINITY's efficiency is enhanced on multi-GPU servers and is expected to improve further with the upcoming wave of serving servers featuring higher inter-memory bandwidth.

Conclusion

In summary, MOE-INFINITY reflects a carefully designed response to the memory-intensive demands of MoE models. By tailoring offloading, prefetching, and caching strategies specifically to leverage MoE structures, it paves the way for high-throughput, cost-effective AI services. The system's strength lies in its ability to adapt to varying MoE architectures and serving workloads, making it an innovative platform for future explorations into MoE-optimized deployment systems.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Leyang Xue (16 papers)
  2. Yao Fu (83 papers)
  3. Zhan Lu (7 papers)
  4. Luo Mai (22 papers)
  5. Mahesh Marina (2 papers)
Citations (3)
Github Logo Streamline Icon: https://streamlinehq.com