Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Fiddler: CPU-GPU Orchestration for Fast Inference of Mixture-of-Experts Models (2402.07033v1)

Published 10 Feb 2024 in cs.LG, cs.AI, cs.OS, and cs.DC
Fiddler: CPU-GPU Orchestration for Fast Inference of Mixture-of-Experts Models

Abstract: LLMs based on Mixture-of-Experts (MoE) architecture are showing promising performance on various tasks. However, running them on resource-constrained settings, where GPU memory resources are not abundant, is challenging due to huge model sizes. Existing systems that offload model weights to CPU memory suffer from the significant overhead of frequently moving data between CPU and GPU. In this paper, we propose Fiddler, a resource-efficient inference engine with CPU-GPU orchestration for MoE models. The key idea of Fiddler is to use the computation ability of the CPU to minimize the data movement between the CPU and GPU. Our evaluation shows that Fiddler can run the uncompressed Mixtral-8x7B model, which exceeds 90GB in parameters, to generate over $3$ tokens per second on a single GPU with 24GB memory, showing an order of magnitude improvement over existing methods. The code of Fiddler is publicly available at \url{https://github.com/efeslab/fiddler}

Overview of Fiddler: CPU-GPU Orchestration for Fast Inference of Mixture-of-Experts Models

The paper introduces Fiddler, an advanced inference engine designed to efficiently deploy Mixture-of-Experts (MoE) models in resource-constrained environments using CPU-GPU orchestration. MoE architectures, which can dynamically activate subsets of experts, present a significant challenge when operating under limited GPU memory due to their extensive parameter sizes. Traditional methods often incur high overhead by moving weights between CPU and GPU, negatively impacting performance. Fiddler addresses this issue by leveraging both CPU memory and computational resources to minimize data transfers.

Core Contributions

Fiddler's primary innovation lies in its approach to handling expert computations. It selectively uses CPU computational capabilities to execute expert layers, reducing the need to transfer large weights to the GPU. This strategy is particularly effective in single-batch, latency-critical scenarios where small batch sizes exacerbate inefficiencies in data transfer.

  1. CPU Utilization: Fiddler shifts the computation of certain operations to the CPU. This significantly reduces latency associated with transferring large weights over PCIe connections, typically a bottleneck in such setups.
  2. Single-Batch Efficiency: The inference system is optimized for local, single-request processing. By managing expert layers directly on CPUs, Fiddler provides a solution tailored for environments where only one GPU with constrained memory is available.
  3. Performance Improvement: The results show that Fiddler achieves substantial speedups—up to 10.1 times faster—compared to existing techniques by effectively orchestrating CPU and GPU activities.

Performance Evaluation

Fiddler's performance was evaluated using the Mixtral-8x7B model with over 90GB of parameters. Tests conducted on a Quadro RTX 6000 GPU and an L4 GPU evidenced a significant throughput improvement, generating over 3 tokens per second, a marked advance over other offloading methods. The evaluation considered various token lengths for input and output, emphasizing Fiddler's robustness and versatility across different scenarios.

Implications

Fiddler's design presents noteworthy advancements for MoE model deployment in resource-limited settings. By effectively utilizing heterogeneous hardware resources, it sets a precedent for balancing memory and compute management across CPUs and GPUs. This methodology not only enhances the practical deployment of large-scale LLMs in such environments but also provides a potential blueprint for future optimizations in AI model orchestration.

Future Directions

The development of Fiddler opens new avenues for research, particularly in exploring further efficiency gains in MoE architectures. Future work might investigate the integration of compression techniques with Fiddler's framework, potentially offering enhanced performance without significant loss in model quality. Additionally, adapting Fiddler to support evolving hardware configurations and ensuring compatibility with newer AI models could further its applicability and impact.

In conclusion, Fiddler represents a significant step forward in the efficient and practical deployment of MoE models, providing a solution to circumvent the limitations of current resource-constrained inference approaches by fully exploiting the capabilities of available hardware.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Keisuke Kamahori (6 papers)
  2. Yile Gu (25 papers)
  3. Kan Zhu (8 papers)
  4. Baris Kasikci (15 papers)
Citations (9)
Github Logo Streamline Icon: https://streamlinehq.com
Reddit Logo Streamline Icon: https://streamlinehq.com