CMoE: Converting Mixture-of-Experts from Dense to Accelerate LLM Inference (2502.04416v2)

Published 6 Feb 2025 in cs.LG and cs.AI

Abstract: Scaling LLMs improves performance but dramatically increases inference costs. The feed-forward network (FFN), consuming approximately 70\% of inference compute, represents a critical bottleneck, particularly in large batch size scenarios. While mixture-of-experts (MoE) architectures leverage activation sparsity for efficiency, converting existing dense models to MoEs traditionally requires resource-intensive continual pre-training. We present CMoE, a framework that rapidly transforms dense LLMs into MoEs without training. The key innovation lies in analyzing FFN neuron activations to partition them into shared (always active) and routed experts. Routed neurons are clustered using a balanced assignment algorithm, and a differentiable router is constructed analytically from activation statistics, enabling immediate deployment or optional lightweight fine-tuning. Experiments demonstrate that, with activation ratio of 75\%, it achieves remarkable results, delivering lossless precision in terms of perplexity while still maintaining a 5\% acceleration. Further experiments reveal that a CMoE configuration activating just 25\% of parameters reduces end-to-end latency by 1.5x while preserving usable perplexity without additional training. Moreover, a brief LoRA fine-tuning process (requiring only 1 hour and 2,000 samples) successfully recovers over 76\% of the dense model's downstream accuracy. By effectively balancing performance and efficiency, CMoE offers a viable path forward for deploying LLMs in real-world scenarios where computational resources are limited. We make our code publicly available at https://github.com/JarvisPei/CMoE.

PDF Abstract

Analysis of CMoE: Fast Carving of Mixture-of-Experts for Efficient LLM Inference

The paper under discussion introduces CMoE (Carved Mixture-of-Experts), a framework designed to enhance the inference efficiency of LLMs by effectively transforming dense models into mixture-of-experts (MoE) architectures. This paper addresses the significant computational challenges associated with deploying expansive LLMs in resource-constrained environments, presenting a novel approach that retains model performance while optimizing for efficiency.

The central focus of CMoE is to exploit the high activation sparsity inherent in the feed-forward networks (FFNs) of LLMs. The framework innovatively carves MoE models by reorganizing FFN parameters from dense models without the need for extensive retraining. CMoE is achieved through two main phases: efficient expert grouping and training-free router construction.

Methodology Overview

Expert Grouping:
- Shared Experts: Neurons exhibiting universally high activation rates are grouped into shared experts. These are always active, capturing broad knowledge features.
- Routed Experts: Neurons with more specialized, token-dependent activations are organized into routed experts. The research formulates this grouping as a linear assignment problem solved using the Jonker-Volgenant algorithm, ensuring a balanced clustering.
Router Construction:
- CMoE introduces a routing mechanism derived directly from dense model activation statistics. This analytical approach enables an operational routing process without retraining.
- The framework incorporates differentiable routing, enhancing the model's flexibility and allowing for performance recovery via lightweight fine-tuning.

Empirical Results

CMoE demonstrates compelling results in both training-free and fine-tuned scenarios. For example, with a $25\%$ activation ratio, CMoE maintains reasonable perplexity, outperforming conventional models that require large datasets and computational resources for training. On benchmarks like WikiText-2 and C4, CMoE achieves perplexity reductions to as low as $12.73$ with fine-tuning, a substantial improvement over baseline methods such as LLaMA-MoE.

Furthermore, CMoE shows competitive performance on downstream tasks. In a comparison across various benchmarks, such as BoolQ and SciQ, CMoE consistently outperforms the baseline LLaMA-MoE model. After fine-tuning, CMoE achieves $76.59\%$ of the dense model's accuracy on SciQ, illustrating its capability to retain substantial performance while enhancing efficiency.

Implications and Future Work

The introduction of CMoE presents significant implications for the deployment of LLMs, particularly in scenarios with stringent latency and hardware constraints. By demonstrating a viable method to reduce the inference overhead while sustaining high performance, the framework potentially sets a new direction for the development of more efficient LLM architectures.

Looking forward, the research opens up pathways for further exploration into:

Enhanced routing strategies that could further reduce the need for training without compromising efficiency.
Adaptations across varying architectures and application domains beyond the LLaMA model, extending CMoE's applicability.
Integration with other model compression techniques, such as pruning and quantization, to further optimize computational demands.

Overall, CMoE represents a significant step forward in the practical deployment of LLMs, efficiently leveraging existing parameters while minimizing additional computational burdens. Its methodological innovations and empirical results underscore its potential as a transformative approach in the field of efficient machine learning inference.

PDF Markdown Bookmark Chat (Pro)

Authors (8)

Zehua Pei (7 papers)
Lancheng Zou (2 papers)
Hui-Ling Zhen (33 papers)
Xianzhi Yu (16 papers)
Wulong Liu (38 papers)
Sinno Jialin Pan (32 papers)
Mingxuan Yuan (81 papers)
Bei Yu (113 papers)

Related Papers

Find Related Papers

GitHub

GitHub - JarvisPei/CMoE: Implementation for the paper: CMoE: Fast Carving of Mixture-of-Experts for Efficient LLM Inference (1 star)

Tweets

https://twitter.com/arXivGPT/status/1889375240658317443