MC#: Mixture Compressor for Mixture-of-Experts Large Models

Published 13 Oct 2025 in cs.LG and cs.AI | (2510.10962v1)

Abstract: Mixture-of-Experts (MoE) effectively scales LLMs and vision-LLMs (VLMs) by increasing capacity through sparse activation. However, preloading all experts into memory and activating multiple experts per input introduces significant computational and memory overhead, making the expert module a major contributor to model size and inference cost. To address this, we propose MC# (Mixture-Compressor-sharp), a framework that combines static quantization and dynamic expert pruning by leveraging the significance of experts and tokens for aggressive compression of MoE-LLMs/VLMs. To reduce storage and loading costs, we introduce Pre-Loading Mixed-Precision Quantization (PMQ), which optimizes bit allocation via linear programming, balancing expert importance and quantization error for a Pareto-optimal trade-off between size and performance. To reduce runtime computation, Online Top-any Pruning (OTP) uses Gumbel-Softmax sampling to dynamically select a subset of experts per token, enabling fine-grained control over activation. By combining PMQ's static bit-width optimization with OTP's dynamic routing, MC# achieves extreme compression with minimal accuracy loss. On DeepSeek-VL2, MC# achieves a 6.2 times weight reduction at 2.57 average bits with only a 1.7% accuracy drop across five multimodal benchmarks. Additionally, OTP reduces expert activation over 20% with less than 1% performance degradation, demonstrating strong potential for efficient MoE-based model deployment.

Abstract PDF Upgrade to Chat

Authors (9)

Summary

The paper demonstrates that MC# achieves over 6x weight reduction while maintaining performance via a hybrid static quantization and dynamic pruning approach.
It introduces Pre-Loading Mixed-Precision Quantization (PMQ), which optimally allocates bits using integer programming to compress experts effectively.
Online Top-any Pruning (OTP) utilizes Gumbel-Softmax sampling to dynamically select experts during inference, reducing computational overhead by 20%.

Mixture Compressor for Mixture-of-Experts Large Models

The research paper introduces MC#, a framework designed to optimize the mixture-of-experts (MoE) models, particularly focusing on LLMs and vision-LLMs (VLMs). This work presents an innovative approach to model compression, aiming to reduce computational and memory overhead associated with MoE architectures.

Introduction

MoE represents an efficient scaling method for LLMs and VLMs, expanding model capacity by utilizing multiple expert branches while ensuring sparse activation for efficiency. However, MoE models face significant challenges during deployment due to their memory and computational demands, predominantly driven by the inference process, which activates multiple experts simultaneously. The paper addresses these challenges by proposing MC#, a unified mixture compressor framework combining static quantization and dynamic expert pruning.

Figure 1: Comparison of total parameter size and inference activated parameter size on various large models and compressed Mixtral MoE-LLMs and DeepSeek VLMs.

Framework Overview

The MC# framework consists of two primary components:

Pre-Loading Mixed-Precision Quantization (PMQ): PMQ aims to reduce storage and loading overhead by optimizing bit allocation using a linear programming model, considering expert significance and quantization error for achieving a Pareto-optimal trade-off between model size and performance.
Online Top-any Pruning (OTP): OTP enhances inference efficiency by dynamically selecting a subset of experts for activation during runtime. It leverages Gumbel-Softmax sampling to model expert activation as a learnable distribution, allowing fine-grained control over which experts are activated.
Figure 2: Performance across different LLMs and VLMs with varying activated parameters.

Implementation Details

Pre-Loading Mixed-Precision Quantization (PMQ)

PMQ employs adaptive bit allocation to statically compress experts within MoE models. This approach formulates the allocation problem as an Integer Programming problem, optimizing quantization configurations based on expert significance metrics derived from access frequency and activation-weighted importance.

Figure 3: Overview of MC pipeline including PMQ and OTP stages for experts.

Online Top-any Pruning (OTP)

OTP offers a novel learning mechanism for dynamically adjusting expert activation at inference time. By using Gumbel-Softmax sampling, it creates a differentiable process for selecting the active experts, enabling higher inference efficiency without significant performance degradation. The framework incorporates a learnable router to apply token-wise pruning strategies effectively.

Figure 4: Workflow of OTP demonstrating selective expert activation and pruning.

Experimental Results

The paper reports evaluation results on both MoE-LLMs and MoE-VLMs across various benchmarks. The proposed MC# framework achieves significant compression, reducing model size while maintaining comparable performance to baseline full-precision models. The hybrid compression strategy results in over 6x weight reduction with minimal accuracy loss. Furthermore, the OTP stage reduces expert activation by 20% with negligible performance impact.

Figure 5: Quantized performance of Mixtral under different mixed-precision strategies.

Conclusion

This study effectively addresses the deployment challenges of MoE models by introducing a robust framework for model compression. The MC# framework not only reduces computational and memory overhead but also enhances efficiency through mixed-precision quantization and dynamic pruning. Future work could explore further applications in different multimodal settings and optimize these strategies for specific hardware configurations.

Figure 6: Ablation of mask ratio during training showcasing optimal pruning strategies.

The proposed approach significantly contributes to the practicality of deploying large MoE models in diverse computing environments, paving the way for more efficient AI applications.

Markdown Report Issue