MoBiE: Efficient Inference of Mixture of Binary Experts under Post-Training Quantization

Published 8 Apr 2026 in cs.LG and cs.AI | (2604.06798v3)

Abstract: Mixture-of-Experts (MoE) based LLMs offer strong performance but suffer from high memory and computation costs. Weight binarization provides extreme efficiency, yet existing binary methods designed for dense LLMs struggle with MoE-specific issues, including cross-expert redundancy, task-agnostic importance estimation, and quantization-induced routing shifts. To this end, we propose MoBiE, the first binarization framework tailored for MoE-based LLMs. MoBiE is built on three core innovations: 1. using joint SVD decomposition to reduce cross-expert redundancy; 2. integrating global loss gradients into local Hessian metrics to enhance weight importance estimation; 3. introducing an error constraint guided by the input null space to mitigate routing distortion. Notably, MoBiE achieves these optimizations while incurring no additional storage overhead, striking a balance between efficiency and model performance. Extensive experiments demonstrate that MoBiE consistently outperforms state-of-the-art binary methods across multiple MoE-based LLMs and benchmarks. For example, on Qwen3-30B-A3B, MoBiE reduces perplexity by 52.2$\%$, improves average zero-shot performance by 43.4$\%$, achieves over 2 $\times$ inference speedup, and further shortens quantization time. The code is available at https://github.com/Kishon-zzx/MoBiE.

Abstract PDF Upgrade to Chat

Authors (4)

Summary

The paper presents MoBiE, a PTQ binarization framework that uses CEJD to reduce cross-expert redundancy and preserve key semantic structures.
It leverages GLAS to align weight importance with global loss and employs NGES to confine quantization error, ensuring stable expert routing.
Empirical results demonstrate up to 52.2% perplexity reduction, 43.4% accuracy gains, and over 2× speedup in MoE-based LLMs with significant memory savings.

MoBiE: Efficient Binarization for MoE-based LLMs under Post-Training Quantization

Motivation and Context

Mixture-of-Experts (MoE) architectures have gained prominence in scaling LLMs, exploiting sparse activation to maximize compute efficiency. However, MoE-based LLMs introduce severe memory bottlenecks at deployment, requiring all expert parameters to remain resident despite sparse activation. Weight binarization, offering maximal compression and inference acceleration, is attractive for such resource-constrained scenarios. Existing binarization methods, optimized for dense LLMs, degrade substantially when directly applied to MoE models due to cross-expert redundancy, task-insensitive weight importance estimation, and quantization-induced expert-shift. MoBiE addresses these MoE-specific challenges, introducing a PTQ binarization framework that is explicitly designed for MoE LLMs.

Method: MoBiE Framework and Technical Contributions

MoBiE implements three methodological innovations targeting the key binarization failure modes of MoE:

Cross-Expert Joint Decomposition (CEJD)

CEJD leverages joint SVD to uncover shared semantic structure across expert weights, extracting a high-precision backbone and binarizing only expert-specific projections. This reduces redundant storage and confines binarization error to a stable orthogonal basis. The backbone is kept at modest precision (e.g., 8-bit), providing negligible overhead due to its compactness relative to the expert pool.

Global Loss-Aligned Saliency (GLAS)

GLAS augments conventional layerwise Hessian saliency with first-order global loss gradients, aligning importance estimation with downstream task objectives. For each output channel, GLAS constructs a Hessian weighted by squared global loss gradients derived from calibration data and computes per-weight saliency as $s_{ij} = \frac{w_{ij}^2}{\left([H_{\text{global}, j}^{-1}]_{ii}\right)^2}$ . This task-aware saliency prioritizes weights most critical to downstream performance and informs adaptive mixed-order binarization, ensuring robust accuracy preservation.

Null-Space Guided Expert-Shift Suppression (NGES)

NGES confines binarization error to the null space of input activations, a routing-insensitive subspace which suppresses quantization-induced expert-shift. NGES achieves this via implicit row/column scaling vectors, fused with existing binarization scales and optimized through a regularized least-squares objective with alternating closed-form updates. This approach preserves routing stability and prevents collapse of expert assignment, a major issue in binarized MoE LLMs.

Numerical Results and Performance Evaluation

MoBiE is systematically evaluated across six diverse MoE-based LLMs (including Qwen, DeepSeek, OLMoE, Mixtral, GPT-OSS) and a suite of benchmarks (WikiText2 perplexity, ARC, HellaSwag, LAMBADA, PIQA, WinoGrande, MMLU, GSM8K, HumanEval).

On Qwen3-30B-A3B, MoBiE achieves a 52.2% perplexity reduction, a 43.4% increase in average zero-shot accuracy, and a >2× inference speedup compared to state-of-the-art baselines, with activated parameter memory reduced by over 90%.
MoBiE consistently outperforms both MoE-specific quantizers (MoEQuant, QuantMoE-Bench, MxMoE) and general low-bit PTQ methods (AWQ, GPTQ, BiLLM, ARB-LLM) under 1–2 bit settings. It matches or exceeds 3-bit GPTQ at less than half the memory cost.
Strong robustness is observed on instruction-tuned models and complex reasoning tasks; performance drop under binarization is minimal compared to catastrophic degradation in prior methods.
The efficiency analysis reveals quantization throughput increases and memory savings without sacrificing accuracy, demonstrating MoBiE's practicality for deployment.

Empirical Analysis and Ablations

Ablation studies detail the necessity and synergy of MoBiE’s components:

CEJD yields maximal accuracy gains by eliminating cross-expert redundancy.
GLAS brings consistent perplexity reduction across multiple corpora and seamlessly integrates with other PTQ backbones.
NGES uniquely addresses routing distortion, maintaining expert assignment similarity to the full-precision baseline.

Bitwidth analysis indicates that 8-bit precision for the shared backbone achieves optimal trade-off between storage and accuracy; lower-bit settings yield sharp performance drops.

Plug-and-play evaluations confirm that GLAS and NGES can augment other PTQ methods, demonstrating their general utility beyond MoBiE.

Practical and Theoretical Implications

MoBiE advances post-training quantization methodology for MoE-based LLMs, enabling extreme model compression without retraining. The framework preserves expert specialization and stable gating, facilitating efficient inference on edge and multi-tenant environments.

Theoretically, joint decomposition exposes the latent structure of expert pools, suggesting directions for further factorization-based compression. Loss-aligned saliency bridges local reconstruction sensitivity and global task relevance. Null-space projection offers principled noise suppression for quantization in models where routing is tightly coupled to expert parameters.

Future Directions

Potential developments include:

Extension of joint decomposition to hierarchical mixtures or dynamic expert pools.
Adaptive calibration and saliency estimation for distribution-shifted deployment scenarios.
Hardware-software co-design for binarized MoE inference (e.g., FPGA/ASIC acceleration, mixed-precision routing).
Investigating combined binarization and pruning or ternary quantization for ultra-low resource constraints.

Conclusion

MoBiE establishes a rigorous PTQ binarization framework for MoE-based LLMs, combining structural decomposition, global saliency alignment, and domain-specific error control to deliver both accuracy and efficiency. Its superior empirical performance across models, datasets, and instruction-tuned settings highlights its value for practical deployment and theoretical understanding of MoE quantization (2604.06798).

Markdown Report Issue