- The paper presents MoBiE, a PTQ binarization framework that uses CEJD to reduce cross-expert redundancy and preserve key semantic structures.
- It leverages GLAS to align weight importance with global loss and employs NGES to confine quantization error, ensuring stable expert routing.
- Empirical results demonstrate up to 52.2% perplexity reduction, 43.4% accuracy gains, and over 2ร speedup in MoE-based LLMs with significant memory savings.
MoBiE: Efficient Binarization for MoE-based LLMs under Post-Training Quantization
Motivation and Context
Mixture-of-Experts (MoE) architectures have gained prominence in scaling LLMs, exploiting sparse activation to maximize compute efficiency. However, MoE-based LLMs introduce severe memory bottlenecks at deployment, requiring all expert parameters to remain resident despite sparse activation. Weight binarization, offering maximal compression and inference acceleration, is attractive for such resource-constrained scenarios. Existing binarization methods, optimized for dense LLMs, degrade substantially when directly applied to MoE models due to cross-expert redundancy, task-insensitive weight importance estimation, and quantization-induced expert-shift. MoBiE addresses these MoE-specific challenges, introducing a PTQ binarization framework that is explicitly designed for MoE LLMs.
Method: MoBiE Framework and Technical Contributions
MoBiE implements three methodological innovations targeting the key binarization failure modes of MoE:
Cross-Expert Joint Decomposition (CEJD)
CEJD leverages joint SVD to uncover shared semantic structure across expert weights, extracting a high-precision backbone and binarizing only expert-specific projections. This reduces redundant storage and confines binarization error to a stable orthogonal basis. The backbone is kept at modest precision (e.g., 8-bit), providing negligible overhead due to its compactness relative to the expert pool.
Global Loss-Aligned Saliency (GLAS)
GLAS augments conventional layerwise Hessian saliency with first-order global loss gradients, aligning importance estimation with downstream task objectives. For each output channel, GLAS constructs a Hessian weighted by squared global loss gradients derived from calibration data and computes per-weight saliency as sijโ=([Hglobal,jโ1โ]iiโ)2wij2โโ. This task-aware saliency prioritizes weights most critical to downstream performance and informs adaptive mixed-order binarization, ensuring robust accuracy preservation.
Null-Space Guided Expert-Shift Suppression (NGES)
NGES confines binarization error to the null space of input activations, a routing-insensitive subspace which suppresses quantization-induced expert-shift. NGES achieves this via implicit row/column scaling vectors, fused with existing binarization scales and optimized through a regularized least-squares objective with alternating closed-form updates. This approach preserves routing stability and prevents collapse of expert assignment, a major issue in binarized MoE LLMs.
MoBiE is systematically evaluated across six diverse MoE-based LLMs (including Qwen, DeepSeek, OLMoE, Mixtral, GPT-OSS) and a suite of benchmarks (WikiText2 perplexity, ARC, HellaSwag, LAMBADA, PIQA, WinoGrande, MMLU, GSM8K, HumanEval).
- On Qwen3-30B-A3B, MoBiE achieves a 52.2% perplexity reduction, a 43.4% increase in average zero-shot accuracy, and a >2ร inference speedup compared to state-of-the-art baselines, with activated parameter memory reduced by over 90%.
- MoBiE consistently outperforms both MoE-specific quantizers (MoEQuant, QuantMoE-Bench, MxMoE) and general low-bit PTQ methods (AWQ, GPTQ, BiLLM, ARB-LLM) under 1โ2 bit settings. It matches or exceeds 3-bit GPTQ at less than half the memory cost.
- Strong robustness is observed on instruction-tuned models and complex reasoning tasks; performance drop under binarization is minimal compared to catastrophic degradation in prior methods.
- The efficiency analysis reveals quantization throughput increases and memory savings without sacrificing accuracy, demonstrating MoBiE's practicality for deployment.
Empirical Analysis and Ablations
Ablation studies detail the necessity and synergy of MoBiEโs components:
- CEJD yields maximal accuracy gains by eliminating cross-expert redundancy.
- GLAS brings consistent perplexity reduction across multiple corpora and seamlessly integrates with other PTQ backbones.
- NGES uniquely addresses routing distortion, maintaining expert assignment similarity to the full-precision baseline.
Bitwidth analysis indicates that 8-bit precision for the shared backbone achieves optimal trade-off between storage and accuracy; lower-bit settings yield sharp performance drops.
Plug-and-play evaluations confirm that GLAS and NGES can augment other PTQ methods, demonstrating their general utility beyond MoBiE.
Practical and Theoretical Implications
MoBiE advances post-training quantization methodology for MoE-based LLMs, enabling extreme model compression without retraining. The framework preserves expert specialization and stable gating, facilitating efficient inference on edge and multi-tenant environments.
Theoretically, joint decomposition exposes the latent structure of expert pools, suggesting directions for further factorization-based compression. Loss-aligned saliency bridges local reconstruction sensitivity and global task relevance. Null-space projection offers principled noise suppression for quantization in models where routing is tightly coupled to expert parameters.
Future Directions
Potential developments include:
- Extension of joint decomposition to hierarchical mixtures or dynamic expert pools.
- Adaptive calibration and saliency estimation for distribution-shifted deployment scenarios.
- Hardware-software co-design for binarized MoE inference (e.g., FPGA/ASIC acceleration, mixed-precision routing).
- Investigating combined binarization and pruning or ternary quantization for ultra-low resource constraints.
Conclusion
MoBiE establishes a rigorous PTQ binarization framework for MoE-based LLMs, combining structural decomposition, global saliency alignment, and domain-specific error control to deliver both accuracy and efficiency. Its superior empirical performance across models, datasets, and instruction-tuned settings highlights its value for practical deployment and theoretical understanding of MoE quantization (2604.06798).