PuzzleMoE: Training-Free MoE Compression
- PuzzleMoE is a training-free compression framework for Mixture-of-Experts models that reduces memory overhead by merging expert weights at a fine-grained level.
- It employs dual-masking based on magnitude similarity and saliency metrics to preserve specialized expert entries while achieving up to 50% compression and notable speedups.
- Its bit-packed BFloat16 encoding embeds mask and sign metadata for zero-overhead deployment on GPUs, enabling efficient large-scale model inference.
PuzzleMoE is a training-free compression framework for Mixture-of-Experts (MoE) models, designed to address the prohibitive memory overhead of storing large expert matrices while maintaining high inference accuracy and efficiency. Its key innovations are sparse expert merging with element-wise dual-masking and a bit-packed encoding scheme that eliminates the memory cost of metadata, enabling efficient deployment on GPUs. Empirical evaluations demonstrate that PuzzleMoE matches or exceeds the performance of prior MoE compression methods at high sparsity levels, achieving up to 50% compression and 1.28× speedup on common benchmarks (Zhao et al., 6 Nov 2025).
1. Motivation and Compression Challenge in MoE Models
MoE models scale transformer architectures by activating only a subset of experts per token, considerably reducing compute per inference step. However, the storage of all expert parameters remains a substantial bottleneck, as the parameter count is linear in the number of experts (e.g., Mixtral-8×7B with 45B weights). Prior compression methods—expert dropping (NAEE, STUN) and merging (HC-SMoE, D2, Sub-MoE)—address this but at high compression ratios suffer from accuracy degradation due to loss of specialization and insufficient granularity. Expert dropping tends to remove potentially critical experts outright, while merging typically operates at coarse granularity (entire experts/clusters), requiring expensive search or decomposition (e.g., SVDs).
PuzzleMoE is motivated by the observation that expert weights consist of both shared entries (safe to merge) and highly specialized entries (necessary for unique expert behaviors). Its objective is to compress at the fine-grained (entry-wise) level, preserving specialization with dual-masking—and to deploy the result with zero overhead for auxiliary metadata, thanks to bit-packing.
2. Sparse Expert Merging: Algorithm and Dual-Mask Construction
Given a pair of expert weight matrices , PuzzleMoE constructs a merged expert , associated with binary reconstruction masks and sign-bit patterns . At inference, expert weights are reconstructed as:
Detection of Redundancy and Specialization:
- Magnitude similarity is computed per entry:
Entries where are classified as shared (mask ).
- Saliency is evaluated via the Wanda metric using a calibration batch of activations :
Entry masks (, ) preserve more salient expert entries.
- The original sign bits are recorded: , .
Final Mask and Weight Construction:
Masks are combined:
Merged weights are:
The merging pipeline requires only a forward pass to collect activations and pair-wise operations of . Grouping of experts is random; ablation shows negligible difference compared to search-based pairing.
3. Bit-Packed Encoding for Metadata-Free Deployment
Naïvely, masks () and sign bits () introduce substantial memory overhead. PuzzleMoE addresses this by embedding mask and sign bits in the exponent fields of BFloat16, exploiting underutilization discovered empirically (exponents concentrated in ).
Encoding Layout (16 bits, BFloat16):
- bit 15: sign of expert ()
- bit 14: sign of expert ()
- bit 13: mask bit for ()
- bit 12: mask bit for ()
- bits 11–7: 5-bit shifted exponent (original exponent shifted/rounded, bias removed)
- bits 6–0: mantissa
This permits storage of all necessary metadata per element, maintaining full compression with no extra matrix allocations.
On-the-Fly Decoding Algorithm:
During inference, packed weights are decoded as:
1 2 3 4 5 6 7 8 |
def decode_weight(packed, expert_pos): mask_bit = (packed >> (13 - expert_pos)) & 1 if mask_bit == 0: return 0 sign_bit = (packed >> (15 - expert_pos)) & 1 exp_field = (packed & 0x0F80) + (112 << 7) # restore exponent mant_field = packed & 0x007F reconstructed = (sign_bit << 15) | exp_field | mant_field return view_as_bfloat16(reconstructed) |
4. GPU Inference and System Implementation
PuzzleMoE provides a custom GEMV CUDA kernel that reads packed elements directly, decodes mask and sign bits entirely in registers, reconstructs bfloat16 weights, and applies them to input activations. Decoding incurs only a few integer operations per element, and no full dense matrix is constructed in memory. This results in substantial memory savings: for instance, Mixtral-8×7B at 50% compression fits on a single A100-80GB GPU (compared to two GPUs for the uncompressed model). Inference speedups are observed (e.g., for Mixtral-8×7B, for Qwen3-MoE).
5. Empirical Evaluation
PuzzleMoE was tested on Mixtral-8×7B, Deepseek-MoE-16B, Qwen1.5-MoE-2.7B, and Qwen3-MoE-30B. Benchmarks included WikiText-2 (perplexity), ARC-c/e, HellaSwag, PIQA, BoolQ, WinoGrande, MMLU, and GSM8K (8-shot). Calibration used 128 samples of C4, length 2048; similarity threshold fixed at $0.4$; 16 random seeds, reporting mean±std.
| Method | MMLU-zero-shot | Avg. acc | Speed |
|---|---|---|---|
| Full (0%) | 67.9 | 74.1 | 1.00× |
| HC-SMoE(50%) | 49.0 | 63.8 | – |
| Wanda 2:4 | 62.0 | 68.7 | – |
| PuzzleMoE | 65.7±0.3 | 72.6±0.2 | 1.28× |
At 50% sparsity, PuzzleMoE achieves up to percentage points higher accuracy (MMLU) than prior pruning/merging approaches. Perplexity remains unchanged after exponent remapping. Compression time for Mixtral-8×7B is reduced from 80GB to 40GB in 2 minutes (vs. 55 minutes for D2).
6. Limitations, Robustness, and Future Directions
- Merging more than two experts at once significantly degrades accuracy and demands additional free bits.
- PuzzleMoE is robust to calibration data choice (C4 vs Math), indicating task-agnostic behavior.
- Optimal falls in , fixed at $0.4$ for all experiments.
- Activation-aware saliency yields up to percentage points in average accuracy over magnitude masks.
- Combined with 3-bit group quantization (AWQ), PuzzleMoE achieves total compression at a marginal additional loss ( accuracy reduction).
- Future research directions include improving reasoning performance on tasks such as AIME25 (currently percentage points below full) and exploring native sparse-expert training that co-designs sparsity during pre-training.
PuzzleMoE constitutes an efficient compression solution for large-scale MoE models, delivering high compression with minimal performance tradeoff and rapid deployment on commodity GPUs. For further references and implementation details, the official repository is available at: https://github.com/Supercomputing-System-AI-Lab/PuzzleMoE (Zhao et al., 6 Nov 2025).