Papers
Topics
Authors
Recent
2000 character limit reached

PuzzleMoE: Training-Free MoE Compression

Updated 3 December 2025
  • PuzzleMoE is a training-free compression framework for Mixture-of-Experts models that reduces memory overhead by merging expert weights at a fine-grained level.
  • It employs dual-masking based on magnitude similarity and saliency metrics to preserve specialized expert entries while achieving up to 50% compression and notable speedups.
  • Its bit-packed BFloat16 encoding embeds mask and sign metadata for zero-overhead deployment on GPUs, enabling efficient large-scale model inference.

PuzzleMoE is a training-free compression framework for Mixture-of-Experts (MoE) models, designed to address the prohibitive memory overhead of storing large expert matrices while maintaining high inference accuracy and efficiency. Its key innovations are sparse expert merging with element-wise dual-masking and a bit-packed encoding scheme that eliminates the memory cost of metadata, enabling efficient deployment on GPUs. Empirical evaluations demonstrate that PuzzleMoE matches or exceeds the performance of prior MoE compression methods at high sparsity levels, achieving up to 50% compression and 1.28× speedup on common benchmarks (Zhao et al., 6 Nov 2025).

1. Motivation and Compression Challenge in MoE Models

MoE models scale transformer architectures by activating only a subset of experts per token, considerably reducing compute per inference step. However, the storage of all expert parameters remains a substantial bottleneck, as the parameter count is linear in the number of experts (e.g., Mixtral-8×7B with 45B weights). Prior compression methods—expert dropping (NAEE, STUN) and merging (HC-SMoE, D2, Sub-MoE)—address this but at high compression ratios suffer from accuracy degradation due to loss of specialization and insufficient granularity. Expert dropping tends to remove potentially critical experts outright, while merging typically operates at coarse granularity (entire experts/clusters), requiring expensive search or decomposition (e.g., SVDs).

PuzzleMoE is motivated by the observation that expert weights consist of both shared entries (safe to merge) and highly specialized entries (necessary for unique expert behaviors). Its objective is to compress at the fine-grained (entry-wise) level, preserving specialization with dual-masking—and to deploy the result with zero overhead for auxiliary metadata, thanks to bit-packing.

2. Sparse Expert Merging: Algorithm and Dual-Mask Construction

Given a pair of expert weight matrices Wi,Wj∈Rd×hW_i, W_j \in \mathbb{R}^{d \times h}, PuzzleMoE constructs a merged expert Wm∈Rd×hW_m \in \mathbb{R}^{d\times h}, associated with binary reconstruction masks Mi,Mj∈{0,1}d×hM_i, M_j \in \{0,1\}^{d \times h} and sign-bit patterns Si,Sj∈{0,1}d×hS_i, S_j \in \{0,1\}^{d\times h}. At inference, expert weights are reconstructed as:

W^i=(−1)Si∘Mi∘Wm,W^j=(−1)Sj∘Mj∘Wm\hat{W}_i = (-1)^{S_i} \circ M_i \circ W_m,\quad \hat{W}_j = (-1)^{S_j} \circ M_j \circ W_m

Detection of Redundancy and Specialization:

  • Magnitude similarity is computed per entry:

Δ:=∣∣Wi∣−∣Wj∣∣∣Wi∣+∣Wj∣∈[0,1]d×h\Delta := \frac{||W_i| - |W_j||}{|W_i| + |W_j|} \in [0,1]^{d\times h}

Entries where Δ≤τsim\Delta \leq \tau_{sim} are classified as shared (mask MsimM^{sim}).

  • Saliency is evaluated via the Wanda metric using a calibration batch of activations Xi,XjX_i, X_j:

Ai=∣Wi∣∘∥Xi∥2,Aj=∣Wj∣∘∥Xj∥2A_i = |W_i| \circ \|X_i\|_2,\quad A_j = |W_j| \circ \|X_j\|_2

Entry masks (MaliM^{ali}, MaljM^{alj}) preserve more salient expert entries.

  • The original sign bits are recorded: Si=1Wi<0S_i = 1_{W_i < 0}, Sj=1Wj<0S_j = 1_{W_j < 0}.

Final Mask and Weight Construction:

Masks are combined:

Mi=Mali∨Msim,Mj=Malj∨MsimM_i = M^{ali} \vee M^{sim},\quad M_j = M^{alj} \vee M^{sim}

Merged weights are:

Wm=Msim∘∣Wi∣+∣Wj∣2+(1−Msim)∘(Mali∘∣Wi∣+Malj∘∣Wj∣)W_m = M^{sim} \circ \frac{|W_i| + |W_j|}{2} + (1 - M^{sim}) \circ (M^{ali} \circ |W_i| + M^{alj} \circ |W_j|)

The merging pipeline requires only a forward pass to collect activations and pair-wise operations of O(dh)O(dh). Grouping of experts is random; ablation shows negligible difference compared to search-based pairing.

3. Bit-Packed Encoding for Metadata-Free Deployment

Naïvely, masks (Mi,MjM_i, M_j) and sign bits (Si,SjS_i, S_j) introduce substantial memory overhead. PuzzleMoE addresses this by embedding mask and sign bits in the exponent fields of BFloat16, exploiting underutilization discovered empirically (exponents concentrated in [112,128][112,128]).

Encoding Layout (16 bits, BFloat16):

  • bit 15: sign of expert ii (SiS_i)
  • bit 14: sign of expert jj (SjS_j)
  • bit 13: mask bit for ii (MiM_i)
  • bit 12: mask bit for jj (MjM_j)
  • bits 11–7: 5-bit shifted exponent (original exponent shifted/rounded, bias removed)
  • bits 6–0: mantissa

This permits storage of all necessary metadata per element, maintaining full compression with no extra matrix allocations.

On-the-Fly Decoding Algorithm:

During inference, packed weights are decoded as:

1
2
3
4
5
6
7
8
def decode_weight(packed, expert_pos):
    mask_bit = (packed >> (13 - expert_pos)) & 1
    if mask_bit == 0: return 0
    sign_bit = (packed >> (15 - expert_pos)) & 1
    exp_field = (packed & 0x0F80) + (112 << 7)  # restore exponent
    mant_field = packed & 0x007F
    reconstructed = (sign_bit << 15) | exp_field | mant_field
    return view_as_bfloat16(reconstructed)
Positions (expert_pos=0expert\_pos = 0 for ii, =1=1 for jj) select the relevant mask and sign bits.

4. GPU Inference and System Implementation

PuzzleMoE provides a custom GEMV CUDA kernel that reads packed WmW_m elements directly, decodes mask and sign bits entirely in registers, reconstructs bfloat16 weights, and applies them to input activations. Decoding incurs only a few integer operations per element, and no full dense matrix is constructed in memory. This results in substantial memory savings: for instance, Mixtral-8×7B at 50% compression fits on a single A100-80GB GPU (compared to two GPUs for the uncompressed model). Inference speedups are observed (e.g., +1.28×+1.28\times for Mixtral-8×7B, +1.19×+1.19\times for Qwen3-MoE).

5. Empirical Evaluation

PuzzleMoE was tested on Mixtral-8×7B, Deepseek-MoE-16B, Qwen1.5-MoE-2.7B, and Qwen3-MoE-30B. Benchmarks included WikiText-2 (perplexity), ARC-c/e, HellaSwag, PIQA, BoolQ, WinoGrande, MMLU, and GSM8K (8-shot). Calibration used 128 samples of C4, length 2048; similarity threshold τsim\tau_{sim} fixed at $0.4$; 16 random seeds, reporting mean±std.

Method MMLU-zero-shot Avg. acc Speed
Full (0%) 67.9 74.1 1.00×
HC-SMoE(50%) 49.0 63.8 –
Wanda 2:4 62.0 68.7 –
PuzzleMoE 65.7±0.3 72.6±0.2 1.28×

At 50% sparsity, PuzzleMoE achieves up to +16.7+16.7 percentage points higher accuracy (MMLU) than prior pruning/merging approaches. Perplexity remains unchanged after exponent remapping. Compression time for Mixtral-8×7B is reduced from 80GB to 40GB in 2 minutes (vs. 55 minutes for D2).

6. Limitations, Robustness, and Future Directions

  • Merging more than two experts at once significantly degrades accuracy and demands additional free bits.
  • PuzzleMoE is robust to calibration data choice (C4 vs Math), indicating task-agnostic behavior.
  • Optimal Ï„sim\tau_{sim} falls in [0.3,0.5][0.3, 0.5], fixed at $0.4$ for all experiments.
  • Activation-aware saliency yields up to +0.3+0.3 percentage points in average accuracy over magnitude masks.
  • Combined with 3-bit group quantization (AWQ), PuzzleMoE achieves ∼4.8×\sim 4.8\times total compression at a marginal additional loss (∼1.7%\sim 1.7\% accuracy reduction).
  • Future research directions include improving reasoning performance on tasks such as AIME25 (currently ∼10\sim 10 percentage points below full) and exploring native sparse-expert training that co-designs sparsity during pre-training.

PuzzleMoE constitutes an efficient compression solution for large-scale MoE models, delivering high compression with minimal performance tradeoff and rapid deployment on commodity GPUs. For further references and implementation details, the official repository is available at: https://github.com/Supercomputing-System-AI-Lab/PuzzleMoE (Zhao et al., 6 Nov 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to PuzzleMoE.