Bit-Packed Encoding for MoE Compression

Updated 7 December 2025

Bit-Packed Encoding is a method that embeds binary mask, sign, and weight data within standard floating-point representations to enable efficient sparse expert merging in MoE models.
It leverages the bfloat16 format by reallocating exponent bits to store metadata, achieving up to 50% memory reduction while preserving inference accuracy.
This approach reduces storage and bandwidth costs, allowing single-device deployment of large Mixture-of-Experts models as demonstrated by PuzzleMoE.

Bit-packed encoding in the context of sparse expert merging and Mixture-of-Experts (MoE) model compression refers to the integration of binary mask, sign, and weight information into a minimal and hardware-efficient format by exploiting underutilized bits within standard floating-point representations. This innovation is primarily motivated by the pressing need to reduce model memory footprint and unlock single-device deployment for modern MoE LLMs, whose parameter counts preclude efficient inference under naïve expert storage schemes. Bit-packed encoding enables models such as PuzzleMoE to achieve high compression ratios with minimal loss in accuracy, while maintaining compatibility with GPU arithmetic (Zhao et al., 6 Nov 2025).

1. Motivation and Background

The scalability of @@@@3@@@@—where dozens of experts are deployed per layer but only a sparse subset is activated per token—results in storage costs that scale linearly with the number of experts. Even when computation is kept sublinear, total GPU memory usage becomes prohibitive. Sparse expert merging techniques aim to collapse redundant and specialized knowledge from multiple experts. However, they introduce the challenge of efficiently storing fine-grained binary mask metadata and sign bits required to reconstruct the original experts from the merged representation (Zhao et al., 6 Nov 2025).

Traditional approaches store this metadata in auxiliary buffers using standard integer or boolean types, introducing nontrivial storage overhead and bandwidth penalties. Bit-packed encoding achieves mask and sign storage at virtually zero additional memory and no kernel-level complexity by embedding this information directly into the floating-point representation of the weights.

2. Bit-Packed Encoding: Technical Mechanism

PuzzleMoE demonstrates a canonical bit-packed encoding, targeting the bfloat16 format widely used for MoE weights on modern accelerators. The scheme is as follows (Zhao et al., 6 Nov 2025):

Bfloat16 components (16 bits):
- 1 sign
- 8 exponent
- 7 mantissa

Empirical profiling of expert weights reveals that MoE precision exponents typically occupy a narrow subrange, e.g., $[112,128]$ . By remapping the exponent field—fixing any exponent below 112 to 112 and then subtracting 112—only 5 bits are needed to encode exponent information ( $0\dots 31$ ), freeing up 3 high-order bits per weight entry.

3 freed bits per weight entry:
- 1 mask bit (to indicate whether the given merge location is active for expert $i$ or $j$ )
- 1 sign bit (per expert)
- 1 reserved/global control bit

Thus, each bfloat16 word for a merged expert encodes not just the magnitude, but also, without additional memory cost, the binary mask and sign metadata required to reconstruct both original experts (or, by extension, groupings of $G$ experts, $G>2$ ).

3. Bit-Packed Sparse Expert Merging Workflow

The bit-packed encoding in PuzzleMoE is implemented as follows (Zhao et al., 6 Nov 2025):

Sparse Merging with Dual Masks: For each pair of experts $W_1, W_2\in\mathbb{R}^{d\times h}$ , compute a similarity mask $M^{\rm sim}$ (identifying shared entries via percent difference threshold) and per-expert saliency masks $M^{\rm sal_1}, M^{\rm sal_2}$ (identifying high-activation, expert-unique entries).
Merged Weight Matrix Construction: The resulting merged matrix $W_{\rm merged}$ encodes averages for shared weights and salient values for unshared ones, retaining the necessary quantization precision.
Bit-Packing of Metadata: The mask and sign bits for each expert at each position are overlaid into the freed bits of the exponent field in each bfloat16 merged weight.
On-the-Fly Decoding at Inference: A custom kernel extracts the relevant mask and sign bits from each word during matrix multiplication, zeroing pruned positions and restoring the correct sign and exponent in accordance with the target expert.

This workflow is governed by the following pseudocode (using bitwise operations for extraction):

function decode_weight(W_packed, expert_pos):
   mask_bit = (W_packed >> (13 - expert_pos)) & 1
   if mask_bit == 0: return 0.0
   sign_bit = (W_packed >> (15 - expert_pos)) & 1
   exp_field = ((W_packed & 0x0F80) + (112 << 7))
   mant = W_packed & 0x007F
   W_val = (sign_bit<<15) | exp_field | mant
   return reinterpret_as_bfloat16(W_val)

Custom GEMV/MatMul kernels apply this decoding, eliminating extra memory passes or intermediate buffers.

4. Theoretical and Practical Properties

Bit-packed encoding provides explicit, analytically-bounded memory and error trade-offs:

Compression Ratio: For $G$ experts merged into one, only $E/G$ weight matrices must be stored; mask/sign bits are embedded “for free.”
Memory Reduction: For $G=2$ (pairwise merging), memory savings approach $50\%$ ; for general $G$ , savings are $(G-1)/G$ .
Bandwidth: Inference kernel reads only the compact weight buffer; mask/sign information is decoded on-the-fly from the input stream.
Approximation Error Bound: Reconstruction error for each expert's weight is bounded by the mask threshold parameter (e.g., for shared entries with percent difference up to $\tau_{\rm sim}$ , error ≤ $\tau_{\rm sim}\|W_i\|_F$ ), and exact recovery for salient entries (Zhao et al., 6 Nov 2025).

5. Comparative Performance and Empirical Results

Empirical studies on Mixtral-8×7B and related large LLMs demonstrate:

Accuracy: At 50% compression ratio, bit-packed PuzzleMoE degrades by ≤1.5% from the original on zero-shot accuracy benchmarks, outperforming hierarchical clustering (HC-SMoE) and structured pruning baselines by 3–17% (Zhao et al., 6 Nov 2025).
Speed: Bit-packed inference on a single GPU yields 1.28× speedup compared to baseline multi-GPU deployment (with no extra inference-time memory cost).
Ablations: Omitting similarity masks or reserving more bits for mask storage results in 2–3× greater perplexity increase, underscoring the importance of careful bit allocation.

6. Design Constraints and Limitations

Bit-packed encoding is constrained by hardware word size (hardware operating on fp16/bfloat16) and the dynamic range of expert exponents. If exponents fall outside the range $[112,128]$ or greater than three bits are needed for metadata, encoding is less effective. Additionally, the number of experts that can be merged into a single matrix is upper bounded by the number of available bits to encode their masks and signs; $G=2$ is optimal for bfloat16, generalizations to higher $G$ require new floating-point variants or alternative encodings.

7. Significance for Modular and Scalable LLM Deployment

Bit-packed encoding as pioneered by PuzzleMoE is instrumental in achieving efficient, training-free, accurate, and deployable compression of large MoE models (Zhao et al., 6 Nov 2025). It allows the practical realization of storage- and bandwidth-constrained inference by eliminating auxiliary storage for mask and sign metadata, while supporting elemental reconstruction of merged experts with provable error bounds and no retraining. This approach reduces both static storage and dynamic inference costs without altering the semantics of the floating-point computation pipeline, thus integrating seamlessly with existing GPU and accelerator-based LLM infrastructure.

PDF Markdown Chat (Pro)

References (1)

PuzzleMoE: Efficient Compression of Large Mixture-of-Experts Models via Sparse Expert Merging and Bit-packed inference (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Bit-Packed Encoding.