Calibration-Free Asymmetric Matryoshka Quantization
- AMAT is a truncation-based quantization scheme that uses zero-point-aware truncation to derive both high-bit and low-bit representations from a single integer code.
- It eliminates costly per-layer calibration by applying uniform quantization and integer bit-shift operations, ensuring symmetric range preservation across slices.
- AMAT facilitates dynamic mixed-precision inference in MoE pipelines, achieving energy and latency improvements while maintaining near-high-bit accuracy.
Calibration-Free Asymmetric Matryoshka Quantization (AMAT) is a truncation-based quantization scheme designed for memory-efficient, mixed-precision inference in large-scale Mixture-of-Experts (MoE) models. AMAT enables direct compatibility between high-bit and low-bit quantized slices through zero-point-aware truncation, facilitating dynamic precision selection and cache-friendly expert management without layer- or channel-wise calibration or memory duplication. Originally introduced in the SliceMoE framework, AMAT provides a practical solution for conditional expert deployment under stringent energy and latency budgets, maintaining near-high-bit accuracy even in low-bit regimes (Choi et al., 15 Dec 2025).
1. Motivation and Design Principles
Conventional approaches to mixed-precision quantization in neural networks typically require storing separate copies of weight tensors for each precision or impose complex, non-uniform calibration procedures—especially problematic for MoE models with large parameter sets. AMAT is motivated by the need to:
- Eliminate memory duplication by allowing low-bit and high-bit slices to be derived from the same integer code.
- Abandon costly per-layer or per-channel calibration and distribution fitting.
- Enable slice-wise bit caching for conditional computation in MoE routing.
Key design features include the use of simple uniform quantization, integer and zero-point truncation by the same bit-shift, and recentering of ranges to mitigate value clipping.
2. Mathematical Formulation and Key Algorithms
AMAT operates on floating-point weight tensors , producing high-bit () and low-bit () quantized representations, with and .
- High-bit uniform quantization (asymmetric integer scheme):
- Scale:
- Zero-point:
- Quantization:
- Approximate dequantization:
- Naïve truncation failure: Direct bit-shifting of integer codes alone () without adjusting zero-point leads to severe clipping.
- AMAT truncation: Both integer codes and zero-point are shifted:
This preserves the correct symmetric range for the truncated slice.
- Low-bit dequantization:
If full-precision slices are available, the original high-bit dequantization is performed.
- Algorithmic implementation:
1 2 3 4 5 6 7 8 9 10 |
def AMAT_Quantize(W, b_high, b_low): w_min = W.min() w_max = W.max() S_h = (w_max - w_min) / (2 ** b_high - 1) z_h = round(-w_min / S_h) Q_h = clamp(0, 2 ** b_high - 1, round(W / S_h) + z_h) shift = b_high - b_low Q_l = floor(Q_h / 2 ** shift) z_l = floor(z_h / 2 ** shift) return {scale_high: S_h, zp_high: z_h, Q_h: uint[b_high](W)}, b_low, z_l |
3. Calibration-Free Operation
AMAT dispenses with all per-layer or per-channel calibration, using a single interval for both high- and low-precision quantizations. The arithmetic truncation of integer code and zero-point ensures the low-bit slice inherits its zero-point symmetrically, avoiding layer-specific tuning and distribution fitting. This calibration-free property distinguishes AMAT from previous non-uniform matryoshka schemes, which require per-layer calibration and distribution-specific thresholds.
4. Matryoshka Bit-Slice Compatibility
AMAT provides exact nesting: the -bit codes correspond to the most significant bits of the -bit codes , i.e., . This nesting enables selective expert offloading and precision routing in MoE systems:
- When both MSB (high-bit) and LSB slices are cached, full-precision is restored.
- When only the MSB cache is available, the low-bit path dequantizes with preserved fidelity.
- No additional memory regions are required; both bit-slices coexist within .
The term "Matryoshka" references this hierarchical nesting, analogous to Russian matryoshka dolls.
5. Integration and Implementation in MoE Pipelines
AMAT is natively compatible with expert offloading and cache management in SliceMoE:
- High-bit codes reside in non-volatile memory (Flash).
- Two DRAM caches operate: MSB-cache (top bits, LRU), and LSB-cache (lower bits, lower priority).
- Dynamic Bit-Sliced Caching (DBSC) utilizes gating scores for expert selection:
- High-score experts receive both bit-slices for high precision.
- Lower-score experts use only the MSB (low precision).
- Q_\ell and z_\ell are derived in real time by integer shifts, avoiding recomputation or duplication.
- Inference logic dynamically selects between low-bit and high-bit GEMM operations according to cache availability.
This workflow requires no retraining or specialized hardware, relying solely on bit-shift arithmetic for mixed-precision selection.
6. Quantitative Performance and Empirical Results
AMAT has been evaluated on DeepSeek-V2-Lite and Qwen1.5-MoE-A2.7B models under strict DRAM-Flash miss-rate constraints. The following table excerpts reflect per-token perplexity for various quantization schemes and bit-widths:
| Model | Quant | Scheme | MAT42 (4b,2b) | MAT63 (6b,3b) | MAT84 (8b,4b) |
|---|---|---|---|---|---|
| DeepSeek-V2-Lite | Sym | Base | 7.08 / 19.67 | 7.01 / 7.61 | 7.00 / 7.08 |
| Sym | Trunc | 7.08 / ∞ | 7.01 / ∞ | 7.00 / ∞ | |
| Asym | Base | 7.06 / 9.46 | 7.01 / 7.29 | 7.00 / 7.06 | |
| Asym | Trunc | 7.06 / ∞ | 7.01 / ∞ | 7.00 / ∞ | |
| → AMAT | 7.06 / 10.18 | 7.01 / 7.56 | 7.00 / 7.11 | ||
| Qwen1.5-MoE-A2.7B | Sym | Base | 8.18 / 19.75 | 7.97 / 9.12 | 7.97 / 8.18 |
| Sym | Trunc | 8.18 / ∞ | 7.97 / ∞ | 7.97 / ∞ | |
| Asym | Base | 8.14 / 11.51 | 7.97 / 8.51 | 7.96 / 8.14 | |
| Asym | Trunc | 8.14 / ∞ | 7.97 / ∞ | 7.96 / ∞ | |
| → AMAT | 8.14 / 10.85 | 7.97 / 8.61 | 7.96 / 8.09 |
Additional macro-level efficiency results on GSM8K (5-shot) tasks include:
- DeepSeek-V2-Lite decode: up to 2.37× lower energy and 1.81× speedup.
- Qwen1.5-MoE: up to 2.85× lower energy and 1.64× speedup.
AMAT preserves near-high-bit accuracy under competitive miss-rate budgets.
7. Context and Significance
Calibration-Free Asymmetric Matryoshka Quantization marks an advance in precision-efficient neural inference for resource-constrained, cache-centric architectures. Its explicit compatibility with slice-wise caching enables dynamic precision routing for MoE models, offering substantial energy and latency gains without compromising accuracy or requiring retraining. The joint truncation of integer codes and zero-points is essential to its fidelity and calibration-free operation, distinguishing it from earlier matryoshka and non-uniform quantization schemes. A plausible implication is the further scalability of conditional computation models on edge devices and accelerators, where mixed-precision and rapid cache adaptation are increasingly critical (Choi et al., 15 Dec 2025).