Calibration-Free Asymmetric Matryoshka Quantization

Updated 22 December 2025

AMAT is a truncation-based quantization scheme that uses zero-point-aware truncation to derive both high-bit and low-bit representations from a single integer code.
It eliminates costly per-layer calibration by applying uniform quantization and integer bit-shift operations, ensuring symmetric range preservation across slices.
AMAT facilitates dynamic mixed-precision inference in MoE pipelines, achieving energy and latency improvements while maintaining near-high-bit accuracy.

Calibration-Free Asymmetric Matryoshka Quantization (AMAT) is a truncation-based quantization scheme designed for memory-efficient, mixed-precision inference in large-scale Mixture-of-Experts (MoE) models. AMAT enables direct compatibility between high-bit and low-bit quantized slices through zero-point-aware truncation, facilitating dynamic precision selection and cache-friendly expert management without layer- or channel-wise calibration or memory duplication. Originally introduced in the SliceMoE framework, AMAT provides a practical solution for conditional expert deployment under stringent energy and latency budgets, maintaining near-high-bit accuracy even in low-bit regimes (Choi et al., 15 Dec 2025).

1. Motivation and Design Principles

Conventional approaches to mixed-precision quantization in neural networks typically require storing separate copies of weight tensors for each precision or impose complex, non-uniform calibration procedures—especially problematic for MoE models with large parameter sets. AMAT is motivated by the need to:

Eliminate memory duplication by allowing low-bit and high-bit slices to be derived from the same integer code.
Abandon costly per-layer or per-channel calibration and distribution fitting.
Enable slice-wise bit caching for conditional computation in MoE routing.

Key design features include the use of simple uniform quantization, integer and zero-point truncation by the same bit-shift, and recentering of ranges to mitigate value clipping.

2. Mathematical Formulation and Key Algorithms

AMAT operates on floating-point weight tensors $W$ , producing high-bit ( $b_h$ ) and low-bit ( $b_\ell$ ) quantized representations, with $b_h > b_\ell$ and $\text{shift} = b_h - b_\ell$ .

High-bit uniform quantization (asymmetric integer scheme):
- Scale: $S_h = \frac{w_{\max} - w_{\min}}{2^{b_h} - 1}$
- Zero-point: $z_h = \mathrm{round}(-w_{\min}/S_h)$
- Quantization: $q_h(W) = \mathrm{clamp}(0, 2^{b_h} - 1, \mathrm{round}(W/S_h) + z_h)$
- Approximate dequantization: $W \approx S_h (q_h - z_h)$
Naïve truncation failure: Direct bit-shifting of integer codes alone ( $q_\ell = q_h \gg \text{shift}$ ) without adjusting zero-point leads to severe clipping.
AMAT truncation: Both integer codes and zero-point are shifted:

$q_\ell^{\mathrm{AMAT}} = \left\lfloor \frac{q_h}{2^{\text{shift}}} \right\rfloor, \quad z_\ell^{\mathrm{AMAT}} = \left\lfloor \frac{z_h}{2^{\text{shift}}} \right\rfloor$

This preserves the correct symmetric range for the truncated slice.

Low-bit dequantization:

$\hat{W}_\ell = S_h \left(q_\ell^{\mathrm{AMAT}} - z_\ell^{\mathrm{AMAT}}\right) \times 2^{\text{shift}}$

If full-precision slices are available, the original high-bit dequantization is performed.

Algorithmic implementation:

def AMAT_Quantize(W, b_high, b_low):
    w_min = W.min()
    w_max = W.max()
    S_h = (w_max - w_min) / (2 ** b_high - 1)
    z_h = round(-w_min / S_h)
    Q_h = clamp(0, 2 ** b_high - 1, round(W / S_h) + z_h)
    shift = b_high - b_low
    Q_l = floor(Q_h / 2 ** shift)
    z_l = floor(z_h / 2 ** shift)
    return {scale_high: S_h, zp_high: z_h, Q_h: uint[b_high](W)}, b_low, z_l

No extra copies for low-bit codes are required; they are rederivable by on-the-fly bit-shifts.

3. Calibration-Free Operation

AMAT dispenses with all per-layer or per-channel calibration, using a single $(w_{\min}, w_{\max})$ interval for both high- and low-precision quantizations. The arithmetic truncation of integer code and zero-point ensures the low-bit slice inherits its zero-point symmetrically, avoiding layer-specific tuning and distribution fitting. This calibration-free property distinguishes AMAT from previous non-uniform matryoshka schemes, which require per-layer calibration and distribution-specific thresholds.

4. Matryoshka Bit-Slice Compatibility

AMAT provides exact nesting: the $b_\ell$ -bit codes $Q_\ell$ correspond to the most significant bits of the $b_h$ -bit codes $Q_h$ , i.e., $Q_\ell = Q_h \gg (b_h - b_\ell)$ . This nesting enables selective expert offloading and precision routing in MoE systems:

When both MSB (high-bit) and LSB slices are cached, full-precision is restored.
When only the MSB cache is available, the low-bit path dequantizes with preserved fidelity.
No additional memory regions are required; both bit-slices coexist within $Q_h$ .

The term "Matryoshka" references this hierarchical nesting, analogous to Russian matryoshka dolls.

5. Integration and Implementation in MoE Pipelines

AMAT is natively compatible with expert offloading and cache management in SliceMoE:

High-bit codes $Q_h$ reside in non-volatile memory (Flash).
Two DRAM caches operate: MSB-cache (top $b_\ell$ bits, LRU), and LSB-cache (lower bits, lower priority).
Dynamic Bit-Sliced Caching (DBSC) utilizes gating scores for expert selection:
- High-score experts receive both bit-slices for high precision.
- Lower-score experts use only the MSB (low precision).
Q_\ell and z_\ell are derived in real time by integer shifts, avoiding recomputation or duplication.
Inference logic dynamically selects between low-bit and high-bit GEMM operations according to cache availability.

This workflow requires no retraining or specialized hardware, relying solely on bit-shift arithmetic for mixed-precision selection.

6. Quantitative Performance and Empirical Results

AMAT has been evaluated on DeepSeek-V2-Lite and Qwen1.5-MoE-A2.7B models under strict DRAM-Flash miss-rate constraints. The following table excerpts reflect per-token perplexity for various quantization schemes and bit-widths:

Model	Quant	Scheme	MAT42 (4b,2b)	MAT63 (6b,3b)	MAT84 (8b,4b)
DeepSeek-V2-Lite	Sym	Base	7.08 / 19.67	7.01 / 7.61	7.00 / 7.08
	Sym	Trunc	7.08 / ∞	7.01 / ∞	7.00 / ∞
	Asym	Base	7.06 / 9.46	7.01 / 7.29	7.00 / 7.06
	Asym	Trunc	7.06 / ∞	7.01 / ∞	7.00 / ∞
	→ AMAT		7.06 / 10.18	7.01 / 7.56	7.00 / 7.11
Qwen1.5-MoE-A2.7B	Sym	Base	8.18 / 19.75	7.97 / 9.12	7.97 / 8.18
	Sym	Trunc	8.18 / ∞	7.97 / ∞	7.97 / ∞
	Asym	Base	8.14 / 11.51	7.97 / 8.51	7.96 / 8.14
	Asym	Trunc	8.14 / ∞	7.97 / ∞	7.96 / ∞
	→ AMAT		8.14 / 10.85	7.97 / 8.61	7.96 / 8.09

Additional macro-level efficiency results on GSM8K (5-shot) tasks include:

DeepSeek-V2-Lite decode: up to 2.37× lower energy and 1.81× speedup.
Qwen1.5-MoE: up to 2.85× lower energy and 1.64× speedup.

AMAT preserves near-high-bit accuracy under competitive miss-rate budgets.

7. Context and Significance

Calibration-Free Asymmetric Matryoshka Quantization marks an advance in precision-efficient neural inference for resource-constrained, cache-centric architectures. Its explicit compatibility with slice-wise caching enables dynamic precision routing for MoE models, offering substantial energy and latency gains without compromising accuracy or requiring retraining. The joint truncation of integer codes and zero-points is essential to its fidelity and calibration-free operation, distinguishing it from earlier matryoshka and non-uniform quantization schemes. A plausible implication is the further scalability of conditional computation models on edge devices and accelerators, where mixed-precision and rapid cache adaptation are increasingly critical (Choi et al., 15 Dec 2025).

PDF Markdown Chat (Pro)

References (1)

SliceMoE: Bit-Sliced Expert Caching under Miss-Rate Constraints for Efficient MoE Inference (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Calibration-Free Asymmetric Matryoshka Quantization (AMAT).