Papers
Topics
Authors
Recent
2000 character limit reached

DynaExq: Adaptive MoE Quantization

Updated 26 November 2025
  • DynaExq is a runtime quantization framework that dynamically allocates bit-widths for MoE models, balancing high-precision needs of active experts with GPU memory constraints.
  • It employs an asynchronous, hotness-aware controller and dual-pool memory management to switch expert precision seamlessly, eliminating inference stalls and memory fragmentation.
  • Empirical evaluations on Qwen3 MoE models show that DynaExq nearly matches FP16 accuracy while significantly reducing HBM usage compared to static quantization.

DynaExq is a runtime quantization framework specifically developed for scalable deployment of large Mixture-of-Experts (MoE) LLMs under strict GPU memory (HBM) constraints. The method advances beyond traditional post-training static quantization by dynamically adjusting the precision (bit-width) of each expert during inference, based on real-time activation statistics. DynaExq enables high-accuracy serving of models such as Qwen3-30B and Qwen3-80B on commodity GPUs with limited memory, with substantially lower accuracy loss than static low-bit quantization, and without incurring forward path stalls or allocation fragmentation (Chu et al., 19 Nov 2025).

1. MoE Inference Bottleneck and Motivation

MoE models activate only a subset (kNk \ll N) of experts per token but necessitate all NN experts' weights to be loaded on the GPU for fast token-level routing. This results in severe HBM over-provisioning, especially as model scale increases (e.g., Qwen3-30B with 128 experts/layer consumes approximately 57 GB HBM, even though only ~6 GB are in use per token). Static post-training quantization (PTQ) methods such as uniform INT4 or INT2 compress storage but degrade the accuracy of frequently-activated ("hot") experts when aggressive quantization is applied, or squander memory maintaining "cold" experts in high precision. DynaExq resolves this by recasting expert bit-width as a dynamically managed resource, promoting hot experts to high precision and demoting cold ones to low bit-width, with seamless transitions during live inference (Chu et al., 19 Nov 2025).

2. System Architecture and Key Components

DynaExq comprises three primary architectural modules:

(a) Hotness-Aware Precision Controller

This controller operates asynchronously on the CPU, sampling router outputs gi(xt)g_i(x_t) for each expert ii at inference step tt. Each expert maintains a hotness score as an Exponential Moving Average (EMA):

Si(t)=αSi(t1)+(1α)gi(xt)S_i^{(t)} = \alpha S_i^{(t-1)} + (1-\alpha) g_i(x_t)

Inactive experts undergo decay with SjαSjS_j \leftarrow \alpha S_j. At a fixed stride TT, experts are sorted by SiS_i to select the nhotn_{\mathrm{hot}} most active (scores τh\geq \tau_h) for high-precision allocation, ensuring the layer-level HBM budget constraint:

nhotShot+ncoldScoldMHBMn_{\mathrm{hot}} \cdot S^{hot} + n_{\mathrm{cold}} \cdot S^{cold} \leq M_{\mathrm{HBM}}

The controller triggers promotion or demotion of experts accordingly, always non-blocking and never stalling GPU inference kernels.

(b) Asynchronous Precision-Switching Pipeline

Expert weights reside in a hierarchy: SSD \rightarrow DRAM cache \rightarrow GPU HBM. Changes in bit-width occur through an end-to-end asynchronous pipeline:

  • Promotion (LOW \rightarrow HIGH): SSD to DRAM (prefetch if needed), DRAM to HBM copy, registration of the new buffer, and reclamation of the former low-bit buffer.
  • Demotion (HIGH \rightarrow LOW) is symmetric.

CUDA streams per tier overlap data movement, allowing the core MoE forward pass to proceed without stalling. Until the pipeline stabilizes, inference uses the last-committed version ("provenance-consistent" operation).

(c) Fragmentation-Free Dual-Pool Memory Management

HBM is split into fixed-size pools for high- and low-precision expert parameters, with each pool partitioned into blocks exactly sized for one expert's weights. All allocations/deallocations are atomic O(1)O(1) bitmask operations, eliminating fragmentation and allocator jitter. Precision transitions atomically swap between pools, with a small transient buffer to handle bursts of in-flight promotions.

3. Bit-Width Assignment and Optimization Objective

The runtime controller solves for a bit-width allocation bib_i per expert satisfying:

bi={bhighif iH blowotherwiseb_i = \begin{cases} b_{high} & \text{if } i \in \mathcal{H} \ b_{low} & \text{otherwise} \end{cases}

subject to nhotShot+ncoldScoldMHBMn_{\mathrm{hot}} \cdot S^{hot} + n_{\mathrm{cold}} \cdot S^{cold} \leq M_{\mathrm{HBM}}, where ShotS^{hot} and ScoldS^{cold} are per-expert storage at high/low precision, and nhot+ncold=Nn_{\mathrm{hot}} + n_{\mathrm{cold}} = N. The controller's update schedule and threshold τh\tau_h solve this memory bound and minimize hotness-coverage oscillation.

4. Integration with MoE Inference Engines

DynaExq is implemented atop PyTorch and HuggingFace Transformers, leveraging AutoRound for offline weight calibration. Initialization includes a warmup to collect hotness statistics, solving the memory constraint, and carving HBM pools. Quantization and dequantization are overlapped with MoE kernels. Demoted weights persist in a dedicated pool, amortizing cost for experts reactivated soon after demotion.

During inference, CPU controller threads update hotness scores and schedule precision transitions, maintaining a CPU-side mapping of expert \to (HBM address, bit-width). All precision switches execute fully asynchronously, maintaining deterministic HBM bounds.

5. Empirical Evaluation

Accuracy and Throughput

Representative results on Qwen3-MoE-30B (128 experts/layer) and Qwen3-MoE-80B (512 experts/layer), using 8 or 10 active experts per token:

Model Precision HBM GB Accuracy (%) Δ from FP16
Qwen3-30B FP16 57 65.96 0.00
Static INT4 17 64.72 -1.24
DynaExq (A6000) 17 65.15 -0.81
Qwen3-80B Static INT4 41 77.74 n/a
Static INT2 21 72.65 n/a
DynaExq (A6000) 21 76.68 n/a

DynaExq delivers accuracy nearly matching much larger FP16 baselines while operating within strict HBM budgets. Throughput is 85–95% of fully static INT2 quantized inference, with first-token and per-token latency increased by only 5–15% relative to the most aggressive static quantization.

Perplexity studies demonstrate that as the fraction of demoted experts grows, DynaExq degrades more gracefully than uniform low-bit quantization due to preferential demotion of rarely-routed ("cold") experts first.

Ablations

  • Blocking (synchronous) precision switches result in 20–30% latency spikes and allocation jitter; DynaExq's dual-pool design abolishes >1>1 GB memory fragmentation observed in early prototypes.
  • Static quantization fails to adapt to the evolving set of hot and cold experts, rapidly degrading accuracy in dynamic workloads (see steep perplexity increases for INT4/INT2 baselines).

6. Insights and Limitations

DynaExq, by elevating expert bit-width to a first-class runtime resource, substantially closes the accuracy gap between static quantized and high-precision MoE inference under fixed HBM capacity. Its adaptive controller preferentially preserves precision for the most impactful ("hot") experts over long stretches, and the decoupled, non-blocking pipeline prevents inference stalls and resource leakage. This approach supports deployment of MoE LLMs with hundreds of billions of parameters on single 32–48 GB GPUs, marking a major advance for consumer-grade infrastructure (Chu et al., 19 Nov 2025).

Limitations include the requirement for workload-aware hotness estimation (which might lag during rapid context shifts), the need for careful offline calibration of quantization artifacts, and the engineering complexity of maintaining multiple asynchronous memory tiers and pools. Thresholds, EMA parameters, and chunking intervals require empirical tuning.

7. Relation to Broader Quantization and Efficient Inference Research

DynaExq exemplifies a new class of runtime-adaptive quantization frameworks distinct from prior uniform and post-training schemes. Rather than constraining all weights to a fixed bit-width, DynaExq exploits the inherently dynamic sparsity patterns induced by MoE routers—tracking expert utilization and dynamically allocating precision under a global resource envelope. This strategy is orthogonal to, and can be composed with, parallel advances in early exit (e.g., DYNAMAX (Nogales et al., 29 Apr 2025)), operator fusion, and tensor rematerialization for further inference efficiency optimization. The general approach of "hotness-aware resource management" is likely to generalize to other conditional computation models (e.g., LayerDrop, dynamic routing), suggesting broad relevance for scalable serving of large foundation models under hardware constraints (Chu et al., 19 Nov 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to DynaExq.