DynaExq: Adaptive MoE Quantization
- DynaExq is a runtime quantization framework that dynamically allocates bit-widths for MoE models, balancing high-precision needs of active experts with GPU memory constraints.
- It employs an asynchronous, hotness-aware controller and dual-pool memory management to switch expert precision seamlessly, eliminating inference stalls and memory fragmentation.
- Empirical evaluations on Qwen3 MoE models show that DynaExq nearly matches FP16 accuracy while significantly reducing HBM usage compared to static quantization.
DynaExq is a runtime quantization framework specifically developed for scalable deployment of large Mixture-of-Experts (MoE) LLMs under strict GPU memory (HBM) constraints. The method advances beyond traditional post-training static quantization by dynamically adjusting the precision (bit-width) of each expert during inference, based on real-time activation statistics. DynaExq enables high-accuracy serving of models such as Qwen3-30B and Qwen3-80B on commodity GPUs with limited memory, with substantially lower accuracy loss than static low-bit quantization, and without incurring forward path stalls or allocation fragmentation (Chu et al., 19 Nov 2025).
1. MoE Inference Bottleneck and Motivation
MoE models activate only a subset () of experts per token but necessitate all experts' weights to be loaded on the GPU for fast token-level routing. This results in severe HBM over-provisioning, especially as model scale increases (e.g., Qwen3-30B with 128 experts/layer consumes approximately 57 GB HBM, even though only ~6 GB are in use per token). Static post-training quantization (PTQ) methods such as uniform INT4 or INT2 compress storage but degrade the accuracy of frequently-activated ("hot") experts when aggressive quantization is applied, or squander memory maintaining "cold" experts in high precision. DynaExq resolves this by recasting expert bit-width as a dynamically managed resource, promoting hot experts to high precision and demoting cold ones to low bit-width, with seamless transitions during live inference (Chu et al., 19 Nov 2025).
2. System Architecture and Key Components
DynaExq comprises three primary architectural modules:
(a) Hotness-Aware Precision Controller
This controller operates asynchronously on the CPU, sampling router outputs for each expert at inference step . Each expert maintains a hotness score as an Exponential Moving Average (EMA):
Inactive experts undergo decay with . At a fixed stride , experts are sorted by to select the most active (scores ) for high-precision allocation, ensuring the layer-level HBM budget constraint:
The controller triggers promotion or demotion of experts accordingly, always non-blocking and never stalling GPU inference kernels.
(b) Asynchronous Precision-Switching Pipeline
Expert weights reside in a hierarchy: SSD DRAM cache GPU HBM. Changes in bit-width occur through an end-to-end asynchronous pipeline:
- Promotion (LOW HIGH): SSD to DRAM (prefetch if needed), DRAM to HBM copy, registration of the new buffer, and reclamation of the former low-bit buffer.
- Demotion (HIGH LOW) is symmetric.
CUDA streams per tier overlap data movement, allowing the core MoE forward pass to proceed without stalling. Until the pipeline stabilizes, inference uses the last-committed version ("provenance-consistent" operation).
(c) Fragmentation-Free Dual-Pool Memory Management
HBM is split into fixed-size pools for high- and low-precision expert parameters, with each pool partitioned into blocks exactly sized for one expert's weights. All allocations/deallocations are atomic bitmask operations, eliminating fragmentation and allocator jitter. Precision transitions atomically swap between pools, with a small transient buffer to handle bursts of in-flight promotions.
3. Bit-Width Assignment and Optimization Objective
The runtime controller solves for a bit-width allocation per expert satisfying:
subject to , where and are per-expert storage at high/low precision, and . The controller's update schedule and threshold solve this memory bound and minimize hotness-coverage oscillation.
4. Integration with MoE Inference Engines
DynaExq is implemented atop PyTorch and HuggingFace Transformers, leveraging AutoRound for offline weight calibration. Initialization includes a warmup to collect hotness statistics, solving the memory constraint, and carving HBM pools. Quantization and dequantization are overlapped with MoE kernels. Demoted weights persist in a dedicated pool, amortizing cost for experts reactivated soon after demotion.
During inference, CPU controller threads update hotness scores and schedule precision transitions, maintaining a CPU-side mapping of expert (HBM address, bit-width). All precision switches execute fully asynchronously, maintaining deterministic HBM bounds.
5. Empirical Evaluation
Accuracy and Throughput
Representative results on Qwen3-MoE-30B (128 experts/layer) and Qwen3-MoE-80B (512 experts/layer), using 8 or 10 active experts per token:
| Model | Precision | HBM GB | Accuracy (%) | Δ from FP16 |
|---|---|---|---|---|
| Qwen3-30B | FP16 | 57 | 65.96 | 0.00 |
| Static INT4 | 17 | 64.72 | -1.24 | |
| DynaExq (A6000) | 17 | 65.15 | -0.81 | |
| Qwen3-80B | Static INT4 | 41 | 77.74 | n/a |
| Static INT2 | 21 | 72.65 | n/a | |
| DynaExq (A6000) | 21 | 76.68 | n/a |
DynaExq delivers accuracy nearly matching much larger FP16 baselines while operating within strict HBM budgets. Throughput is 85–95% of fully static INT2 quantized inference, with first-token and per-token latency increased by only 5–15% relative to the most aggressive static quantization.
Perplexity studies demonstrate that as the fraction of demoted experts grows, DynaExq degrades more gracefully than uniform low-bit quantization due to preferential demotion of rarely-routed ("cold") experts first.
Ablations
- Blocking (synchronous) precision switches result in 20–30% latency spikes and allocation jitter; DynaExq's dual-pool design abolishes GB memory fragmentation observed in early prototypes.
- Static quantization fails to adapt to the evolving set of hot and cold experts, rapidly degrading accuracy in dynamic workloads (see steep perplexity increases for INT4/INT2 baselines).
6. Insights and Limitations
DynaExq, by elevating expert bit-width to a first-class runtime resource, substantially closes the accuracy gap between static quantized and high-precision MoE inference under fixed HBM capacity. Its adaptive controller preferentially preserves precision for the most impactful ("hot") experts over long stretches, and the decoupled, non-blocking pipeline prevents inference stalls and resource leakage. This approach supports deployment of MoE LLMs with hundreds of billions of parameters on single 32–48 GB GPUs, marking a major advance for consumer-grade infrastructure (Chu et al., 19 Nov 2025).
Limitations include the requirement for workload-aware hotness estimation (which might lag during rapid context shifts), the need for careful offline calibration of quantization artifacts, and the engineering complexity of maintaining multiple asynchronous memory tiers and pools. Thresholds, EMA parameters, and chunking intervals require empirical tuning.
7. Relation to Broader Quantization and Efficient Inference Research
DynaExq exemplifies a new class of runtime-adaptive quantization frameworks distinct from prior uniform and post-training schemes. Rather than constraining all weights to a fixed bit-width, DynaExq exploits the inherently dynamic sparsity patterns induced by MoE routers—tracking expert utilization and dynamically allocating precision under a global resource envelope. This strategy is orthogonal to, and can be composed with, parallel advances in early exit (e.g., DYNAMAX (Nogales et al., 29 Apr 2025)), operator fusion, and tensor rematerialization for further inference efficiency optimization. The general approach of "hotness-aware resource management" is likely to generalize to other conditional computation models (e.g., LayerDrop, dynamic routing), suggesting broad relevance for scalable serving of large foundation models under hardware constraints (Chu et al., 19 Nov 2025).