Activation-Weight Co-Quantization
- Activation-Weight Co-Quantization is the process of converting both weights and activations from full precision to low-bit representations, thereby improving memory footprint and computational efficiency.
- It employs diverse techniques such as post-training quantization, quantization-aware training, mixed-precision assignment, and codebook-based strategies to mitigate joint quantization errors.
- Advanced methods integrate statistical error modeling, adaptive scaling, and hardware/software co-design to address activation outlier sensitivity and ensure minimal accuracy degradation.
Activation-Weight Co-Quantization refers to the simultaneous quantization of both weights and activations in artificial neural networks, mapping full-precision (typically floating-point) parameters and intermediate values to low-bitwidth representations (e.g., INT4, INT8). The primary motivation is to achieve significant improvements in memory footprint, inference efficiency, and hardware compatibility, especially for resource-constrained deployment scenarios. Activation-weight co-quantization encompasses a spectrum of methods, including post-training quantization (PTQ), quantization-aware training (QAT), mixed-precision assignment, block-based codebooks, output-approximation adjustment, and hardware/software co-design. Effective co-quantization must address the intricate statistical interplay and error propagation between weights and activations, which can produce complex degradation patterns when both are quantized aggressively.
1. Theoretical Framework and Problem Formulation
The mathematical basis for activation-weight co-quantization considers a neural network layer parameterized by a weight tensor and taking an activation input . Let and denote the respective quantization operators (often uniform or codebook-based), defined by a scale and zero-point . Classical uniform quantization for a -bit representation proceeds as: The standard forward pass with both operands quantized is or, for PTQ, depending on operator placement.
Crucially, optimal quantization scales and codebooks must be chosen accounting for the joint error induced in due to both sources of quantization noise. Linearizing the error, as in (Yao et al., 2023), reveals
The dominant errors, and , are coupled: suboptimal quantization of either operand can amplify total output error in a nontrivial fashion.
Recent theory (e.g., (Long et al., 2020)) formalizes training with constrained discrete sets and , leading to non-convex optimization. The QUANT algorithm (projected "coarse" gradient descent with STE) obtains guaranteed recurrence to global optima under mild proximity assumptions—a key theoretical insight for full-precision-to-quantized training regimes.
2. Quantization Algorithms and Co-Optimization Strategies
A range of practical methods address the challenges of balancing accuracy and efficiency during co-quantization, including:
Uniform and Non-Uniform Quantizers
Uniform quantization applies fixed intervals, often with symmetric or asymmetric scaling per-channel or per-tensor. Extensions include non-linear codebooks (e.g., Lloyd–Max or logarithmic quantizers), and adaptive codebooks derived per statistical cluster of weights or activations (Elangovan et al., 7 Feb 2025).
Activation-Quantization-Aware Scaling
Activation–Weight co-quantization incurs a mismatch if operand scales are optimized independently. Activation-Quantization-Aware Scaling (AQAS) (Lee et al., 2023) introduces per-channel scaling that jointly minimizes output MSE: A grid search over candidate selects the balance yielding lowest joint quantization loss on calibration data.
Output-Level Alignment and Closed-Form Corrections
LoaQ (Lin et al., 8 Sep 2025) exploits a closed-form adjustment compensating for activation quantization drift by modifying the quantized weights as: where , , with full-precision and quantized activation. This procedure explicitly minimizes , preventing the error amplification typical in naive layer-wise PTQ.
Mixed Precision and Sensitivity-Guided Allocation
Fine-grained mixed-precision assignment (FGMP) (Hooper et al., 19 Apr 2025) leverages the Fisher information matrix to score each block of weights/activations by its impact on the loss, assigning high-precision to only the most sensitive blocks and quantizing the rest to ultra-low precision (e.g., NVFP4). The per-block impact score is
and blocks above a threshold retain precision, minimizing accuracy loss at fixed memory/energy budget.
Post-Training and QAT Pipelines
PTQ-based methods gather calibration data to estimate quantization parameters (scales, codebooks), possibly applying closed-form weight corrections, range clipping, and codebook clustering (Elangovan et al., 7 Feb 2025, Yang et al., 2023, Lee et al., 2023). QAT approaches (e.g., (Long et al., 2020, Choi et al., 2018, Ardakani et al., 2022)) embed quantization and STE in the forward/backward passes, learning scale and codebook parameters jointly with weights.
3. Specialized Techniques for Activation-Weight Co-Quantization
dINT Hybrid Format and Underflow Mitigation
4-bit quantization often suffers from underflow, as small-magnitude values round to zero. The dINT format (Lee et al., 2023) reserves two codes for “denormal half-steps,” dramatically reducing underflow error. A finite-state detector in hardware routes these codes and achieves over 2 area/power efficiency improvement in MAC units.
Channel- or Block-Specific Clustering
Block-Clustered Quantization (BCQ) (Elangovan et al., 7 Feb 2025) forms contiguous blocks (e.g., 8 values), clusters them, and learns dedicated codebooks for each cluster. This reduces the effective bitwidth slightly above the raw quantization, but enables sub-1% loss in W4A4 LLM quantization without retraining.
Edge-Device Specializations
Agile-Quant (Shen et al., 2023) co-quantizes weights and activations, integrates token-pruning to reduce activation outliers, and implements SIMD-optimized INT4 kernels with TRIP matrix multiplication, yielding 1.9–2.6 speedup on commodity ARM CPUs/NPUs with minimal accuracy loss.
AdderNet and Non-multiplicative Networks
For AdderNets (l1-norm-based), the commutative law constraints require a shared scale factor for both operands (Nie et al., 2022). Quantization is enhanced by grouping, intra/inter-group scaling, and range clamps to preserve accuracy and efficiency trade-offs.
4. Sensitivity, Error Analysis, and Model-Specific Phenomena
Empirical analyses (e.g., (Yao et al., 2023, Nrusimha et al., 4 Apr 2024)) repeatedly establish that activation quantization, especially in LLMs, is more sensitive than weight quantization. Outlier channels—where single channels have extremely high magnitude—emerge naturally, especially in residual streams, making low-bit activation quantization particularly challenging. Methods such as QAT plus kurtosis regularization (Nrusimha et al., 4 Apr 2024) address outlier-driven accuracy drops by penalizing heavy-tailed activation distributions during training, preventing weight norm inflation during PTQ.
First-order error expansion (Yao et al., 2023) models
and indicates that even small errors in either operand can produce large discrepancies when both are quantized in sequence, necessitating co-optimization.
5. Hardware Co-Design and Deployment Implications
Activation-weight co-quantization is tightly intertwined with hardware design:
- Integer MAC arithmetic: W4A8 and especially W4A4 quantization enables deployment on native INT4/INT8 tensor cores, yielding – bandwidth and area gains (Lee et al., 2023, Elangovan et al., 7 Feb 2025, Shen et al., 2023).
- Custom formats (NVFP4, dINT) support denormal and half-step codes to reduce error, with dedicated control logic to maintain throughput.
- Fine-grained mixed-precision accelerators (Hooper et al., 19 Apr 2025) select computation units dynamically per block, using sensitivity-metadata, with minimal run-time overhead.
- MobileQuant (Tan et al., 25 Aug 2024) enables integer-only on-device deployment, folding scaling parameters into weights and activations for NPU/DSP compatibility, achieving up to 50% energy savings over FP16 activation baselines.
6. Empirical Results, Limitations, and Future Directions
Extensive evaluations demonstrate that modern activation-weight co-quantization schemes can achieve sub-1% loss in perplexity or top-1 error across diverse settings:
| Method | Bitwidth | Model | Perplexity/Accuracy Drop |
|---|---|---|---|
| AQAS+dINT (Lee et al., 2023) | W4A8+dINT | LLaMA-7B | +0.04 PPL |
| BCQ (Elangovan et al., 7 Feb 2025) | W4A4 (4.5) | LLaMA-2 | <1% degradation |
| LoaQ (Lin et al., 8 Sep 2025) | W3A3 | LLaMA-2 | –0.9 PPL vs GPTQ |
| Agile-Quant (Shen et al., 2023) | W4A8 | LLaMA-7B | <0.5 PPL, ×2–2.6 speedup |
| FGMP (Hooper et al., 19 Apr 2025) | NVFP4/FP8 | LLaMA-2 | <1% degrade, 14% energy |
| MobileQuant (Tan et al., 25 Aug 2024) | W4A8 | TinyLLaMA | –2.5 PPL,–50% power |
Common limitations include (a) increased calibration/optimization cost as quantization becomes more aggressive; (b) sensitivity to outliers and data representativeness; (c) marginal overheads in storage when codebook selectors are used; (d) possible instability when activation distributions are ill-conditioned (necessitating damping); and (e) performance bottlenecks specific to model architectures (e.g., diffusion UNets, AdderNets).
Future directions involve adaptive per-layer or per-block bitwidth search, combined hardware/software optimization, further investigation of non-uniform and learned codebook quantizers, and principled integration of co-quantization into model pretraining and architecture design.
7. Comparison to Weight-Only and Activation-Only Quantization
While weight-only quantization to 4–8 bits typically produces minimal accuracy loss across mainstream CNNs and transformers (Yao et al., 2023), activation-weight co-quantization to the same or lower bitwidths demands far more careful statistical and algorithmic treatment. Activation-only quantization at such low bitwidths can be catastrophic due to the prominence of activation outliers; conversely, weight-only quantization neglects cross-operand error propagation. Methods that treat both operands independently usually fail to reach the accuracy levels achieved by coordinated, output-level, or joint calibration methods (Lee et al., 2023, Lin et al., 8 Sep 2025).
In summary, activation-weight co-quantization is now recognized as a fundamental requirement for efficient, scalable deployment of large deep networks. Success in this domain requires both statistical insight into error propagation mechanisms and careful algorithmic/hardware co-design leveraging modern advances in PTQ, QAT, and sensitivity-aware precision assignment.
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free