Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
GPT-5.1
GPT-5.1 96 tok/s
Gemini 3.0 Pro 48 tok/s Pro
Gemini 2.5 Flash 155 tok/s Pro
Kimi K2 197 tok/s Pro
Claude Sonnet 4.5 36 tok/s Pro
2000 character limit reached

Activation-Weight Co-Quantization

Updated 16 November 2025
  • Activation-Weight Co-Quantization is the process of converting both weights and activations from full precision to low-bit representations, thereby improving memory footprint and computational efficiency.
  • It employs diverse techniques such as post-training quantization, quantization-aware training, mixed-precision assignment, and codebook-based strategies to mitigate joint quantization errors.
  • Advanced methods integrate statistical error modeling, adaptive scaling, and hardware/software co-design to address activation outlier sensitivity and ensure minimal accuracy degradation.

Activation-Weight Co-Quantization refers to the simultaneous quantization of both weights and activations in artificial neural networks, mapping full-precision (typically floating-point) parameters and intermediate values to low-bitwidth representations (e.g., INT4, INT8). The primary motivation is to achieve significant improvements in memory footprint, inference efficiency, and hardware compatibility, especially for resource-constrained deployment scenarios. Activation-weight co-quantization encompasses a spectrum of methods, including post-training quantization (PTQ), quantization-aware training (QAT), mixed-precision assignment, block-based codebooks, output-approximation adjustment, and hardware/software co-design. Effective co-quantization must address the intricate statistical interplay and error propagation between weights and activations, which can produce complex degradation patterns when both are quantized aggressively.

1. Theoretical Framework and Problem Formulation

The mathematical basis for activation-weight co-quantization considers a neural network layer parameterized by a weight tensor WW and taking an activation input XX. Let Qw()Q_w(\cdot) and Qa()Q_a(\cdot) denote the respective quantization operators (often uniform or codebook-based), defined by a scale ss and zero-point zz. Classical uniform quantization for a bb-bit representation proceeds as: Qb(x;s,z)=clamp(round(x/s)+z,0,2b1)sQ_b(x; s, z) = \text{clamp}(\text{round}(x/s) + z, 0, 2^b-1) \cdot s The standard forward pass with both operands quantized is Y=Qw(W)Qa(X)Y = Q_w(W) \cdot Q_a(X) or, for PTQ, Y=Qa(Qw(W)X)Y = Q_a(Q_w(W) \cdot X) depending on operator placement.

Crucially, optimal quantization scales and codebooks must be chosen accounting for the joint error induced in YY due to both sources of quantization noise. Linearizing the error, as in (Yao et al., 2023), reveals

Y=(W+ΔW)(X+ΔX)=WX+WΔX+ΔWX+ΔWΔXY = (W + \Delta W)(X + \Delta X) = WX + W\Delta X + \Delta W X + \Delta W \Delta X

The dominant errors, WΔXW\Delta X and ΔWX\Delta W X, are coupled: suboptimal quantization of either operand can amplify total output error in a nontrivial fashion.

Recent theory (e.g., (Long et al., 2020)) formalizes training with constrained discrete sets SwS_w and SaS_a, leading to non-convex optimization. The QUANT algorithm (projected "coarse" gradient descent with STE) obtains guaranteed recurrence to global optima under mild proximity assumptions—a key theoretical insight for full-precision-to-quantized training regimes.

2. Quantization Algorithms and Co-Optimization Strategies

A range of practical methods address the challenges of balancing accuracy and efficiency during co-quantization, including:

Uniform and Non-Uniform Quantizers

Uniform quantization applies fixed intervals, often with symmetric or asymmetric scaling per-channel or per-tensor. Extensions include non-linear codebooks (e.g., Lloyd–Max or logarithmic quantizers), and adaptive codebooks derived per statistical cluster of weights or activations (Elangovan et al., 7 Feb 2025).

Activation-Quantization-Aware Scaling

Activation–Weight co-quantization incurs a mismatch if operand scales are optimized independently. Activation-Quantization-Aware Scaling (AQAS) (Lee et al., 2023) introduces per-channel scaling ss that jointly minimizes output MSE: s=argminsQ4(Wdiag(s))Q8(diag(s)1X)WX22s^* = \arg\min_s \left\| Q_4(W \cdot \text{diag}(s)) \cdot Q_8(\text{diag}(s)^{-1} X) - WX \right\|_2^2 A grid search over candidate ss selects the balance yielding lowest joint quantization loss on calibration data.

Output-Level Alignment and Closed-Form Corrections

LoaQ (Lin et al., 8 Sep 2025) exploits a closed-form adjustment compensating for activation quantization drift by modifying the quantized weights as: Q=(I+H1C)WQ^* = (I + H^{-1}C)W where H=X^TX^H = \hat X^T \hat X, C=X^T(XX^)C = \hat X^T(X - \hat X), with XX full-precision and X^\hat X quantized activation. This procedure explicitly minimizes X^QXWF2\|\hat X Q - X W\|^2_F, preventing the error amplification typical in naive layer-wise PTQ.

Mixed Precision and Sensitivity-Guided Allocation

Fine-grained mixed-precision assignment (FGMP) (Hooper et al., 19 Apr 2025) leverages the Fisher information matrix to score each block of weights/activations by its impact on the loss, assigning high-precision to only the most sensitive blocks and quantizing the rest to ultra-low precision (e.g., NVFP4). The per-block impact score is

I(v)=i=1Ngi2(Δphpl(vi))2I(v) = \sum_{i=1}^N g_i^2 \left(\Delta_{p_h \to p_l}(v_i)\right)^2

and blocks above a threshold retain precision, minimizing accuracy loss at fixed memory/energy budget.

Post-Training and QAT Pipelines

PTQ-based methods gather calibration data to estimate quantization parameters (scales, codebooks), possibly applying closed-form weight corrections, range clipping, and codebook clustering (Elangovan et al., 7 Feb 2025, Yang et al., 2023, Lee et al., 2023). QAT approaches (e.g., (Long et al., 2020, Choi et al., 2018, Ardakani et al., 2022)) embed quantization and STE in the forward/backward passes, learning scale and codebook parameters jointly with weights.

3. Specialized Techniques for Activation-Weight Co-Quantization

dINT Hybrid Format and Underflow Mitigation

4-bit quantization often suffers from underflow, as small-magnitude values round to zero. The dINT format (Lee et al., 2023) reserves two codes for “denormal half-steps,” dramatically reducing underflow error. A finite-state detector in hardware routes these codes and achieves over 2×\times area/power efficiency improvement in MAC units.

Channel- or Block-Specific Clustering

Block-Clustered Quantization (BCQ) (Elangovan et al., 7 Feb 2025) forms contiguous blocks (e.g., 8 values), clusters them, and learns dedicated codebooks for each cluster. This reduces the effective bitwidth slightly above the raw quantization, but enables sub-1% loss in W4A4 LLM quantization without retraining.

Edge-Device Specializations

Agile-Quant (Shen et al., 2023) co-quantizes weights and activations, integrates token-pruning to reduce activation outliers, and implements SIMD-optimized INT4 kernels with TRIP matrix multiplication, yielding 1.9–2.6×\times speedup on commodity ARM CPUs/NPUs with minimal accuracy loss.

AdderNet and Non-multiplicative Networks

For AdderNets (l1-norm-based), the commutative law constraints require a shared scale factor for both operands (Nie et al., 2022). Quantization is enhanced by grouping, intra/inter-group scaling, and range clamps to preserve accuracy and efficiency trade-offs.

4. Sensitivity, Error Analysis, and Model-Specific Phenomena

Empirical analyses (e.g., (Yao et al., 2023, Nrusimha et al., 4 Apr 2024)) repeatedly establish that activation quantization, especially in LLMs, is more sensitive than weight quantization. Outlier channels—where single channels have extremely high magnitude—emerge naturally, especially in residual streams, making low-bit activation quantization particularly challenging. Methods such as QAT plus kurtosis regularization (Nrusimha et al., 4 Apr 2024) address outlier-driven accuracy drops by penalizing heavy-tailed activation distributions during training, preventing weight norm inflation during PTQ.

First-order error expansion (Yao et al., 2023) models

ΔYWΔA+ΔWA\Delta Y \approx W \Delta A + \Delta W A

and indicates that even small errors in either operand can produce large discrepancies when both are quantized in sequence, necessitating co-optimization.

5. Hardware Co-Design and Deployment Implications

Activation-weight co-quantization is tightly intertwined with hardware design:

  • Integer MAC arithmetic: W4A8 and especially W4A4 quantization enables deployment on native INT4/INT8 tensor cores, yielding 2×2\times4×4\times bandwidth and area gains (Lee et al., 2023, Elangovan et al., 7 Feb 2025, Shen et al., 2023).
  • Custom formats (NVFP4, dINT) support denormal and half-step codes to reduce error, with dedicated control logic to maintain throughput.
  • Fine-grained mixed-precision accelerators (Hooper et al., 19 Apr 2025) select computation units dynamically per block, using sensitivity-metadata, with minimal run-time overhead.
  • MobileQuant (Tan et al., 25 Aug 2024) enables integer-only on-device deployment, folding scaling parameters into weights and activations for NPU/DSP compatibility, achieving up to 50% energy savings over FP16 activation baselines.

6. Empirical Results, Limitations, and Future Directions

Extensive evaluations demonstrate that modern activation-weight co-quantization schemes can achieve sub-1% loss in perplexity or top-1 error across diverse settings:

Method Bitwidth Model Perplexity/Accuracy Drop
AQAS+dINT (Lee et al., 2023) W4A8+dINT LLaMA-7B +0.04 PPL
BCQ (Elangovan et al., 7 Feb 2025) W4A4 (4.5) LLaMA-2 <1% degradation
LoaQ (Lin et al., 8 Sep 2025) W3A3 LLaMA-2 –0.9 PPL vs GPTQ
Agile-Quant (Shen et al., 2023) W4A8 LLaMA-7B <0.5 PPL, ×2–2.6 speedup
FGMP (Hooper et al., 19 Apr 2025) NVFP4/FP8 LLaMA-2 <1% degrade, 14% energy
MobileQuant (Tan et al., 25 Aug 2024) W4A8 TinyLLaMA –2.5 PPL,–50% power

Common limitations include (a) increased calibration/optimization cost as quantization becomes more aggressive; (b) sensitivity to outliers and data representativeness; (c) marginal overheads in storage when codebook selectors are used; (d) possible instability when activation distributions are ill-conditioned (necessitating damping); and (e) performance bottlenecks specific to model architectures (e.g., diffusion UNets, AdderNets).

Future directions involve adaptive per-layer or per-block bitwidth search, combined hardware/software optimization, further investigation of non-uniform and learned codebook quantizers, and principled integration of co-quantization into model pretraining and architecture design.

7. Comparison to Weight-Only and Activation-Only Quantization

While weight-only quantization to 4–8 bits typically produces minimal accuracy loss across mainstream CNNs and transformers (Yao et al., 2023), activation-weight co-quantization to the same or lower bitwidths demands far more careful statistical and algorithmic treatment. Activation-only quantization at such low bitwidths can be catastrophic due to the prominence of activation outliers; conversely, weight-only quantization neglects cross-operand error propagation. Methods that treat both operands independently usually fail to reach the accuracy levels achieved by coordinated, output-level, or joint calibration methods (Lee et al., 2023, Lin et al., 8 Sep 2025).

In summary, activation-weight co-quantization is now recognized as a fundamental requirement for efficient, scalable deployment of large deep networks. Success in this domain requires both statistical insight into error propagation mechanisms and careful algorithmic/hardware co-design leveraging modern advances in PTQ, QAT, and sensitivity-aware precision assignment.

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Activation-Weight Co-Quantization.