Papers
Topics
Authors
Recent
2000 character limit reached

SignRoundV2: Low-Bit LLM Quantization

Updated 5 December 2025
  • SignRoundV2 is a post-training quantization framework that enables extremely low-bit (2–5 bits) quantization of large language models with nearly full-precision accuracy.
  • It introduces a gradient-informed DeltaLoss metric for layer sensitivity and a calibration-based scale search to optimize bit allocation and initialization.
  • SignRoundV2 demonstrates robust, production-grade performance—recovering 95–99% of full-precision accuracy—while reducing memory footprint and inference latency.

SignRoundV2 is a post-training quantization (PTQ) framework designed to enable extremely low-bit quantization (2–5 bits) of LLMs with minimal loss in accuracy relative to full-precision baselines. It introduces two principal innovations: a first-order, gradient-informed layerwise sensitivity metric (“DeltaLoss”) that directs bit allocation, and a lightweight, calibration-based search for optimal scale initialization prior to quantization. SignRoundV2 demonstrates production-grade performance—within ~1% of full-precision—in the 4–5 bit regime and delivers strong results even with 2-bit weight quantization, advancing the state of efficient LLM deployment (Cheng et al., 4 Dec 2025).

1. Mathematical Formulation

SignRoundV2 employs a symmetric, scale-driven quantizer for both weights and activations. For a full-precision tensor xx, quantization bit-width bb, and scale ss, the quantize-dequantize operator is defined as:

qdq(x;b,s)=sclip(xs;n,m)qdq(x; b, s) = s \cdot \mathrm{clip} \left( \left\lfloor \frac{x}{s} \right\rceil; n, m \right)

where \lfloor \cdot \rceil is round-to-nearest; clip(;n,m)\mathrm{clip}(\cdot; n, m) saturates to [n,m][n, m] with n=2b1n = -2^{b-1}, m=2b11m = 2^{b-1}-1; and s=max(x)min(x)2b1s = \frac{\max(x) - \min(x)}{2^{b-1}} (Equation 2).

SignRoundV2 maintains compatibility with the trainable offsets (v,α,βv, \alpha, \beta) introduced in SignRound V1 while focusing on initialization and bit allocation. For each weight tensor WW, the quantization deviation is defined:

ΔQ(W)=Wfqdq(W)\Delta_Q(W) = W_f - qdq(W)

The sensitivity of each layer \ell is estimated via a first-order Taylor expansion as:

ΔLgaq,AfAq+gwq,WfWq\Delta L_\ell \approx \langle g_{aq}, A_f - A_q \rangle + \langle g_{wq}, W_f - W_q \rangle

where gaq=L/Aqg_{aq} = \partial L / \partial A_q, gwq=L/Wqg_{wq} = \partial L / \partial W_q, AfA_f, AqA_q are full-precision and quantized activations, WfW_f, WqW_q are full-precision and quantized weights (Equation 3). In practice, activation errors dominate, leading to the DeltaLoss sensitivity metric:

S=gaq(AfAq)1S_\ell = \Vert | g_{aq} \circ (A_f - A_q) | \Vert_1

as in Equation 4.

2. Algorithmic Components

2.1 Layer-wise Bit Allocation

SignRoundV2 formulates a 0–1 integer program to minimize aggregate layerwise DeltaLoss under an average bit-width constraint. Given nn layers, allowed bit-widths B={b1,...,bK}B = \{b_1, ..., b_K\}, and a global average-bit target TT, the optimization is:

mini=1nbBΔLi(b)Ii,b\min \sum_{i=1}^n \sum_{b\in B} \Delta L_i(b) \cdot I_{i,b}

subject to

bIi,b=1i;i,bbIi,bPiTiPi;Ii,b{0,1}\sum_{b} I_{i,b} = 1 \hspace{0.6cm} \forall i; \quad \sum_{i,b} b \cdot I_{i,b} \cdot P_i \leq T \cdot \sum_i P_i; \quad I_{i,b} \in \{0,1\}

where PiP_i is the parameter count in layer ii. This is efficiently solved using dynamic programming in O(nBbudget)O(n \cdot |B| \cdot \text{budget}) time.

Step Description Time Complexity
DeltaLoss computation Compute ΔLi(b)\Delta L_i(b) for all i,bi, b O(nB)\mathcal{O}(n|B|)
DP bit allocation Optimize per-layer bits under global constraint O(nBbudget)\mathcal{O}(n|B|\text{budget})

A calibration-driven, grid-based search finds an effective scale ss for each layer before any gradient-based tuning. The initialization objective is:

Lossinit(s)=1Ni=1N(Wf[i]qdq(Wf[i];b,s))Aˉ222\mathrm{Loss}_{init}(s) = \frac{1}{N} \sum_{i=1}^N \left\Vert \left( W_f[i] - qdq(W_f[i]; b, s) \right) \circ \bar{A}^2 \right\Vert_2^2

(Equation 6), where Aˉ\bar{A} collects per-input-channel activation maxima from a small calibration set. Candidate scales SS are scanned, and the minimizer is chosen as sinits_{init}. This is optionally refined by a trainable scalar α[0.5,1.5]\alpha\in [0.5, 1.5] for fine-tuning.

2.3 Tuning Procedure

Each transformer block undergoes 200 sign-gradient descent steps (or up to 500 in extended “Ours*” experiments) on a blockwise reconstruction MSE loss. The learning rate is 1/steps1/\text{steps}, with batch size 8 and sequence length 2048, using mixed precision for improved computational throughput. To reduce outlier influence, the largest k=0.1%k=0.1\% of squared errors in each block are excluded.

3. Evaluation: Models, Benchmarks, and Results

3.1 Models and Tasks

SignRoundV2 has been evaluated on LLaMA 2 (7B, 13B, 70B), LLaMA 3 (8B, 70B), and Qwen (2.5B, 8B, 32B) models. The benchmark suite includes ARC-Easy, ARC-Challenge, BoolQ, HellaSwag, LAMBADA, MMLU, OpenBookQA, PIQA, TruthfulQA, and WinoGrande.

3.2 Quantitative Performance

For 2-bit weights (W2A16):

Method LLaMA2-7B LLaMA2-13B LLaMA2-70B
GPTQ (W2A16) 41.6% 48.3% 34.4%
AWQ (W2A16) 34.7% 36.0% 35.5%
OmniQ (W2A16) 47.0% 53.6% 54.9%
SignRound V1 54.5% 60.7% 67.7%
SignRound V2 57.9% 61.9% 68.4%

At 4–5 bits average (MXFP4/8), SignRoundV2 achieves 95–99% recovery of full-precision accuracy, sustaining 1%\sim1\% variance.

4. Ablation Studies and Comparative Analysis

4.1 Initialization

Initialization with scale pre-tuning yields gains of 5–10 percentage points in average accuracy over “without init” baselines: Qwen3-8B improves from 48–54% to 56–66%; LLaMA3.1-8B from 48–53% to 50–60%.

4.2 DeltaLoss-Only vs. Full Tuning

The DeltaLoss-only mode (no gradient-based SignRound tuning) already surpasses heuristic methods such as “head-8bit,” “tail-8bit,” and RTN. Full SignRoundV2 further improves accuracy by approximately 1–2 percentage points due to sign-gradient rounding.

4.3 Mixed vs. Uniform Precision

In pure 2-bit (W2A16) mode, uniform-precision SignRoundV2 nearly closes the gap with mixed-precision setups. For 4–5 bits, uniform allocation plus SignRoundV2 achieves >95%>95\% recovery; mixed-precision yields only marginal gains.

5. Implementation and Practical Considerations

5.1 Pipeline and Resource Profile

SignRoundV2 is available as open source at https://github.com/intel/auto-round, offering routines for DeltaLoss computation, dynamic bit allocation, pre-tuning, and blockwise SignRound training. Default hyperparameters include 200 steps per block, batch size 8, and 128 calibration samples. The pipeline, for one LLM instance, typically proceeds as follows:

  1. Load full-precision (FP) model.
  2. Collect activation maxima Aˉ\bar{A} from 16–64 random calibration prompts.
  3. Compute DeltaLoss sensitivities (\sim5–10 min per 8B model).
  4. Solve for per-layer bit assignment.
  5. Pre-tune quantization scales by grid search on Eq. 6.
  6. Run per-block SignRound tuning (\sim2–3 h per 70B model).
  7. Export quantized model for inference.

5.2 Deployment and Performance

  • Weight memory: W2A16 mode yields 8× reduction; MXFP4/8 mode achieves 4–2× reduction.
  • Inference speed: Quantized matmul kernels (ADSO, standard INT) provide near 2–4× latency improvements on GPU/CPU.
  • Resource constraints: 70B models fit in 40\sim 40GB VRAM (W2A16) with 10\sim 10GB peak overhead for DeltaLoss.

A plausible implication is that SignRoundV2 enables practical PTQ of large-scale LLMs on commodity hardware by combining first-order accuracy maintenance and computational efficiency.

6. Significance and Future Directions

SignRoundV2 establishes two major contributions for low-bit LLM quantization: (1) a scalable, gradient-informed sensitivity metric (DeltaLoss) that guides allocation, and (2) efficient pre-tuning that substantially improves scale initialization, both of which lead to robust quantization even in extremely low-bit regimes. Its methodology generalizes to a range of LLM architectures and could be further extended by investigation into alternative sensitivity metrics, broader calibration strategies, or adaptation to even lower resource targets (Cheng et al., 4 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to SignRoundV2.