SignRoundV2: Low-Bit LLM Quantization

Updated 5 December 2025

SignRoundV2 is a post-training quantization framework that enables extremely low-bit (2–5 bits) quantization of large language models with nearly full-precision accuracy.
It introduces a gradient-informed DeltaLoss metric for layer sensitivity and a calibration-based scale search to optimize bit allocation and initialization.
SignRoundV2 demonstrates robust, production-grade performance—recovering 95–99% of full-precision accuracy—while reducing memory footprint and inference latency.

SignRoundV2 is a post-training quantization (PTQ) framework designed to enable extremely low-bit quantization (2–5 bits) of LLMs with minimal loss in accuracy relative to full-precision baselines. It introduces two principal innovations: a first-order, gradient-informed layerwise sensitivity metric (“DeltaLoss”) that directs bit allocation, and a lightweight, calibration-based search for optimal scale initialization prior to quantization. SignRoundV2 demonstrates production-grade performance—within ~1% of full-precision—in the 4–5 bit regime and delivers strong results even with 2-bit weight quantization, advancing the state of efficient LLM deployment (Cheng et al., 4 Dec 2025).

1. Mathematical Formulation

SignRoundV2 employs a symmetric, scale-driven quantizer for both weights and activations. For a full-precision tensor $x$ , quantization bit-width $b$ , and scale $s$ , the quantize-dequantize operator is defined as:

$qdq(x; b, s) = s \cdot \mathrm{clip} \left( \left\lfloor \frac{x}{s} \right\rceil; n, m \right)$

where $\lfloor \cdot \rceil$ is round-to-nearest; $\mathrm{clip}(\cdot; n, m)$ saturates to $[n, m]$ with $n = -2^{b-1}$ , $m = 2^{b-1}-1$ ; and $s = \frac{\max(x) - \min(x)}{2^{b-1}}$ (Equation 2).

SignRoundV2 maintains compatibility with the trainable offsets ( $v, \alpha, \beta$ ) introduced in SignRound V1 while focusing on initialization and bit allocation. For each weight tensor $W$ , the quantization deviation is defined:

$\Delta_Q(W) = W_f - qdq(W)$

The sensitivity of each layer $\ell$ is estimated via a first-order Taylor expansion as:

$\Delta L_\ell \approx \langle g_{aq}, A_f - A_q \rangle + \langle g_{wq}, W_f - W_q \rangle$

where $g_{aq} = \partial L / \partial A_q$ , $g_{wq} = \partial L / \partial W_q$ , $A_f$ , $A_q$ are full-precision and quantized activations, $W_f$ , $W_q$ are full-precision and quantized weights (Equation 3). In practice, activation errors dominate, leading to the DeltaLoss sensitivity metric:

$S_\ell = \Vert | g_{aq} \circ (A_f - A_q) | \Vert_1$

as in Equation 4.

2. Algorithmic Components

2.1 Layer-wise Bit Allocation

SignRoundV2 formulates a 0–1 integer program to minimize aggregate layerwise DeltaLoss under an average bit-width constraint. Given $n$ layers, allowed bit-widths $B = \{b_1, ..., b_K\}$ , and a global average-bit target $T$ , the optimization is:

$\min \sum_{i=1}^n \sum_{b\in B} \Delta L_i(b) \cdot I_{i,b}$

subject to

$\sum_{b} I_{i,b} = 1 \hspace{0.6cm} \forall i; \quad \sum_{i,b} b \cdot I_{i,b} \cdot P_i \leq T \cdot \sum_i P_i; \quad I_{i,b} \in \{0,1\}$

where $P_i$ is the parameter count in layer $i$ . This is efficiently solved using dynamic programming in $O(n \cdot |B| \cdot \text{budget})$ time.

Step	Description	Time Complexity
DeltaLoss computation	Compute $\Delta L_i(b)$ for all $i, b$	$\mathcal{O}(n\|B\|)$
DP bit allocation	Optimize per-layer bits under global constraint	$\mathcal{O}(n\|B\|\text{budget})$

2.2 Lightweight Pre-Tuning Scale Search

A calibration-driven, grid-based search finds an effective scale $s$ for each layer before any gradient-based tuning. The initialization objective is:

$\mathrm{Loss}_{init}(s) = \frac{1}{N} \sum_{i=1}^N \left\Vert \left( W_f[i] - qdq(W_f[i]; b, s) \right) \circ \bar{A}^2 \right\Vert_2^2$

(Equation 6), where $\bar{A}$ collects per-input-channel activation maxima from a small calibration set. Candidate scales $S$ are scanned, and the minimizer is chosen as $s_{init}$ . This is optionally refined by a trainable scalar $\alpha\in [0.5, 1.5]$ for fine-tuning.

2.3 Tuning Procedure

Each transformer block undergoes 200 sign-gradient descent steps (or up to 500 in extended “Ours*” experiments) on a blockwise reconstruction MSE loss. The learning rate is $1/\text{steps}$ , with batch size 8 and sequence length 2048, using mixed precision for improved computational throughput. To reduce outlier influence, the largest $k=0.1\%$ of squared errors in each block are excluded.

3. Evaluation: Models, Benchmarks, and Results

3.1 Models and Tasks

SignRoundV2 has been evaluated on LLaMA 2 (7B, 13B, 70B), LLaMA 3 (8B, 70B), and Qwen (2.5B, 8B, 32B) models. The benchmark suite includes ARC-Easy, ARC-Challenge, BoolQ, HellaSwag, LAMBADA, MMLU, OpenBookQA, PIQA, TruthfulQA, and WinoGrande.

3.2 Quantitative Performance

For 2-bit weights (W2A16):

Method	LLaMA2-7B	LLaMA2-13B	LLaMA2-70B
GPTQ (W2A16)	41.6%	48.3%	34.4%
AWQ (W2A16)	34.7%	36.0%	35.5%
OmniQ (W2A16)	47.0%	53.6%	54.9%
SignRound V1	54.5%	60.7%	67.7%
SignRound V2	57.9%	61.9%	68.4%

At 4–5 bits average (MXFP4/8), SignRoundV2 achieves 95–99% recovery of full-precision accuracy, sustaining $\sim1\%$ variance.

4. Ablation Studies and Comparative Analysis

4.1 Initialization

Initialization with scale pre-tuning yields gains of 5–10 percentage points in average accuracy over “without init” baselines: Qwen3-8B improves from 48–54% to 56–66%; LLaMA3.1-8B from 48–53% to 50–60%.

4.2 DeltaLoss-Only vs. Full Tuning

The DeltaLoss-only mode (no gradient-based SignRound tuning) already surpasses heuristic methods such as “head-8bit,” “tail-8bit,” and RTN. Full SignRoundV2 further improves accuracy by approximately 1–2 percentage points due to sign-gradient rounding.

4.3 Mixed vs. Uniform Precision

In pure 2-bit (W2A16) mode, uniform-precision SignRoundV2 nearly closes the gap with mixed-precision setups. For 4–5 bits, uniform allocation plus SignRoundV2 achieves $>95\%$ recovery; mixed-precision yields only marginal gains.

5. Implementation and Practical Considerations

5.1 Pipeline and Resource Profile

SignRoundV2 is available as open source at https://github.com/intel/auto-round, offering routines for DeltaLoss computation, dynamic bit allocation, pre-tuning, and blockwise SignRound training. Default hyperparameters include 200 steps per block, batch size 8, and 128 calibration samples. The pipeline, for one LLM instance, typically proceeds as follows:

Load full-precision (FP) model.
Collect activation maxima $\bar{A}$ from 16–64 random calibration prompts.
Compute DeltaLoss sensitivities ( $\sim$ 5–10 min per 8B model).
Solve for per-layer bit assignment.
Pre-tune quantization scales by grid search on Eq. 6.
Run per-block SignRound tuning ( $\sim$ 2–3 h per 70B model).
Export quantized model for inference.

5.2 Deployment and Performance

Weight memory: W2A16 mode yields 8× reduction; MXFP4/8 mode achieves 4–2× reduction.
Inference speed: Quantized matmul kernels (ADSO, standard INT) provide near 2–4× latency improvements on GPU/CPU.
Resource constraints: 70B models fit in $\sim 40$ GB VRAM (W2A16) with $\sim 10$ GB peak overhead for DeltaLoss.

A plausible implication is that SignRoundV2 enables practical PTQ of large-scale LLMs on commodity hardware by combining first-order accuracy maintenance and computational efficiency.

6. Significance and Future Directions

SignRoundV2 establishes two major contributions for low-bit LLM quantization: (1) a scalable, gradient-informed sensitivity metric (DeltaLoss) that guides allocation, and (2) efficient pre-tuning that substantially improves scale initialization, both of which lead to robust quantization even in extremely low-bit regimes. Its methodology generalizes to a range of LLM architectures and could be further extended by investigation into alternative sensitivity metrics, broader calibration strategies, or adaptation to even lower resource targets (Cheng et al., 4 Dec 2025).

PDF Markdown Chat (Pro)

References (1)

SignRoundV2: Closing the Performance Gap in Extremely Low-Bit Post-Training Quantization for LLMs (2025)

SignRoundV2: Low-Bit LLM Quantization

1. Mathematical Formulation

2. Algorithmic Components

2.1 Layer-wise Bit Allocation

2.2 Lightweight Pre-Tuning Scale Search

2.3 Tuning Procedure

3. Evaluation: Models, Benchmarks, and Results

3.1 Models and Tasks

3.2 Quantitative Performance

4. Ablation Studies and Comparative Analysis

4.1 Initialization

4.2 DeltaLoss-Only vs. Full Tuning

4.3 Mixed vs. Uniform Precision

5. Implementation and Practical Considerations

5.1 Pipeline and Resource Profile

5.2 Deployment and Performance

6. Significance and Future Directions

Whiteboard

Follow Topic

Continue Learning

SignRoundV2: Low-Bit LLM Quantization

1. Mathematical Formulation

2. Algorithmic Components

2.1 Layer-wise Bit Allocation

2.2 Lightweight Pre-Tuning Scale Search

2.3 Tuning Procedure

3. Evaluation: Models, Benchmarks, and Results

3.1 Models and Tasks

3.2 Quantitative Performance

4. Ablation Studies and Comparative Analysis

4.1 Initialization

4.2 DeltaLoss-Only vs. Full Tuning

4.3 Mixed vs. Uniform Precision

5. Implementation and Practical Considerations

5.1 Pipeline and Resource Profile

5.2 Deployment and Performance

6. Significance and Future Directions

Sponsor

Whiteboard

Follow Topic

Continue Learning

Related Topics