SignRoundV2: Low-Bit LLM Quantization
- SignRoundV2 is a post-training quantization framework that enables extremely low-bit (2–5 bits) quantization of large language models with nearly full-precision accuracy.
- It introduces a gradient-informed DeltaLoss metric for layer sensitivity and a calibration-based scale search to optimize bit allocation and initialization.
- SignRoundV2 demonstrates robust, production-grade performance—recovering 95–99% of full-precision accuracy—while reducing memory footprint and inference latency.
SignRoundV2 is a post-training quantization (PTQ) framework designed to enable extremely low-bit quantization (2–5 bits) of LLMs with minimal loss in accuracy relative to full-precision baselines. It introduces two principal innovations: a first-order, gradient-informed layerwise sensitivity metric (“DeltaLoss”) that directs bit allocation, and a lightweight, calibration-based search for optimal scale initialization prior to quantization. SignRoundV2 demonstrates production-grade performance—within ~1% of full-precision—in the 4–5 bit regime and delivers strong results even with 2-bit weight quantization, advancing the state of efficient LLM deployment (Cheng et al., 4 Dec 2025).
1. Mathematical Formulation
SignRoundV2 employs a symmetric, scale-driven quantizer for both weights and activations. For a full-precision tensor , quantization bit-width , and scale , the quantize-dequantize operator is defined as:
where is round-to-nearest; saturates to with , ; and (Equation 2).
SignRoundV2 maintains compatibility with the trainable offsets () introduced in SignRound V1 while focusing on initialization and bit allocation. For each weight tensor , the quantization deviation is defined:
The sensitivity of each layer is estimated via a first-order Taylor expansion as:
where , , , are full-precision and quantized activations, , are full-precision and quantized weights (Equation 3). In practice, activation errors dominate, leading to the DeltaLoss sensitivity metric:
as in Equation 4.
2. Algorithmic Components
2.1 Layer-wise Bit Allocation
SignRoundV2 formulates a 0–1 integer program to minimize aggregate layerwise DeltaLoss under an average bit-width constraint. Given layers, allowed bit-widths , and a global average-bit target , the optimization is:
subject to
where is the parameter count in layer . This is efficiently solved using dynamic programming in time.
| Step | Description | Time Complexity |
|---|---|---|
| DeltaLoss computation | Compute for all | |
| DP bit allocation | Optimize per-layer bits under global constraint |
2.2 Lightweight Pre-Tuning Scale Search
A calibration-driven, grid-based search finds an effective scale for each layer before any gradient-based tuning. The initialization objective is:
(Equation 6), where collects per-input-channel activation maxima from a small calibration set. Candidate scales are scanned, and the minimizer is chosen as . This is optionally refined by a trainable scalar for fine-tuning.
2.3 Tuning Procedure
Each transformer block undergoes 200 sign-gradient descent steps (or up to 500 in extended “Ours*” experiments) on a blockwise reconstruction MSE loss. The learning rate is , with batch size 8 and sequence length 2048, using mixed precision for improved computational throughput. To reduce outlier influence, the largest of squared errors in each block are excluded.
3. Evaluation: Models, Benchmarks, and Results
3.1 Models and Tasks
SignRoundV2 has been evaluated on LLaMA 2 (7B, 13B, 70B), LLaMA 3 (8B, 70B), and Qwen (2.5B, 8B, 32B) models. The benchmark suite includes ARC-Easy, ARC-Challenge, BoolQ, HellaSwag, LAMBADA, MMLU, OpenBookQA, PIQA, TruthfulQA, and WinoGrande.
3.2 Quantitative Performance
For 2-bit weights (W2A16):
| Method | LLaMA2-7B | LLaMA2-13B | LLaMA2-70B |
|---|---|---|---|
| GPTQ (W2A16) | 41.6% | 48.3% | 34.4% |
| AWQ (W2A16) | 34.7% | 36.0% | 35.5% |
| OmniQ (W2A16) | 47.0% | 53.6% | 54.9% |
| SignRound V1 | 54.5% | 60.7% | 67.7% |
| SignRound V2 | 57.9% | 61.9% | 68.4% |
At 4–5 bits average (MXFP4/8), SignRoundV2 achieves 95–99% recovery of full-precision accuracy, sustaining variance.
4. Ablation Studies and Comparative Analysis
4.1 Initialization
Initialization with scale pre-tuning yields gains of 5–10 percentage points in average accuracy over “without init” baselines: Qwen3-8B improves from 48–54% to 56–66%; LLaMA3.1-8B from 48–53% to 50–60%.
4.2 DeltaLoss-Only vs. Full Tuning
The DeltaLoss-only mode (no gradient-based SignRound tuning) already surpasses heuristic methods such as “head-8bit,” “tail-8bit,” and RTN. Full SignRoundV2 further improves accuracy by approximately 1–2 percentage points due to sign-gradient rounding.
4.3 Mixed vs. Uniform Precision
In pure 2-bit (W2A16) mode, uniform-precision SignRoundV2 nearly closes the gap with mixed-precision setups. For 4–5 bits, uniform allocation plus SignRoundV2 achieves recovery; mixed-precision yields only marginal gains.
5. Implementation and Practical Considerations
5.1 Pipeline and Resource Profile
SignRoundV2 is available as open source at https://github.com/intel/auto-round, offering routines for DeltaLoss computation, dynamic bit allocation, pre-tuning, and blockwise SignRound training. Default hyperparameters include 200 steps per block, batch size 8, and 128 calibration samples. The pipeline, for one LLM instance, typically proceeds as follows:
- Load full-precision (FP) model.
- Collect activation maxima from 16–64 random calibration prompts.
- Compute DeltaLoss sensitivities (5–10 min per 8B model).
- Solve for per-layer bit assignment.
- Pre-tune quantization scales by grid search on Eq. 6.
- Run per-block SignRound tuning (2–3 h per 70B model).
- Export quantized model for inference.
5.2 Deployment and Performance
- Weight memory: W2A16 mode yields 8× reduction; MXFP4/8 mode achieves 4–2× reduction.
- Inference speed: Quantized matmul kernels (ADSO, standard INT) provide near 2–4× latency improvements on GPU/CPU.
- Resource constraints: 70B models fit in GB VRAM (W2A16) with GB peak overhead for DeltaLoss.
A plausible implication is that SignRoundV2 enables practical PTQ of large-scale LLMs on commodity hardware by combining first-order accuracy maintenance and computational efficiency.
6. Significance and Future Directions
SignRoundV2 establishes two major contributions for low-bit LLM quantization: (1) a scalable, gradient-informed sensitivity metric (DeltaLoss) that guides allocation, and (2) efficient pre-tuning that substantially improves scale initialization, both of which lead to robust quantization even in extremely low-bit regimes. Its methodology generalizes to a range of LLM architectures and could be further extended by investigation into alternative sensitivity metrics, broader calibration strategies, or adaptation to even lower resource targets (Cheng et al., 4 Dec 2025).