BLC: Best Low-Rank Approximation under Clipping
- The paper introduces BLC as a novel method that integrates low-rank extraction with outlier clipping to minimize quantization error.
- It employs an alternating iterative approach that refines rank approximations and clipping thresholds, ensuring rapid convergence and minimal overhead.
- Empirical validations show significant improvements in perplexity and quantization fidelity for 2–4 bit post-training quantization of large language models.
Best Low-rank Approximation under Clipping (BLC) is a method central to the FLRQ (Flexible Low-Rank Quantization) framework, designed for efficient and accurate post-training quantization (PTQ) of LLMs. BLC focuses on minimizing quantization error by alternating scalable low-rank extraction with outlier clipping and quantization of the residual, thereby providing robust, near-monotone error decrease and high quantization quality at minimal computational overhead (Gul et al., 9 Jan 2026).
1. Formal Definition and Objective
BLC addresses the quantized low-rank approximation of a neural network weight matrix , targeting efficient storage and inference without costly fine-tuning. Given , a desired bit-width for uniform quantization, and a calibration activation matrix (with activation samples), the goal is to decompose as:
- A rank- matrix with and ,
- A clipping threshold ,
- A quantized residual ,
such that the end-to-end error on the calibration batch,
is minimized. In matrix norm notation, the objective is:
where the clipping operator , and quantization uses uniform -bit codebooks (with possible group-wise scaling or zero-point schemes).
2. Optimization Motivation
Low-rank preconditioning with captures the majority of the variance in , isolating heavy-tail outliers in the residual . Directly quantizing can produce excessive rounding errors, especially from outlier values. By clipping to , the dynamic range for quantization is reduced, thus improving quantization fidelity for the bulk of residual entries at the expense of zeroing a small fraction of large outliers.
A joint, closed-form optimization over all variables is computationally intractable. BLC therefore uses an alternating iterative approach:
- Recompute the best rank- low-rank component of the current residual (via R1-Sketch flexible rank selection).
- Search for the optimal clipping threshold that, after quantizing the residual, minimizes calibration error.
The working assumption is that reproducing with small error on the calibration set is a sufficient proxy for maintaining accuracy post-quantization.
3. Iterative BLC Procedure
The central loop of BLC alternates between low-rank extraction and clipping/quantization threshold search. The following pseudocode outlines the key steps within the context of FLRQ:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
W_r = initial_low_rank(W) # via SVD or R1‐FLR W_q = Quant(Clip(W - W_r, τ0), d) bestE = +∞ best_{W_r,W_q} = {W_r, W_q} for epoch in range(epochs): E = || W X - (W_r + W_q) X ||_2 if E < bestE: bestE = E best_{W_r, W_q} = {W_r, W_q} R = W - W_q U, V = R1-FLR(R) # Flexible-rank low-rank extraction W_r = U V^T bestτ, bestWq = argmin_{τ ∈ p_clp_search} || W X - (W_r + Quant(Clip(W - W_r, τ), d)) X ||_2 W_q = bestWq return best_{W_r, W_q} |
R1-FLR leverages the fast Rank-1 Sketch (with Gaussian projection) to extract layer- and data-dependent rank efficiently, allowing for outlier-aware low-rank representations. The clipping threshold search is performed over $10$ logarithmically spaced fractions of the residual's maximum absolute value.
4. Theoretical Properties
The theoretical properties of BLC arise from the error guarantees of R1-Sketch and the monotone error reduction obtained via alternating minimization:
- With power iteration and expectation, the R1-Sketch error bound is
guaranteeing top-singular-vector approximation within a small constant factor per power iteration (Halko et al., 2011).
- Each BLC epoch does not increase the calibration error ; storing the best solution ensures non-increasing objective. Although the problem is non-convex, convergence within epochs is typical for 3–4 bit quantization, and for 2 bits.
- Per-epoch complexity is dominated by the cost of R1-Sketch (i.e., , implemented as GEMV BLAS-2 operations) and the threshold search (which is a small loop over preselected values). Empirically, using for R1-Sketch costs only $6$ GEMV operations and is 2–5 faster than truncated SVD (Gul et al., 9 Jan 2026).
5. Practical Parameter Choices
Practical deployment of BLC involves several empirically validated settings:
- R1-Sketch with is sufficient to match truncated SVD accuracy at a fraction of the computational cost (see Table 13 in (Gul et al., 9 Jan 2026)).
- Quantization is performed using group size $128$, consistent with AWQ default.
- Calibration data consists of $128$ sequences of $2048$ tokens from WikiText2.
- The initial low-rank component can be derived via a full SVD or a single R1-FLR pass.
- Clipping thresholds are searched over $10$ logarithmically spaced fractions of the absolute max of the current residual.
- The recommended number of BLC epochs is 1–2 for 4-bit quantization, 2–5 for 3 bits, and 10–30 for 2 bits (see Figure 1 and Table 15 in (Gul et al., 9 Jan 2026)).
6. Empirical Validation and Comparisons
BLC achieves state-of-the-art accuracy and efficiency in quantized LLMs. Key experimental results on benchmark models and tasks include:
- On OPT-1.3B with W3A16 quantization, BLC improves perplexity (PPL) from 15.80 (without BLC) to 15.53 (with BLC), a delta of .
- For W2A16 on OPT-1.3B, PPL without BLC increases to above due to overflow; BLC reduces this to 22.99.
- On six zero-shot tasks for LLaMA2-7B at 3 bits, FLRQ+BLC improves average accuracy from 53.7% to 54.4%.
- Ablation shows that removing BLC degrades 2-bit OPT-1.3B PPL from 22.99 to 29.32.
- Throughput and latency overhead of FLRQ+LoRA on W4A16 is only 4–6% relative to the baseline (Figure 2).
- Compared with fixed-rank LQER at rank 256, adaptive FLRQ achieves similar or better PPL with average rank ≈ 39 and extra storage ≈ 0.24 bits per parameter (Table 9).
BLC thus provides an alternating low-rank extraction and outlier quantization procedure with robust calibration loss reduction, fast convergence, and minimal overhead, underpinning accurate and efficient 2–4 bit quantization for large-scale models (Gul et al., 9 Jan 2026).