Papers
Topics
Authors
Recent
Search
2000 character limit reached

BLC: Best Low-Rank Approximation under Clipping

Updated 16 January 2026
  • The paper introduces BLC as a novel method that integrates low-rank extraction with outlier clipping to minimize quantization error.
  • It employs an alternating iterative approach that refines rank approximations and clipping thresholds, ensuring rapid convergence and minimal overhead.
  • Empirical validations show significant improvements in perplexity and quantization fidelity for 2–4 bit post-training quantization of large language models.

Best Low-rank Approximation under Clipping (BLC) is a method central to the FLRQ (Flexible Low-Rank Quantization) framework, designed for efficient and accurate post-training quantization (PTQ) of LLMs. BLC focuses on minimizing quantization error by alternating scalable low-rank extraction with outlier clipping and quantization of the residual, thereby providing robust, near-monotone error decrease and high quantization quality at minimal computational overhead (Gul et al., 9 Jan 2026).

1. Formal Definition and Objective

BLC addresses the quantized low-rank approximation of a neural network weight matrix WRm×nW \in \mathbb{R}^{m \times n}, targeting efficient storage and inference without costly fine-tuning. Given WW, a desired bit-width dd for uniform quantization, and a calibration activation matrix XRn×BX \in \mathbb{R}^{n \times B} (with BB activation samples), the goal is to decompose WW as:

  • A rank-rr matrix Wr=UVTW_r = UV^T with URm×rU \in \mathbb{R}^{m \times r} and VRn×rV \in \mathbb{R}^{n \times r},
  • A clipping threshold τ0\tau \geq 0,
  • A quantized residual Wq=Quant(Clip(WWr,τ),d)W_q = \textrm{Quant}\left( \textrm{Clip}(W - W_r, \tau), d \right),

such that the end-to-end error on the calibration batch,

E(U,V,τ):=WX[Wr+Wq]X2,E(U, V, \tau) := \| WX - [W_r + W_q] X \|_2,

is minimized. In matrix norm notation, the objective is:

minU,V,τ0W[UVT+Quant(Clip(WUVT,τ),d)]F2,\min_{U, V, \tau \geq 0} \| W - [UV^T + \textrm{Quant}( \textrm{Clip}(W - UV^T, \tau), d ) ] \|_F^2,

where the clipping operator Clip(A,τ)ij=sign(Aij)min(Aij,τ)\textrm{Clip}(A, \tau)_{ij} = \textrm{sign}(A_{ij}) \cdot \min(|A_{ij}|, \tau), and quantization uses uniform dd-bit codebooks (with possible group-wise scaling or zero-point schemes).

2. Optimization Motivation

Low-rank preconditioning with WrW_r captures the majority of the variance in WW, isolating heavy-tail outliers in the residual R=WWrR = W - W_r. Directly quantizing RR can produce excessive rounding errors, especially from outlier values. By clipping RR to ±τ\pm \tau, the dynamic range for quantization is reduced, thus improving quantization fidelity for the bulk of residual entries at the expense of zeroing a small fraction of large outliers.

A joint, closed-form optimization over all variables (U,V,τ)(U, V, \tau) is computationally intractable. BLC therefore uses an alternating iterative approach:

  • Recompute the best rank-rr low-rank component of the current residual (via R1-Sketch flexible rank selection).
  • Search for the optimal clipping threshold τ\tau that, after quantizing the residual, minimizes calibration error.

The working assumption is that reproducing WXWX with small 2\ell_2 error on the calibration set is a sufficient proxy for maintaining accuracy post-quantization.

3. Iterative BLC Procedure

The central loop of BLC alternates between low-rank extraction and clipping/quantization threshold search. The following pseudocode outlines the key steps within the context of FLRQ:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
W_r = initial_low_rank(W)       # via SVD or R1‐FLR
W_q = Quant(Clip(W - W_r, τ0), d)
bestE = +
best_{W_r,W_q} = {W_r, W_q}

for epoch in range(epochs):
    E = || W X - (W_r + W_q) X ||_2
    if E < bestE:
        bestE = E
        best_{W_r, W_q} = {W_r, W_q}
    R = W - W_q
    U, V = R1-FLR(R)             # Flexible-rank low-rank extraction
    W_r = U V^T
    bestτ, bestWq = argmin_{τ  p_clp_search} || W X - (W_r + Quant(Clip(W - W_r, τ), d)) X ||_2
    W_q = bestWq

return best_{W_r, W_q}

R1-FLR leverages the fast Rank-1 Sketch (with Gaussian projection) to extract layer- and data-dependent rank efficiently, allowing for outlier-aware low-rank representations. The clipping threshold search is performed over $10$ logarithmically spaced fractions of the residual's maximum absolute value.

4. Theoretical Properties

The theoretical properties of BLC arise from the error guarantees of R1-Sketch and the monotone error reduction obtained via alternating minimization:

  • With r=1r = 1 power iteration and expectation, the R1-Sketch error bound is

EAA1σ2+[1+42n/(11)]1/(it+1)σ2,\mathbb{E} \|A - A_1\| \leq \sigma_2 + [1 + 4\sqrt{2n/(1-1)}]^{1/(it+1)} \sigma_2,

guaranteeing top-singular-vector approximation within a small constant factor per power iteration (Halko et al., 2011).

  • Each BLC epoch does not increase the calibration error EE; storing the best solution ensures non-increasing objective. Although the problem is non-convex, convergence within O(1)O(1) epochs is typical for 3–4 bit quantization, and O(1020)O(10–20) for 2 bits.
  • Per-epoch complexity is dominated by the cost of R1-Sketch (i.e., O(itn2)O(\text{it} \cdot n^2), implemented as GEMV BLAS-2 operations) and the threshold search (which is a small loop over preselected τ\tau values). Empirically, using it=2it = 2 for R1-Sketch costs only $6$ GEMV operations and is 2–5×\times faster than truncated SVD (Gul et al., 9 Jan 2026).

5. Practical Parameter Choices

Practical deployment of BLC involves several empirically validated settings:

  • R1-Sketch with it=2it = 2 is sufficient to match truncated SVD accuracy at a fraction of the computational cost (see Table 13 in (Gul et al., 9 Jan 2026)).
  • Quantization is performed using group size $128$, consistent with AWQ default.
  • Calibration data consists of $128$ sequences of $2048$ tokens from WikiText2.
  • The initial low-rank component WrW_r can be derived via a full SVD or a single R1-FLR pass.
  • Clipping thresholds τ\tau are searched over $10$ logarithmically spaced fractions of the absolute max of the current residual.
  • The recommended number of BLC epochs is 1–2 for 4-bit quantization, 2–5 for 3 bits, and 10–30 for 2 bits (see Figure 1 and Table 15 in (Gul et al., 9 Jan 2026)).

6. Empirical Validation and Comparisons

BLC achieves state-of-the-art accuracy and efficiency in quantized LLMs. Key experimental results on benchmark models and tasks include:

  • On OPT-1.3B with W3A16 quantization, BLC improves perplexity (PPL) from 15.80 (without BLC) to 15.53 (with BLC), a delta of 0.27-0.27.
  • For W2A16 on OPT-1.3B, PPL without BLC increases to above 10410^4 due to overflow; BLC reduces this to 22.99.
  • On six zero-shot tasks for LLaMA2-7B at 3 bits, FLRQ+BLC improves average accuracy from 53.7% to 54.4%.
  • Ablation shows that removing BLC degrades 2-bit OPT-1.3B PPL from 22.99 to 29.32.
  • Throughput and latency overhead of FLRQ+LoRA on W4A16 is only 4–6% relative to the baseline (Figure 2).
  • Compared with fixed-rank LQER at rank 256, adaptive FLRQ achieves similar or better PPL with average rank ≈ 39 and extra storage ≈ 0.24 bits per parameter (Table 9).

BLC thus provides an alternating low-rank extraction and outlier quantization procedure with robust calibration loss reduction, fast convergence, and minimal overhead, underpinning accurate and efficient 2–4 bit quantization for large-scale models (Gul et al., 9 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Best Low-rank Approximation under Clipping (BLC).