Papers
Topics
Authors
Recent
Search
2000 character limit reached

R1-FLR: Rank1-Sketch Flexible Rank Selection

Updated 16 January 2026
  • R1-FLR is a randomized algorithmic framework that leverages repeated rank-1 Gaussian sketches for fine-grained, adaptive rank selection in low-rank matrix approximation.
  • The method employs an iterative deflation process with dynamic stopping rules, balancing quantization error reduction against memory constraints for deep neural network weights.
  • Empirical results demonstrate state-of-the-art quantization performance with lower computational overhead compared to traditional SVD or RSVD methods.

Rank1-Sketch-based Flexible Rank Selection (R1-FLR) is a randomized algorithmic framework for adaptive, fine-grained rank determination and low-rank matrix approximation, distinguished by its use of repeated rank-1 Gaussian sketches and thresholding schemes that enable computational efficiency and layer-wise adaptability. R1-FLR is principally designed for large-scale applications—such as post-training quantization of deep neural network weights—where traditional low-rank approximations (e.g., SVD or randomized SVD) are suboptimal due to computational cost and inability to tailor rank to the heterogeneity of matrix structure across layers. The method facilitates on-the-fly extraction of singular vectors and dynamic stopping, balancing quantization error reduction against memory constraints and computational complexity (Gul et al., 9 Jan 2026).

1. Motivation and Problem Definition

The motivation for R1-FLR arises from the limitations of fixed-rank low-rank approximation techniques in large models, notably LLMs. While classical SVD-based approaches—such as truncated SVD and randomized SVD—can enable low-rank decompositions, they incur substantial computational overhead when extended to per-layer, data-dependent rank selection. These methods generally require either a globally fixed compromise rank or expensive per-layer sweeps to optimally reduce quantization error.

R1-FLR seeks to decouple rank selection from an a priori guess by enabling rapid, layerwise discovery of the minimal rank rr needed for a weight matrix WRm×nW \in \mathbb{R}^{m \times n} such that the rr-rank correction significantly improves quantization accuracy without violating model-size or memory constraints. Instead of blockwise or full-matrix sketching, R1-FLR iteratively applies a Gaussian-projected rank-1 sketch, leveraging repeated extraction of the leading singular-vector direction and providing explicit and immediate stopping rules (Gul et al., 9 Jan 2026).

2. Methodological Foundations: The R1-Sketch Mechanism

At each iteration, a standard R1-FLR extraction proceeds as follows:

  • Draw a Gaussian vector SRn×1S \in \mathbb{R}^{n \times 1} with entries SiN(0,1)S_i \sim \mathcal{N}(0,1).
  • Form P=(AA)itASP = (A A^\top)^{it} A S, where itit denotes the number of power-iterations and AA is the current residual.
  • Normalize: Q=P/PQ = P / \|P\|.
  • Compute the sketch B=QAB = Q^\top A (a 1×n1 \times n row vector).
  • Perform a rank-1 SVD: UB=1U_B = 1, ΣB=B\Sigma_B = \|B\|, VB=B/BV_B = B / \|B\|.
  • The best rank-1 approximation is given by AL=QΣBA_L = Q \Sigma_B, AR=VBA_R = V_B.
  • Update AAALARA \leftarrow A - A_L A_R and repeat as needed (Gul et al., 9 Jan 2026).

This sequential, deflationary process enables fine-grained rank selection. Unlike CUR approaches that use blockwise selection and a fixed small sketch matrix throughout, R1-FLR can be interpreted as a limiting case where a fresh rank-1 Gaussian sketch is used per increment, maximizing adaptability (Pritchard et al., 26 Sep 2025).

3. Outlier-Aware Rank Extraction and Stopping Criteria

R1-FLR integrates an outlier-aware criterion to automate layerwise selection of the effective rank, directly tied to quantization noise and memory usage. After rr rank-1 increments:

  • Compute the approximation Wr=i=1rUiViW_r = \sum_{i=1}^r U_i V_i.
  • Quantize the residual R=WWrR = W - W_r in dd bits, with scaling sr=(2d11)/maxijRijs_r = (2^{d-1}-1)/\max_{ij} |R_{ij}|.
  • The worst-case error Er=1/(2sr)E_r = 1/(2s_r) yields the "precision gain" metric Q=(d+log2(w0/wr))/dQ = (d + \log_2(w_0 / w_r)) / d, where w0=maxijWijw_0 = \max_{ij}|W_{ij}|.
  • The "memory cost" metric K=1+(dfpr(m+n))/(dmn)K = 1 + (d_{fp} r (m+n)) / (d m n) compares the cost of storing rr rank-1 corrections to dd-bit quantization of WW.
  • Terminate extraction when QKQ \leq K (precision gain is not worth extra memory), K>1+xK > 1 + x (exceeds user-specified memory budget xx), or the relative decrease in maxR\max |R| falls below threshold tslopet_{\text{slope}} (Gul et al., 9 Jan 2026).

This procedure ensures that each increment is justified by a balance of quantization error reduction and storage cost, exploiting the layerwise variability of effective rank in practice.

4. Algorithmic Description and Complexity

The R1-FLR algorithm is summarized in the pseudocode below (verbatim from (Gul et al., 9 Jan 2026)):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
R  W
w0  max(|R|)
W_L, W_R  empty lists
for i = 1 to min(m, n):
    S  RandomNormal(n, 1)
    P  (R R^T)^{it}  R  S
    Q  P / P
    B  Q^T  R
    σ  B
    u  Q * σ
    v  B / σ
    R  R - u v
    wr  max(|R|)
    Qgain  (d + log(w0/wr))/d
    Kcost  1 + (d_fp * i * (m + n)) / (d * m * n)
    slope  (prev_wr - wr) / prev_wr
    prev_wr  wr
    if Qgain  Kcost or Kcost > 1 + x or slope < t_slope:
        break
    append W_L  u, W_R  v
return W_L, W_R

The total computational cost to extract rr increments is O(rpmn)O(r p m n), where p=2it+2p = 2 it + 2 is the number of GEMV (matrix–vector products) per rank-1 extraction. This is asymptotically several times faster than SVD or RSVD for rmin(m,n)r \ll \min(m,n). Working memory comprises O(mn)O(m n) for the matrix itself, and O(m+n)O(m + n) for workspace vectors (Gul et al., 9 Jan 2026).

5. Theoretical Guarantees and Error Bounds

R1-FLR inherits the theoretical error bounds of rank-1 RSVD, with the following spectral-norm guarantee:

EAArσr+1+[1+42n/(r1)]1/(it+1)σr+1\mathbb{E}\bigl\|A - A_r\bigr\| \leq \sigma_{r+1} + \left[1 + 4\sqrt{2n/(r-1)}\right]^{1/(it+1)} \sigma_{r+1}

where σr+1\sigma_{r+1} is the (r+1)(r+1)-st singular value and itit is the number of power iterations, making R1-FLR nearly optimal in spectral norm for well-decaying spectra and robust to slow singular value decay rates (Gul et al., 9 Jan 2026).

In the broader context of CUR-based approaches, the underlying rationale of "recycling" a tall sketch across iterations is theoretically justified. Extending the IterativeCUR framework to a pure rank-1 sketching regime (the R1-FLR limit) enables high-probability accuracy guarantees analogous to those derived via random projection inequalities (Pritchard et al., 26 Sep 2025).

6. Empirical Performance and Usage in Quantization

Empirical results in LLM post-training quantization demonstrate that R1-FLR, implemented in the FLRQ framework, achieves:

  • State-of-the-art quantization quality, with perplexity values matching or outperforming fixed-rank SVD-based methods at much lower average rank (e.g., average rank ≈ 40 vs. fixed rank 256 for 2-bit quantization) while adding only ≈ 0.3 bits per weight.
  • Algorithmic speed-ups: quantization time 30–50% lower than RSVD/SVD alternatives, with inference latency rising by only 4–6%.
  • Robust performance across varying group sizes and clipping regimes, and effectiveness even in the presence of layerwise variability and weight outliers (Gul et al., 9 Jan 2026).

A subset of empirical hyperparameters reported as effective are: it=2it=2, tslope102t_{\text{slope}} \approx 10^{-2}, memory threshold x=0.2x=0.2, and activation group size 128 for clipping (Gul et al., 9 Jan 2026).

Precision Model Avg. rank Extra bits PPL
W4A16 OPT-1.3B 30.5 0.34 14.65
LLaMA2-7B 36.1 0.21 5.55
W3A16 OPT-1.3B 28.8 0.33 15.53
LLaMA2-7B 35.8 0.21 5.88
W2A16 OPT-1.3B 27.6 0.33 22.99
LLaMA2-7B 39.2 0.24 9.14

7. Connections to Broader Rank-Adaptive Sketching Paradigms

R1-FLR exemplifies a general shift toward sketch-based adaptive algorithms for low-rank matrix approximation and numerical rank estimation. The methodology aligns with two-sided randomized sketching approaches for rank estimation, which use small random sketches to recover singular value structure and adapt rank thresholding in streaming or memory-limited regimes (Meier et al., 2021). It also represents the rank-1-sketch limit of blockwise recycled-sketch algorithms (e.g., IterativeCUR) that operate by incrementally updating both the approximation and its error proxy using only matrix–vector or matrix–small-matrix products and never requiring a full residual (Pritchard et al., 26 Sep 2025).

A plausible implication is that the R1-FLR strategy, while derived for quantization and LLM compression, forms a unifying framework for computationally adaptive low-rank approximation wherever per-matrix or per-layer variability in numerical rank is present or where memory and finite-precision quantization plays a central role. The approach is amenable to further acceleration when paired with structured or fast-transform sketches, and its probabilistic error control suggests a broad range of robust, high-confidence applications in large-scale numerical linear algebra and machine learning.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Rank1-Sketch-based Flexible Rank Selection (R1-FLR).