R1-FLR: Rank1-Sketch Flexible Rank Selection

Updated 16 January 2026

R1-FLR is a randomized algorithmic framework that leverages repeated rank-1 Gaussian sketches for fine-grained, adaptive rank selection in low-rank matrix approximation.
The method employs an iterative deflation process with dynamic stopping rules, balancing quantization error reduction against memory constraints for deep neural network weights.
Empirical results demonstrate state-of-the-art quantization performance with lower computational overhead compared to traditional SVD or RSVD methods.

Rank1-Sketch-based Flexible Rank Selection (R1-FLR) is a randomized algorithmic framework for adaptive, fine-grained rank determination and low-rank matrix approximation, distinguished by its use of repeated rank-1 Gaussian sketches and thresholding schemes that enable computational efficiency and layer-wise adaptability. R1-FLR is principally designed for large-scale applications—such as post-training quantization of deep neural network weights—where traditional low-rank approximations (e.g., SVD or randomized SVD) are suboptimal due to computational cost and inability to tailor rank to the heterogeneity of matrix structure across layers. The method facilitates on-the-fly extraction of singular vectors and dynamic stopping, balancing quantization error reduction against memory constraints and computational complexity (Gul et al., 9 Jan 2026).

1. Motivation and Problem Definition

The motivation for R1-FLR arises from the limitations of fixed-rank low-rank approximation techniques in large models, notably LLMs. While classical SVD-based approaches—such as truncated SVD and randomized SVD—can enable low-rank decompositions, they incur substantial computational overhead when extended to per-layer, data-dependent rank selection. These methods generally require either a globally fixed compromise rank or expensive per-layer sweeps to optimally reduce quantization error.

R1-FLR seeks to decouple rank selection from an a priori guess by enabling rapid, layerwise discovery of the minimal rank $r$ needed for a weight matrix $W \in \mathbb{R}^{m \times n}$ such that the $r$ -rank correction significantly improves quantization accuracy without violating model-size or memory constraints. Instead of blockwise or full-matrix sketching, R1-FLR iteratively applies a Gaussian-projected rank-1 sketch, leveraging repeated extraction of the leading singular-vector direction and providing explicit and immediate stopping rules (Gul et al., 9 Jan 2026).

2. Methodological Foundations: The R1-Sketch Mechanism

At each iteration, a standard R1-FLR extraction proceeds as follows:

Draw a Gaussian vector $S \in \mathbb{R}^{n \times 1}$ with entries $S_i \sim \mathcal{N}(0,1)$ .
Form $P = (A A^\top)^{it} A S$ , where $it$ denotes the number of power-iterations and $A$ is the current residual.
Normalize: $Q = P / \|P\|$ .
Compute the sketch $B = Q^\top A$ (a $1 \times n$ row vector).
Perform a rank-1 SVD: $U_B = 1$ , $\Sigma_B = \|B\|$ , $V_B = B / \|B\|$ .
The best rank-1 approximation is given by $A_L = Q \Sigma_B$ , $A_R = V_B$ .
Update $A \leftarrow A - A_L A_R$ and repeat as needed (Gul et al., 9 Jan 2026).

This sequential, deflationary process enables fine-grained rank selection. Unlike CUR approaches that use blockwise selection and a fixed small sketch matrix throughout, R1-FLR can be interpreted as a limiting case where a fresh rank-1 Gaussian sketch is used per increment, maximizing adaptability (Pritchard et al., 26 Sep 2025).

3. Outlier-Aware Rank Extraction and Stopping Criteria

R1-FLR integrates an outlier-aware criterion to automate layerwise selection of the effective rank, directly tied to quantization noise and memory usage. After $r$ rank-1 increments:

Compute the approximation $W_r = \sum_{i=1}^r U_i V_i$ .
Quantize the residual $R = W - W_r$ in $d$ bits, with scaling $s_r = (2^{d-1}-1)/\max_{ij} |R_{ij}|$ .
The worst-case error $E_r = 1/(2s_r)$ yields the "precision gain" metric $Q = (d + \log_2(w_0 / w_r)) / d$ , where $w_0 = \max_{ij}|W_{ij}|$ .
The "memory cost" metric $K = 1 + (d_{fp} r (m+n)) / (d m n)$ compares the cost of storing $r$ rank-1 corrections to $d$ -bit quantization of $W$ .
Terminate extraction when $Q \leq K$ (precision gain is not worth extra memory), $K > 1 + x$ (exceeds user-specified memory budget $x$ ), or the relative decrease in $\max |R|$ falls below threshold $t_{\text{slope}}$ (Gul et al., 9 Jan 2026).

This procedure ensures that each increment is justified by a balance of quantization error reduction and storage cost, exploiting the layerwise variability of effective rank in practice.

4. Algorithmic Description and Complexity

The R1-FLR algorithm is summarized in the pseudocode below (verbatim from (Gul et al., 9 Jan 2026)):

R ← W
w0 ← max(|R|)
W_L, W_R ← empty lists
for i = 1 to min(m, n):
    S ← RandomNormal(n, 1)
    P ← (R R^T)^{it} ⋅ R ⋅ S
    Q ← P / ‖P‖
    B ← Q^T ⋅ R
    σ ← ‖B‖
    u ← Q * σ
    v ← B / σ
    R ← R - u v
    wr ← max(|R|)
    Qgain ← (d + log₂(w0/wr))/d
    Kcost ← 1 + (d_fp * i * (m + n)) / (d * m * n)
    slope ← (prev_wr - wr) / prev_wr
    prev_wr ← wr
    if Qgain ≤ Kcost or Kcost > 1 + x or slope < t_slope:
        break
    append W_L ← u, W_R ← v
return W_L, W_R

The total computational cost to extract $r$ increments is $O(r p m n)$ , where $p = 2 it + 2$ is the number of GEMV (matrix–vector products) per rank-1 extraction. This is asymptotically several times faster than SVD or RSVD for $r \ll \min(m,n)$ . Working memory comprises $O(m n)$ for the matrix itself, and $O(m + n)$ for workspace vectors (Gul et al., 9 Jan 2026).

5. Theoretical Guarantees and Error Bounds

R1-FLR inherits the theoretical error bounds of rank-1 RSVD, with the following spectral-norm guarantee:

$\mathbb{E}\bigl\|A - A_r\bigr\| \leq \sigma_{r+1} + \left[1 + 4\sqrt{2n/(r-1)}\right]^{1/(it+1)} \sigma_{r+1}$

where $\sigma_{r+1}$ is the $(r+1)$ -st singular value and $it$ is the number of power iterations, making R1-FLR nearly optimal in spectral norm for well-decaying spectra and robust to slow singular value decay rates (Gul et al., 9 Jan 2026).

In the broader context of CUR-based approaches, the underlying rationale of "recycling" a tall sketch across iterations is theoretically justified. Extending the IterativeCUR framework to a pure rank-1 sketching regime (the R1-FLR limit) enables high-probability accuracy guarantees analogous to those derived via random projection inequalities (Pritchard et al., 26 Sep 2025).

6. Empirical Performance and Usage in Quantization

Empirical results in LLM post-training quantization demonstrate that R1-FLR, implemented in the FLRQ framework, achieves:

State-of-the-art quantization quality, with perplexity values matching or outperforming fixed-rank SVD-based methods at much lower average rank (e.g., average rank ≈ 40 vs. fixed rank 256 for 2-bit quantization) while adding only ≈ 0.3 bits per weight.
Algorithmic speed-ups: quantization time 30–50% lower than RSVD/SVD alternatives, with inference latency rising by only 4–6%.
Robust performance across varying group sizes and clipping regimes, and effectiveness even in the presence of layerwise variability and weight outliers (Gul et al., 9 Jan 2026).

A subset of empirical hyperparameters reported as effective are: $it=2$ , $t_{\text{slope}} \approx 10^{-2}$ , memory threshold $x=0.2$ , and activation group size 128 for clipping (Gul et al., 9 Jan 2026).

Precision	Model	Avg. rank	Extra bits	PPL
W4A16	OPT-1.3B	30.5	0.34	14.65
	LLaMA2-7B	36.1	0.21	5.55
W3A16	OPT-1.3B	28.8	0.33	15.53
	LLaMA2-7B	35.8	0.21	5.88
W2A16	OPT-1.3B	27.6	0.33	22.99
	LLaMA2-7B	39.2	0.24	9.14

7. Connections to Broader Rank-Adaptive Sketching Paradigms

R1-FLR exemplifies a general shift toward sketch-based adaptive algorithms for low-rank matrix approximation and numerical rank estimation. The methodology aligns with two-sided randomized sketching approaches for rank estimation, which use small random sketches to recover singular value structure and adapt rank thresholding in streaming or memory-limited regimes (Meier et al., 2021). It also represents the rank-1-sketch limit of blockwise recycled-sketch algorithms (e.g., IterativeCUR) that operate by incrementally updating both the approximation and its error proxy using only matrix–vector or matrix–small-matrix products and never requiring a full residual (Pritchard et al., 26 Sep 2025).

A plausible implication is that the R1-FLR strategy, while derived for quantization and LLM compression, forms a unifying framework for computationally adaptive low-rank approximation wherever per-matrix or per-layer variability in numerical rank is present or where memory and finite-precision quantization plays a central role. The approach is amenable to further acceleration when paired with structured or fast-transform sketches, and its probabilistic error control suggests a broad range of robust, high-confidence applications in large-scale numerical linear algebra and machine learning.