R1-FLR: Rank1-Sketch Flexible Rank Selection
- R1-FLR is a randomized algorithmic framework that leverages repeated rank-1 Gaussian sketches for fine-grained, adaptive rank selection in low-rank matrix approximation.
- The method employs an iterative deflation process with dynamic stopping rules, balancing quantization error reduction against memory constraints for deep neural network weights.
- Empirical results demonstrate state-of-the-art quantization performance with lower computational overhead compared to traditional SVD or RSVD methods.
Rank1-Sketch-based Flexible Rank Selection (R1-FLR) is a randomized algorithmic framework for adaptive, fine-grained rank determination and low-rank matrix approximation, distinguished by its use of repeated rank-1 Gaussian sketches and thresholding schemes that enable computational efficiency and layer-wise adaptability. R1-FLR is principally designed for large-scale applications—such as post-training quantization of deep neural network weights—where traditional low-rank approximations (e.g., SVD or randomized SVD) are suboptimal due to computational cost and inability to tailor rank to the heterogeneity of matrix structure across layers. The method facilitates on-the-fly extraction of singular vectors and dynamic stopping, balancing quantization error reduction against memory constraints and computational complexity (Gul et al., 9 Jan 2026).
1. Motivation and Problem Definition
The motivation for R1-FLR arises from the limitations of fixed-rank low-rank approximation techniques in large models, notably LLMs. While classical SVD-based approaches—such as truncated SVD and randomized SVD—can enable low-rank decompositions, they incur substantial computational overhead when extended to per-layer, data-dependent rank selection. These methods generally require either a globally fixed compromise rank or expensive per-layer sweeps to optimally reduce quantization error.
R1-FLR seeks to decouple rank selection from an a priori guess by enabling rapid, layerwise discovery of the minimal rank needed for a weight matrix such that the -rank correction significantly improves quantization accuracy without violating model-size or memory constraints. Instead of blockwise or full-matrix sketching, R1-FLR iteratively applies a Gaussian-projected rank-1 sketch, leveraging repeated extraction of the leading singular-vector direction and providing explicit and immediate stopping rules (Gul et al., 9 Jan 2026).
2. Methodological Foundations: The R1-Sketch Mechanism
At each iteration, a standard R1-FLR extraction proceeds as follows:
- Draw a Gaussian vector with entries .
- Form , where denotes the number of power-iterations and is the current residual.
- Normalize: .
- Compute the sketch (a row vector).
- Perform a rank-1 SVD: , , .
- The best rank-1 approximation is given by , .
- Update and repeat as needed (Gul et al., 9 Jan 2026).
This sequential, deflationary process enables fine-grained rank selection. Unlike CUR approaches that use blockwise selection and a fixed small sketch matrix throughout, R1-FLR can be interpreted as a limiting case where a fresh rank-1 Gaussian sketch is used per increment, maximizing adaptability (Pritchard et al., 26 Sep 2025).
3. Outlier-Aware Rank Extraction and Stopping Criteria
R1-FLR integrates an outlier-aware criterion to automate layerwise selection of the effective rank, directly tied to quantization noise and memory usage. After rank-1 increments:
- Compute the approximation .
- Quantize the residual in bits, with scaling .
- The worst-case error yields the "precision gain" metric , where .
- The "memory cost" metric compares the cost of storing rank-1 corrections to -bit quantization of .
- Terminate extraction when (precision gain is not worth extra memory), (exceeds user-specified memory budget ), or the relative decrease in falls below threshold (Gul et al., 9 Jan 2026).
This procedure ensures that each increment is justified by a balance of quantization error reduction and storage cost, exploiting the layerwise variability of effective rank in practice.
4. Algorithmic Description and Complexity
The R1-FLR algorithm is summarized in the pseudocode below (verbatim from (Gul et al., 9 Jan 2026)):
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
R ← W w0 ← max(|R|) W_L, W_R ← empty lists for i = 1 to min(m, n): S ← RandomNormal(n, 1) P ← (R R^T)^{it} ⋅ R ⋅ S Q ← P / ‖P‖ B ← Q^T ⋅ R σ ← ‖B‖ u ← Q * σ v ← B / σ R ← R - u v wr ← max(|R|) Qgain ← (d + log₂(w0/wr))/d Kcost ← 1 + (d_fp * i * (m + n)) / (d * m * n) slope ← (prev_wr - wr) / prev_wr prev_wr ← wr if Qgain ≤ Kcost or Kcost > 1 + x or slope < t_slope: break append W_L ← u, W_R ← v return W_L, W_R |
The total computational cost to extract increments is , where is the number of GEMV (matrix–vector products) per rank-1 extraction. This is asymptotically several times faster than SVD or RSVD for . Working memory comprises for the matrix itself, and for workspace vectors (Gul et al., 9 Jan 2026).
5. Theoretical Guarantees and Error Bounds
R1-FLR inherits the theoretical error bounds of rank-1 RSVD, with the following spectral-norm guarantee:
where is the -st singular value and is the number of power iterations, making R1-FLR nearly optimal in spectral norm for well-decaying spectra and robust to slow singular value decay rates (Gul et al., 9 Jan 2026).
In the broader context of CUR-based approaches, the underlying rationale of "recycling" a tall sketch across iterations is theoretically justified. Extending the IterativeCUR framework to a pure rank-1 sketching regime (the R1-FLR limit) enables high-probability accuracy guarantees analogous to those derived via random projection inequalities (Pritchard et al., 26 Sep 2025).
6. Empirical Performance and Usage in Quantization
Empirical results in LLM post-training quantization demonstrate that R1-FLR, implemented in the FLRQ framework, achieves:
- State-of-the-art quantization quality, with perplexity values matching or outperforming fixed-rank SVD-based methods at much lower average rank (e.g., average rank ≈ 40 vs. fixed rank 256 for 2-bit quantization) while adding only ≈ 0.3 bits per weight.
- Algorithmic speed-ups: quantization time 30–50% lower than RSVD/SVD alternatives, with inference latency rising by only 4–6%.
- Robust performance across varying group sizes and clipping regimes, and effectiveness even in the presence of layerwise variability and weight outliers (Gul et al., 9 Jan 2026).
A subset of empirical hyperparameters reported as effective are: , , memory threshold , and activation group size 128 for clipping (Gul et al., 9 Jan 2026).
| Precision | Model | Avg. rank | Extra bits | PPL |
|---|---|---|---|---|
| W4A16 | OPT-1.3B | 30.5 | 0.34 | 14.65 |
| LLaMA2-7B | 36.1 | 0.21 | 5.55 | |
| W3A16 | OPT-1.3B | 28.8 | 0.33 | 15.53 |
| LLaMA2-7B | 35.8 | 0.21 | 5.88 | |
| W2A16 | OPT-1.3B | 27.6 | 0.33 | 22.99 |
| LLaMA2-7B | 39.2 | 0.24 | 9.14 |
7. Connections to Broader Rank-Adaptive Sketching Paradigms
R1-FLR exemplifies a general shift toward sketch-based adaptive algorithms for low-rank matrix approximation and numerical rank estimation. The methodology aligns with two-sided randomized sketching approaches for rank estimation, which use small random sketches to recover singular value structure and adapt rank thresholding in streaming or memory-limited regimes (Meier et al., 2021). It also represents the rank-1-sketch limit of blockwise recycled-sketch algorithms (e.g., IterativeCUR) that operate by incrementally updating both the approximation and its error proxy using only matrix–vector or matrix–small-matrix products and never requiring a full residual (Pritchard et al., 26 Sep 2025).
A plausible implication is that the R1-FLR strategy, while derived for quantization and LLM compression, forms a unifying framework for computationally adaptive low-rank approximation wherever per-matrix or per-layer variability in numerical rank is present or where memory and finite-precision quantization plays a central role. The approach is amenable to further acceleration when paired with structured or fast-transform sketches, and its probabilistic error control suggests a broad range of robust, high-confidence applications in large-scale numerical linear algebra and machine learning.