Papers
Topics
Authors
Recent
Search
2000 character limit reached

LSR: Linearized Subspace Refinement

Updated 30 March 2026
  • LSR is a universal, architecture-agnostic framework that refines neural network predictions using a low-dimensional subspace derived from the Jacobian.
  • It constructs a reduced least-squares problem via randomized range finding and SVD, enabling one-shot or iterative accuracy improvements without altering the underlying architecture.
  • Empirical results demonstrate order-of-magnitude error reductions in supervised, operator, and physics-informed neural learning compared to conventional gradient methods.

Linearized Subspace Refinement (LSR) is a universal, architecture-agnostic framework designed for refining neural network predictions beyond the limits typically achieved by gradient-based optimization. LSR leverages the linearized residual model induced by the Jacobian at a fixed trained parameter state and solves a reduced least-squares problem within a data-driven low-rank subspace. This approach provides a tractable and numerically stable mechanism for substantial post-training or in-training accuracy improvement across supervised learning, operator learning, and physics-informed neural operator fine-tuning, without altering model architectures or loss formulations (Cao et al., 20 Jan 2026).

1. Problem Formulation and Core Methodology

LSR operates on a generic neural network predictor q(θ,x)Rdq(\theta, x)\in\mathbb{R}^d with a parameter vector θRm\theta\in\mathbb{R}^m and a residual vector f(θ)Rnf(\theta)\in\mathbb{R}^n, where the training objective is to minimize the squared norm L(θ)=12f(θ)22L(\theta) = \frac{1}{2}\|f(\theta)\|_2^2. Typical choices for f(θ)f(\theta) encompass residuals for supervised learning, operator learning, or physics-informed learning.

At a pretrained state θ0\theta_0, the first-order Taylor expansion provides

f(θ0+δ)f(θ0)+G0δf(\theta_0+\delta) \approx f(\theta_0) + G_0\,\delta

with G0=fθθ0G_0 = \frac{\partial f}{\partial \theta}|_{\theta_0}. Direct solution of the full least-squares problem,

δ=argminδRmf(θ0)+G0δ22,\delta^* = \arg\min_{\delta\in\mathbb{R}^m} \|f(\theta_0) + G_0\,\delta\|_2^2,

is intractable for large mm. LSR addresses this by restricting δ\delta to a low-dimensional subspace: δ=Vy\delta = V\,y, where VRm×rV\in\mathbb{R}^{m\times r} with VTV=IrV^TV=I_r and rmr\ll m. The reduced problem is

y=argminyRrf(θ0)+G0Vy22,y^* = \arg\min_{y\in\mathbb{R}^r} \|f(\theta_0) + G_0 V y\|_2^2,

with the refined predictor given by

qLSR(x)=q(θ0,x)+J0(x)δ,q_\text{LSR}(x) = q(\theta_0, x) + J_0(x)\,\delta^*,

where J0(x)=qθθ0,xJ_0(x) = \frac{\partial q}{\partial\theta}|_{\theta_0, x}.

2. Subspace Construction and Linear Residual Modeling

To construct the subspace VV, LSR employs a randomized range-finding strategy based on the network-output Jacobian J0=qθJ_0 = \frac{\partial q}{\partial\theta}. The process involves:

  • Drawing a Gaussian random matrix ΩRm×(r+p)\Omega\in\mathbb{R}^{m\times(r+p)},
  • Computing Y=J0ΩY = J_0\Omega using Jacobian-vector products,
  • QR factorization of Y=QRY=QR,
  • SVD on the reduced matrix B=QTJ0B = Q^TJ_0 to extract the dominant right singular vectors,
  • Selecting the first rr columns of VV.

Restricting to this subspace yields a reduced least-squares problem in rr dimensions, efficiently solvable via direct methods (thin QR or normal equations). For typical r102103r\sim 10^2 - 10^3, this procedure enables order-of-magnitude improvements in empirical accuracy.

The underlying assumption is that most actionable directions for residual minimization are captured in a tractable, low-rank local subspace of parameter perturbations, exposing accuracy not attainable by standard gradient-based training due to ill-conditioning.

3. One-Shot and Iterative LSR Algorithms

The LSR methodology encompasses two core algorithmic modes:

  • One-Shot LSR: Performed as a post-processing step after standard training has converged. The algorithm consists of subspace identification, construction of a reduced system (using G0VG_0V), and solution via direct linear algebra to deliver a refined linear predictor.
  • Iterative LSR: Designed for composite or operator-constrained objectives, such as those arising in PDE-constrained learning. The procedure alternates between one-shot LSR subspace corrections and supervised nonlinear alignment using standard optimizers (e.g., L-BFGS or Adam) to minimize alignment losses. This approach is particularly effective for physics-informed learning where residual minimization is combined with boundary and physical constraints.

High-level pseudocode for both one-shot and iterative variants explicitly specifies the sequence of subspace construction (randomized SVD, QR), system assembly (batching, Jacobian-vector products), and least-squares solution, as well as recommended practical choices for rank rr and oversampling parameter pp (Cao et al., 20 Jan 2026).

4. Numerical Conditioning and Trade-Offs Versus Gradient Training

Within convex quadratic regimes, the convergence rate of gradient or quasi-Newton methods is governed by the condition number of the Hessian H=2G0TG0H = 2G_0^T G_0. Ill-conditioning (large κ(H)\kappa(H)) provokes slow or stalled convergence, frequently causing early plateaus in loss minimization.

Empirical results indicate that for supervised function fitting with modern neural networks, standard optimizers such as Adam can stall at MSE levels orders of magnitude above the solution attainable by direct reduced-subspace solutions. One-shot LSR achieves full machine-precision minimization in these cases, while even iterative solvers on the full linearized system plateau due to ill-conditioning. For operator learning and PINN fine-tuning, growing rr improves the LSR loss up to a threshold, beyond which numerical errors and subspace ill-conditioning dominate.

The practical selection of subspace rank rr is dictated by monotonic improvement in residual loss; rank is increased up to the point where this improvement is no longer observed.

5. Empirical Performance Across Applications

Experimental results in multiple regimes demonstrate the broad efficacy of LSR:

  • Supervised Function Approximation: On a 2D sine target, Adam converges to test MSE 106\sim10^{-6}, whereas one-shot LSR (with r=1000r=1000) reduces this to 1012\sim10^{-12}. From random initialization, LSR still surpasses optimizer plateaus.
  • Operator Learning (1D Burgers equation): Across DeepONet and MultiONet architectures, median test-error reduction factors post-LSR range from 12×12\times to 236×236\times, depending on nonlinearity choices.
  • Physics-Informed Fine-Tuning: For 300 test instances each, error reduction ranges from 8×8\times (advection) to 240×240\times (linear ODE).
  • Iterative LSR in PDE Solving: Combining LSR and nonlinear alignment accelerates convergence by >10×>10\times compared to standard PINN or TSONN approaches, with alternating steps effectively neutralizing both high- and low-frequency errors.
  • Classification (MNIST): On a $110$k-parameter CNN with random initialization, LSR reduces test error from 90%90\% to 4%4\% with a single application at r1000r\approx1000.

A summary table of physics-informed fine-tuning results is below (averaging over 300 instances):

Equation Baseline Test Error Post-LSR Test Error Error Reduction
Linear ODE 3.7×1033.7\times10^{-3} 1.5×1051.5\times10^{-5} 240×240\times
Reaction–Diffusion 5.6×1035.6\times10^{-3} 5.7×1045.7\times10^{-4} 10×10\times
Burgers 1.2×1021.2\times10^{-2} 3.9×1043.9\times10^{-4} 30×30\times
Advection 2.2×1022.2\times10^{-2} 3.1×1033.1\times10^{-3} 8×8\times

6. Computational Aspects and Integration Guidelines

  • Complexity: Subspace construction requires O((r+p)Cost(JVP))O((r+p)\cdot\text{Cost}(\text{JVP})) for Y=J0ΩY=J_0\Omega and O(m(r+p)2)O(m(r+p)^2) for the small QR+SVD. Reduced-system solve costs O(nr2+r3)O(nr^2+r^3), dominated by assembling G0VG_0V for large nn.
  • Memory: Storing VrV_r uses O(mr)O(mr) memory; batch-LSR can reduce peak usage by iterating over data.
  • Rank Selection: Monitor refinement loss as a function of rr and select the highest rank before numerical issues arise; typical practical values are r=1002000r=100 – 2000.
  • Pipeline Integration: One-shot LSR can be applied post-convergence of standard neural network training, while Iterative LSR is compatible with operator-constrained and PDE-driven training by alternating with supervised alignment steps.
  • Implementation: Automatic differentiation frameworks (e.g., PyTorch, TensorFlow) are leveraged for Jacobian-vector products; standard linear algebra (LAPACK, NumPy) is used for small QR/SVD solves.

7. Theoretical Limitations and Research Directions

LSR yields a nonzero correction only when the pretrained parameters are not already stationary points for the residual (i.e., G0Tf00G_0^T f_0 \ne 0). In practice, such exact stationarity is rarely attained, hence LSR often leads to tangible gains.

Increasing subspace rank rr induces severe ill-conditioning beyond a problem-specific threshold. LSR fundamentally operates only on the local linearized residual, without altering the global nonlinear characteristics of the underlying network. Naïvely using the LSR correction δ\delta^* as a full Gauss–Newton parameter update leads to breakdowns outside the linear regime; LSR is defined only as a mechanism for linear predictor refinement at a fixed θ0\theta_0.

Additional limitations include susceptibility to overfitting in low-sample or noisy regimes (necessitating cross-validation or regularization), and current applicability restricts to least-squares or similar objectives.

Open research directions include dynamic/adaptive rank selection, incorporation of damping or 2\ell_2 regularization in the subspace solve (e.g., Levenberg–Marquardt), integration with learned preconditioners, characterization of deep network Jacobian spectra, and extension to general classification losses beyond least-squares.

Summary

Linearized Subspace Refinement addresses loss-induced numerical ill-conditioning that limits the effectiveness of gradient-based neural network training. By marrying randomized subspace construction with direct reduced least-squares solvers, LSR reliably exploits the model's attainable accuracy within a convex local linearization, providing measurable and frequently order-of-magnitude function and operator error reductions while preserving compatibility with standard architectures and training pipelines (Cao et al., 20 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Linearized Subspace Refinement (LSR).