Matrix-Free Optimization Framework

Updated 7 May 2026

Matrix-free optimization framework is a paradigm that avoids global matrix assembly by using fast operator oracles, enhancing memory efficiency and scalability.
It leverages evaluation oracles, code graphs, and fused kernels in iterative solvers to execute high-throughput computations in imaging, PDEs, and machine learning.
This approach achieves significant speedups (up to 150×) and memory reductions (up to 90%), while requiring specialized preconditioning and algorithmic trade-offs.

A matrix-free optimization framework is an architectural and computational paradigm in which all key linear-algebraic operations—matrix–vector products, preconditioning, projection, and solvers for optimality conditions—are implemented without materializing large dense or sparse matrices. Instead, matrix–vector products and other operator actions are defined by fast evaluation oracles, code graphs, or kernel routines, dramatically reducing memory consumption and enabling high-throughput computations on modern parallel or distributed architectures. This approach is motivated by domain instances—imaging, PDE-constrained optimization, nonlinear least squares, mixed-integer programming, and machine learning—where operator structure, problem size, or hardware limitations render explicit matrix assembly infeasible or undesirable.

1. Core Principles and Motivation

Matrix-free frameworks address the prohibitive cost, in both memory and computational throughput, of explicitly assembling and storing large linear operators or system Jacobians. The primary objectives are:

Avoid global matrix storage: Only local or blockwise information is ever stored; global matrices such as Jacobians, Hessians, normal equations, or Kronecker products are never built.
Operator-oriented interface: All computations delegate to forward–adjoint operator oracles, graph compositions of linear operators, or highly-fused GPU kernels.
High arithmetic intensity: By fusing local computations (residual, Jacobian–vector products, preconditioning), the ratio of floating-point operations to memory access is increased, maximizing hardware efficiency.
Scalability: Enables optimization over millions or billions of parameters/residuals (e.g., imaging, PDE control, graph optimization, topology design).

The approach is particularly essential in the following domains:

Large-scale nonlinear least squares in graphics/imaging (DeVito et al., 2016)
PDE-constrained optimization on spectral/FEM grids (Marin et al., 2018)
Large-scale mixed-integer programming (Mexi et al., 2023)
Matrix-structured online/adaptive optimization (Kovalev, 2 Apr 2026, Cutkosky et al., 2019)
Topology optimization with 3D finite elements (Yang et al., 20 Apr 2026)
High-dimensional convex optimization with structured transforms (Diamond et al., 2015)
Second-order methods with low-rank Hessian approximations (Frantar et al., 2021)

2. Unified Matrix-Free Operator Workflow

All matrix-free optimization frameworks share methodological elements at the operator and algorithmic level:

2.1 Operator Definitions:

Fast operator representations replace stored matrices. Examples include:
- Forward–adjoint oracles for matrix–vector multiplication (MATVEC, ADJMATVEC) (Diamond et al., 2015, Marin et al., 2018).
- DAG-based operator graphs preserving fast transforms (FFT, convolution, Kronecker, or elementwise mapping) (Diamond et al., 2015, Marin et al., 2018).
- Local kernel blocks in graphics or topology settings (DeVito et al., 2016, Yang et al., 20 Apr 2026).
- Rank-one decompositions (e.g., empirical Fisher in deep learning) (Frantar et al., 2021).
In all cases, the global matrix is never explicitly realized.

2.2 Matrix-Free Iterative Solvers:

All Krylov-type, first-order, or projections methods delegate to repeated operator products:
- PCG and GMRES only require application of $A$ and $A^T$ .
- Interior-point or ADMM steps solve Newton systems or primal–dual updates using only operator products (Fountoulakis et al., 2012, Adil et al., 2022).
- Projection-free or preconditioned SGD updates bypass projections by operator-splitting or FTRL-based regularizers (Kovalev, 2 Apr 2026).
Preconditioners are also constructed in a matrix-free manner: diagonals via fast trace estimation, block-Jacobi from operator probes, or analytic expressions in block or banded contexts (DeVito et al., 2016, Yang et al., 20 Apr 2026, Miniguano-Trujillo et al., 2024).

2.3 Explicit Avoidance of Matrix Assembly:

No global sparse/dense CSR/COO/ell storage.
Kernel fusions (e.g., gather-GEMM-scatter on GPU) avoid even intermediate buffer writes (Yang et al., 20 Apr 2026).

3. Algorithmic Realizations in Key Application Domains

3.1 Nonlinear Least Squares for Visual Computation

Opt (DeVito et al., 2016) generates matrix-free Gauss–Newton (GN) and Levenberg–Marquardt (LM) GPU solvers for imaging, MVS, and pose graph problems. The user declares block residuals, the system symbolically differentiates and emits GPU kernels for Jacobian–vector and adjoint operations without ever forming $J$ or $J^T J$ . GN/LM systems are solved by CG or PCG where each iteration only evaluates matrix-free products over local blocks—a crucial feature with $N, M > 10^6$ . Memory and walltime overhead is reduced by orders of magnitude relative to classical assembled sparse solvers.

3.2 PDE-Constrained Optimization and Spectral Elements

In PETSc/TAO (Marin et al., 2018), matrix-free "shell" objects abstract all forward/adjoin actions, exploiting block/tensor product structure. High-order spectral element operators are only materialized locally, and time/space complexity stays $O(N)$ instead of $O(N^2)$ or worse. The approach enables scalable PDE optimization over $>2$ billion DOFs on supercomputers, with 2–3× speedups and 90% reduction in RAM consumption.

3.3 Convex and Conic Optimization with Fast Transforms

Matrix-free modeling (FAO–DAG) (Diamond et al., 2015, Adil et al., 2022) treats convolution, FFT, and Kronecker constraints as evaluation graphs. Cone solvers (ADMM or IPM) utilize only forward/adjoint routines; for conic problems, factorized operators ( $A=UV^T$ ) with diagonal structure enable all steps to be $O(o)$ (number of nonzeros), sometimes with 100× or more speedup on GPU.

3.4 Mixed-Integer Programming

Matrix-free first-order heuristics, such as Scylla (Mexi et al., 2023), rely exclusively on mat–vecs with $A^T$ 0/ $A^T$ 1 in their PDHG loop for solving LP relaxations. This avoids all matrix factorization costs and allows rapid warm starts and large-scale feasibility searches even with very sparse giant problems.

3.5 Matrix-Structured Online and Adaptive Optimization

Projection-free preconditioned SGD and FTRL variants (e.g., Leon, RecursiveOptimizer) (Kovalev, 2 Apr 2026, Cutkosky et al., 2019) replace explicit projections and full-matrix preconditioners with blockwise or diagonalized operator updates, achieving performance that interpolates between diagonal and full-matrix AdaGrad, yet with $A^T$ 2 time/space complexity and no $A^T$ 3 storage or inversion requirements.

3.6 Second-Order and Newton-type Methods

Matrix-free Newton/Krylov methods apply the chain rule and sparse/block factorizations without assembling $A^T$ 4 (Naumann, 2023). For structure-exploiting cases (banded/tridiagonal/block), per-step complexity falls to $A^T$ 5, and graph-based heuristics are used to manage decomposition complexity. In empirical Fisher/second-order SGD (M-FAC) (Frantar et al., 2021), Sherman-Morrison-Woodbury recursions exploit the structure to make inverse-Hessian–vector products efficient and matrix free.

4. Software, DSLs, and Compilation Pipelines

Domain-Specific Languages (DSLs): Opt (DeVito et al., 2016) provides a C-like DSL for user-declared residuals, from which the compiler generates symbolic derivatives and fusion kernels for matrix-free linear algebra, automatically scheduling and fusing kernels for maximal occupancy and minimal memory movement.
Operator Graph Modeling: Matrix-free modeling frameworks (Diamond et al., 2015) build DAGs of FAOs; canonicalization, splitting, and graph optimization steps preserve decomposability and minimize memory usage.
Highly Tuned Kernels: For SIMP topology optimization (Yang et al., 20 Apr 2026), the gather-GEMM-scatter is fused into a single CUDA kernel that never allocates intermediate global-memory buffers, maximizing occupancy and minimizing DRAM traffic.
Integration with Automatic Differentiation: Newton methods (Naumann, 2023) leverage program-tape DAGs and reverse-mode AD principles for both Jacobian-vector and adjoint computations in a matrix-free manner.

5. Numerical Performance, Memory Complexity, and Hardware Utilization

Matrix-free frameworks routinely demonstrate speedups of 3–150× and memory reductions of up to 90% relative to assembled-matrix approaches across a diverse set of workloads.

Summary Table of Key Benchmarks and Results:

Application	Speedup / Memory vs. Assembled	Hardware	Method Reference
Multi-view stereo LM	5× faster, 6× less memory	V100 GPU	(DeVito et al., 2016)
3D SIMP Topology (Cantilever)	4.6–7.3× faster, 3.2–4.9× less energy	RTX 4090 GPU	(Yang et al., 20 Apr 2026)
PDE-constrained optimization	2–3× faster, 90% less RAM	Theta (KNL cluster)	(Marin et al., 2018)
Conic Optimization (SOCP)	10–150× faster (GPU)	Multi-GPU, CPU	(Adil et al., 2022)
Mixed-Integer LP (hard root-LP)	4× faster on hard instances	CPU	(Mexi et al., 2023)
Nonlocal denoising (PCG+NFFT)	Cholesky infeasible for $A^T$ 6, PCG viable up to $A^T$ 7	Laptop	(Miniguano-Trujillo et al., 2024)
Online learning/preconditioning	$A^T$ 8 time/space, full-matrix quality	CPU/GPU	(Cutkosky et al., 2019, Kovalev, 2 Apr 2026)

6. Limitations, Conditioning, and Preconditioning Strategies

While matrix-free methods eliminate assembly and allow highly parallel computations, certain trade-offs and domain-specific limitations arise:

Recomputation Overhead: CG/PCG in matrix-free frameworks recomputes local operator products in every iteration, whereas assembled solvers may cache and reuse.
Preconditioning Quality: Only diagonal or block-Jacobi preconditioners are practical for "pure" matrix-free methods. Complex global preconditioners and direct solvers would require access to assembled structure.
Numerical Precision: Mixed-precision (e.g., BF16 in GPU kernels) is limited by matrix conditioning; if $A^T$ 9 (e.g., $J$ 0), iterative refinement fails to converge (Yang et al., 20 Apr 2026).
Scope of Applicability: Problems that lack exploitable sparsity, operator structure, or fast oracles may not see benefits from matrix-free techniques.
NP-Completeness in Decomposition: Scheduling the optimal matrix-free Newton step in arbitrary DAGs is NP-complete; heuristics are needed (Naumann, 2023).

7. Future Directions and Open Research Problems

Matrix-free frameworks are the subject of ongoing research in several directions:

Geometric Multigrid Preconditioning: Embedding matrix-free operators as smoothers within multilevel cycles to reduce Krylov iterations in ill-conditioned regimes (Yang et al., 20 Apr 2026).
Distributed and Multi-GPU Scalability: Extending fused-kernel or operator-graph approaches to distributed memory and multi-device environments with low-latency communication (Yang et al., 20 Apr 2026).
Generic Operator Graph Optimization: Further automating the optimization and fusion of operator DAGs for rapidly evolving hardware and large-scale systems (Diamond et al., 2015, Marin et al., 2018).
Projection-Free Adaptive Methods: Refining FTRL-gradient-accumulation and blockwise preconditioning for non-convex, non-smooth, or highly-structured machine learning objectives (Kovalev, 2 Apr 2026).
Extensions to Mixed-Integer/Quadratic Programming: Leveraging matrix-free first-order solvers as primal heuristics inside commercial MIP/MIQP environments (Mexi et al., 2023).
Adaptive Mixed-Precision Strategies: Tuning operator arithmetic dynamically to avoid stagnation in highly ill-conditioned settings (Yang et al., 20 Apr 2026).

The matrix-free optimization framework constitutes an overview of operator-oriented programming, automatic differentiation, compilation, and iterative algorithms, offering hardware-efficient and scalable solutions for a broad range of large-scale optimization problems.