Papers
Topics
Authors
Recent
2000 character limit reached

Sketching as a Regularizer

Updated 11 November 2025
  • Sketching as a regularizer is a technique that uses randomized projections, such as JL transforms and CountSketch, to compress high-dimensional data while nearly preserving its geometric structure.
  • It serves as an explicit surrogate for conventional regularizers in applications like continual learning, deep representation, and inverse imaging, achieving competitive performance with lower memory and computation.
  • Empirical and theoretical results demonstrate that sketched regularization maintains error bounds and convergence rates, offering a scalable and adaptive approach in complex optimization problems.

Sketching as a regularizer refers to the practice of employing randomized linear transformations—known as sketches—to compress information, reduce computational cost, and/or induce explicit or implicit inductive biases in optimization and learning problems. This approach harnesses sketching’s classical ability to preserve geometric structure under random projections (e.g., Johnson–Lindenstrauss lemma) in order to regularize high-dimensional statistical, neural, and inverse problems. Recent developments have demonstrated that sketching can function as an explicit surrogate for standard regularizers, as a mechanism for memory-efficient replay in continual learning, as an inductive bias for structure in vision representations, and as an implicit multiscale prior in stochastic optimization for imaging applications.

1. Mathematical Foundations of Sketching as a Regularizer

Randomized sketching typically involves projecting high-dimensional vectors or matrices onto a low-dimensional subspace using a random matrix SRk×nS \in \mathbb{R}^{k \times n}, with knk \ll n. Sketching matrices are constructed to satisfy 2\ell^2-subspace embedding properties: for any xx in a subspace,

(1ε)Ax22SAx22(1+ε)Ax22,(1-\varepsilon)\|Ax\|_2^2 \leq \|S A x\|_2^2 \leq (1+\varepsilon)\|Ax\|_2^2,

with high probability.

Key instantiations include:

  • CountSketch matrices for row-space compression and efficient matrix product approximation.
  • Johnson–Lindenstrauss (JL) transforms, such as Gaussian or Fast JL matrices (FJLT), to preserve pairwise distances.
  • Multiresolution operators in imaging, which are block-averaging/downsampling-upscaling compositions.

Sketching thus induces a new, lower-dimensional geometry in which quadratic forms, distances, (co-)variance, and loss landscapes are only slightly distorted—this underpins its application as a regularizer across different domains.

2. Structural Regularization in Lifelong Learning

Structural regularization (SR) methods in continual learning penalize deviation of current neural network parameters θ\theta from parameters θ\theta^* learned on previous tasks, modulated by a (typically quadratic) importance matrix FF: R(θ)=12(θθ)F(θθ).R(\theta) = \frac{1}{2}(\theta - \theta^*)^\top F(\theta - \theta^*). In "Lifelong Learning with Sketched Structural Regularization" (Li et al., 2021), a sketched SR approach compresses FF via CountSketch, such that FF~=1nW~W~F \approx \tilde F = \frac{1}{n} \tilde{W}^\top \tilde{W}, where W~=SW\tilde W = SW (with SS the sketching matrix and WW the Jacobian matrix of per-sample gradients).

Algorithmic aspects:

  • Building W~\tilde W costs O(nm)O(nm) time and O(tm)O(tm) memory for tmt \ll m "buckets".
  • The sketched penalty preserves off-diagonal information ignored by diagonal-SR, but at drastically reduced storage versus the full O(m2)O(m^2) matrix.

Theoretical guarantees:

Let rr be the stable rank of WW. For t=O(r2/ε2)t = O(r^2/\varepsilon^2), with high probability,

(1ε)R(θ)Rsketch(θ)(1+ε)R(θ)(1-\varepsilon) R(\theta) \leq R_{\rm sketch}(\theta) \leq (1+\varepsilon) R(\theta)

uniformly over θ\theta, and F~F2εF2\|\tilde F - F\|_2 \leq \varepsilon \|F\|_2.

Empirical results:

Sketched SR outperforms diagonal-only SR in standard lifelong learning benchmarks (Permuted-MNIST, CIFAR-100). For instance, Sketch-EWC achieves 93.6% on CIFAR-100 distribution-shift (vs. 90.8% for diagonal EWC), matching full-matrix EWC accuracy at a small fraction of computational cost (Li et al., 2021).

3. Sketch-Based Regularization in High-Dimensional Optimization

Sketching has been extended to high-dimensional regularized least-squares with arbitrary convex or nonconvex penalties. The Sketching for Regularized Optimization (SRO) framework (Yang et al., 2023) analyzes the following: x=argminx  12Axb22+R(x)x^* = \arg\min_x \; \frac{1}{2}\|A x - b\|_2^2 + R(x) and its sketched version,

x^=argminx  12S(Axb)22+R(x),\hat x = \arg\min_x \; \frac{1}{2}\|S(Ax - b)\|_2^2 + R(x),

where SS is a JL-type sketch matrix.

Theoretical results:

  • For convex RR, with SS a (1±ε)(1\pm\varepsilon) subspace embedding,

A(x^x)2ε1εAx2.\|A(\hat x - x^*)\|_2 \leq \frac{\varepsilon}{1-\varepsilon}\|A x^*\|_2.

  • For sparse estimators (convex and also nonconvex, e.g., SCAD/MCP), iterative SRO achieves optimal minimax rates,

x~(N)xˉ2=O(slogdn)\|\tilde x^{(N)}-\bar x\|_2 = O\left(\sqrt{\frac{s \log d}{n}}\right)

for sketch size mO(rlogd)m \sim O(r \log d).

Interpretation:

Sketching reduces the effective problem dimension and noise, akin to explicit penalization (e.g. ridge/Lasso). The amount of regularization is controlled by the sketch dimension mm: small mm yields stronger shrinkage.

Empirical validation:

SRO and iterative SRO yield prediction error and statistical rates on par with full-data solvers in Lasso, Ridge, and subspace clustering tasks, but with reduced computational load (Yang et al., 2023).

4. Sketching as Inductive Bias in Deep Representation Learning

Sketching can serve as an architectural or loss-level inductive bias. In Learning by Sketching (LBS) (Lee et al., 2023), images are embedded into a compact set of parametric Bézier strokes using a Transformer decoder, with end-to-end training governed by CLIP-based geometric and semantic perceptual losses, as well as guidance and embedding losses.

Architectural elements:

  • Depth-constrained explicit representations: e.g., n20n \approx 20 strokes per image.
  • Losses imposed on both geometric (low-level) and semantic (high-level) CLIP embeddings of the stroke rendering.

Theoretical analysis:

  • Affine-equivariance: The stroke generation and rasterization pipeline is provably equivariant under Aff(2)\mathrm{Aff}(2) transformations, ensuring the learned representation is geometry-aware.
  • Bottleneck of limited strokes imposes a capacity constraint that prioritizes global shape over fine, potentially noisy, details.

Implications and empirical evidence:

  • LBS yields greater spatial reasoning and generalization capacity than standard contrastive or self-supervised methods in tasks such as rotMNIST, CLEVR attribute probing, and domain transfer. For instance, LBS achieves 81.8% on leftmost-color CLEVR attribute linear probe, compared to 76.7% (CE) and 70.1% (E(2)-CNN) (Lee et al., 2023).
  • Sketching regularizes the model to focus on essential, interpretable structure rather than pixel-level variation.

5. Sketch-Based Regularization for Continual and In Situ Learning

Memory-efficient continual learning is enabled by sketch-based replay. In neural compressors for in situ scientific simulation (Simpson et al., 4 Nov 2025), only low-dimensional random projections (“sketched snapshots”) of previous data are retained in memory, and the replay loss is computed in the sketch space.

Theoretical underpinning:

  • The Johnson–Lindenstrauss lemma and its manifold variant guarantee that the sketched loss S(U)Sfθ2\|S(U) - S f_\theta\|^2 tightly controls the full-space error Ufθ2\|U - f_\theta\|^2 up to (1±ε)(1 \pm \varepsilon) multiplicative slack.

Algorithmic protocol:

  • For each new data snapshot, store the full data in a small buffer, store multiple sketched projections (knk \ll n) in a larger buffer, and minimize a combined loss: Linsitu=1bfLfull+λ1bsLsketchL_{\text{insitu}} = \frac{1}{b_f} \sum L_{\text{full}} + \lambda \frac{1}{b_s} \sum L_{\text{sketch}}

Empirical performance:

  • On large-scale 2D/3D simulation tasks (Ignition, Neuron, Channel-flow), in situ training with FJLT-based sketching achieves relative Frobenius error (RFE) 2.64%\leq 2.64\%, nearly matching offline baselines, whereas omitting sketching leads to catastrophic forgetting (RFE50%\,\gg 50\%) (Simpson et al., 4 Nov 2025).

Benefits:

  • One-pass, architecture-agnostic, mesh-agnostic, and minimal memory increase, with theoretical control of regularization strength via kk. The method is especially effective for preventing forgetting in long sequence training.

6. Implicit Multiscale Regularization via Sketching in Inverse Imaging

In stochastic multiresolution sketching for image reconstruction (ImaSk) (Perelli et al., 13 Dec 2024), sketching acts as an implicit multiscale regularizer by interleaving updates at different spatial resolutions within each iteration.

Algorithmic innovation:

  • Introduce random “sketch” operators Si=TiTiS_i = T_i^\top T_i, where TiT_i down-samples by 2i2^i and TiT_i^\top up-samples by nearest neighbor.
  • The update uses a SAGA-type variance-reduction framework, sampling the resolution at each step.

Theoretical result:

  • Linear convergence rate in strongly convex settings: ExKx22θKC,\mathbb{E}\|x^K - x^*\|_2^2 \leq \theta^K C, with contraction rate θ<1\theta<1 improving with increased number of sketch resolutions rr.

Effect on regularization:

  • Coarse-resolution updates emphasize smoothing and denoising by suppressing high-frequency details, while fine resolutions refine boundaries. Multiscale stochasticity implicitly regularizes against overfitting high-frequency noise.

Computational and empirical impact:

  • On clinical CT, ImaSk with four resolutions reduces per-iteration computation by approximately a factor of two and reaches convergence $2$-3×3\times faster than single-resolution SAGA, with final reconstructions matching full-resolution solvers in PSNR and error (Perelli et al., 13 Dec 2024).

7. Synthesis: Interpretations, Advantages, and Limitations

The use of sketching as a regularizer unifies explicit surrogate penalties, implicit memory- and information-bottlenecks, and multiscale or geometric priors. Key mechanisms include:

  • Explicit compression, yielding control over noise and overfitting analogous to 2\ell_2-type shrinkage, but tunable via the sketch size.
  • Inductive biases for geometry, compactness, and equivariance, robust to domain shifts and transformations.
  • Memory- and computation-efficient replay and optimization strategies for high-dimensional, streaming, or online scenarios.

Chief advantages are theoretical transparency (explicit error bounds), structural flexibility for diverse problem settings, and strong empirical efficacy in both statistical and deep learning regimes. Principal limitations include the stochastic nature of the error guarantees, necessity of careful sketch dimension choice, and occasional reliance on mild curvature or manifold assumptions for JL-type results. A plausible implication is the potential for further gains via adaptive, data-driven, or deterministic sketches tailored to domain-specific structure or learned low-rank representations.

Sketching as a regularizer thus constitutes a fundamental toolkit for scalable, geometry-aware, and provably robust statistical learning, optimization, and continual adaptation across a broad spectrum of applications.

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Sketching as a Regularizer.