Randomized Sparse Updates

Updated 2 January 2026

Randomized sparse updates are techniques that use random sampling and sparsification operators to selectively update parameters, reducing memory and arithmetic costs while ensuring unbiasedness.
They are applied in sparse recovery, neural network training, GNN acceleration, and high-dimensional PDE solutions by leveraging efficient sampling and parallel computation.
The methods offer theoretical guarantees like linear convergence and error bounds and extend to federated learning and distributed settings through adaptive sampling.

Randomized sparse updates refer to algorithmic strategies in large-scale optimization, numerical linear algebra, deep learning, and scientific computing that replace dense, computationally intensive parameter or coordinate updates with randomly sampled, sparsity-promoting alternatives. These methods leverage random sampling, subspace selection, or masking to aggressively limit the number of coordinates, entries, or parameters updated per iteration—often dramatically reducing both memory and arithmetic costs—while maintaining statistical or optimization guarantees such as unbiasedness, variance control, and rapid convergence whenever feasible. This paradigm has produced scalable solvers for sparse recovery, communication-efficient neural training, federated learning, GNN acceleration, high-dimensional PDEs, and scientific simulations. Randomized sparse updates represent an intersection of randomized numerical linear algebra, compressed sensing, stochastic optimization, and scalable machine learning.

1. Algorithmic Principles and Sampling Mechanisms

Randomized sparse update schemes are characterized by their design of (i) sparsification operators—randomized mappings $S_r: \mathbb{R}^N \to \mathbb{R}^N$ restricting output to at most $r$ nonzero entries, (ii) support selection heuristics—data-driven or magnitude-based for maximal efficiency, and (iii) unbiasedness constraints ensuring that $\mathbb{E}[S_r(v)] = v$ for any input $v$ .

Approaches include:

Pivotal and splitting-based sampling: Preserves the largest-magnitude coordinates and randomly rounds the remainder to achieve unbiasedness, as in the sparsified Richardson iteration (Weare et al., 2023).
Adaptive sampling for matrix operations: In graph neural network (GNN) training, columns/rows are sampled proportional to their induced error or gradient norm, subject to global resource budgets (Liu et al., 2022).
Bernoulli masking in neural training: Parameters are switched on/off per iteration with independent Bernoulli trials, typically with $P(\text{active})=p$ for each coordinate (Berman et al., 2023).
Mini-batched or block-coordinate updates: Multiple coordinates or parameters are sampled independently or as blocks for parallelism and statistical averaging (Tondji et al., 2022).

These mechanisms control statistical error, computational resource allocation, and memory footprint.

2. Randomized Sparse Kaczmarz and Bregman Methods

Sparse randomized updates originated in the randomized sparse Kaczmarz (SRK) and its numerous generalizations, designed for large-scale linear systems under sparsity constraints:

Weighted Kaczmarz updates: At iteration $j$ , weights $w^{(j)}$ shrink off-support components of $A$ to bias projection towards estimated support $S_j$ . Each update modifies only a small, adaptively estimated support, dramatically accelerating convergence for sparse solutions, outperforming both standard RK and $\ell_1$ minimization in many regimes (Mansour et al., 2013, Schöpfer et al., 2016, Aggarwal et al., 2014, Yuan et al., 2021).
Bregman-projection view: The update is a Bregman projection onto affine constraints, with the Bregman distance induced by $f(x)=\lambda\|x\|_1+\frac{1}{2}\|x\|_2^2$ . This formulation enables theoretical analysis for split feasibility and allows extensions to generalized regularizers and block-coordinate updates (Schöpfer et al., 2016, Schöpfer et al., 2022).
Parallel and mini-batch extensions: Mini-batched block versions further accelerate convergence and exploit parallel resources. In the RSKA method (Tondji et al., 2022), batch-averaged Kaczmarz corrections are combined with over-relaxation, yielding improved contraction factors in theory and order-of-magnitude wall-clock speedups in practice.

Theoretical guarantees include linear convergence rates in expectation under strong convexity and mild geometric conditions on the constraint hyperplanes, with convergence factors explicitly characterized by data dimensions, batch sizes, and sampling probabilities.

3. Randomized Sparse Updates in Deep Learning and GNNs

Randomized sparse updates have found widespread impact in deep learning and graph learning:

Hash-based randomized updates: Random projections and locality-sensitive hashing select a small active set of neurons per forward and backward pass, reducing floating-point operations from $O(ND)$ to $O(kD)$ , where $k \ll N$ . This enables sparse gradients naturally suited for lock-free, highly parallel hardware execution, with minimal accuracy loss—less than $1\%$ degradation at $5\%$ active units—demonstrated on large benchmarks (Spring et al., 2016).
Layerwise and epoch-wise resource allocation in GNNs: The RSC framework for GNNs uses top- $k$ sampling on sparse adjacency columns, subject to global FLOP budgets. The resource allocation (number $k_\ell$ sampled per layer) is optimized via greedy optimization to minimize cumulative error, while epoch-wise caching exploits temporal stability of top- $k$ sets to avoid repeated sampling overhead (Liu et al., 2022). Switching back to exact computation for the final 20% of epochs recovers any lost accuracy due to approximation.
Federated and communication-efficient paradigms: In FL, Bernoulli-sampled random masks communicate sparse parameter updates stochastically, reducing uplink bandwidth below $1$ bit per parameter, while maintaining or improving accuracy relative to quantized gradient baselines (Isik et al., 2022).

Summary Table: Key Randomized Sparse Update Applications

Domain	Key Method(s)	Notable Features
Linear systems	SRK, RSKA, SSKM, GERK	Support-guessing, weighted projections
Deep learning	Hash-based AS, Bernoulli masks	LSH-accelerated passes, communication sparsity
GNNs	RSC	Resource-constrained SpMM, layerwise budget
Neural PDEs	RSNG (random-mask Galerkin)	Subnetwork updates, regularization
FL/Distributed	Rotating stochastic masks	Unbiased sparse averaging, entropy coding

4. Theoretical Guarantees: Unbiasedness, Convergence, and Error Bounds

Central to randomized sparse update methods are formal guarantees on unbiasedness and convergence:

Unbiasedness: For randomized sparse estimators $S_r(v)$ , $\mathbb{E}[S_r(v)] = v$ holds by construction for pivotal and Bernoulli sampling, ensuring that statistical estimates or gradients remain correct in expectation (Weare et al., 2023, Berman et al., 2023, Liu et al., 2022).
Convergence rates: For convex and strongly convex objectives (linear systems, basis pursuit, Galerkin flows), expected linear convergence is established with explicit rate constants depending on data geometry, batch size, regularization, and sparsity level (Schöpfer et al., 2016, Tondji et al., 2022, Yuan et al., 2021, Liu et al., 2022). In particular, error bounds often decompose into bias (deterministic contraction) and variance (controlled by $1/\sqrt{r}$ or faster decay in the sparsity budget $r$ ).
Error-vs-sample-size tradeoffs: E.g., in approximating SpMM on graphs, top- $k$ sampling yields error that decays as $O(1/\sqrt{k})$ (or better), and empirical performance matches theoretical predictions as long as top contributors dominate (Liu et al., 2022). In high-dimensional settings, bias scales as the exponential contraction of the fixed-point map, while variance is dimension-independent under optimal sparsification (Weare et al., 2023).
Robustness to noise and stochasticity: For inconsistent systems or impulsive noise, block and dual-variable variants retain linear expected convergence to within a noise-dependent neighborhood, with $O(\|\text{noise}\|^2)$ contributions (Schöpfer et al., 2022, Tondji et al., 2022).

5. Implementation Strategies and Practical Performance

Implementing randomized sparse update schemes depends critically on efficient support selection, batching, and hardware-parallel capabilities:

Efficient sampling: Top- $k$ or magnitude-based selection is often executed via partial sorting and is efficiently implementable on GPU architectures. For Bernoulli masks, random number generation is entirely parallel (Liu et al., 2022, Berman et al., 2023).
Caching and epoch-wise strategies: For sparse GNN adjacency matrices, caching sampled supports across several epochs leverages the empirical observation that active sets exhibit high temporal consistency (over $90\%$ overlap for $T \approx 10$ consecutive steps), saving up to $50\%$ of sampling overhead with negligible test accuracy impact (Liu et al., 2022).
Parallel execution: Mini-batch and hashing-based methods match the computation cost of updating one coordinate with that of many in parallel. On 56-core multicore setups, $5\%$ active neurons yield up to 31 $\times$ reduction in training time per epoch (Spring et al., 2016). Sparse update patterns naturally admit Hogwild!-style asynchronous stochastic optimization.
Empirical speedup and accuracy profiles: Randomized sparse update methods deliver operation-level speedups up to $11.6\times$ for single sparse operations and $1.3$– $1.6\times$ end-to-end speedups in GNN training for $10\%$ resource budget, with test accuracy loss under $0.3\%$ (Liu et al., 2022). In neural Galerkin schemes for PDEs, random sparse updating is up to $100\times$ faster or $100\times$ more accurate at fixed runtime versus dense-update baselines (Berman et al., 2023).

6. Extensions, Limitations, and Open Directions

Recent work highlights ongoing extensions and some limitations:

Block/generalized proximal updates: Randomized sparse updates generalize to primal-dual splitting, block-coordinate stochastic mirror descent, and distributed settings, using variance-reduced under-relaxation and unbiased randomization of complex primitive operators (Condat et al., 2022).
Nonuniform and adaptive sampling: Potential exists to further optimize sampling schedules (e.g., importance sampling based on row/feature norms), adapt batch sizes, and exploit momentum or variance reduction (Tondji et al., 2022).
Handling model and data heterogeneity: In federated and neural PDEs scenarios, randomization and sparsity serve the dual role of both computation and statistical regularization, but require careful tuning to prevent underfitting in highly heterogeneous settings (Berman et al., 2023, Isik et al., 2022).
Robustness and convergence: Although theoretical guarantees are well developed in convex settings, extensions to nonconvex objectives, rich data geometries, or weak/partial support estimation remain open, particularly in deep learning and over-parameterized models.

Randomized sparse updates have thus become a foundational principle for scalable computation in modern numerical and machine learning pipelines, enabling resource-aware, parallelizable, and theoretically controlled algorithms for a broad range of high-dimensional problems.