Papers
Topics
Authors
Recent
Search
2000 character limit reached

Sparse Maximal Update Parameterization (SμPar)

Updated 7 February 2026
  • Sparse Maximal Update Parameterization (SμPar) is a framework that defines invariant scaling laws to maintain consistent activations, gradients, and updates in sparse and distributed neural network training.
  • It employs precise scaling of weight initialization and layerwise learning rates to counteract vanishing signal issues in extreme sparsity regimes.
  • Empirical results show that SμPar reduces communication, memory, and computational overhead while yielding improved loss and perplexity in large-scale models.

Sparse Maximal Update Parameterization (SμPar) encompasses two distinct but closely related frameworks for optimizing large-scale neural networks under constraints of sparsity and distributed computation. SμPar unifies parameterization strategies for static-sparse architectures and computes/communication-efficient partitioned training, with the central goal of maintaining stable training dynamics and achieving high performance even under extreme sparsity or distributed-memory limitations (Dey et al., 2024, Filippova et al., 26 Sep 2025).

1. Definitions and Generalized-μ Desiderata

SμPar is rooted in the Maximal Update Parameterization (μP) paradigm, which dictates initialization and update scaling to preserve stable signal propagation as a function of network width. SμPar generalizes these invariance properties to both width and arbitrary sparsity. For a linear layer or subblock at depth ll, let the input X(l)RB×dl1X^{(l)} \in \mathbb{R}^{B \times d^{l-1}}, weights W(l)Rdl1×dlW^{(l)} \in \mathbb{R}^{d^{l-1} \times d^{l}}, and sparsity mask M(l){0,1}dl1×dlM^{(l)} \in \{0,1\}^{d^{l-1} \times d^{l}} of density ρ\rho (sparsity s=1ρs = 1 - \rho) yield the output

X(l+1)=X(l)(W(l)M(l)).X^{(l+1)} = X^{(l)} (W^{(l)} \odot M^{(l)}).

For SμPar, the Frobenius norms

X(l)F,X(l)F,ΔX(l)F\|X^{(l)}\|_F, \quad \|\nabla X^{(l)}\|_F, \quad \|\Delta X^{(l)}\|_F

are kept invariant under changes in width (mdm_d) and density (mρm_\rho), where md=dl/dbaselm_d = d^l/d^l_{base} and mρ=ρ/ρbasem_\rho = \rho/\rho_{base}. This “generalized-μ desideratum” ensures independence from architecture scale and sparsity for activations, gradients, and parameter updates (Dey et al., 2024).

2. Mathematical Formulation and Parameter Scaling

SμPar prescribes the following scaling for hidden linear projections at layer ll:

  • Weight initialization: Wij(l)N(0,σbase2/(mdl1mρ))W^{(l)}_{ij} \sim \mathcal{N}(0,\,\sigma_{base}^2/(m_{d^{l-1}} m_{\rho})).
  • Layerwise learning rate: η(l)=ηbase/(mdl1mρ)\eta^{(l)} = \eta_{base}/(m_{d^{l-1}} m_{\rho}).

Embedding and layer-norm parameters remain dense and follow the μP prescription. This scaling counters the vanishing signal effect caused by increased sparsity, maintaining well-propagated signals even near the extreme sparsity regime. For AdamW or SGD, the updates also incorporate the scaling in the per-layer learning rates, yielding sparsity-invariant optimization steps (Dey et al., 2024).

A practical implementation entails measuring the active width mdm_d and density mρm_\rho per layer and adjusting both σ\sigma and η\eta as above. For unstructured static sparsity with random mask M(l)M^{(l)}, this uniquely ensures all layers keep propagating activations, gradients, and updates at a constant scale (Dey et al., 2024).

3. Subset Selection and Distributed Training

In a distributed setting, SμPar enables partial parameter updates to minimize communication and memory cost. Consider a model parameter vector θRd\theta \in \mathbb{R}^d partitioned among KK nodes, each node kk training on a subset Sk{1,,d}S_k \subset \{1,\dotsc,d\}. The support operator PSkP_{S_k} retains only coordinates in SkS_k. During HH local steps, each node performs gradient updates restricted to SkS_k, while the remaining weights are frozen. After HH steps, each node emits a sparse delta update (nonzero only on SkS_k), which is all-reduced and normalized across nodes with the count vector mi={k:iSk}m_i = |\{k : i \in S_k\}|.

Subset selection is controlled via a “slice count” hyperparameter NN (with NN dividing KK):

  • MLP-only slicing partitions each MLP matrix into NN contiguous blocks, activating only the nnth slice (n=kmodNn = k \bmod N) per node, while keeping embeddings, attention, norms fully present in each SkS_k.
  • MLP + head slicing additionally partitions attention heads and assigns head groups to nodes.

This design allows tradeoffs between f, the per-node trainable parameter fraction, and communication/computational savings. For example, N=4N=4 leads to f0.52f \approx 0.52 for Transformers with $1.3$B parameters, meaning only 52% of parameters are individually trainable per node (Filippova et al., 26 Sep 2025).

4. Communication, Memory, and Compute Efficiency

SμPar substantially reduces communication cost by only transmitting fMf M bytes (for M=dbgM = d b_g, bgb_g the bytes per param) once every HH inner steps. The amortized per-step communication cost drops by a factor of f/Hf/H: TcommSμPar=Tcommfull×fH,T_{\mathrm{comm}}^{\mathrm{SμPar}} = T_{\mathrm{comm}}^{\mathrm{full}} \times \frac{f}{H}, as compared to full-gradient DDP. For K=32K=32, H=100H=100, f=0.25f=0.25, the bandwidth use per update is reduced by $0.25/100 = 0.0025$.

Peak memory usage is improved since frozen parameters do not require allocation of gradient buffers or optimizer states: MSμPar=W+f(G+S)+OM_{\mathrm{SμPar}} = W + f(G+S) + O versus MDDP=W+G+S+OM_{\mathrm{DDP}} = W+G+S+O, yielding a reduction up to 47% in reported experiments (from 19.4 GB down to 10.4 GB for a 1.3B parameter model) (Filippova et al., 26 Sep 2025).

The backward pass computes parameter-gradient matmuls only for active weights, reducing backward and optimizer update FLOPs in proportion to ff. Across all HH local steps, total training FLOPs are reduced by approximately 15% compared to prior low-communication methods at equivalent perplexity (Filippova et al., 26 Sep 2025).

5. Theoretical Properties and Training Dynamics

SμPar is equivalent to a distributed block-coordinate descent algorithm in the partitioned distributed setting. Rotating access to NN slices ensures all parameters participate in the gradient trajectory, and unbiasedness is preserved by normalizing update deltas with respect to the overlap count vector mm. Given smoothness of the loss, convergence rates approach those of full-gradient SGD, up to a $1/N$ factor (Filippova et al., 26 Sep 2025).

In the static sparsity regime, SμPar uniquely satisfies the extended μ-desiderata: invariance of activation, gradient, and update norms with both width and sparsity. Empirically, SμPar avoids vanishing signal propagation even at density 272^{-7} (sparsity >99%>99\%), unlike standard μP and SP. This leads to increased compute-efficiency and training stability across a range of sparsity levels (Dey et al., 2024).

6. Hyperparameter Transfer and Practical Implementation

Because SμPar establishes exact scaling laws, optimal hyperparameters for learning rate and initialization variance can be tuned once on a small dense μP proxy model and then transferred, without modification, to large sparse models across a wide spectrum of widths and densities. For all tested sparsity regimes, SμPar achieves optimal training curves with fixed base learning rate (26\approx 2^{-6}) and initialization variance (0.0872\approx 0.087^2), directly transferable to large-scale sparse Transformers (Dey et al., 2024).

Minimal implementations require only:

  • Per-layer initialization: WijN(0,σbase2/(mdmρ))W_{ij} \sim \mathcal{N}(0,\,\sigma^2_{\mathrm{base}}/(m_d m_\rho))
  • Per-layer learning rate: ηbase/(mdmρ)\eta_{\mathrm{base}}/(m_d m_\rho)
  • Standard AdamW/SGD optimization
  • Static mask definition for unstructured sparsity

Implementation and configuration examples are available in open-source repositories, covering initialization, training loops, and optimizer setup (Dey et al., 2024).

7. Empirical Results and Extensions

SμPar demonstrates favorable scaling and convergence characteristics. In large-scale GPT-like Transformers with static random unstructured sparsity up to 99.2%, SμPar yields up to 11.9% relative loss improvement over standard parameterization, and 8.2% improvement over μP in compute-optimal “Chinchilla” settings. For 1.3B-parameter models distributed across 32 nodes, SμPar achieves similar perplexity to streaming DiLoCo under equal token and bandwidth budgets, while reducing peak memory consumption up to 47% and training FLOPs by ~15% (Dey et al., 2024, Filippova et al., 26 Sep 2025).

Potential extensions include dynamic re-assignment of parameter slices for improved freshness, quantization/compression of the sparse deltas in communication bounds, asynchronous or overlapped communication strategies, and alternative slice geometries such as spectral or low-rank partitions.

SμPar provides a unified, theory-driven parameterization and update framework for both static-sparse and distributed partitioned training regimes, enabling efficient large-model training under resource constraints and establishing robust transfer rules for hyperparameters and initialization. This has broad implications for the tractable scaling of neural LLMs on commodity hardware and the reliable deployment of highly sparse deep learning architectures (Dey et al., 2024, Filippova et al., 26 Sep 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Sparse Maximal Update Parameterization (SμPar).