Papers
Topics
Authors
Recent
Search
2000 character limit reached

Split Weight Decay in Deep Learning

Updated 14 June 2026
  • Split Weight Decay is a regularization method that decomposes weight updates into radial and tangential components to better tailor decay.
  • It utilizes methods such as AdamO, AlphaDecay, and SPD to apply selective decay based on parameter geometry, module structure, and fine-tuning dynamics.
  • Empirical results show that split decay leads to smoother optimization, improved accuracy on benchmarks like CIFAR-100, and enhanced out-of-distribution performance.

Split weight decay encompasses a family of techniques for regularization in deep learning where the traditional isotropic (uniform) decay penalty is decomposed, modulated, or selectively applied based on parameter geometry, module structure, or step-by-step dynamics. Unlike standard approaches that penalize all weights equally in 2\ell_2 norm, split weight decay exploits architectural symmetries and empirical evidence that meaningful regularization should often occur along specific subspaces (e.g., radial direction of weights), or be tailored module-wise, or switched adaptively. This paradigm underlies advances such as orthogonally decoupled optimizers, layerwise spectral-adaptive schemes, and geometry-aware decay for fine-tuning, leading to demonstrably improved generalization, hyperparameter robustness, and sometimes sparser or more robust solutions in both vision and LLMs.

1. Geometric Motivation: Decomposition of Weight Dynamics

Recent analysis shows that standard decoupled weight decay (such as AdamW) induces a "Radial Tug-of-War" in adaptive optimization. For parameter vector wRdw \in \mathbb{R}^d and gradient g=L(w)g = \nabla L(w), gradient steps often increase the norm w2\|w\|_2 (expanding model capacity), while decay attempts to shrink w2\|w\|_2. This interaction injects high-variance radial oscillations into adaptive optimizer statistics, particularly corrupting estimates of second moments and thus impairing feature learning in tangential (orthogonal) directions (Chen et al., 4 Feb 2026).

To resolve this, the weight-update vector is decomposed into radial (parallel to ww) and tangential (orthogonal) components: u:=ww2,gr:=g,uu,gt:=ggru := \frac{w}{\|w\|_2},\quad g_r := \langle g, u \rangle u,\quad g_t := g - g_r The update can be further cast as separate projections: ϕr(w)(z):=z,ww,ww,ϕϕ(w)(z):=zϕr(w)(z)\phi_r^{(w)}(z) := \frac{ \langle z, w \rangle }{ \langle w, w \rangle } w,\quad \phi_{\phi}^{(w)}(z) := z - \phi_r^{(w)}(z) where gr=ϕr(w)(g)g_r = \phi_r^{(w)}(g), gt=ϕϕ(w)(g)g_t = \phi_{\phi}^{(w)}(g).

2. Algorithmic Instantiations: Radial-only and Module-wise Decay

Radial-only decay (AdamO):

The full optimizer maintains distinct moment buffers for the decomposed subspaces and applies decay exclusively along the radial. The AdamO algorithm, for each step wRdw \in \mathbb{R}^d0, updates as follows (Chen et al., 4 Feb 2026):

  • Radial direction: Simple SGD update with learning rate wRdw \in \mathbb{R}^d1 (adaptive to local curvature), no adaptive moment.
  • Tangential direction: Adam-style preconditioning (first and second moments) with fixed learning rate wRdw \in \mathbb{R}^d2.
  • Decay: Applied as wRdw \in \mathbb{R}^d3, strictly in the radial component. Tangential direction is untouched by decay.

For biases, LayerNorm, or small dimensionalities, a default Adam-style update is used. For scale-invariant layers (e.g., BatchNorm, LayerNorm), the radial step and decay are omitted.

Module-wise decay (AlphaDecay):

Instead of uniform decay, AlphaDecay estimates the spectral "heavy-tailedness" of each module's weight correlation matrix via the empirical spectral density and Hill estimator of the power-law tail index wRdw \in \mathbb{R}^d4 (He et al., 17 Jun 2025). Modules with smaller wRdw \in \mathbb{R}^d5 (heavier tails, more principal directions) receive weaker decay, while lighter-tailed modules get stronger decay: wRdw \in \mathbb{R}^d6 This schedule is updated every wRdw \in \mathbb{R}^d7 steps and implemented via grouped parameter groups in the optimizer.

Selective per-layer decay (SPD):

During fine-tuning of foundation models, split decay may be applied only to layers whose update directions are inconsistent with previous progress (Tian et al., 2024). For layer wRdw \in \mathbb{R}^d8:

  • Compute wRdw \in \mathbb{R}^d9.
  • If g=L(w)g = \nabla L(w)0 ("over-exploring"), decay is applied proportional to the fractional overshoot radius; otherwise, no shrinkage.

3. Theoretical and Practical Implications

Separation of directions:

AdamO and related splits address the inherent conflict between magnitude and directional regularization. By confining decay to the radial subspace and reserving tangential dynamics for feature learning (via moment-preconditioned adjustment), split decay stabilizes norm oscillations, leading to smoother loss and improved effective capacity utilization (Chen et al., 4 Feb 2026).

Spectral adaptation:

Module-wise split decay leverages heavy-tailed self-regularization theory: networks empirically display highly variable principal spectra across different attention and MLP blocks. Uniform decay inadvertently suppresses useful high-variance modes, while AlphaDecay spatially regularizes according to empirically measured spectral exponents (He et al., 17 Jun 2025).

Fine-tuning robustness:

Selective decay (SPD) restricts shrinkage to layers that depart from their pre-trained trajectory, enhancing retention of in-domain and out-of-distribution model behavior during fine-tuning (Tian et al., 2024). This splits regularization power across parameter groups automatically, balancing flexibility with strong constraint where needed.

Algorithmic realization:

Split decay methods can be implemented using projection operators, grouped parameter schedules, or selective masking within a standard optimizer framework compatible with PyTorch or TensorFlow (Chen et al., 4 Feb 2026, He et al., 17 Jun 2025, Tian et al., 2024).

4. Empirical Results and Benchmark Performance

AdamO (radial-only):

On CIFAR-100 (ResNet-18, BatchNorm, 300 epochs), AdamO achieves 79.74 ± 0.09% accuracy, outperforming AdamW (74.75 ± 0.15%) by ≈5 points. Removing any split component (projection, dimension-aware rule, or curvature adaptation) significantly degrades performance, collapsing results to the AdamW regime if no split decay is used. AdamO exhibits smoother optimization trajectories and greater hyperparameter robustness (Chen et al., 4 Feb 2026).

AlphaDecay (module-wise):

For LLaMa-family LLMs (60M–1B parameters), perplexity improvements over uniform decay are observed: e.g., 3.0% lower on LLaMa-60M, and consistent but attenuated benefits at larger scale (0.8% for 1B) (He et al., 17 Jun 2025).

SPD (selective, fine-tuning):

On DomainNet and ImageNet variants, SPD consistently reduces parameter deviation g=L(w)g = \nabla L(w)1 by 3–5 × vs. AdamW and boosts out-of-distribution (OOD) accuracy by 5–10 points. In PEFT settings, SPD improves commonsense QA scores for LLaMA-7B/13B by 1–2 points (Tian et al., 2024).

Optimizer/Method CIFAR-100 Acc (%) OOD Gain (DomainNet) Perplexity (LLaMa-60M)
Adam 74.48 ± 0.12 32.56
AdamW 74.75 ± 0.15 39.3 32.56
AdamO (full) 79.74 ± 0.09
SPD (DomainNet) 45.9
AlphaDecay (LLMs) 31.58

5. Algorithmic Details and Pseudocode

AdamO Split Weight Decay Core Loop:

w2\|w\|_26 AlphaDecay Module-wise Update (high-level):

Every g=L(w)g = \nabla L(w)2 steps, for each module:

  • Compute eigenvalues of g=L(w)g = \nabla L(w)3.
  • Estimate power-law exponent g=L(w)g = \nabla L(w)4 with Hill estimator.
  • Interpolate decay rate and assign to parameter group in optimizer.

SPD Update (per layer):

w2\|w\|_27

6. Comparison to Standard and Decoupled Weight Decay

Traditional g=L(w)g = \nabla L(w)5-regularization, linked to "weight decay," applies a penalty g=L(w)g = \nabla L(w)6 either as a gradient term or as a multiplicative shrinkage post-update. Loshchilov and Hutter (Loshchilov et al., 2017) proved that for adaptive optimizers (e.g., Adam), g=L(w)g = \nabla L(w)7 penalty and explicit decay are inequivalent: only fully decoupled decay (AdamW) ensures correct separation of normalization and regularization.

Split weight decay generalizes this separation by further restricting the directions or groups to which decay is applied, providing more nuanced control over optimization geometry. Empirically, isotropic (non-split) decay can suppress beneficial capacity expansion or over-regularize anisotropic modules, outcomes precluded by geometric or selective split decay (AdamO, AlphaDecay, SPD) (Chen et al., 4 Feb 2026, He et al., 17 Jun 2025, Tian et al., 2024).

7. Limitations, Practical Guidelines, and Outlook

Split weight decay requires geometric projections, module-wise eigenanalysis, or parameter grouping, introducing minimal but nonzero computational overhead. For module-wise approaches, measuring spectral densities periodically can increase cost, but update intervals (e.g., every 500 steps) make this manageable in practice (He et al., 17 Jun 2025). For finer-grained architectures or decentralization across large-scale pretraining, further research may optimize trade-offs between regularization selectivity and compute efficiency.

Hyperparameter guidelines:

  • For AdamO, set g=L(w)g = \nabla L(w)8 and g=L(w)g = \nabla L(w)9 by grid search; w2\|w\|_20 is common (Chen et al., 4 Feb 2026).
  • For AlphaDecay, w2\|w\|_21 scaling factors between w2\|w\|_22 and w2\|w\|_23 are effective (He et al., 17 Jun 2025).
  • For SPD, projection strength w2\|w\|_24 is typical, with w2\|w\|_25 as a robust default (Tian et al., 2024).

Split weight decay continues to evolve as neural architectures grow, drawing on geometric, spectral, and empirical insights. It provides a principled approach to directional, modular, and adaptive regularization, with quantifiable improvements in generalization and robustness over isotropic schemes.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Split Weight Decay.