Gradient Norm Increase Phenomenon

Updated 6 December 2025

Gradient Norm Increase Phenomenon is defined as the non-monotonic, sometimes spiking, behavior of gradient magnitudes in deep networks that can lead to unstable training.
It manifests in various regimes such as layer-wise vanishing/exploding gradients, prescriptive instability at high learning rates, and heterogeneous behavior across CNNs and LLMs.
Normalization techniques like Backward Gradient Normalization, Gradient Autoscaled Normalization, and adaptive weight decay offer practical solutions to control gradient norms and boost performance.

The gradient norm increase phenomenon describes various regimes in deep learning optimization where the magnitude of the backpropagated or instantaneous gradient exhibits non-monotonic growth, spikes, or sustained elevation, rather than the expected decay with depth, training time, or parameter norm. This complex behavior manifests in several contexts: (i) layer-wise vanishing/exploding gradients in deep networks, (ii) transient increases in gradient standard deviation during early or late training, and (iii) instability near critical learning-rate thresholds. These behaviors undermine stable optimization, frustrate convergence, and limit generalization. Multiple recent studies propose normalization methods or corrected update rules that guarantee bounded or well-controlled gradient norms, often with demonstrated empirical benefits.

1. Foundational Mechanisms and Instability

In standard $N$ -layer feed-forward networks, the backpropagated gradient at layer $K$ follows

$\delta^{(K)} = \left( W^T \delta^{(K+1)} \right) \circ f'(z^{(K)}),$

with spectral norms of $W^T$ and $f'(z^{(j)})$ possibly $<1$ or $>1$ . The resulting norm propagation,

$\|\delta^{(K)}\| \approx \prod_{j=K}^N \|W^T\|\, \|f'(z^{(j)})\|,$

may decay (vanishing gradients) or blow up (exploding gradients) exponentially with depth difference. Vanishing gradients inhibit learning in early layers; exploding gradients induce instability and numerical overflow. This instability is pronounced in deep multilayer perceptrons, especially those without skip connections or normalization layers (Cabana et al., 2021).

2. Sharp Gradient Norm Transition at Large Learning Rates

Gradient descent with fixed step size $\eta$ on differentiable losses $f(x)$ can be analyzed via its trajectory gradient norm $g_t = \|\nabla f(x_t)\|$ . Classical stable convergence (for $\eta < 2/L$ , $L$ the gradient Lipschitz constant) ensures $g_t \to 0$ . However, for nonconvex losses and sufficiently large $\eta > \eta_E$ , there is a sharp bifurcation: almost no initial points converge, and $g_t$ remains bounded away from zero, often spiking sharply. For the linear neural network with squared loss, the instability threshold is

$\eta_{\rm crit} = \frac{2}{\lambda_{\max}(H_f(\theta))},$

where $\lambda_{\max}$ is the largest Hessian eigenvalue at any minimum. Empirical results show that increasing $\eta$ beyond $\eta_{\rm crit}$ causes the gradient norm to rise abruptly and maintain high-amplitude oscillations. In deep non-linear networks, similar unstable regimes are observed, consistent with a rigorous dynamical-systems proof (Crăciun et al., 20 Feb 2024).

3. Empirical Characterization in CNNs and LLMs

Empirical analyses in deep convolutional architectures (ResNet-20/56, VGG-16-BN) reveal that per-layer gradient standard deviations $\sigma_t^{(l)}$ can transiently increase by factors of $2\times$ – $5\times$ , especially in deeper or middle layers during early training. Meanwhile, the global aggregate gradient std $s_t$ decays monotonically. This layered heterogeneity stems from the mixture of skip connections, batch normalization, and non-uniform curvature. In LLM pretraining, a dramatic increase in the global gradient norm $\|g_t\|$ is observed in the final phase of training, coinciding with a sharply decaying learning rate and weight norm (Yun, 3 Sep 2025, Defazio, 2 Jun 2025). The sequence of phases includes burn-in, steady-state, and "tail blow-up," in which $\|g_t\|/\|x_t\|$ diverges.

Setting	Gradient Norm Profile	Consequence
Deep MLP, no norm	Exponential decay (vanishing)	No update in early layers
Deep MLP, BGN	Flat, constant norm	Stable updates all layers
CNNs, training	Per-layer transient spikes	Possible instability, performance loss
LLM, late phase	Rapid final blow-up	Higher loss, instability

4. Theoretical Explanations for Gradient Norm Increase

In LLM training and normalized vision networks, the interaction of $\ell_2$ weight decay ( $\lambda$ ), normalization layers, and learning rate schedules produces a specific steady-state ratio

$\frac{\|g_t\|}{\|x_t\|} \approx \sqrt{2\lambda/\gamma_t}$

where $\gamma_t$ is the per-step learning rate. When $\gamma_t \to 0$ in the tail, this ratio diverges, causing the optimizer to overshoot and $\|g_t\|$ to spike. A similar mechanism is seen in deep nets as the learning rate exceeds the stability edge: fixed points become repellers, preventing $\|\nabla f(x_t)\| \to 0$ for almost any initialization (Defazio, 2 Jun 2025, Crăciun et al., 20 Feb 2024). This phenomenon has been formalized using stability analysis at Hessian minima, local invertibility arguments, and the stable manifold theorem.

5. Normalization and Update Strategies to Control Gradient Norms

Multiple normalization strategies have been introduced to prevent gradient norm increase and restore stable optimization:

Backward Gradient Normalization (BGN): Identity-forward nodes placed before every nonlinearity rescale incoming gradients to a constant norm $\kappa$ during backpropagation:

$\mathrm{BGN}_b(g) = \kappa\,\frac{g}{\|g\|_2}$

This guarantees $\|\delta^{(K)}_{\mathrm{BGN}}\|_2 = \kappa$ at every layer, preventing both vanishing and explosion (Cabana et al., 2021).

Gradient Autoscaled Normalization (GAN): Global zero-centering and adaptive rescaling of all eligible gradient tensors using the autoscale multiplier

$a_t = \left( \frac{4}{|\log s_t| + \epsilon} \right)^{p_t}$

applied to the centered global gradient, preserving theoretical convergence guarantees and eliminating uncontrolled amplification (Yun, 3 Sep 2025).

Weight-Decay Correction (AdamC/SGDC): Proportional scaling of the weight-decay parameter at each step,

$\hat{\lambda}_t = \lambda \cdot \frac{\gamma_t}{\gamma_{\max}}$

ensures that $\|g_t\|/\|x_t\|$ stays fixed, removing the "tail blow-up" (Defazio, 2 Jun 2025).

Pseudocode for BGN (from (Cabana et al., 2021)) and GAN (from (Yun, 3 Sep 2025)) are as follows:

for each layer K in backward pass:
    g = W[K].T @ delta_next * f_prime(z[K])
    n = norm(g, 2)
    delta_current = kappa * (g / n)
    # downstream gradients, updates

for t in range(T):
    for layer l:
        G_t_l = grad(layer_l)
    ST = {l: dim(G_t_l)>1}
    g_t = concat([vec(G_t_l) for l in ST])
    s_t = std(g_t)
    a_t = (4 / (abs(log(s_t)) + eps)) ** p_t
    for l in ST:
        G_tilde_l = G_t_l - mean(G_t_l)
        G_hat_l = a_t * G_tilde_l

6. Empirical Validation and Comparative Results

Empirical studies validate that normalization eliminates undesirable gradient norm behavior and improves performance. On MNIST with deep MLPs, BGN stabilization yields flat gradient norms across all layers and greatly improved accuracy, especially in deep ReLU networks. GAN achieves superior test accuracy compared to AdamW, GradNorm, GradCentralization, and Z-score normalization on CIFAR-100 benchmarks using ResNet and VGG-16-BN architectures.

Network	AdamW	GradNorm	GradCentralization	Z-Score Norm	GAN (Ours)
ResNet-20	59.32%	59.58%	60.72%	59.53%	61.34%
ResNet-56	70.01%	70.50%	69.73%	68.58%	71.29%
VGG-16-BN	74.54%	74.18%	73.82%	73.91%	74.54%

For long-duration LLM training, AdamC eliminates the final-stage gradient spike and reduces loss by 0.15 nats/token (Defazio, 2 Jun 2025), with similar improvements in ImageNet ResNet-50 (SGDC provides a 0.3% accuracy gain).

7. Limitations, Open Problems, and Practitioner Guidelines

The generalizability of these normalization strategies is an active area of research. Limitations include computational overhead for full-tensor norms in large convolutional feature maps, lack of validation on sequence models and extreme-scale networks, and unclear interaction with other regularizers. The optimal normalization strength ( $\kappa$ ), placement frequency of normalization nodes, and batch size requirements remain open. For practitioners, monitoring per-layer and global gradient statistics during training is recommended. When transient spikes (2–5×) in gradient std are observed, global normalization (GAN) is favored over naive per-layer schemes. Corrected weight decay, especially in models with normalization and scheduled learning rates, should be adopted to prevent final-phase gradient blow-up. All suggested fixes are hyperparameter-free and compatible with Adam/SGD variants (Cabana et al., 2021, Yun, 3 Sep 2025, Defazio, 2 Jun 2025).

A plausible implication is that effective gradient norm control, whether via BGN, GAN, or adaptive weight decay, is indispensable for stable optimization, especially in architectures with extreme depth, normalization, or nontrivial learning rate schedules. Understanding and normalizing gradient norm increase bridges empirical training behavior with theoretical optimization guarantees, advancing both deep learning practice and theory.