Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 79 tok/s

Gemini 2.5 Pro 60 tok/s Pro

GPT-5 Medium 25 tok/s Pro

GPT-5 High 29 tok/s Pro

GPT-4o 117 tok/s Pro

Kimi K2 201 tok/s Pro

GPT OSS 120B 466 tok/s Pro

Claude Sonnet 4.5 37 tok/s Pro

2000 character limit reached

Gradient-Preserving Activation Scaling (GPAS)

Updated 1 July 2025

Gradient-Preserving Activation Scaling (GPAS) is a set of techniques that modify deep networks to preserve or enhance the magnitude and informativeness of gradients during training.
GPAS techniques use methods like stop-gradient operators, adaptive activation functions, or gradient scaling to decouple activation scaling from backward gradient flow.
GPAS techniques improve training stability and performance in deep networks like LLMs, PINNs, and quantized models by mitigating issues like vanishing gradients.

Gradient-Preserving Activation Scaling (GPAS) is a family of techniques and architectural modifications designed to address gradient pathologies and activation variance issues in deep neural networks. GPAS methodologies seek to preserve the magnitude and informativeness of gradients throughout the network, especially in very deep architectures where vanishing or exploding gradients commonly arise. Implementations span runtime activation scaling, adaptive activation shaping, gradient modulation during training, and a variety of preconditioning and normalization schemes. This entry presents the principal formulations, theoretical guarantees, architectural configurations, empirical impacts, and limitations of GPAS, reflecting both recent and foundational research.

1. Definition and Theoretical Foundations

Gradient-Preserving Activation Scaling refers to any modification of the neural network architecture or optimization process that purposefully scales activations—or the pathways through which they propagate—in a manner that preserves (or enhances) the magnitude and integrity of gradients during backpropagation. The primary objective is to mitigate classic problems such as the vanishing/exploding gradient and to enable deeper or more expressive architectures to effectively learn.

The formal utility of GPAS is most transparent in architectures where forward scaling naively leads to backward gradient attenuation. To address this, GPAS methods introduce adjustments—either to the scaling mechanism, the activation function, or via explicit stop-gradient constructs—that preserve or control the backward gradient separately from the forward activation scale. Some approaches extend this principle to act on optimization preconditioning, gradient norm normalization, or through the design of adaptive activation functions and gradient transformation operators.

2. Architectural and Algorithmic Techniques

2.1. Stop-Gradient Forward Scaling

A widely adopted GPAS implementation involves forward-scaling activations while explicitly preventing this scaling from affecting gradients. In Pre-LayerNorm (Pre-LN) Transformers, for example, GPAS is realized as:

$x_{l+1} = x_{l+1} - \mathrm{SiLU}(\alpha_l) \cdot sg(x_{l+1}),$

where $\mathrm{SiLU}(\alpha_l) = \alpha_l \cdot \sigma(\alpha_l)$ is a smooth, bounded gating function, $sg(\cdot)$ denotes the stop-gradient operator (zero in backward pass), and $\alpha_l$ is a learnable gate. This ensures activations are downscaled in the forward pass, but gradients in the backward pass remain unaffected. This decouples the control of activation variance from the maintenance of gradient magnitude (Chen et al., 27 Jun 2025).

2.2. Locally Adaptive Activation Functions

Another category of GPAS mechanisms employs learnable scaling parameters within activation functions (layer-wise or neuron-wise). For example, Locally Adaptive Activation Functions (LAAF) introduce layer- or neuron-specific slopes $a_k$ (or $a_{k,i}$ ):

$\tilde{z}_k = \sigma(n a_k \mathcal{L}_k(z_{k-1})),$

with $n$ a fixed factor and $\sigma$ an activation function. Such scaling is optimized jointly with network parameters, and a slope recovery term augments the loss, accelerating convergence and improving gradient flow (Jagtap et al., 2019).

2.3. Gradient Activation Functions and Gradient Scaling

Some GPAS approaches focus on processing gradients themselves. The Gradient Activation Function (GAF) framework operates by applying a non-linear, monotonic function to each scalar gradient:

$\acute{g}(g) = \alpha \tanh(\beta g),$

with the aim of amplifying small gradients and restricting large gradients, thereby reducing ill-conditioning and enhancing convergence rates (Liu et al., 2021).

Other methods introduce online adaptive scaling to the gradient preconditioner $P_k$ at each training step:

$x^{k+1} = x^k - P_k \nabla f(x^k),$

with $P_k$ learned by minimizing criteria such as function value ratios or gradient norm ratios, providing guarantees that approach those of optimally preconditioned gradient descent (Gao et al., 4 Nov 2024).

2.4. Quantization and Discretization with Scaling

In scenarios where networks are quantized, element-wise gradient scaling (EWGS) is applied during backpropagation. The propagated gradient is modified as:

$g_{x_n} = g_{x_q} \left(1 + \delta\, \mathrm{sign}(g_{x_q})(x_n - x_q)\right),$

where $\delta$ is a Hessian-informed scaling factor and $(x_n - x_q)$ is the quantization error, preserving gradient information lost due to discretization (Lee et al., 2021).

2.5. Activation Shaping for Gradient Preservation

Adaptive activation functions (e.g. adaptive Gumbel, smooth ReLU, self-scalable tanh) employ trainable shape or scaling parameters (e.g., $\alpha$ , $\beta$ ) embedded in their definition, allowing the network to jointly learn optimal nonlinearity and slope per neuron or layer, automatically seeking a balance between expressivity and gradient flow (Farhadi et al., 2019, Gnanasambandam et al., 2022).

3. Application Domains

GPAS is particularly salient for the following contexts:

LLMs: Pre-LN Transformer architectures experience exponential activation variance growth; GPAS stabilizes activations while maintaining strong gradient flow, improving pretraining and downstream performance (average accuracy gains of 1–10% across various LLMs) (Chen et al., 27 Jun 2025).
Physics-Informed Neural Networks (PINNs): Self-scalable or locally adaptive activations enable stable computation of higher-order derivatives and robust fitting when target output scales are large or unnormalized (Gnanasambandam et al., 2022, Jagtap et al., 2019).
Network Quantization: Gradients are preserved across discretization boundaries via element-wise scaling, improving the convergence and accuracy of quantized and binary neural networks (Lee et al., 2021, Kim et al., 2020).
Differential Privacy: Non-monotonous adaptive scaling of per-sample gradients maintains small but critical updates, resulting in improved utility-privacy tradeoff (Huang et al., 5 Nov 2024).

4. Theoretical Guarantees and Analysis

Several works provide strong theoretical support for GPAS-style methodologies:

Convergence Guarantees: Online learning approaches to scaling yield $O(\kappa^\star \log(1/\varepsilon))$ complexity, where $\kappa^\star$ is the best achievable condition number by any allowed preconditioner, and, for quadratic functions, even superlinear convergence (Gao et al., 4 Nov 2024).
Gradient Flow Bounds: For Pre-LN Transformers, GPAS ensures that gradient bounds grow at most slowly with depth, in contrast to the exponential decay observed without GPAS (Chen et al., 27 Jun 2025).
Avoidance of Spurious Minima: Slope recovery and adaptive activation scaling prevent convergence to non-optimal critical points, as global minima are the only limit points under reasonable initialization and learning rates (Jagtap et al., 2019).
Stability under Quantization and Discontinuity: EWGS and similar strategies guarantee better correlation between true and estimated gradients even under extreme quantization (Lee et al., 2021, Kim et al., 2020).

5. Empirical Effects and Performance

Extensive experiments highlight GPAS effectiveness:

Transformers/LLMs: GPAS delivers consistent pretraining perplexity reductions (e.g., -0.35 to -1.75), and downstream accuracy improvements (up to +9.71% relative improvement over baselines for 1B parameter models). Even on architectures with sophisticated normalization (Sandwich-LN, DeepNorm), GPAS yields additive gains.
Deep Networks: In deep multilayer perceptrons and CNNs, locally adaptive activations with slope recovery accelerate convergence and improve error metrics for both regression and classification tasks (Jagtap et al., 2019).
PINNs: Self-scalable tanh enables accurate recovery of both solution fields and physical parameters in forward and inverse problems (Gnanasambandam et al., 2022).
Quantized/Private Models: EWGS and adaptive per-sample scaling outperform fixed clipping or normalization, especially in late training where small gradients dominate but are susceptible to suppression or noise (Lee et al., 2021, Huang et al., 5 Nov 2024).

6. Implementation Considerations and Limitations

Parameter Overhead: Neuron-wise adaptation and per-layer gates introduce extra parameters; practical designs trade off granularity and scalability.
Hyperparameter Sensitivity: Optimization may require careful tuning of scaling learning rates or regularization, particularly for per-unit or per-sample adaptive schemes.
Architectural Constraints: Some normalization or preconditioning schemes (e.g., in Pre-LN Transformers) may require careful site placement (after residual connections) and correct application of stop-gradient to avoid nullifying benefits (Chen et al., 27 Jun 2025).
Numerical Stability: Extreme values for scaling parameters can induce instability, requiring parameterizations that guarantee positivity and boundedness.
Compatibility: Integration with existing architectures may demand updating the gradient flow logic, especially for custom backpropagation (e.g., in JAX, TensorFlow, PyTorch).

7. Summary Table: GPAS across Domains

GPAS Variant	Application Domain	Key Benefit
Stop-gradient scaling (GPAS)	Transformers/LLMs	Controls variance, maintains gradients
Locally adaptive activation (LAAF)	PINNs, regression/classification	Implicit preconditioning, fast training
Self-scalable tanh (Stan)	PINNs	Robust gradient flow, scale adaptation
EWGS / non-monotone scaling	Quantization, differential privacy	Preserves informative gradients, utility

8. Conclusion

Gradient-Preserving Activation Scaling encompasses a diverse set of theoretically justified tools and architectural primitives that enable deep, expressive neural networks to train effectively by safeguarding the passage of gradients. Through a principled decoupling of forward activation scaling and backward gradient propagation, typically using stop-gradient operators or adaptive scaling parameters, GPAS techniques have demonstrated consistent improvements in convergence speed, stability, and final task performance across major neural network architectures, especially in LLMs and physics-based learning systems. For future large-scale and high-fidelity applications, GPAS formulations continue to represent a core ingredient for deep network design and robust optimization.