Activation Functions in Deep Learning

Updated 14 December 2025

Activation Functions are nonlinear components in deep neural networks that enable modeling of complex, nonlinearly separable feature spaces.
They range from classical (Sigmoid, Tanh, ReLU) to adaptive/trainable variants (PReLU, Swish, GELU) that enhance gradient flow and convergence.
Emerging methods including meta-learned, rational, and biologically-inspired activations offer significant gains in accuracy and robustness.

Activation functions (AFs) are the fundamental nonlinear components in deep neural network architectures, responsible for modulating the output of each neural unit, enabling representation of highly complex, nonlinearly separable feature spaces. The rigorous selection and design of activation functions substantially impacts model expressivity, optimization dynamics, convergence, and ultimate generalization performance. Over recent decades, activation function research has advanced from static, hand-crafted non-linearities to adaptive, trainable, biologically inspired, and hybrid forms that optimize gradient flow and learning stability. Here we provide a comprehensive, technical review of the leading activation functions, their mathematical foundations, empirical behavior on competitive benchmarks, modern adaptive innovations, and practical guidelines for their deployment.

1. Canonical Activation Functions: Definitions, Properties, and Dynamics

Classical activation functions include the logistic sigmoid, hyperbolic tangent (tanh), rectified linear unit (ReLU), and their piecewise linear and self-gated smooth variants. Their formal definitions, derivatives, and principal properties are summarized below (Szandała, 2020, Hammad, 14 Jul 2024):

Sigmoid (Logistic): $\sigma(x) = \frac{1}{1 + e^{-x}}$ ; smooth, monotonic, bounded $(0,1)$ , but suffers severe vanishing gradients for $|x| \gg 2$ and non-zero centering impairs convergence in deep stacks.
Tanh: $\tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}$ ; zero-centered, bounded $(-1,1)$ , monotonic; improved gradient flow near $x=0$ but retention of vanishing gradient issues at $|x| \gg 2$ .
ReLU: $\mathrm{ReLU}(x) = \max(0,x)$ ; piecewise linear, unbounded above, zero for $x<0$ ("dying" problem), extremely efficient, alleviates vanishing gradient for $x>0$ .
Leaky ReLU: $\mathrm{LReLU}(x) = \begin{cases} \alpha x, & x \le 0 \ x, & x > 0 \end{cases}$ ; introduces a small negative slope $\alpha$ to mitigate the dead neuron issue.
Swish: $\mathrm{Swish}(x) = x \cdot \sigma(x)$ ; smooth, non-monotonic, unbounded, retains negative gradients for $x \ll 0$ , effective in very deep architectures.

Key empirical observations indicate that ReLU and Leaky ReLU remain competitive defaults for vision tasks, Swish variants and gated units offer improved performance where vanishing gradients persist, and saturated S-shaped functions (sigmoid, tanh) are discouraged except for output layers or gateway gates in RNN architectures (Szandała, 2020, Dubey et al., 2021, Hammad, 14 Jul 2024).

2. Advanced, Adaptive, and Trainable Activation Functions

Recent progress has led to adaptive activation architectures that directly learn activation-form parameters during training, such as parametric ReLU (PReLU), adaptive piecewise linear units (APL), Padé and orthogonal Padé families, and metalearned functional compositions:

PReLU: $f(x) = \max(\alpha x,x)$ , with learnable $\alpha$ per channel or layer.
APL (Adaptive Piecewise Linear) (Agostinelli et al., 2014): $h_i(x) = \max(0,x) + \sum_{s=1}^S a_i^s \max(0,-x + b_i^s)$ , enabling per-unit nonlinear flexibility via $a_i^s$ , $b_i^s$ .
Orthogonal-Padé (HP-1/HP-2) (Biswas et al., 2021): Rational expansions over Hermite polynomials, $G(x) = \sum_{i=0}^k c_i H_i(x) / (1 + \sum_{j=1}^l |d_j| |H_j(x)|)$ , universally approximating, smooth, parameter-efficient; empirically yield $1.8$– $5.1\%$ absolute accuracy improvements over ReLU on CIFAR and ImageNet benchmarks.
ErfReLU (Rajanand et al., 2023): Combines ReLU for $x \ge 0$ and $a \cdot \mathrm{erf}(x)$ for $x<0$ , with $a$ trainable, providing robust gradient flow for negative activations and superior convergence.
EIS Family (Biswas et al., 2020): $f_{\alpha,\beta,\gamma,\delta,\theta}(x) = x[\ln(1+e^x)]^\alpha / [\sqrt{\beta+\gamma x^2} + \delta e^{-\theta x}]$ , unifying Softplus, ISRU, Swish and more, with hyperparameters tuned for each task.

Significant empirical evidence supports adaptive and trainable activations consistently outperforming static forms; for instance, APL units provided state-of-the-art error rates (CIFAR-10 $7.51\%$ ), and HP-1 achieved up to $5\%$ gains on challenging architectures (Agostinelli et al., 2014, Biswas et al., 2021).

3. Smooth Non-Monotonic and Rational Activation Functions

Smooth non-monotonic functions, including Mish, GELU, Sqish, saturated non-monotonic gates, and rational polynomial families (PAU, EIS), are designed to improve gradient flow, feature representation, and robustness:

Mish: $x \cdot \tanh(\ln(1+e^x))$ ; improved generalization and gradient stability, effective in CV/NLP (Kumar, 2022, Dubey et al., 2021).
GELU: $x \cdot \Phi(x)$ , $\Phi(x)$ = Gaussian CDF; standard in transformers and state-of-the-art models (Nguyen et al., 2021, Hammad, 14 Jul 2024).
Sqish (Biswas et al., 2023): $ax + \frac{(1-a)x}{\sqrt{1+\beta \exp(-2\gamma(1-a)x)}}$ , three trainable parameters; yields $1$– $8\%$ accuracy improvements and strong adversarial robustness.
Saturated variants (SGELU, SSiLU, SMish) (Chen et al., 2023): Combine ReLU's identity with GELU/SiLU/Mish negative branches, systematically outperform default baselines (SGELU up to $4.3\%$ over ReLU).
Padé Activation Units (PAU) (Dubey et al., 2021): Low-order rational polynomial forms learned per layer, compact yet highly expressive.

Adaptive rational and non-monotonic activations offer a compelling balance between computational cost, trainability, and generalization, particularly in ultra-deep, noisy, or adversarial settings.

4. Metalearning and Evolutionary Discovery of Activation Functions

Automated search and coevolution frameworks have uncovered high-performing, unconventional activation forms beyond manual engineering (Bingham et al., 2020, Lapid et al., 2022):

Tree-based genetic search (Bingham et al., 2020): Expression-trees over a large primitive set (unary, binary, trigonometric, error-function, piecewise) evolved with mutation and crossover, discovering non-trivial formulas (e.g., $f(x) = (arctan(x)) \cdot \min(x,0)$ ) with statistically significant performance superiority over ReLU/Swish on CIFAR-10/100.
Layer-specific coevolution (Lapid et al., 2022): Cartesian genetic programming for input, hidden, output layers separately, yielding composite and specialized AFs adapted to task and feature distribution; demonstrated $1$– $7\%$ top-1 accuracy gains relative to ReLU/Leaky-ReLU.

Evolutionary and meta-learned approaches represent a new research frontier, leveraging flexible search spaces to tune activation nonlinearity per dataset, architecture, and layer.

5. Hybrid, Symmetric, and Biologically-Inspired Activation Functions

Domain-specific requirements and biological realism have motivated symmetric and hybrid activation forms that align with input signal characteristics:

Parametric Leaky Tanh (PLTanh) (Mastromichalakis, 2023): Parametrically blends tanh and leaky-ReLU, $f(x;\alpha) = \max(\tanh(x), \alpha|x|)$ , ensuring nonzero negative gradients and zero-centering, with empirically validated accuracy and convergence gains.
fNIRS!specific symmetric activations (Adeli et al., 15 Jul 2025): Tanh or absolute-value functions outperform ReLU in low-SNR domains; evenly symmetric or odd symmetric forms efficiently preserve bidirectional signal energy.
BRU (Bionodal Root Unit) (Bhumbra, 2018): Biophysically-motivated, root-law compression and exponential tails, reflecting neuronal input–output curves; improved speed and generalization even with no explicit regularization.
Hybrid functions S3/S4 (Kavun, 29 Jul 2025): S3 is a hard switch between sigmoid (for $x<0$ ) and softsign ( $x>0$ ); S4 employs a smooth sigmoid-controlled blend, eliminating dead neuron and vanishing gradient regimes, achieving superior accuracy and robust gradient flow in deep stacks.

Symmetry, smoothness, and biologically plausible compression can be critical in precision scientific signal domains, recurrent networks, and architectures processing zero-centered data.

6. Task-Specific Selection, Empirical Comparison, and Practical Guidelines

Extensive empirical investigation across classification, regression, detection, segmentation, adversarial robustness, and transfer learning tasks underlines that no activation is universally optimal; selection should account for architecture depth, input statistics, computational budget, and intended robustness properties (Dubey et al., 2021, Gustineli, 2022, Hammad, 14 Jul 2024):

Activation	Vision Accuracy	Deep NLP/Speech	Convergence	Gradient Issues
ReLU	Baseline, robust	Limited	Fast	Dying units
Leaky ReLU	Baseline/wider	Moderate	Fast	Residual vanishing
ELU/SELU	Marginal gains	Good (SELU:NLP)	Very fast	Needs proper init
Swish/GELU	Top CV/NLP/SOTA	Best (Transform)	Moderate	Smooth, non-zero
Mish	Best (SOTA)	Good	Fast	Non-monotonic, heavy
HP-1/HP-2	+1.5–5% gain	Unknown	Steep	Extra compute
Sqish/SGELU	+1–8% gain	Improved robust	Moderate	Tunable, smooth

Key recommendations:

Start with ReLU/Leaky for efficiency.
Adopt Swish, Mish, or GELU for ultra-deep stacks, transformers, or marginal accuracy gains; expect higher computational cost.
For noisy, symmetric signals (fNIRS, scientific data): prefer Tanh, Abs, or symmetric/modified absolute functions.
Leverage adaptive/trainable (PReLU, PAU, HP, ErfReLU) for maximal accuracy, especially on small datasets or when prior activation optimality is uncertain.
Experiment with evolutionary/meta-learned or hybrid forms if primary gradient pathologies persist or hardware constraints align.

7. Future Directions, Open Problems, and Design Considerations

Ongoing research addresses theoretical analysis of non-monotonic and rational AFs in deep learning optimization, expanding biological realism and symmetry considerations, integrating activation function search into broader neural architecture optimization (NAS), and hardware specialization for efficient AF evaluation (Gustineli, 2022, Hammad, 14 Jul 2024, Bingham et al., 2020):

Metalearned AFs: Automated search for activation templates per task, per layer, or per domain remains an open challenge; evolving both topology and AF jointly may drive further performance gains (Bingham et al., 2020, Lapid et al., 2022).
Hardware/Inference Efficiency: Trade-offs between smoothness, robustness, and evaluation cost (APTx, rational/Padé/saturated gates) are central for real-time, edge, and large-scale model deployment (Kumar, 2022).
Robustness and Generalization: Adversarial and scientific tasks suggest symmetric, bounded, non-monotonic, and trainable AFs often confer greater resistance and generalization (Biswas et al., 2023, Chen et al., 2023, Adeli et al., 15 Jul 2025).
Theory of Loss Landscapes: Understanding the geometric and statistical effects of AF curvature, non-monotonicity, and symmetry on optimization is critical for principled activation design (Gustineli, 2022).

In summary, the contemporary field of activation function research for deep learning encompasses a rigorous, multidimensional taxonomy from static rectifiers to meta-learned polynomial compositions. Advanced AFs yield substantial gains in accuracy, convergence, and training stability across diverse tasks, and ongoing research in adaptive, hybrid, and biologically-motivated forms is reshaping deep network optimization (Szandała, 2020, Dubey et al., 2021, Biswas et al., 2021, Rajanand et al., 2023, Hammad, 14 Jul 2024).