Parametric Swish Activation

Updated 13 January 2026

Parametric Swish activations are neural network non-linearities with trainable parameters that adjust slope, bias, and gating to flexibly interpolate between linear and non-linear behaviors.
They improve gradient flow and convergence by dynamically adapting their response based on input statistics and network depth, surpassing fixed functions like ReLU.
Empirical benchmarks show up to 1.5% accuracy gains on datasets like CIFAR and ImageNet when using variants such as E-Swish, SSwish, ASH, and PFTS.

Parametric Swish activation functions constitute a class of non-linearities in neural networks designed to enhance flexibility, gradient flow, and representation power compared to fixed functions such as ReLU or the vanilla Swish. These functions introduce explicit, trainable or hyperparameterized degrees of freedom—most commonly via slope/scaling, bias shift, input-dependent thresholding, or context adaptation—that allow each layer or unit to interpolate between linearity, non-monotonic gating, and sharp rectification. The family includes several variants used across vision, NLP, and generative models, with substantial empirical evidence showing reliable, often state-of-the-art, performance improvements in large-scale benchmarks.

1. Mathematical Foundations and Canonical Parametric Forms

The canonical Swish activation is defined as

$\mathrm{Swish}(x; \beta) = x \cdot \sigma(\beta x)$

with $\sigma(z) = 1/(1 + e^{-z})$ as the logistic sigmoid, and $\beta \geq 0$ a shape parameter. The non-parametric version sets $\beta=1$ ("SiLU"), while the parametric (trainable or hyperparameter) variant allows $\beta$ to vary globally, per-layer, or per-channel. As $\beta \to \infty$ , $\mathrm{Swish}$ approaches $\mathrm{ReLU}$ ; $\beta \to 0$ yields a scaled linear identity. The derivative has a closed form:

$\frac{d}{dx}\mathrm{Swish}(x; \beta) = \sigma(\beta x) + \beta x \sigma(\beta x)(1 - \sigma(\beta x))$

yielding smooth, non-monotonic, and gradient-preserving transitions across $x$ (Ramachandran et al., 2017).

The E-swish extension introduces an output slope parameter:

$\mathrm{E\text{-}swish}_\beta(x) = \beta x \cdot \sigma(x)$

where $\beta > 0$ amplifies (or contracts) the overall gain. This scaling can be interpreted as a "depth knob," controlling gradient magnitude and activation nonlinearity based on architecture depth (Alcaide, 2018).

Further parametric forms generalize Swish to:

Learnable vertical shifts: $x \cdot \sigma(\beta x) - \gamma$ , with $\gamma \in \mathbb{R}$ providing re-centering and explicit bias (as in SG-Blend's $\mathrm{SSwish}_{\beta, \gamma}$ ) (Sarkar et al., 29 May 2025).
Context-adaptive gates: $x \cdot \sigma(a x + b)$ , with $a$ and $b$ modulating the sharpness and the position of the gating boundary (as in Adaptive SwisH, ASH) (Lee et al., 2022).
Piecewise or hybrid regimes where Swish is augmented via additive biases or composed with Tanh, as in the Swish-T family (Seo et al., 2024).

2. Parametric Swish Variants and Theoretical Properties

Table: Representative Parametric Swish Activations

Name	Formula	Principal Parameter(s)
Swish	$x \cdot \sigma(\beta x)$	$\beta$ (trainable/fixed)
E-swish	$\beta x \cdot \sigma(x)$	$\beta$ (hyperparameter)
SSwish	$x \cdot \sigma(\beta x) - \gamma$	$\beta,\,\gamma$ (learnable)
Swish-T $_C$	$\sigma(\beta x)\big(x + \frac{2\alpha}{\beta}\big) - \frac{\alpha}{\beta}$	$\beta$ (learned), $\alpha$ (fixed)
ASH	$x \cdot \sigma(a x + b)$ , $a,b$ adapt to stats	$a,\,b$ (contextual, learned)
PFTS	$x\cdot\sigma(x)+t$ for $x\ge0$ , else $t$	$t$ (learnable, piecewise)

All variants retain the essential properties of Swish: $C^\infty$ smoothness, non-monotonicity (negative "dip" near $x=0$ ), gradient non-sparsity, and unbounded positive support. Additional degrees of freedom enable:

Tunable sharpness and symmetry ( $\beta$ , $\gamma$ )
Controlled mean activation (bias shift, via $t$ or $\gamma$ )
Adaptive sparsity and top- $k$ filtering (ASH)
Context-aware or learnable gating regions (via $a$ , $b$ )
Flexible negative-region behavior (via Tanh bias in Swish-T or flat constant in PFTS)

3. Parametric Swish in Empirical Benchmarking

Parametric Swish activations have demonstrated consistent increases in accuracy, faster convergence, and improved loss landscape properties over both ReLU and fixed Swish, especially in convolutional and transformer architectures. Key findings include:

CIFAR-10/100, ImageNet: Swish with trainable or appropriately chosen $\beta$ exceeds ReLU by $+0.7$ – $+1.4$ \% top-1 on ResNet, MobileNet, and Inception/ResNet models (Ramachandran et al., 2017).
E-swish: Largest accuracy boosts on Wide ResNet are obtained for $\beta = 1.25$ –$1.75$, with larger $\beta$ needed for shallow nets, and smaller $\beta$ for deeper or batch-normalized networks (Alcaide, 2018).
SSwish/SG-Blend: Learning both $\beta$ and $\gamma$ outperforms both Swish and GELU, with SG-Blend (blending SSwish and GELU using a learned $\alpha$ ) achieving state-of-the-art on CIFAR-10, BERT, and WMT14 BLEU (Sarkar et al., 29 May 2025).
ASH: Adaptive, context-dependent thresholding provides 0.5–1\% higher accuracy than Swish or GELU on vision and detection benchmarks, with accelerated convergence (Lee et al., 2022).
Swish-T $_C$ : Non-parametric variant matches or marginally surpasses learned $\beta$ Swish-T, indicating that in some architectures a fixed, high $\beta$ (e.g., $\beta = 6$ ) suffices for optimal performance (Seo et al., 2024).
PFTS: On SVHN, performance gains of up to $+71.8$ \% over ReLU in deep fully-connected architectures, and highest mean rank among parametric activations (Chieng et al., 2020).
Zorro family: The Sloped-Zorro variant ( $m$ tuned to $0.7$) numerically approximates Swish but further improves accuracy on vision tasks (e.g., $+5.3$ \% on CIFAR-10 over classic Swish) and enables explicit central slope control to mitigate vanishing gradients (Roodschild et al., 2024).

4. Parameterization, Learning Schemes, and Regularization

Different variants instantiate parameter learning distinctly:

Fixed vs. learned $\beta$ : Swish can use a global fixed $\beta$ , but per-channel (or per-layer) learnable $\beta$ consistently improves accuracy and adaptation (Ramachandran et al., 2017).
Symmetry and centering ( $\gamma$ , Tanh bias): Adding a learnable vertical offset (as in SSwish) or a Tanh bias (Swish-T) enables the activation's mean to be controlled, aiding convergence and negative region expressivity (Seo et al., 2024, Sarkar et al., 29 May 2025).
Adaptive gating (ASH): Per-feature/mini-batch statistics ( $\mu_X$ , $\sigma_X$ ) and a learned percentile gate ( $z_k$ ) are updated alongside network weights, providing context-sensitive activation sparsity or density (Lee et al., 2022).
Piecewise regimes (PFTS): Trainable threshold $t$ is optimized via standard back-prop, with careful initialization ( $t_0 \approx -0.2$ ) leading to stable mean-zero activations and reduced bias shift (Chieng et al., 2020).

All parameters may be optimized using standard SGD/Adam, with optional clamping or softplus constraints for stability, and regularizers (e.g., weight decay, penalties on $\|\alpha-0.5\|$ ) as needed to avoid degenerate solutions (Sarkar et al., 29 May 2025).

5. Theoretical Insights and Functional Flexibility

The introduction of parametric degrees of freedom serves to:

Smoothly interpolate between identity, ReLU, and saturating or non-monotonic regimes.
Adapt nonlinearity and gradient scale to network depth (avoiding vanishing or exploding gradients as in E-swish and Zorro).
Restore information flow or gradient signals in negative activation regimes (Swish-T, PFTS, SSwish).
Implement biologically inspired or sampling-based gating (ASH), linking to the behavior variability of real neuron populations.

These properties facilitate globally smoother loss landscapes, fewer plateaus, data- and architecture-adaptive nonlinearity, and—in deep or wide networks—greater resilience against training instabilities (Ramachandran et al., 2017, Alcaide, 2018, Lee et al., 2022, Roodschild et al., 2024).

6. Comparative Experimental Evaluation and Selection Guidelines

Empirical work recommends the following:

Swish/E-swish: For shallow to moderate-depth architectures, $\beta \approx 1.25$ –$1.5$; for deep (>30 layers), $\beta \approx 1.0$ –$1.25$ (Alcaide, 2018).
SG-Blend/SSwish: Use as a drop-in replacement for GELU or Swish in both vision and NLP, initializing $\alpha=0.5$ , $\beta=1.0$ , and $\gamma=0$ (Sarkar et al., 29 May 2025).
Swish-T $_A$ (non-parametric): Suitable in low-resource or mobile inference settings.
ASH: Trainable thresholding robustly outperforms fixed (non-learned) variants, and α (steepness) in [5, 20] works reliably (Lee et al., 2022).
Zorro: Start with $m=0.7$ , $a_1=1.3$ , $a_2=0$ , $b=1.8$ for Swish-like behavior, tuning $m$ upward to accelerate convergence if necessary (Roodschild et al., 2024).

Table: Empirical Accuracy Improvements (Sample Results)

Task/Arch	Baseline	Parametric Swish Variant	Accuracy Δ (%)	Source
CIFAR-10 WRN-10-2	ReLU	E-swish (β=1.5)	+1.5	(Alcaide, 2018)
CIFAR-10 ResNet-18	Swish	SG-Blend	+0.36	(Sarkar et al., 29 May 2025)
SVHN DNN-7	ReLU	PFTS	+71.83	(Chieng et al., 2020)
CIFAR-10 ConvNet	Swish	Zorro-Swish	+5.3	(Roodschild et al., 2024)

7. Implementation and Practical Integration

Integration is straightforward in modern DL frameworks, with minimal code modifications:

Swish/E-swish: One-liner replacements are possible. For E-swish: activation(x)=β*x*sigmoid(x) or activation(x)=β*tf.nn.swish(x) in TensorFlow (Alcaide, 2018).
SG-Blend/SSwish: Minimal PyTorch modules, registering parameters as nn.Parameters and updating with network weights, clamping as needed (Sarkar et al., 29 May 2025).
ASH, PFTS, Swish-T: Custom modules/layers, with in-line calculation of activation/statistics and simple per-layer parameter bookkeeping (Lee et al., 2022, Chieng et al., 2020, Seo et al., 2024).
Zorro: Explicit parameterization of the central slope/bounds allows both fixed and learnable regimes, adapting to dataset and performance requirements (Roodschild et al., 2024).

Standard optimization hyperparameters suffice; batch normalization and weight initialization practices remain unchanged except as noted for Swish (retaining batchnorm scale) (Ramachandran et al., 2017).

Parametric Swish activation functions, by introducing trainable or context-adaptive parameters, provide an effective mechanism for learning data- and architecture-specific nonlinearities. Empirical and theoretical advantages include smoother optimization, improved accuracy and convergence, and increased functional flexibility, establishing these variants as robust, practical alternatives to both ReLU and fixed Swish/GELU baselines across a spectrum of deep learning applications (Ramachandran et al., 2017, Alcaide, 2018, Lee et al., 2022, Seo et al., 2024, Chieng et al., 2020, Roodschild et al., 2024, Sarkar et al., 29 May 2025).