Papers
Topics
Authors
Recent
2000 character limit reached

Parametric Swish Activation

Updated 13 January 2026
  • Parametric Swish activations are neural network non-linearities with trainable parameters that adjust slope, bias, and gating to flexibly interpolate between linear and non-linear behaviors.
  • They improve gradient flow and convergence by dynamically adapting their response based on input statistics and network depth, surpassing fixed functions like ReLU.
  • Empirical benchmarks show up to 1.5% accuracy gains on datasets like CIFAR and ImageNet when using variants such as E-Swish, SSwish, ASH, and PFTS.

Parametric Swish activation functions constitute a class of non-linearities in neural networks designed to enhance flexibility, gradient flow, and representation power compared to fixed functions such as ReLU or the vanilla Swish. These functions introduce explicit, trainable or hyperparameterized degrees of freedom—most commonly via slope/scaling, bias shift, input-dependent thresholding, or context adaptation—that allow each layer or unit to interpolate between linearity, non-monotonic gating, and sharp rectification. The family includes several variants used across vision, NLP, and generative models, with substantial empirical evidence showing reliable, often state-of-the-art, performance improvements in large-scale benchmarks.

1. Mathematical Foundations and Canonical Parametric Forms

The canonical Swish activation is defined as

Swish(x;β)=x⋅σ(βx)\mathrm{Swish}(x; \beta) = x \cdot \sigma(\beta x)

with σ(z)=1/(1+e−z)\sigma(z) = 1/(1 + e^{-z}) as the logistic sigmoid, and β≥0\beta \geq 0 a shape parameter. The non-parametric version sets β=1\beta=1 ("SiLU"), while the parametric (trainable or hyperparameter) variant allows β\beta to vary globally, per-layer, or per-channel. As β→∞\beta \to \infty, Swish\mathrm{Swish} approaches ReLU\mathrm{ReLU}; β→0\beta \to 0 yields a scaled linear identity. The derivative has a closed form:

ddxSwish(x;β)=σ(βx)+βxσ(βx)(1−σ(βx))\frac{d}{dx}\mathrm{Swish}(x; \beta) = \sigma(\beta x) + \beta x \sigma(\beta x)(1 - \sigma(\beta x))

yielding smooth, non-monotonic, and gradient-preserving transitions across xx (Ramachandran et al., 2017).

The E-swish extension introduces an output slope parameter:

E-swishβ(x)=βx⋅σ(x)\mathrm{E\text{-}swish}_\beta(x) = \beta x \cdot \sigma(x)

where β>0\beta > 0 amplifies (or contracts) the overall gain. This scaling can be interpreted as a "depth knob," controlling gradient magnitude and activation nonlinearity based on architecture depth (Alcaide, 2018).

Further parametric forms generalize Swish to:

  • Learnable vertical shifts: x⋅σ(βx)−γx \cdot \sigma(\beta x) - \gamma, with γ∈R\gamma \in \mathbb{R} providing re-centering and explicit bias (as in SG-Blend's SSwishβ,γ\mathrm{SSwish}_{\beta, \gamma}) (Sarkar et al., 29 May 2025).
  • Context-adaptive gates: x⋅σ(ax+b)x \cdot \sigma(a x + b), with aa and bb modulating the sharpness and the position of the gating boundary (as in Adaptive SwisH, ASH) (Lee et al., 2022).
  • Piecewise or hybrid regimes where Swish is augmented via additive biases or composed with Tanh, as in the Swish-T family (Seo et al., 2024).

2. Parametric Swish Variants and Theoretical Properties

Table: Representative Parametric Swish Activations

Name Formula Principal Parameter(s)
Swish x⋅σ(βx)x \cdot \sigma(\beta x) β\beta (trainable/fixed)
E-swish βx⋅σ(x)\beta x \cdot \sigma(x) β\beta (hyperparameter)
SSwish x⋅σ(βx)−γx \cdot \sigma(\beta x) - \gamma β, γ\beta,\,\gamma (learnable)
Swish-TC_C σ(βx)(x+2αβ)−αβ\sigma(\beta x)\big(x + \frac{2\alpha}{\beta}\big) - \frac{\alpha}{\beta} β\beta (learned), α\alpha (fixed)
ASH x⋅σ(ax+b)x \cdot \sigma(a x + b), a,ba,b adapt to stats a, ba,\,b (contextual, learned)
PFTS x⋅σ(x)+tx\cdot\sigma(x)+t for x≥0x\ge0, else tt tt (learnable, piecewise)

All variants retain the essential properties of Swish: C∞C^\infty smoothness, non-monotonicity (negative "dip" near x=0x=0), gradient non-sparsity, and unbounded positive support. Additional degrees of freedom enable:

  • Tunable sharpness and symmetry (β\beta, γ\gamma)
  • Controlled mean activation (bias shift, via tt or γ\gamma)
  • Adaptive sparsity and top-kk filtering (ASH)
  • Context-aware or learnable gating regions (via aa, bb)
  • Flexible negative-region behavior (via Tanh bias in Swish-T or flat constant in PFTS)

3. Parametric Swish in Empirical Benchmarking

Parametric Swish activations have demonstrated consistent increases in accuracy, faster convergence, and improved loss landscape properties over both ReLU and fixed Swish, especially in convolutional and transformer architectures. Key findings include:

  • CIFAR-10/100, ImageNet: Swish with trainable or appropriately chosen β\beta exceeds ReLU by +0.7+0.7–+1.4+1.4\% top-1 on ResNet, MobileNet, and Inception/ResNet models (Ramachandran et al., 2017).
  • E-swish: Largest accuracy boosts on Wide ResNet are obtained for β=1.25\beta = 1.25–$1.75$, with larger β\beta needed for shallow nets, and smaller β\beta for deeper or batch-normalized networks (Alcaide, 2018).
  • SSwish/SG-Blend: Learning both β\beta and γ\gamma outperforms both Swish and GELU, with SG-Blend (blending SSwish and GELU using a learned α\alpha) achieving state-of-the-art on CIFAR-10, BERT, and WMT14 BLEU (Sarkar et al., 29 May 2025).
  • ASH: Adaptive, context-dependent thresholding provides 0.5–1\% higher accuracy than Swish or GELU on vision and detection benchmarks, with accelerated convergence (Lee et al., 2022).
  • Swish-TC_C: Non-parametric variant matches or marginally surpasses learned β\beta Swish-T, indicating that in some architectures a fixed, high β\beta (e.g., β=6\beta = 6) suffices for optimal performance (Seo et al., 2024).
  • PFTS: On SVHN, performance gains of up to +71.8+71.8\% over ReLU in deep fully-connected architectures, and highest mean rank among parametric activations (Chieng et al., 2020).
  • Zorro family: The Sloped-Zorro variant (mm tuned to $0.7$) numerically approximates Swish but further improves accuracy on vision tasks (e.g., +5.3+5.3\% on CIFAR-10 over classic Swish) and enables explicit central slope control to mitigate vanishing gradients (Roodschild et al., 2024).

4. Parameterization, Learning Schemes, and Regularization

Different variants instantiate parameter learning distinctly:

  • Fixed vs. learned β\beta: Swish can use a global fixed β\beta, but per-channel (or per-layer) learnable β\beta consistently improves accuracy and adaptation (Ramachandran et al., 2017).
  • Symmetry and centering (γ\gamma, Tanh bias): Adding a learnable vertical offset (as in SSwish) or a Tanh bias (Swish-T) enables the activation's mean to be controlled, aiding convergence and negative region expressivity (Seo et al., 2024, Sarkar et al., 29 May 2025).
  • Adaptive gating (ASH): Per-feature/mini-batch statistics (μX\mu_X, σX\sigma_X) and a learned percentile gate (zkz_k) are updated alongside network weights, providing context-sensitive activation sparsity or density (Lee et al., 2022).
  • Piecewise regimes (PFTS): Trainable threshold tt is optimized via standard back-prop, with careful initialization (t0≈−0.2t_0 \approx -0.2) leading to stable mean-zero activations and reduced bias shift (Chieng et al., 2020).

All parameters may be optimized using standard SGD/Adam, with optional clamping or softplus constraints for stability, and regularizers (e.g., weight decay, penalties on ∥α−0.5∥\|\alpha-0.5\|) as needed to avoid degenerate solutions (Sarkar et al., 29 May 2025).

5. Theoretical Insights and Functional Flexibility

The introduction of parametric degrees of freedom serves to:

  • Smoothly interpolate between identity, ReLU, and saturating or non-monotonic regimes.
  • Adapt nonlinearity and gradient scale to network depth (avoiding vanishing or exploding gradients as in E-swish and Zorro).
  • Restore information flow or gradient signals in negative activation regimes (Swish-T, PFTS, SSwish).
  • Implement biologically inspired or sampling-based gating (ASH), linking to the behavior variability of real neuron populations.

These properties facilitate globally smoother loss landscapes, fewer plateaus, data- and architecture-adaptive nonlinearity, and—in deep or wide networks—greater resilience against training instabilities (Ramachandran et al., 2017, Alcaide, 2018, Lee et al., 2022, Roodschild et al., 2024).

6. Comparative Experimental Evaluation and Selection Guidelines

Empirical work recommends the following:

  • Swish/E-swish: For shallow to moderate-depth architectures, β≈1.25\beta \approx 1.25–$1.5$; for deep (>30 layers), β≈1.0\beta \approx 1.0–$1.25$ (Alcaide, 2018).
  • SG-Blend/SSwish: Use as a drop-in replacement for GELU or Swish in both vision and NLP, initializing α=0.5\alpha=0.5, β=1.0\beta=1.0, and γ=0\gamma=0 (Sarkar et al., 29 May 2025).
  • Swish-TA_A (non-parametric): Suitable in low-resource or mobile inference settings.
  • ASH: Trainable thresholding robustly outperforms fixed (non-learned) variants, and α (steepness) in [5, 20] works reliably (Lee et al., 2022).
  • Zorro: Start with m=0.7m=0.7, a1=1.3a_1=1.3, a2=0a_2=0, b=1.8b=1.8 for Swish-like behavior, tuning mm upward to accelerate convergence if necessary (Roodschild et al., 2024).

Table: Empirical Accuracy Improvements (Sample Results)

Task/Arch Baseline Parametric Swish Variant Accuracy Δ (%) Source
CIFAR-10 WRN-10-2 ReLU E-swish (β=1.5) +1.5 (Alcaide, 2018)
CIFAR-10 ResNet-18 Swish SG-Blend +0.36 (Sarkar et al., 29 May 2025)
SVHN DNN-7 ReLU PFTS +71.83 (Chieng et al., 2020)
CIFAR-10 ConvNet Swish Zorro-Swish +5.3 (Roodschild et al., 2024)

7. Implementation and Practical Integration

Integration is straightforward in modern DL frameworks, with minimal code modifications:

  • Swish/E-swish: One-liner replacements are possible. For E-swish: activation(x)=β*x*sigmoid(x) or activation(x)=β*tf.nn.swish(x) in TensorFlow (Alcaide, 2018).
  • SG-Blend/SSwish: Minimal PyTorch modules, registering parameters as nn.Parameters and updating with network weights, clamping as needed (Sarkar et al., 29 May 2025).
  • ASH, PFTS, Swish-T: Custom modules/layers, with in-line calculation of activation/statistics and simple per-layer parameter bookkeeping (Lee et al., 2022, Chieng et al., 2020, Seo et al., 2024).
  • Zorro: Explicit parameterization of the central slope/bounds allows both fixed and learnable regimes, adapting to dataset and performance requirements (Roodschild et al., 2024).

Standard optimization hyperparameters suffice; batch normalization and weight initialization practices remain unchanged except as noted for Swish (retaining batchnorm scale) (Ramachandran et al., 2017).


Parametric Swish activation functions, by introducing trainable or context-adaptive parameters, provide an effective mechanism for learning data- and architecture-specific nonlinearities. Empirical and theoretical advantages include smoother optimization, improved accuracy and convergence, and increased functional flexibility, establishing these variants as robust, practical alternatives to both ReLU and fixed Swish/GELU baselines across a spectrum of deep learning applications (Ramachandran et al., 2017, Alcaide, 2018, Lee et al., 2022, Seo et al., 2024, Chieng et al., 2020, Roodschild et al., 2024, Sarkar et al., 29 May 2025).

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Parametric Swish Activation.