Adaptive Swish (ASH) Activation

Updated 25 June 2026

Adaptive Swish (ASH) is an activation function that uses adaptive, context-aware thresholds based on per-layer statistics to regulate neuron activation.
It extends the traditional Swish function by incorporating learnable, data-dependent thresholds to achieve high sparsity and mitigate interference in continual learning.
Empirical evaluations show that ASH and its hard variant offer improved accuracy and convergence in vision and continual learning tasks by focusing gradient flow on informative activations.

Adaptive Swish (ASH) is an activation function designed to impose adaptive, data-dependent sparsity in neural network layers by modulating activations based on feature-map statistics such as mean and standard deviation. Originating as a generalization of the Swish activation, ASH introduces learnable, context-aware thresholds that differentiate it from traditional pointwise nonlinearities. Its formulation is motivated both by biological observations of variable neuronal thresholds and by the need to mitigate interference in continual learning scenarios through sparse representations. ASH and its variants have demonstrated effectiveness across a range of tasks, including class-incremental learning and high-dimensional visual classification, providing empirical improvements over established activation functions and regularization methods (Keskinen, 2024, Lee et al., 2022).

1. Mathematical Formulation and Mechanism

ASH extends the Swish activation, which is $\mathrm{Swish}(x) = x \cdot \sigma(x)$ , with an adaptive, per-layer (or per-channel) threshold. The core formula, as defined in (Lee et al., 2022), is: $\mathrm{ASH}(x) = x \cdot \sigma\left(\alpha \cdot (x - \mu_X - z_k \sigma_X)\right)$ where $x$ is the pre-activation, $\sigma$ is the logistic sigmoid, $\mu_X$ and $\sigma_X$ are the mean and standard deviation of the activation vector (per layer or feature-map), $\alpha > 0$ controls the steepness or hardness of the gating, and $z_k$ sets the threshold as a Z-score (number of standard deviations above the mean). This construction ensures that only units with activations exceeding a dynamic, data-driven threshold contribute non-trivially to the layer's output.

Variants include a smooth gate (sigmoid or tanh-based) and a hard gate, as detailed later. The per-layer percentile-thresholding is justified by the approximate Gaussianity of pre-activations found in convolutional and feedforward layers, allowing the threshold $z_k$ to be interpreted as selecting the top- $k\%$ of activations (Lee et al., 2022).

2. Hard Adaptive Swish (Hard ASH) Variant

Hard ASH modifies the smooth gating to a discretized, nearly piecewise-constant regime. The gating function is replaced with a hard sigmoid, and the activations themselves are clipped within a maximum bound: $\mathrm{ASH}(x) = x \cdot \sigma\left(\alpha \cdot (x - \mu_X - z_k \sigma_X)\right)$ 0 where

$\mathrm{ASH}(x) = x \cdot \sigma\left(\alpha \cdot (x - \mu_X - z_k \sigma_X)\right)$ 1

and $\mathrm{ASH}(x) = x \cdot \sigma\left(\alpha \cdot (x - \mu_X - z_k \sigma_X)\right)$ 2 restricts $\mathrm{ASH}(x) = x \cdot \sigma\left(\alpha \cdot (x - \mu_X - z_k \sigma_X)\right)$ 3 to $\mathrm{ASH}(x) = x \cdot \sigma\left(\alpha \cdot (x - \mu_X - z_k \sigma_X)\right)$ 4. This regime effectively divides neurons into three groups: those totally suppressed (output and gradient zero), those passing maximal output (again, gradient zero), and a narrow transition whose location and width are controlled by $\mathrm{ASH}(x) = x \cdot \sigma\left(\alpha \cdot (x - \mu_X - z_k \sigma_X)\right)$ 5 and $\mathrm{ASH}(x) = x \cdot \sigma\left(\alpha \cdot (x - \mu_X - z_k \sigma_X)\right)$ 6. This further sparsifies both activations and gradients, sharply limiting interference during sequential task learning (Keskinen, 2024).

3. Context-Adaptive Thresholding and Trainability

The central feature of ASH is its context- and batch-adaptive gating. Unlike ReLU or Swish, which use fixed thresholds or fixed gating dynamics, ASH computes its thresholds on the fly per forward pass using mini-batch statistics. The sparsity control parameter $\mathrm{ASH}(x) = x \cdot \sigma\left(\alpha \cdot (x - \mu_X - z_k \sigma_X)\right)$ 7 (optionally trainable per layer or channel) sets the desired activation density, while $\mathrm{ASH}(x) = x \cdot \sigma\left(\alpha \cdot (x - \mu_X - z_k \sigma_X)\right)$ 8 modulates gating steepness. The combination yields trainable, context-adaptive sparsification—activating only the most informative subset of units in response to current task or input statistics.

Empirical ablations show that trainable $\mathrm{ASH}(x) = x \cdot \sigma\left(\alpha \cdot (x - \mu_X - z_k \sigma_X)\right)$ 9 outperforms fixed percentile sparsification. A hard threshold gating function is intractable for gradient-based learning, so ASH employs smooth approximations (sigmoid or tanh), preserving differentiability for optimization (Lee et al., 2022).

4. Empirical Evaluation and Comparative Results

ASH has been tested in both vision and continual learning settings:

Vision benchmarks: On ImageNet (ResNet-164, WRN-28-10, DenseNet-100-12 backbones), ASH attains top-1 accuracies of 78.6% ± 0.07 versus Swish's 77.5% ± 0.07 and ReLU's 76.4% ± 0.1. For CIFAR-10, ASH achieves 96.1% ± 0.05 (Swish: 95.6%, ReLU: 94.7%). Across other tasks (detection, segmentation, image generation), ASH consistently yields higher accuracy, faster convergence, and improved mean average precision or mIoU scores (Lee et al., 2022).
Continual learning (Split-MNIST): With a one-hidden-layer MLP (1,000 neurons, weight normalization), Hard ASH + Adagrad achieves 78.3% ± 1.4 after five sequential tasks (one epoch per task), notably outperforming ReLU + Adam (49.2% ± 7.9), Top-K subtraction, and Elastic Weight Consolidation (EWC, 500 epochs: 61%). ASH and Hard ASH exhibit resilience across optimizers, with SGD+Hard ASH (52.9%) far surpassing SGD+ReLU (19.8%). Increasing $x$ 0 or $x$ 1 boosts retention for early tasks without degrading learning on later tasks (Keskinen, 2024).

A representative table of comparative Split-MNIST results:

Activation + Optimizer	Final Test Acc. (5 runs, %)	Training Regime
Hard ASH + Adagrad	78.3 ± 1.4	5 epochs (1 per task)
ASH + Adagrad	76.4 ± 1.4	5 epochs (1 per task)
Top-K Subtract + Adagrad	76.0 ± 1.6	5 epochs (1 per task)
ReLU + Adam	49.2 ± 7.9	5 epochs (1 per task)
EWC	61	500 epochs
SDMLP	69	500 epochs
SDMLP + EWC	83	500 epochs

5. Implementation Details and Computational Considerations

ASH requires computation of per-layer (or per-feature-map) mean and standard deviation, introducing minimal additional overhead relative to popular normalization layers. In standard vision models, ASH is "plug-and-play," replacing ReLU or Swish with no other architectural changes. For MLP continual learning, the canonical setup uses Kaiming-normalized weights and weight normalization on the first layer, with Adagrad as the primary optimizer (learning rate ≈ $x$ 2, initial accumulator $x$ 3) (Keskinen, 2024, Lee et al., 2022).

ASH achieves high sparsity (typically 97–99% of units suppressed per layer) in $x$ 4 per-layer time, sidestepping explicit Top-K sorting ( $x$ 5). Backpropagation through the adaptive gating is handled via standard autodiff, with explicit formulas for gradient flow with respect to both $x$ 6 and $x$ 7. Sample pseudo-code for forward/backward computation is provided in (Lee et al., 2022).

6. Theoretical Properties and Intuitions

ASH leverages the approximate Gaussianity of deep-layer pre-activations to justify its adaptive Z-thresholding (Proposition 1, 2 in (Lee et al., 2022)). The gating function is infinitely differentiable (for finite $x$ 8), strictly increasing for positive $x$ 9, and bounded between $\sigma$ 0 and $\sigma$ 1. Hard ASH introduces larger regions of zero gradient, promoting stability by constraining most unit updates to the narrow transition band around the adaptive threshold.

In continual learning, ASH and Hard ASH directly mitigate the stability–plasticity dilemma. High sparsity reduces overlap between units active during different tasks, thus lowering interference (stability), while active units for new tasks receive full gradient signals (plasticity). This focused gradient flow prevents widespread drift of parameters tied to inactive units, curbing catastrophic forgetting.

7. Practical Benefits, Limitations, and Open Issues

Empirical studies highlight several advantages of ASH:

Context- and batch-aware gating adapts to input statistics per forward pass.
Trainable thresholding enables each neuron or channel to regulate its own effective sparsity.
Rapid convergence and competitive accuracy on diverse benchmarks, including classification, detection, and segmentation.

Documented limitations include slight computational and memory overhead for per-map statistics, absence of formal generalization bounds for arbitrary $\sigma$ 2 in the generalized Swish form, and untested application to NLP/transformer models. The dynamics of $\sigma$ 3 evolution across deeper architectures remain an open research avenue. A plausible implication is that further theoretical analysis of adaptive thresholding could yield insights amenable to broader classes of neural architectures (Lee et al., 2022).

References

"Hard ASH: Sparsity and the right optimizer make a continual learner" (Keskinen, 2024)
"Stochastic Adaptive Activation Function" (Lee et al., 2022)

Markdown Report Issue Upgrade to Chat

References (2)

Hard ASH: Sparsity and the right optimizer make a continual learner (2024)

Stochastic Adaptive Activation Function (2022)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Adaptive Swish (ASH).