Learnable Sigmoid Function for Mask Estimation

Updated 15 June 2026

The paper introduces RPSU, a learnable sigmoid that adaptively optimizes mask estimation by parameterizing nonlinearity and outperforming traditional fixed functions.
It provides an analytic formulation with tractable differentiation, enabling seamless integration into backpropagation pipelines and standard neural architectures.
Empirical tests show improved convergence and accuracy in image and audio tasks, validating RPSU’s benefits over established activations like ReLU and Swish.

A learnable sigmoid function for mask estimation refers to the use of neural transfer functions whose nonlinearity is not fixed but is parameterized and trained jointly with linear weights in neural network architectures. The Rectified Power Sigmoid Unit (RPSU), introduced by Atto et al. (2021), offers a general framework for such learnable nonlinearities, integrating characteristics of the sigmoid, rectified linear unit (ReLU), Swish/SiLU, and other activation forms within a parametrized and analytically-tractable class. RPSU provides a universal, learnable alternative for conventional sigmoids in tasks such as time–frequency mask estimation, enabling networks to adaptively select effective nonlinear squash-and-rectify shapes to optimize task performance (Atto et al., 2021).

1. Analytic Formulation of the Rectified Power Sigmoid Unit

The RPSU is defined in terms of the following parameters: a rectification threshold $\lambda$ , a horizontal shift $\mu$ , a positive scale $\sigma$ , a positive shape exponent $\beta$ , and a convex mixing parameter $\alpha \in [0,1]$ . Let $x \in \mathbb{R}$ be the pre-activation input. The two extremal building block functions are:

$f_{\lambda, \sigma, \mu, \beta}(x) := (x - \lambda)^+ \cdot \frac{1}{1 + \exp\{-\mathrm{sgn}(x - \mu)(|x - \mu|/\sigma)^\beta\}}$ , where $(x - \lambda)^+ = (x - \lambda)\cdot \mathbf{1}_{x \geq \lambda}$ and $\mathrm{sgn}(u) = +1$ if $u \geq 0$ , $\mu$ 0 otherwise.
$\mu$ 1, where $\mu$ 2.

The composite RPSU activation is the affine blend:

$\mu$ 3

For $\mu$ 4, $\mu$ 5; for $\mu$ 6, let $\mu$ 7, $\mu$ 8, $\mu$ 9, and $\sigma$ 0.

Various choices of the parameters recover known activation functions as special cases:

Regime	Parameter Settings	Limit Function
ReLU	$\sigma$ 1, $\sigma$ 2, $\sigma$ 3	$\sigma$ 4
Swish/SiLU	$\sigma$ 5, $\sigma$ 6, $\sigma$ 7, $\sigma$ 8, $\sigma$ 9	$\beta$ 0
SSBS	$\beta$ 1, $\beta$ 2	Shrinkage (SSBS)

RPSU thus allows for a smooth interpolation between a broad family of rectifying and squashing functions.

2. Differentiation and Backpropagation Dynamics

The derivatives of $\beta$ 3 with respect to all its parameters are analytically tractable for $\beta$ 4. Denote $\beta$ 5, $\beta$ 6, and $\beta$ 7 as above.

Key partial derivatives used in backpropagation include:

$\beta$ 8, where $\beta$ 9 and $\alpha \in [0,1]$ 0.
$\alpha \in [0,1]$ 1.
$\alpha \in [0,1]$ 2.
The derivatives with respect to $\alpha \in [0,1]$ 3 and $\alpha \in [0,1]$ 4 enter through $\alpha \in [0,1]$ 5 and follow analogous differentiation rules, enabling seamless parameter updates during training.
$\alpha \in [0,1]$ 6 is typically held fixed (often at $\alpha \in [0,1]$ 7) due to sensitivity and computational expense.

This full analytic differentiability allows RPSU parameters to be optimized by standard SGD or Adam without altering established backpropagation pipelines (Atto et al., 2021).

3. Empirical Performance and Optimization

Empirical evidence from Atto et al. demonstrates substantial performance gains by substituting a fixed nonlinearity with an RPSU parameterized activation:

In a shallow MNIST-type CNN (1 conv layer + RPSU + FC), two epochs of training yielded 95.6% accuracy for RPSU, compared with 91.2% (ReLU), 94.5% (Swish/Mish), and 79% (PSWISH).
For deep texture classification networks (“GFBF-texture” dataset, 5 conv layers + 1 RPSU + ReLUs), the best validation accuracy after 100 epochs for CNN-2-RePSU was 74.3%, compared to 58.9% (ReLU), 66.3% (MISH), 69.6% (PSWISH), and 60.7% (plain Swish).

Guidelines for effective use include initializing $\alpha \in [0,1]$ 8 (threshold) near $\alpha \in [0,1]$ 9, $x \in \mathbb{R}$ 0 (shift) to $x \in \mathbb{R}$ 1, $x \in \mathbb{R}$ 2 (scale) to $x \in \mathbb{R}$ 3, $x \in \mathbb{R}$ 4 (shape) to $x \in \mathbb{R}$ 5 (and often freezing or learning very slowly), and $x \in \mathbb{R}$ 6 (mix) at $x \in \mathbb{R}$ 7 or $x \in \mathbb{R}$ 8. Only one RPSU layer is typically needed for large gains; further layers may retain standard ReLU to preserve computational efficiency. The optimizer learning rate for RPSU parameters can often match that of convolutional weights (for Adam), or be attenuated by a factor of 0.1 (for SGD) to prevent overshooting (Atto et al., 2021).

4. Integration into Mask Estimation Architectures

In mask-estimation networks—common in audio, speech/enhancement, and time–frequency (T-F) tasks—a fixed sigmoid is typically used post-logit to constrain outputs to $x \in \mathbb{R}$ 9. RPSU allows direct replacement of this sigmoid, with trainable parameters enabling the network to learn the most effective mask-shaping nonlinearity.

Recommended practice involves initializing RPSU parameters $f_{\lambda, \sigma, \mu, \beta}(x) := (x - \lambda)^+ \cdot \frac{1}{1 + \exp\{-\mathrm{sgn}(x - \mu)(|x - \mu|/\sigma)^\beta\}}$ 0, $f_{\lambda, \sigma, \mu, \beta}(x) := (x - \lambda)^+ \cdot \frac{1}{1 + \exp\{-\mathrm{sgn}(x - \mu)(|x - \mu|/\sigma)^\beta\}}$ 1, $f_{\lambda, \sigma, \mu, \beta}(x) := (x - \lambda)^+ \cdot \frac{1}{1 + \exp\{-\mathrm{sgn}(x - \mu)(|x - \mu|/\sigma)^\beta\}}$ 2, $f_{\lambda, \sigma, \mu, \beta}(x) := (x - \lambda)^+ \cdot \frac{1}{1 + \exp\{-\mathrm{sgn}(x - \mu)(|x - \mu|/\sigma)^\beta\}}$ 3, $f_{\lambda, \sigma, \mu, \beta}(x) := (x - \lambda)^+ \cdot \frac{1}{1 + \exp\{-\mathrm{sgn}(x - \mu)(|x - \mu|/\sigma)^\beta\}}$ 4, which places RPSU in the SiLU regime: $f_{\lambda, \sigma, \mu, \beta}(x) := (x - \lambda)^+ \cdot \frac{1}{1 + \exp\{-\mathrm{sgn}(x - \mu)(|x - \mu|/\sigma)^\beta\}}$ 5, naturally encoding a flexible squashing. Parameter layouts can involve global, per-channel, or per-frequency settings, balancing parameter cost and expressivity. The output $f_{\lambda, \sigma, \mu, \beta}(x) := (x - \lambda)^+ \cdot \frac{1}{1 + \exp\{-\mathrm{sgn}(x - \mu)(|x - \mu|/\sigma)^\beta\}}$ 6 typically remains in $f_{\lambda, \sigma, \mu, \beta}(x) := (x - \lambda)^+ \cdot \frac{1}{1 + \exp\{-\mathrm{sgn}(x - \mu)(|x - \mu|/\sigma)^\beta\}}$ 7, and task loss functions further constrain predictions to appropriate mask ranges. Only one RPSU per output channel or filter is used in published MNIST experiments (Atto et al., 2021).

The expected benefits in this context are:

Jointly learning both feature mixing and mask-shaping nonlinearity.
Smooth interpolation between soft-sigmoid and hard-rectified behaviors (controlled by $f_{\lambda, \sigma, \mu, \beta}(x) := (x - \lambda)^+ \cdot \frac{1}{1 + \exp\{-\mathrm{sgn}(x - \mu)(|x - \mu|/\sigma)^\beta\}}$ 8).
Early stabilization in near-sigmoid regime, with adaptability towards sharper or more binary mask behaviors as learning progresses.

5. Quantitative Insights and Broader Implications

Reported experiments outside direct mask-estimation tasks suggest that replacing a fixed sigmoid with a learnable RPSU may accelerate convergence (∼4–8% increased absolute accuracy after 2 epochs in shallow vision; ∼15% in deep texture classification after 100 epochs), sharpen and stabilize loss reduction, and lead to higher final accuracy. While direct results on audio masking were not detailed in the original reference, these performance gains imply potential benefits for any application currently relying on standard sigmoid mask outputs, such as voice-activity detection and time–frequency masking (Atto et al., 2021).

A plausible implication is that task-specific, learnable nonlinear squash-and-rectify mappings allow mask estimation networks to automatically shift towards optimal “soft” or “hard” outputs without architectural modification.

6. Practical Considerations and Implementation

RPSU is implemented by element-wise application following any convolutional or linear layer, with parameter updates joint with standard neural network optimization. Best practices call for batch normalization prior to RPSU for distributional stability. Representative pseudo-code and update rules are given explicitly by Atto et al., detailing both forward and backward computations, with attention to maintaining parameter constraints (e.g., $f_{\lambda, \sigma, \mu, \beta}(x) := (x - \lambda)^+ \cdot \frac{1}{1 + \exp\{-\mathrm{sgn}(x - \mu)(|x - \mu|/\sigma)^\beta\}}$ 9, $(x - \lambda)^+ = (x - \lambda)\cdot \mathbf{1}_{x \geq \lambda}$ 0 via clamping).

Initialization guidelines and empirical recommendations are summarized as:

Parameter	Typical Initialization	Notes
$(x - \lambda)^+ = (x - \lambda)\cdot \mathbf{1}_{x \geq \lambda}$ 1	$(x - \lambda)^+ = (x - \lambda)\cdot \mathbf{1}_{x \geq \lambda}$ 2	ReLU-like regime
$(x - \lambda)^+ = (x - \lambda)\cdot \mathbf{1}_{x \geq \lambda}$ 3	$(x - \lambda)^+ = (x - \lambda)\cdot \mathbf{1}_{x \geq \lambda}$ 4	Shift center
$(x - \lambda)^+ = (x - \lambda)\cdot \mathbf{1}_{x \geq \lambda}$ 5	$(x - \lambda)^+ = (x - \lambda)\cdot \mathbf{1}_{x \geq \lambda}$ 6	Sigmoid part unit slope
$(x - \lambda)^+ = (x - \lambda)\cdot \mathbf{1}_{x \geq \lambda}$ 7	$(x - \lambda)^+ = (x - \lambda)\cdot \mathbf{1}_{x \geq \lambda}$ 8 (fixed or slow)	Power-law shape
$(x - \lambda)^+ = (x - \lambda)\cdot \mathbf{1}_{x \geq \lambda}$ 9	$\mathrm{sgn}(u) = +1$ 0 or $\mathrm{sgn}(u) = +1$ 1	Mix between shrink/stretch

With these choices, RPSU layers can be deployed in established networks (e.g., U-Net, mask CNNs) with negligible architectural disruption, offering end-to-end trainability across both linear and nonlinear operator spaces (Atto et al., 2021).

Markdown Report Issue Upgrade to Chat

References (1)

Parametric Rectified Power Sigmoid Units: Learning Nonlinear Neural Transfer Analytical Forms (2021)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Learnable Sigmoid Function for Mask Estimation.