Multiple Parametric ELU (MPELU)

Updated 1 April 2026

Multiple Parametric ELU (MPELU) is a family of activation functions that extends ReLU, PReLU, and ELU by introducing learnable parameters for flexible adaptation of negative activations.
It adapts the negative regime via per-channel parameters α and β, improving convergence rate, classification accuracy, and compatibility with weight initialization and normalization techniques.
Empirical results on CIFAR datasets validate MPELU's effectiveness in achieving faster convergence and higher accuracy in both shallow and very deep residual architectures.

Multiple Parametric Exponential Linear Units (MPELU) are a generalized family of activation functions for deep neural networks that unify and extend the behaviors of the rectified linear unit (ReLU), parametric ReLU (PReLU), and exponential linear unit (ELU). MPELU introduces two per-channel (or shared) learnable parameters, $\alpha$ and $\beta$ , allowing flexible adaptation of both the negative saturation and curvature, which improves classification accuracy, convergence rate, and facilitates robust training of very deep architectures when paired with a matching weight initialization scheme. MPELU was introduced to address limitations of existing activations in both expressive range and compatibility with initialization and normalization techniques, particularly in deep residual networks (Li et al., 2016).

1. Mathematical Formulation and Special Cases

The MPELU activation for input $y_i$ is defined as

$f(y_i) = \begin{cases} y_i, & y_i > 0, \ \alpha_c\bigl(e^{\beta_c y_i} - 1\bigr), & y_i \leq 0, \end{cases}$

where $\alpha_c > 0$ and $\beta_c > 0$ are learnable parameters per channel or layer.

These parameters uniquely define:

$\alpha_c$ : scaling for the negative (saturating) regime
$\beta_c$ : curvature of the exponential branch

The formulation recovers prominent special cases as follows:

ReLU: $\alpha_c = 0$
PReLU: For $\beta_c \rightarrow 0$ , $\beta$ 0, so $\beta$ 1 for $\beta$ 2, $\beta$ 3 for $\beta$ 4
ELU: $\beta$ 5, $\beta$ 6

This parametric generalization covers a strictly larger class of nonlinearities, interpolating continuously between linear and nonlinear (exponential) negative regimes (Li et al., 2016).

2. Forward and Backward Computation

The partial derivatives critical for backpropagation are as follows. For $\beta$ 7, define $\beta$ 8:

$\beta$ 9

$y_i$ 0

$y_i$ 1

For a scalar loss $y_i$ 2, gradients for the shared parameters are

$y_i$ 3

3. Weight Initialization for Exponential Units

MPELU requires variance-preserving initialization tailored for its nonlinear dynamics. For a convolutional layer $y_i$ 4 with $y_i$ 5 kernel and $y_i$ 6 input channels: $y_i$ 7 Under the approximation $y_i$ 8 for $y_i$ 9 and assuming balanced branching ( $f(y_i) = \begin{cases} y_i, & y_i > 0, \ \alpha_c\bigl(e^{\beta_c y_i} - 1\bigr), & y_i \leq 0, \end{cases}$ 0 positive, $f(y_i) = \begin{cases} y_i, & y_i > 0, \ \alpha_c\bigl(e^{\beta_c y_i} - 1\bigr), & y_i \leq 0, \end{cases}$ 1 negative), the second moment is

$f(y_i) = \begin{cases} y_i, & y_i > 0, \ \alpha_c\bigl(e^{\beta_c y_i} - 1\bigr), & y_i \leq 0, \end{cases}$ 2

Enforcing $f(y_i) = \begin{cases} y_i, & y_i > 0, \ \alpha_c\bigl(e^{\beta_c y_i} - 1\bigr), & y_i \leq 0, \end{cases}$ 3 for stable propagation yields the initializer: $f(y_i) = \begin{cases} y_i, & y_i > 0, \ \alpha_c\bigl(e^{\beta_c y_i} - 1\bigr), & y_i \leq 0, \end{cases}$ 4 This generalizes the He/MSRA scheme (for $f(y_i) = \begin{cases} y_i, & y_i > 0, \ \alpha_c\bigl(e^{\beta_c y_i} - 1\bigr), & y_i \leq 0, \end{cases}$ 5), ELU initialization ( $f(y_i) = \begin{cases} y_i, & y_i > 0, \ \alpha_c\bigl(e^{\beta_c y_i} - 1\bigr), & y_i \leq 0, \end{cases}$ 6), and PReLU initialization ( $f(y_i) = \begin{cases} y_i, & y_i > 0, \ \alpha_c\bigl(e^{\beta_c y_i} - 1\bigr), & y_i \leq 0, \end{cases}$ 7), extending variance control to a broader class of nonlinearities (Li et al., 2016).

4. Implementation in Deep Residual Architectures

MPELU was evaluated in both standard and bottleneck residual network (ResNet) architectures on CIFAR-10/100. The main configurations include:

A. Non-bottleneck MPELU ResNet:

Follows the canonical ResNet stacking: initial $f(y_i) = \begin{cases} y_i, & y_i > 0, \ \alpha_c\bigl(e^{\beta_c y_i} - 1\bigr), & y_i \leq 0, \end{cases}$ 8 conv, three stages ( $f(y_i) = \begin{cases} y_i, & y_i > 0, \ \alpha_c\bigl(e^{\beta_c y_i} - 1\bigr), & y_i \leq 0, \end{cases}$ 9 blocks each) with channel sizes $\alpha_c > 0$ 0.
Each block: BatchNorm → MPELU → Conv, repeated; projection via stride 2 at stage transitions.
Used for depths 20, 32, 44, 56, 110.

B. Bottleneck “nopre” MPELU ResNet:

Employs $\alpha_c > 0$ 1 bottleneck; omits activation after residual addition (“nopre-activation”).
BatchNorm → MPELU applied only after the very first convolution and after the last addition.
Applied to very deep variants (e.g., 164- and 1001-layer ResNets).

This design reduces the overhead of internal activation layers while leveraging the adaptability of MPELU at the main input and output points.

Standard training hyperparameters: batch size 128, weight decay $\alpha_c > 0$ 2, momentum $\alpha_c > 0$ 3, learning rate decay, and standard CIFAR-10/100 data augmentation.

5. Empirical Results Across Benchmarks

Comprehensive ablation and benchmarking demonstrate the advantages of MPELU in training efficiency and final accuracy. Selected results (all from (Li et al., 2016)):

Network	Dataset	Activation	Test Error (%)
NIN (9-layer)	CIFAR-10	ReLU	10.41/8.81
		PReLU	9.19/7.49
		ELU	9.63/7.83
		MPELU	9.19/7.52
ResNet-110 (non-bottl.)	CIFAR-10	MPELU	5.47
		ResNet-110	6.61
ResNet-164 (bottleneck)	CIFAR-10	MPELU	4.43
		Pre-ResNet	5.46
ResNet-1001 (bottleneck)	CIFAR-10	MPELU	3.57
	CIFAR-100	MPELU	18.81
		Pre-ResNet	24.33

Key observed trends:

Faster convergence (e.g., in 9-layer NIN, MPELU converges to 15% error in 9k iterations vs. 25k for ReLU)
Consistent improvement over ELU/PReLU in both validation accuracy and convergence across both shallow and very deep models
Stable training of very deep nets (>1000 layers) using MPELU-specific initialization, where generic (e.g., Gaussian) initialization fails

6. Mechanisms Underpinning MPELU’s Effectiveness

The parametric structure of MPELU bridges the functional continuum between linear (ReLU/PReLU) and highly nonlinear (ELU) negative regimes. This enlarges the learnable function class, which improves model adaptation. Specifically:

Adaptation Flexibility: Separate learning of negative saturation ( $\alpha_c > 0$ 4) and curvature ( $\alpha_c > 0$ 5) allows per-channel adaptation for data-specific negative activations, fostering faster convergence and higher accuracy.
BatchNorm Compatibility: MPELU can be algebraically decomposed as a PReLU followed by a generalized ELU, which is particularly synergistic with BatchNorm; this structure circumvents degradation observed in BatchNorm+ELU combinations.
Identity Approximation in Residual Learning: The ability to tune $\alpha_c > 0$ 6 means MPELU can closely approximate the identity in the negative regime, an important property in facilitating the learning of residual mappings near zero without vanishing gradients.
Stable Signal Propagation: The weight initialization formula preserves variance even under highly non-convex nonlinearities, ensuring robust forward and backward signal transmission in very deep architectures previously inaccessible to ELU-style units with standard initializers.

In sum, MPELU is a unifying activation function parameterizing and extending several established nonlinear units, compatible with modern normalization and initialization strategies, that achieves improved accuracy and convergence in deep convolutional and residual architectures (Li et al., 2016).

Markdown Report Issue Upgrade to Chat

References (1)

Improving Deep Neural Network with Multiple Parametric Exponential Linear Units (2016)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Multiple Parametric ELU (MPELU).

Multiple Parametric ELU (MPELU)

1. Mathematical Formulation and Special Cases

2. Forward and Backward Computation

3. Weight Initialization for Exponential Units

4. Implementation in Deep Residual Architectures

5. Empirical Results Across Benchmarks

6. Mechanisms Underpinning MPELU’s Effectiveness

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Multiple Parametric ELU (MPELU)

1. Mathematical Formulation and Special Cases

2. Forward and Backward Computation

3. Weight Initialization for Exponential Units

4. Implementation in Deep Residual Architectures

5. Empirical Results Across Benchmarks

6. Mechanisms Underpinning MPELU’s Effectiveness

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research