Multiple Parametric ELU (MPELU)
- Multiple Parametric ELU (MPELU) is a family of activation functions that extends ReLU, PReLU, and ELU by introducing learnable parameters for flexible adaptation of negative activations.
- It adapts the negative regime via per-channel parameters α and β, improving convergence rate, classification accuracy, and compatibility with weight initialization and normalization techniques.
- Empirical results on CIFAR datasets validate MPELU's effectiveness in achieving faster convergence and higher accuracy in both shallow and very deep residual architectures.
Multiple Parametric Exponential Linear Units (MPELU) are a generalized family of activation functions for deep neural networks that unify and extend the behaviors of the rectified linear unit (ReLU), parametric ReLU (PReLU), and exponential linear unit (ELU). MPELU introduces two per-channel (or shared) learnable parameters, and , allowing flexible adaptation of both the negative saturation and curvature, which improves classification accuracy, convergence rate, and facilitates robust training of very deep architectures when paired with a matching weight initialization scheme. MPELU was introduced to address limitations of existing activations in both expressive range and compatibility with initialization and normalization techniques, particularly in deep residual networks (Li et al., 2016).
1. Mathematical Formulation and Special Cases
The MPELU activation for input is defined as
where and are learnable parameters per channel or layer.
These parameters uniquely define:
- : scaling for the negative (saturating) regime
- : curvature of the exponential branch
The formulation recovers prominent special cases as follows:
- ReLU:
- PReLU: For , 0, so 1 for 2, 3 for 4
- ELU: 5, 6
This parametric generalization covers a strictly larger class of nonlinearities, interpolating continuously between linear and nonlinear (exponential) negative regimes (Li et al., 2016).
2. Forward and Backward Computation
The partial derivatives critical for backpropagation are as follows. For 7, define 8:
9
0
1
For a scalar loss 2, gradients for the shared parameters are
3
3. Weight Initialization for Exponential Units
MPELU requires variance-preserving initialization tailored for its nonlinear dynamics. For a convolutional layer 4 with 5 kernel and 6 input channels: 7 Under the approximation 8 for 9 and assuming balanced branching (0 positive, 1 negative), the second moment is
2
Enforcing 3 for stable propagation yields the initializer: 4 This generalizes the He/MSRA scheme (for 5), ELU initialization (6), and PReLU initialization (7), extending variance control to a broader class of nonlinearities (Li et al., 2016).
4. Implementation in Deep Residual Architectures
MPELU was evaluated in both standard and bottleneck residual network (ResNet) architectures on CIFAR-10/100. The main configurations include:
A. Non-bottleneck MPELU ResNet:
- Follows the canonical ResNet stacking: initial 8 conv, three stages (9 blocks each) with channel sizes 0.
- Each block: BatchNorm → MPELU → Conv, repeated; projection via stride 2 at stage transitions.
- Used for depths 20, 32, 44, 56, 110.
B. Bottleneck “nopre” MPELU ResNet:
- Employs 1 bottleneck; omits activation after residual addition (“nopre-activation”).
- BatchNorm → MPELU applied only after the very first convolution and after the last addition.
- Applied to very deep variants (e.g., 164- and 1001-layer ResNets).
This design reduces the overhead of internal activation layers while leveraging the adaptability of MPELU at the main input and output points.
Standard training hyperparameters: batch size 128, weight decay 2, momentum 3, learning rate decay, and standard CIFAR-10/100 data augmentation.
5. Empirical Results Across Benchmarks
Comprehensive ablation and benchmarking demonstrate the advantages of MPELU in training efficiency and final accuracy. Selected results (all from (Li et al., 2016)):
| Network | Dataset | Activation | Test Error (%) |
|---|---|---|---|
| NIN (9-layer) | CIFAR-10 | ReLU | 10.41/8.81 |
| PReLU | 9.19/7.49 | ||
| ELU | 9.63/7.83 | ||
| MPELU | 9.19/7.52 | ||
| ResNet-110 (non-bottl.) | CIFAR-10 | MPELU | 5.47 |
| ResNet-110 | 6.61 | ||
| ResNet-164 (bottleneck) | CIFAR-10 | MPELU | 4.43 |
| Pre-ResNet | 5.46 | ||
| ResNet-1001 (bottleneck) | CIFAR-10 | MPELU | 3.57 |
| CIFAR-100 | MPELU | 18.81 | |
| Pre-ResNet | 24.33 |
Key observed trends:
- Faster convergence (e.g., in 9-layer NIN, MPELU converges to 15% error in 9k iterations vs. 25k for ReLU)
- Consistent improvement over ELU/PReLU in both validation accuracy and convergence across both shallow and very deep models
- Stable training of very deep nets (>1000 layers) using MPELU-specific initialization, where generic (e.g., Gaussian) initialization fails
6. Mechanisms Underpinning MPELU’s Effectiveness
The parametric structure of MPELU bridges the functional continuum between linear (ReLU/PReLU) and highly nonlinear (ELU) negative regimes. This enlarges the learnable function class, which improves model adaptation. Specifically:
- Adaptation Flexibility: Separate learning of negative saturation (4) and curvature (5) allows per-channel adaptation for data-specific negative activations, fostering faster convergence and higher accuracy.
- BatchNorm Compatibility: MPELU can be algebraically decomposed as a PReLU followed by a generalized ELU, which is particularly synergistic with BatchNorm; this structure circumvents degradation observed in BatchNorm+ELU combinations.
- Identity Approximation in Residual Learning: The ability to tune 6 means MPELU can closely approximate the identity in the negative regime, an important property in facilitating the learning of residual mappings near zero without vanishing gradients.
- Stable Signal Propagation: The weight initialization formula preserves variance even under highly non-convex nonlinearities, ensuring robust forward and backward signal transmission in very deep architectures previously inaccessible to ELU-style units with standard initializers.
In sum, MPELU is a unifying activation function parameterizing and extending several established nonlinear units, compatible with modern normalization and initialization strategies, that achieves improved accuracy and convergence in deep convolutional and residual architectures (Li et al., 2016).