Continuously Differentiable ELU (CELU)
- CELU is a continuously differentiable activation function that ensures smooth gradients and enhances network optimization.
- It provides bounded derivatives, interpolation between ReLU and identity, and scale-similarity for effective hyperparameter tuning.
- Its theoretical guarantees, including efficient ReLU network approximation without added overhead, offer robust performance in deep architectures.
The Continuously Differentiable Exponential Linear Unit (CELU) is a parametric activation function for neural networks, introduced to improve upon the standard Exponential Linear Unit (ELU) by enforcing continuous differentiability (class ) with respect to its input for all positive values of its shape parameter . This property eliminates discontinuity in the derivative at the origin present in ELU for , facilitating more robust optimization and parameter tuning. CELU retains several advantageous features such as a bounded derivative, the ability to interpolate between and the identity function, and scale-similarity with respect to , offering improved stability and interpretability in deep network architectures (Barron, 2017, Zhang et al., 2023).
1. Mathematical Definition and Differentiability
CELU is defined for and as follows:
When , CELU coincides with the original ELU. The first derivative with respect to (denoted 0) is
1
Both left and right derivatives at 2 equal 3, confirming 4 continuity on 5. Each branch is smooth, and CELU is continuously differentiable for all 6 (Barron, 2017, Zhang et al., 2023).
2. Theoretical Properties
CELU exhibits several theoretically significant behaviors:
- Bounded Derivative: For 7, 8; for 9, the derivative is 0. Thus, 1 for all 2, precluding exploding gradients on the negative domain.
- Interpolation of Nonlinearity: As 3, CELU converges pointwise to 4, i.e., 5. As 6, CELU approaches the identity function for all 7.
- Scale-Similarity: For any 8, 9, an exact property simplifying parameter scaling and weight initialization.
- Special Cases: CELU contains both 0 and the linear transfer function as limiting cases.
These features make CELU flexible for controlling the degree of nonlinearity and ensuring stable gradients.
3. Comparison to ELU and Related Activations
The original ELU is given by: 1 Its derivative at 2 is 3 (for 4), whereas for 5 it is 6, yielding a discontinuity for 7. CELU removes this discontinuity by adopting 8 in the negative branch, ensuring the derivative at the origin matches in both directions.
Unlike standard ELU, CELU's negative-side gradient is always in 9 regardless of 0, while ELU's negative slope can become arbitrarily large as 1 increases. The continuous first derivative of CELU improves analytical tractability for optimizers sensitive to higher-order smoothness (Barron, 2017, Zhang et al., 2023).
4. Expressive Power and Approximation in Deep Networks
CELU is included in the class of activation functions for which any 2 network of width 3 and depth 4 can be approximated arbitrarily closely (on compact domains) by a CELU-activated network of the same width and depth. For every 5 and compact domain 6, given a 7 network 8 (width 9, depth 0), there exists a 1 network 2 of identical architecture such that
3
The constructive proof involves replacing each ReLU unit 4 with 5, where 6 is chosen large enough (7) to control the approximation error (8). This direct correspondence enables width-depth scaling factors 9, i.e., no overhead, as opposed to other activation classes which may incur 0 overhead. The continuity of CELU's first derivative is instrumental in achieving this result (Zhang et al., 2023).
5. Practical Considerations and Hyperparameter Selection
The parameter 1 modulates the activation's shape:
- Small 2 3: Activation closely resembles a leaky or standard ReLU, with a sharper hinge at the origin.
- Moderate 4 5: Default setting; maintains balanced nonlinearity, zero-mean activations, and avoids large gradients.
- Large 6 7: Approximates a linear function, minimizing nonlinearity and the extent of gradient clipping.
In practice, 8 can be fixed or learned as a per-layer parameter. Due to bounded negative gradients, large 9 values do not introduce gradient instability, contrasting with ELU where large 0 may cause exploding gradients. Initializing 1 is standard, and scale-similarity aids in harmonizing 2 values across layers (Barron, 2017).
6. Empirical and Theoretical Implications
The introduction of CELU targets activation smoothness and gradient stability without empirical disadvantages relative to ELU. CELU inherits ELU's favorable properties (accelerated convergence, improved generalization on benchmarks such as CIFAR-100) and extends them with formal guarantees for smooth optimization landscapes and robust parameter tuning. No new large-scale empirical benchmarks were introduced at inception, but theoretical benefits in stability and ease of analysis are emphasized.
The 3 continuity is advantageous for employing higher-order optimizers and facilitates convergence proofs. The scale-similarity and interpolation between 4 and the identity map provide flexibility and interpretability in architectural design and hyperparameterization (Barron, 2017, Zhang et al., 2023).
7. Summary Table: CELU Key Properties
| Property | CELU | ELU |
|---|---|---|
| 5 continuity | Yes, for all 6 | Only if 7 |
| Bounded derivative | 8 | No (unbounded for large 9) |
| Special cases | 0 1, linear 2 | 3 4 |
| Scale-similarity | Yes: 5 | No |
| Depth/width overhead for ReLU approximation | None (scaling 6) | Not established |
CELU thus provides a flexible, theoretically favored, and robust activation for modern deep learning, with explicit guarantees in function approximation and gradient management (Barron, 2017, Zhang et al., 2023).