Papers
Topics
Authors
Recent
Search
2000 character limit reached

Continuously Differentiable ELU (CELU)

Updated 1 April 2026
  • CELU is a continuously differentiable activation function that ensures smooth gradients and enhances network optimization.
  • It provides bounded derivatives, interpolation between ReLU and identity, and scale-similarity for effective hyperparameter tuning.
  • Its theoretical guarantees, including efficient ReLU network approximation without added overhead, offer robust performance in deep architectures.

The Continuously Differentiable Exponential Linear Unit (CELU) is a parametric activation function for neural networks, introduced to improve upon the standard Exponential Linear Unit (ELU) by enforcing continuous differentiability (class C1C^1) with respect to its input for all positive values of its shape parameter α\alpha. This property eliminates discontinuity in the derivative at the origin present in ELU for α1\alpha \neq 1, facilitating more robust optimization and parameter tuning. CELU retains several advantageous features such as a bounded derivative, the ability to interpolate between ReLU\mathrm{ReLU} and the identity function, and scale-similarity with respect to α\alpha, offering improved stability and interpretability in deep network architectures (Barron, 2017, Zhang et al., 2023).

1. Mathematical Definition and Differentiability

CELU is defined for xRx \in \mathbb{R} and α>0\alpha > 0 as follows: CELU(x,α)={x,if x0 α(exp(xα)1),if x<0\mathrm{CELU}(x,\alpha) = \begin{cases} x, & \text{if } x \geq 0 \ \alpha\left(\exp\left(\frac{x}{\alpha}\right) - 1\right), & \text{if } x < 0 \end{cases}

When α=1\alpha = 1, CELU coincides with the original ELU. The first derivative with respect to xx (denoted α\alpha0) is

α\alpha1

Both left and right derivatives at α\alpha2 equal α\alpha3, confirming α\alpha4 continuity on α\alpha5. Each branch is smooth, and CELU is continuously differentiable for all α\alpha6 (Barron, 2017, Zhang et al., 2023).

2. Theoretical Properties

CELU exhibits several theoretically significant behaviors:

  • Bounded Derivative: For α\alpha7, α\alpha8; for α\alpha9, the derivative is α1\alpha \neq 10. Thus, α1\alpha \neq 11 for all α1\alpha \neq 12, precluding exploding gradients on the negative domain.
  • Interpolation of Nonlinearity: As α1\alpha \neq 13, CELU converges pointwise to α1\alpha \neq 14, i.e., α1\alpha \neq 15. As α1\alpha \neq 16, CELU approaches the identity function for all α1\alpha \neq 17.
  • Scale-Similarity: For any α1\alpha \neq 18, α1\alpha \neq 19, an exact property simplifying parameter scaling and weight initialization.
  • Special Cases: CELU contains both ReLU\mathrm{ReLU}0 and the linear transfer function as limiting cases.

These features make CELU flexible for controlling the degree of nonlinearity and ensuring stable gradients.

The original ELU is given by: ReLU\mathrm{ReLU}1 Its derivative at ReLU\mathrm{ReLU}2 is ReLU\mathrm{ReLU}3 (for ReLU\mathrm{ReLU}4), whereas for ReLU\mathrm{ReLU}5 it is ReLU\mathrm{ReLU}6, yielding a discontinuity for ReLU\mathrm{ReLU}7. CELU removes this discontinuity by adopting ReLU\mathrm{ReLU}8 in the negative branch, ensuring the derivative at the origin matches in both directions.

Unlike standard ELU, CELU's negative-side gradient is always in ReLU\mathrm{ReLU}9 regardless of α\alpha0, while ELU's negative slope can become arbitrarily large as α\alpha1 increases. The continuous first derivative of CELU improves analytical tractability for optimizers sensitive to higher-order smoothness (Barron, 2017, Zhang et al., 2023).

4. Expressive Power and Approximation in Deep Networks

CELU is included in the class of activation functions for which any α\alpha2 network of width α\alpha3 and depth α\alpha4 can be approximated arbitrarily closely (on compact domains) by a CELU-activated network of the same width and depth. For every α\alpha5 and compact domain α\alpha6, given a α\alpha7 network α\alpha8 (width α\alpha9, depth xRx \in \mathbb{R}0), there exists a xRx \in \mathbb{R}1 network xRx \in \mathbb{R}2 of identical architecture such that

xRx \in \mathbb{R}3

The constructive proof involves replacing each ReLU unit xRx \in \mathbb{R}4 with xRx \in \mathbb{R}5, where xRx \in \mathbb{R}6 is chosen large enough (xRx \in \mathbb{R}7) to control the approximation error (xRx \in \mathbb{R}8). This direct correspondence enables width-depth scaling factors xRx \in \mathbb{R}9, i.e., no overhead, as opposed to other activation classes which may incur α>0\alpha > 00 overhead. The continuity of CELU's first derivative is instrumental in achieving this result (Zhang et al., 2023).

5. Practical Considerations and Hyperparameter Selection

The parameter α>0\alpha > 01 modulates the activation's shape:

  • Small α>0\alpha > 02 α>0\alpha > 03: Activation closely resembles a leaky or standard ReLU, with a sharper hinge at the origin.
  • Moderate α>0\alpha > 04 α>0\alpha > 05: Default setting; maintains balanced nonlinearity, zero-mean activations, and avoids large gradients.
  • Large α>0\alpha > 06 α>0\alpha > 07: Approximates a linear function, minimizing nonlinearity and the extent of gradient clipping.

In practice, α>0\alpha > 08 can be fixed or learned as a per-layer parameter. Due to bounded negative gradients, large α>0\alpha > 09 values do not introduce gradient instability, contrasting with ELU where large CELU(x,α)={x,if x0 α(exp(xα)1),if x<0\mathrm{CELU}(x,\alpha) = \begin{cases} x, & \text{if } x \geq 0 \ \alpha\left(\exp\left(\frac{x}{\alpha}\right) - 1\right), & \text{if } x < 0 \end{cases}0 may cause exploding gradients. Initializing CELU(x,α)={x,if x0 α(exp(xα)1),if x<0\mathrm{CELU}(x,\alpha) = \begin{cases} x, & \text{if } x \geq 0 \ \alpha\left(\exp\left(\frac{x}{\alpha}\right) - 1\right), & \text{if } x < 0 \end{cases}1 is standard, and scale-similarity aids in harmonizing CELU(x,α)={x,if x0 α(exp(xα)1),if x<0\mathrm{CELU}(x,\alpha) = \begin{cases} x, & \text{if } x \geq 0 \ \alpha\left(\exp\left(\frac{x}{\alpha}\right) - 1\right), & \text{if } x < 0 \end{cases}2 values across layers (Barron, 2017).

6. Empirical and Theoretical Implications

The introduction of CELU targets activation smoothness and gradient stability without empirical disadvantages relative to ELU. CELU inherits ELU's favorable properties (accelerated convergence, improved generalization on benchmarks such as CIFAR-100) and extends them with formal guarantees for smooth optimization landscapes and robust parameter tuning. No new large-scale empirical benchmarks were introduced at inception, but theoretical benefits in stability and ease of analysis are emphasized.

The CELU(x,α)={x,if x0 α(exp(xα)1),if x<0\mathrm{CELU}(x,\alpha) = \begin{cases} x, & \text{if } x \geq 0 \ \alpha\left(\exp\left(\frac{x}{\alpha}\right) - 1\right), & \text{if } x < 0 \end{cases}3 continuity is advantageous for employing higher-order optimizers and facilitates convergence proofs. The scale-similarity and interpolation between CELU(x,α)={x,if x0 α(exp(xα)1),if x<0\mathrm{CELU}(x,\alpha) = \begin{cases} x, & \text{if } x \geq 0 \ \alpha\left(\exp\left(\frac{x}{\alpha}\right) - 1\right), & \text{if } x < 0 \end{cases}4 and the identity map provide flexibility and interpretability in architectural design and hyperparameterization (Barron, 2017, Zhang et al., 2023).

7. Summary Table: CELU Key Properties

Property CELU ELU
CELU(x,α)={x,if x0 α(exp(xα)1),if x<0\mathrm{CELU}(x,\alpha) = \begin{cases} x, & \text{if } x \geq 0 \ \alpha\left(\exp\left(\frac{x}{\alpha}\right) - 1\right), & \text{if } x < 0 \end{cases}5 continuity Yes, for all CELU(x,α)={x,if x0 α(exp(xα)1),if x<0\mathrm{CELU}(x,\alpha) = \begin{cases} x, & \text{if } x \geq 0 \ \alpha\left(\exp\left(\frac{x}{\alpha}\right) - 1\right), & \text{if } x < 0 \end{cases}6 Only if CELU(x,α)={x,if x0 α(exp(xα)1),if x<0\mathrm{CELU}(x,\alpha) = \begin{cases} x, & \text{if } x \geq 0 \ \alpha\left(\exp\left(\frac{x}{\alpha}\right) - 1\right), & \text{if } x < 0 \end{cases}7
Bounded derivative CELU(x,α)={x,if x0 α(exp(xα)1),if x<0\mathrm{CELU}(x,\alpha) = \begin{cases} x, & \text{if } x \geq 0 \ \alpha\left(\exp\left(\frac{x}{\alpha}\right) - 1\right), & \text{if } x < 0 \end{cases}8 No (unbounded for large CELU(x,α)={x,if x0 α(exp(xα)1),if x<0\mathrm{CELU}(x,\alpha) = \begin{cases} x, & \text{if } x \geq 0 \ \alpha\left(\exp\left(\frac{x}{\alpha}\right) - 1\right), & \text{if } x < 0 \end{cases}9)
Special cases α=1\alpha = 10 α=1\alpha = 11, linear α=1\alpha = 12 α=1\alpha = 13 α=1\alpha = 14
Scale-similarity Yes: α=1\alpha = 15 No
Depth/width overhead for ReLU approximation None (scaling α=1\alpha = 16) Not established

CELU thus provides a flexible, theoretically favored, and robust activation for modern deep learning, with explicit guarantees in function approximation and gradient management (Barron, 2017, Zhang et al., 2023).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Continuously Differentiable ELU (CELU).