Comparison of non-linear activation functions for deep neural networks on MNIST classification task (1804.02763v1)

Published 8 Apr 2018 in cs.LG and stat.ML

Abstract: Activation functions play a key role in neural networks so it becomes fundamental to understand their advantages and disadvantages in order to achieve better performances. This paper will first introduce common types of non linear activation functions that are alternative to the well known sigmoid function and then evaluate their characteristics. Moreover deeper neural networks will be analysed because they positively influence the final performances compared to shallower networks. They also strictly depend on the weight initialisation hence the effect of drawing weights from Gaussian and uniform distribution will be analysed making particular attention on how the number of incoming and outgoing connection to a node influence the whole network.

Authors (1)

Dabal Pedamonti (2 papers)

Citations (177)

View on Semantic Scholar

Summary

The paper demonstrates that ELU consistently outperforms other activation functions by achieving faster convergence and higher accuracy in deep neural networks.
The study systematically evaluates ReLU, Leaky ReLU, ELU, and SELU on MNIST, addressing challenges like vanishing gradients and neuron saturation.
The research highlights the importance of proper weight initialization, notably Glorot methods, in enhancing performance in architectures with up to seven hidden layers.

Comparison of Non-linear Activation Functions for Deep Neural Networks on MNIST Classification Task

The paper conducted a comparative analysis of several non-linear activation functions, specifically focusing on ReLU and its variants, for application in deep neural networks using the MNIST classification task. A critical element in neural network design is the selection of activation functions, influencing training dynamics and resulting performance. This examination aimed to better understand their impact on classification accuracy and network convergence and address challenges of vanishing and exploding gradients, common with traditional sigmoid activations.

Activation Functions Evaluated

In this paper, the activation functions ReLU, Leaky ReLU (LReLU), Exponential Linear Unit (ELU), and Scaled Exponential Linear Unit (SELU) were assessed. Each function attempts to mitigate certain limitations of sigmoid:

ReLU: Defined as $\text{ReLU}(x) = \max(0, x)$ , it offers simplicity and computational efficiency, mitigating vanishing gradients due to non-zero gradients for positive inputs. However, it is susceptible to the "dying ReLU" problem where neurons can become inactive.
Leaky ReLU: Addressing the dying ReLU issue, LReLU introduces a small slope for negative inputs $\text{LReLU}(x) = \begin{cases} \alpha x, & \text{if } x \leq 0 \ x, & \text{if } x > 0 \end{cases}$ , preserving sparse architecture while adjusting weights for inactive nodes.
ELU: This function centers activations around zero mean without requiring batch normalization, defined as $\text{ELU}(x) = \begin{cases} \alpha (\exp (x)-1), & \text{if } x \leq 0 \ x, & \text{if } x > 0 \end{cases}$ .
SELU: Extends ELU with self-normalizing properties conducive for deep networks, defined with additional scaling factors.

Experimental Setup

Experiments were conducted using MNIST, a benchmark dataset for digit classification comprising 70,000 hand-written images. The evaluation involved comparisons across different network architectures with variations in hidden layers and weight initialization techniques. Activation functions were benchmarked under differing learning rates ( $\eta$ ), and the paper emphasized balancing training accuracy with error rates and avoidance of overfitting.

Key results indicate:

Network Performance: Deeper networks exhibited improved accuracy up to a point of saturation. Specifically, improvements were noticed until seven hidden layers.
Activation Function Efficacy: ELU showed consistently superior performance in terms of faster learning rates and reduced bias shifts compared to other functions.
Weight Initialization: Glorot initialization methods proved significant, particularly in deep architectures, reinforcing the findings from Glorot and Bengio regarding initialization advantages with non-linear activation functions.

Implications and Speculation

The paper's findings reinforce the importance of activation function choice in neural network design, particularly for deeper architectures where gradient-related issues are paramount. The results suggest that activation functions such as ELU, which inherently address gradient saturation and bias shifts, are advantageous. Additionally, the consideration of weight initialization complements activation function selection to optimize network training dynamics.

In terms of theoretical implications, this paper opens avenues for further exploration into activation functions that incorporate normalization properties, potentially reducing the need for additional techniques like batch normalization. Future developments may also investigate expansion in activation function characteristics or hybrid models combining strengths from multiple functions.

Practically, these insights are pivotal for researchers and practitioners aiming to refine deep learning models across varied applications, particularly where efficiency and accuracy are critical. Experimentation with activation functions and initialization methods tailored to specific datasets can yield meaningful performance advances.

Conclusion

The analysis presented in this paper contributes valuable data on how activation functions influence neural network performance. While ReLU variants like ELU and SELU offer compelling improvements, the choice of activation function should be aligned with network architecture and training objectives. As neural network designs continue to evolve, ongoing research will likely build on these findings, enhancing our understanding and capability in the field of AI and machine learning.

PDF Markdown