- The paper demonstrates that the randomized leaky ReLU significantly reduces overfitting compared to standard ReLU.
- It employs a comprehensive experimental framework using CIFAR-10, CIFAR-100, and NDSB datasets to assess activation function performance.
- The results challenge the common belief that sparsity is crucial, promoting the design of more effective non-saturated activation functions.
Empirical Evaluation of Rectified Activations in Convolutional Network
The paper investigates the empirical performance of different rectified activation functions in convolutional neural networks (CNNs): standard rectified linear unit (ReLU), leaky rectified linear unit (Leaky ReLU), parametric rectified linear unit (PReLU), and a newly introduced randomized leaky rectified linear unit (RReLU). This paper evaluates these activation functions on standard image classification tasks, specifically using the CIFAR-10, CIFAR-100, and the National Data Science Bowl (NDSB) datasets. The primary objective is to determine whether incorporating non-saturated activation functions consistently improves the performance of CNNs and to challenge the common belief that sparsity is the key to good performance in ReLU-based models.
Introduction
Convolutional neural networks have achieved substantial success in various computer vision tasks, driven in part by the use of non-saturated activation functions such as ReLU. Unlike sigmoid or tanh, ReLU is a piecewise linear function that prunes the negative part to zero, leading to sparse activations which are commonly believed to be a critical factor in their superior performance. The paper aims to answer two main questions: (1) Is sparsity the most important factor for a good performance? (2) Can better non-saturated activation functions be designed to outperform ReLU?
Rectified Activation Functions
Standard Rectified Linear Unit (ReLU)
ReLU is defined as: yi=max(0,xi)
Leaky Rectified Linear Unit (Leaky ReLU)
Leaky ReLU introduces a small slope for negative input values: yi={xiif xi≥0 axiif xi<0
where a is a fixed parameter typically set to a small value such as 0.01.
Parametric Rectified Linear Unit (PReLU)
PReLU allows the slope a of the negative part to be a learnable parameter: yi={xiif xi≥0 aixiif xi<0
Randomized Leaky Rectified Linear Unit (RReLU)
RReLU randomizes the slope a during training, drawn from a uniform distribution U(l,u): yi={xiif xi≥0 aixiif xi<0
For testing, ai is set to the average value from the training phase.
Experiment Settings
CIFAR-10 and CIFAR-100
The CIFAR-10 dataset consists of 60,000 32x32 RGB images in 10 classes, and CIFAR-100 contains 100 classes. For these datasets, a Network in Network (NIN) architecture was used. It was found that RReLU, with its randomization, reduced overfitting and achieved test accuracies superior to traditional ReLU and its counterparts.
National Data Science Bowl (NDSB)
The NDSB dataset comprises 30,336 labeled grayscale images across 121 classes. The paper utilized a complex CNN architecture, and it's observed that RReLU again demonstrated superior performance in terms of reducing overfitting, evident by lower test error rates.
Results and Discussion
CIFAR-10 and CIFAR-100
- Leaky ReLU (a=5.5) consistently outperformed the standard ReLU.
- PReLU exhibited the lowest training error but higher test error, indicating overfitting.
- RReLU showed significant improvements in test accuracies, highlighting the advantage of its randomized negative slopes.
National Data Science Bowl (NDSB)
- RReLU continued to outperform other activation functions by achieving a superior log-loss on the validation set.
- The effectiveness of RReLU in combating overfitting was more pronounced due to the smaller training set relative to network complexity.
Conclusion
The empirical findings suggest that ReLU is not the optimal activation function across all scenarios. Both Leaky ReLU and its variants (PReLU and RReLU) consistently demonstrated better performance in various CNN architectures and datasets. Specifically, RReLU showed significant advantages in reducing overfitting, making it a robust choice for tasks involving limited training data. Future work should delve into the theoretical underpinnings of these empirical observations and test the efficacy of these activation functions on larger datasets.
By systematically evaluating multiple activation functions, the paper challenges prevalent assumptions about the role of sparsity in neural network performance, opening up avenues for more research in designing superior non-saturated activation functions.