Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Empirical Evaluation of Rectified Activations in Convolutional Network (1505.00853v2)

Published 5 May 2015 in cs.LG, cs.CV, and stat.ML

Abstract: In this paper we investigate the performance of different types of rectified activation functions in convolutional neural network: standard rectified linear unit (ReLU), leaky rectified linear unit (Leaky ReLU), parametric rectified linear unit (PReLU) and a new randomized leaky rectified linear units (RReLU). We evaluate these activation function on standard image classification task. Our experiments suggest that incorporating a non-zero slope for negative part in rectified activation units could consistently improve the results. Thus our findings are negative on the common belief that sparsity is the key of good performance in ReLU. Moreover, on small scale dataset, using deterministic negative slope or learning it are both prone to overfitting. They are not as effective as using their randomized counterpart. By using RReLU, we achieved 75.68\% accuracy on CIFAR-100 test set without multiple test or ensemble.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Bing Xu (66 papers)
  2. Naiyan Wang (65 papers)
  3. Tianqi Chen (77 papers)
  4. Mu Li (95 papers)
Citations (2,808)

Summary

  • The paper demonstrates that the randomized leaky ReLU significantly reduces overfitting compared to standard ReLU.
  • It employs a comprehensive experimental framework using CIFAR-10, CIFAR-100, and NDSB datasets to assess activation function performance.
  • The results challenge the common belief that sparsity is crucial, promoting the design of more effective non-saturated activation functions.

Empirical Evaluation of Rectified Activations in Convolutional Network

The paper investigates the empirical performance of different rectified activation functions in convolutional neural networks (CNNs): standard rectified linear unit (ReLU), leaky rectified linear unit (Leaky ReLU), parametric rectified linear unit (PReLU), and a newly introduced randomized leaky rectified linear unit (RReLU). This paper evaluates these activation functions on standard image classification tasks, specifically using the CIFAR-10, CIFAR-100, and the National Data Science Bowl (NDSB) datasets. The primary objective is to determine whether incorporating non-saturated activation functions consistently improves the performance of CNNs and to challenge the common belief that sparsity is the key to good performance in ReLU-based models.

Introduction

Convolutional neural networks have achieved substantial success in various computer vision tasks, driven in part by the use of non-saturated activation functions such as ReLU. Unlike sigmoid or tanh, ReLU is a piecewise linear function that prunes the negative part to zero, leading to sparse activations which are commonly believed to be a critical factor in their superior performance. The paper aims to answer two main questions: (1) Is sparsity the most important factor for a good performance? (2) Can better non-saturated activation functions be designed to outperform ReLU?

Rectified Activation Functions

Standard Rectified Linear Unit (ReLU)

ReLU is defined as: yi=max(0,xi)y_i = \max(0, x_i)

Leaky Rectified Linear Unit (Leaky ReLU)

Leaky ReLU introduces a small slope for negative input values: yi={xiif xi0 axiif xi<0y_i = \begin{cases} x_i & \text{if } x_i \geq 0 \ a x_i & \text{if } x_i < 0 \end{cases} where aa is a fixed parameter typically set to a small value such as 0.01.

Parametric Rectified Linear Unit (PReLU)

PReLU allows the slope aa of the negative part to be a learnable parameter: yi={xiif xi0 aixiif xi<0y_i = \begin{cases} x_i & \text{if } x_i \geq 0 \ a_i x_i & \text{if } x_i < 0 \end{cases}

Randomized Leaky Rectified Linear Unit (RReLU)

RReLU randomizes the slope aa during training, drawn from a uniform distribution U(l,u)U(l, u): yi={xiif xi0 aixiif xi<0y_i = \begin{cases} x_i & \text{if } x_i \geq 0 \ a_i x_i & \text{if } x_i < 0 \end{cases} For testing, aia_i is set to the average value from the training phase.

Experiment Settings

CIFAR-10 and CIFAR-100

The CIFAR-10 dataset consists of 60,000 32x32 RGB images in 10 classes, and CIFAR-100 contains 100 classes. For these datasets, a Network in Network (NIN) architecture was used. It was found that RReLU, with its randomization, reduced overfitting and achieved test accuracies superior to traditional ReLU and its counterparts.

National Data Science Bowl (NDSB)

The NDSB dataset comprises 30,336 labeled grayscale images across 121 classes. The paper utilized a complex CNN architecture, and it's observed that RReLU again demonstrated superior performance in terms of reducing overfitting, evident by lower test error rates.

Results and Discussion

CIFAR-10 and CIFAR-100

  • Leaky ReLU (a=5.5a = 5.5) consistently outperformed the standard ReLU.
  • PReLU exhibited the lowest training error but higher test error, indicating overfitting.
  • RReLU showed significant improvements in test accuracies, highlighting the advantage of its randomized negative slopes.

National Data Science Bowl (NDSB)

  • RReLU continued to outperform other activation functions by achieving a superior log-loss on the validation set.
  • The effectiveness of RReLU in combating overfitting was more pronounced due to the smaller training set relative to network complexity.

Conclusion

The empirical findings suggest that ReLU is not the optimal activation function across all scenarios. Both Leaky ReLU and its variants (PReLU and RReLU) consistently demonstrated better performance in various CNN architectures and datasets. Specifically, RReLU showed significant advantages in reducing overfitting, making it a robust choice for tasks involving limited training data. Future work should delve into the theoretical underpinnings of these empirical observations and test the efficacy of these activation functions on larger datasets.

By systematically evaluating multiple activation functions, the paper challenges prevalent assumptions about the role of sparsity in neural network performance, opening up avenues for more research in designing superior non-saturated activation functions.

Youtube Logo Streamline Icon: https://streamlinehq.com