Gaussian Error Linear Units (GELUs) (1606.08415v5)

Published 27 Jun 2016 in cs.LG

Abstract: We propose the Gaussian Error Linear Unit (GELU), a high-performing neural network activation function. The GELU activation function is $x\Phi(x)$, where $\Phi(x)$ the standard Gaussian cumulative distribution function. The GELU nonlinearity weights inputs by their value, rather than gates inputs by their sign as in ReLUs ($x\mathbf{1}_{x>0}$). We perform an empirical evaluation of the GELU nonlinearity against the ReLU and ELU activations and find performance improvements across all considered computer vision, natural language processing, and speech tasks.

PDF Abstract

Gaussian Error Linear Units (GELUs)

The paper introduces the Gaussian Error Linear Unit (GELU) as a novel and high-performing activation function for neural networks. This activation function is formulated as $\text{GELU}(x) = x\Phi(x)$ , where $\Phi(x)$ represents the cumulative distribution function of the standard Gaussian distribution. The GELU activation function has been empirically evaluated against commonly used activation functions such as ReLU and ELU across various tasks, including computer vision, natural language processing, and speech recognition. The results indicate performance improvements for GELU on all these tasks.

Introduction

The historical context of activation functions reveals an evolutionary path from binary threshold units to sigmoid activations and then to non-smooth Rectified Linear Units (ReLUs), which have proven effective despite lacking a strong statistical foundation. ReLUs act by gating inputs based on their sign ( $x\mathbf{1}_{x>0}$ ), leading to efficient convergence during training. The Exponential Linear Unit (ELU) builds on ReLUs, allowing negative output values to sometimes enhance training speed. The choice of activation remains critical for neural network architecture to avoid transforming it into a deep linear classifier.

Stochastic regularization techniques such as dropout have been used to improve generalization by introducing pseudo-ensembles, yet these operate independently of the chosen activation function. The GELU, however, integrates stochastic regularization into its formulation, providing a more probabilistic view of neuron output and enhancing the model's adaptability and performance.

GELU Formulation

GELU is motivated by combining aspects of dropout, zoneout, and ReLU. Unlike ReLU's deterministic gating based on sign, GELU weights its inputs by their value. Specifically, it stochastically multiplies the input $x$ by a mask $m \sim \text{Bernoulli}(\Phi(x))$ , with $\Phi(x)$ denoting the CDF of the standard normal distribution. This approach maintains input dependency while introducing non-determinism reminiscent of adaptive dropout mechanisms. The expected transformation yields the GELU:

$\text{GELU}(x) = x\Phi(x) = x \cdot \frac{1}{2}\left[1 + \text{erf}\left(\frac{x}{\sqrt{2}}\right)\right],$

which can be approximated for computational efficiency by:

$0.5x (1 + \tanh[\sqrt{2/\pi}(x + 0.044715x^3)]).$

GELU Experiments

The paper evaluates the performance of GELU against ReLU and ELU on a variety of tasks:

MNIST Classification: Using a fully connected neural network trained with and without dropout, GELU showed lower median training log loss compared to ReLU and ELU.
MNIST Autoencoding: In a deep autoencoder setup, GELU significantly outperformed other activation functions at different learning rates.
Twitter POS Tagging: The median test set error for GELU was marginally better than ReLU and ELU, demonstrating its generalization capabilities on smaller NLP datasets.
TIMIT Frame Classification: GELU achieved the lowest median test error for phone recognition tasks, indicating its robustness in speech recognition.
CIFAR-10/100 Classification: GELU outperformed the other activations in both shallow and deep convolutional networks, achieving lower error rates on CIFAR-10 and CIFAR-100 datasets.

Discussion

GELU shares fundamental properties with ReLU and ELU, such as asymptotic behavior and the potential to be viewed as a smoothed ReLU. Constructed as $x\Phi(x)$ , GELU integrates input value through the Gaussian CDF, unlike ReLU's binary gating. This stochastic-property-driven weighting allows GELU, unlike convex and monotonic activations like ReLU and ELU, to exhibit curvature throughout and handle complex functions better.

Practical recommendations include using momentum-based optimizers and efficient approximations of $\Phi(x)$ during implementation. The alternative Sigmoid Linear Unit (SiLU) $x\sigma(x)$ is another derived nonlinearity with decent performance but lacks the GELU's full effectiveness.

Conclusion

The paper concludes that GELU consistently outperforms both ELU and ReLU across various tasks and datasets, confirming its viability as an alternative activation function for deep learning architectures. Its unique properties and empirical success make it a compelling choice for neural network practitioners.