Searching for Activation Functions (1710.05941v2)

Published 16 Oct 2017 in cs.NE, cs.CV, and cs.LG

Abstract: The choice of activation functions in deep networks has a significant effect on the training dynamics and task performance. Currently, the most successful and widely-used activation function is the Rectified Linear Unit (ReLU). Although various hand-designed alternatives to ReLU have been proposed, none have managed to replace it due to inconsistent gains. In this work, we propose to leverage automatic search techniques to discover new activation functions. Using a combination of exhaustive and reinforcement learning-based search, we discover multiple novel activation functions. We verify the effectiveness of the searches by conducting an empirical evaluation with the best discovered activation function. Our experiments show that the best discovered activation function, $f(x) = x \cdot \text{sigmoid}(\beta x)$, which we name Swish, tends to work better than ReLU on deeper models across a number of challenging datasets. For example, simply replacing ReLUs with Swish units improves top-1 classification accuracy on ImageNet by 0.9\% for Mobile NASNet-A and 0.6\% for Inception-ResNet-v2. The simplicity of Swish and its similarity to ReLU make it easy for practitioners to replace ReLUs with Swish units in any neural network.

Citations (579)

View on Semantic Scholar

Summary

The paper introduces a novel automated search method combining reinforcement learning and exhaustive search to identify activation functions that can replace ReLU.
It validates the discovered Swish function, which boosts accuracy by up to 0.9% on ImageNet and performs robustly across diverse datasets and architectures.
The findings suggest that exploring non-monotonic activation functions can lead to more efficient and adaptable deep network designs.

Overview of "Searching for Activation Functions"

The paper "Searching for Activation Functions" by Ramachandran, Zoph, and Le explores the automatic discovery of novel activation functions to enhance deep neural networks' performance. Activation functions play a critical role in the training dynamics and overall performance of deep models, with the Rectified Linear Unit (ReLU) being the most prevalent. Despite numerous alternatives being proposed, replacing ReLU has been challenging due to inconsistent results across different applications.

Methodology

The authors employ a combination of exhaustive and reinforcement learning-based search strategies to identify new scalar activation functions. They focus on functions that can directly replace ReLU without altering network architectures. The search space, balanced between size and expressivity, is inspired by structures designed for optimizer searches. This space includes core units that are repetitive structures combining unary and binary functions, enabling comprehensive evaluations and flexibility in the search process.

A reinforcement learning model with an RNN controller predicts components of the activation function sequentially, while an exhaustive search is performed when the search space is restricted. The effectiveness of candidate activation functions is validated by training child networks on tasks such as CIFAR-10 classification, with performance updates feeding back into the search.

Swish Activation Function

Among the discovered functions, Swish, defined as $f(x) = x \cdot \text{sigmoid}(\beta x)$ , emerged as a strong performer. The paper details its properties: unbounded above, bounded below, smooth, and notably non-monotonic. Swish's non-monotonic "bump" plays a crucial role, allowing networks to maintain high information flow even with many preactivations falling in this region. The adaptability of Swish is further enhanced by allowing $\beta$ to be either a constant or trainable, with practical implementation requiring minimal modification to existing code.

Empirical Results

Extensive experimentation across diverse datasets and models validates Swish's effectiveness. On challenging environments such as ImageNet, switching from ReLU to Swish in architectures like Mobile NASNet-A leads to significant accuracy improvements (e.g., 0.9% increase on ImageNet top-1 classification). These improvements highlight Swish's robustness across different network configurations and tasks, rendering it a reliable alternative to ReLU.

The paper additionally benchmarks Swish against several baseline activation functions, including Leaky ReLU, Parametric ReLU, Softplus, ELU, SELU, and GELU. Across datasets like CIFAR-10, CIFAR-100, and on models tested with the Transformer architecture for machine translation, Swish consistently shows superior or comparable performance to these alternatives.

Theoretical and Practical Implications

Swish challenges the conventional reliance on ReLU by demonstrating that non-monotonic functions can handle gradient flow effectively. This finding invites further exploration into non-monotonic activation functions and their benefits to deep network architectures. The consistent improvement across tasks also suggests potential for Swish in broader applications and adaptation within architectures optimized specifically for its characteristics.

Conclusion

This research not only underscores the potential of meta-learning in automated discovery of neural network components but also establishes Swish as a robust activation function capable of enhancing model performance across a variety of contexts. As the deep learning field evolves, such advancements might catalyze further experimentation and adoption of non-traditional activation functions, moving beyond the constraints of ReLU and enabling more efficient, adaptable architectures.

PDF Markdown

Related Papers

Tweets

https://twitter.com/__dwrodri/status/1796228894829248593

https://twitter.com/vishal_learner/status/1793693852882403330

YouTube

Show All Videos