On the Impact of the Activation Function on Deep Neural Networks Training (1902.06853v2)

Published 19 Feb 2019 in stat.ML, cs.AI, and cs.LG

Abstract: The weight initialization and the activation function of deep neural networks have a crucial impact on the performance of the training procedure. An inappropriate selection can lead to the loss of information of the input during forward propagation and the exponential vanishing/exploding of gradients during back-propagation. Understanding the theoretical properties of untrained random networks is key to identifying which deep networks may be trained successfully as recently demonstrated by Samuel et al (2017) who showed that for deep feedforward neural networks only a specific choice of hyperparameters known as the `Edge of Chaos' can lead to good performance. While the work by Samuel et al (2017) discuss trainability issues, we focus here on training acceleration and overall performance. We give a comprehensive theoretical analysis of the Edge of Chaos and show that we can indeed tune the initialization parameters and the activation function in order to accelerate the training and improve the performance.

Citations (181)

View on Semantic Scholar

Summary

The paper demonstrates that activation functions critically shape training dynamics by employing Gaussian process approximations to analyze optimal Edge of Chaos conditions.
It systematically categorizes activations, showing that smoother functions like Tanh and ELU maintain stable signal propagation across deep layers.
The study offers practical strategies for initializing network parameters on the Edge of Chaos, enhancing convergence and overall performance in deep learning models.

Impact of Activation Functions on Deep Neural Networks: A Theoretical Analysis

The paper presented in this paper explores the intricate effects of activation functions on the trainability and performance of deep neural networks (DNNs), with a specific emphasis on the initialization conditions that foster efficient training. The authors focus their theoretical analysis on the "Edge of Chaos" (EOC), a regime where networks display optimal trainability. This work builds on existing literature by providing a thorough examination of the EOC and revealing the crucial role of activation functions in deep learning.

Theoretical Contributions

Gaussian Process Approximation: The authors employ Gaussian process approximations to model the behavior of neural networks with infinitely wide layers. This approach enables the derivation of theoretical insights into how network parameters and activation functions impact the variance and correlation propagation through the layers during initialization.
Edge of Chaos Analysis: This research extends previous knowledge by analyzing the EOC conditions for a broad class of activation functions. The EOC is characterized by a delicate balance between order and chaos, where gradient propagation and information flow are both optimized.
Convergence and Stability: The paper elucidates the conditions under which network variances and correlations converge and remain stable across layers. Through rigorous mathematical formulations, the analysis highlights how activation functions influence the convergence rates—particularly showing that smoother activations ensure more effective signal propagation.
Activation Function Classifications: The paper systematically classifies activation functions into categories such as ReLU-like and Tanh-like, and demonstrates how these categorizations affect EOC conditions. Notably, smoother activations like ELU and Tanh facilitate superior information propagation and are therefore more suitable for very deep networks compared to ReLU-like functions.

Practical Implications and Future Directions

Hyperparameter Initialization:

The findings suggest practical initialization strategies for DNNs. The optimal choice involves selecting initialization parameters on the EOC curve, which facilitates efficient training and enhances performance by preventing gradient vanishing or exploding.

Activation Function Design:

Through this theoretical lens, significant evidence is provided for designing and selecting activation functions that are smooth and bounded, as these functions offer better signal propagation—crucial for training deeper architectures effectively.

Bayesian Neural Networks:

The conclusions drawn about initialization on the EOC are also relevant for Bayesian neural networks, suggesting configurations that maintain non-degenerate priors in the induced function space.

Numerical Validation and Practical Insights

Numerical experiments conducted as part of this research reaffirm the theoretical assertions. The experiments demonstrate that deep networks initialized on the EOC, with appropriate activation functions, are not only trainable but yield better empirical performance than those in the ordered or chaotic phases. Additionally, the EOC combined with batch normalization, although less effective, is shown to enhance the stability of very deep networks.

The paper opens up promising directions for optimizing deep learning models, highlighting the need for further exploration into activation functions that provide both expressiveness and stability. Future work could extend these insights into more complex architectures, including convolutional and recurrent neural networks, and investigate adaptive initialization schemes that account for dynamic network behavior during training.

In summary, this extensive theoretical and empirical paper provides foundational knowledge for improving the training dynamics of deep neural networks, with direct implications in enhancing the capabilities of AI across diverse applications.

PDF Markdown