- The paper identifies two critical failure modes in deep ReLU networks and proposes initialization and architectural strategies to mitigate them.
- The authors show that proper weight initialization with variance 2/fan-in prevents exploding or vanishing activations in fully connected and convolutional networks, while residual networks benefit from scaled residual blocks.
- Empirical validation confirms that following these guidelines enables deeper networks to train more efficiently, offering actionable insights for overcoming early training obstacles.
How to Start Training: The Effect of Initialization and Architecture
This paper by Boris Hanin and David Rolnick rigorously addresses critical aspects of neural network training, focusing specifically on initialization and architectural choices. The work identifies two prevalent failure modes during the early training of deep ReLU networks and provides theoretical and empirical insights on how to mitigate these issues across multiple architectures, including fully connected networks, convolutional networks, and residual networks.
Failure Modes in Deep Learning
The paper delineates two primary failure modes:
- FM1 (Exploding or Vanishing Mean Activation Length): This mode occurs when the mean length scale of activations in the final layer increases or decreases exponentially with depth.
- FM2 (Exponential Growth of Activation Length Variance): Here, the variance of activation lengths across layers grows exponentially with depth. The paper asserts that while FM1 is affected by initialization, FM2 is architecture-dependent.
Key Results and Contributions
- FM1 Avoidance: The authors demonstrate that FM1 can be circumvented by proper weight initialization. For fully connected and convolutional networks, initialization from a symmetric distribution with variance 2/fan-in is recommended. For residual networks, the emphasis is on scaling the residual blocks appropriately.
- FM2 Avoidance: For fully connected networks, FM2 can be mitigated through architectural adjustments by keeping the sum of the reciprocals of layer widths constant or growing linearly with depth. This constraint is relaxed in residual networks, where FM2 is not a concern once FM1 is avoided.
- Empirical Validation: Empirical studies affirm the theoretical predictions. Networks initialized according to the paper's guidelines begin training more effectively, especially as network depth increases. The empirical results also highlight the inadequacies of several popular initializations that do not comply with the proposed variance requirements.
Implications and Future Directions
The theoretical framework and results have practical ramifications for designing and initializing deep networks. By addressing FM1 and FM2, the paper contributes to advancing deep learning architectures that are not only deeper but also more efficient in terms of training time. The distinctions drawn between different architectures—fully connected, convolutional, and residual networks—provide valuable insights into why certain architectures, like ResNets, are empirically robust.
Looking forward, further exploration might delve into analyzing other activation functions beyond ReLU and different network configurations such as recurrent networks. The potential application of these findings could influence initialized practices in novel architectures and complex tasks.
Conclusion
This paper contributes a comprehensive analysis on how initialization and architecture influence the ease of training deep neural networks. The rigorous identification of failure modes and their solutions highlights crucial considerations for the deep learning community. Through theoretical guarantees and empirical validation, Hanin and Rolnick provide a structured approach to overcoming early training obstacles, thus enriching the toolkit available for training robust and deep neural networks.