ReZero is All You Need: Fast Convergence at Large Depth (2003.04887v2)

Published 10 Mar 2020 in cs.LG, cs.CL, and stat.ML

Abstract: Deep networks often suffer from vanishing or exploding gradients due to inefficient signal propagation, leading to long training times or convergence difficulties. Various architecture designs, sophisticated residual-style networks, and initialization schemes have been shown to improve deep signal propagation. Recently, Pennington et al. used free probability theory to show that dynamical isometry plays an integral role in efficient deep learning. We show that the simplest architecture change of gating each residual connection using a single zero-initialized parameter satisfies initial dynamical isometry and outperforms more complex approaches. Although much simpler than its predecessors, this gate enables training thousands of fully connected layers with fast convergence and better test performance for ResNets trained on CIFAR-10. We apply this technique to LLMing and find that we can easily train 120-layer Transformers. When applied to 12 layer Transformers, it converges 56% faster on enwiki8.

Citations (253)

View on Semantic Scholar

Summary

The paper demonstrates that initializing residual connections with a near-zero scaling factor (ReZero) enables fast convergence in deep networks.
The paper reveals that standard LayerNorm and self-attention mechanisms can cause vanishing singular values, which ReZero effectively mitigates through architectural tweaks.
The paper validates ReZero’s effectiveness on CIFAR-10, showing improved convergence and stable gradient behavior across models from 12 to 128 layers.

Overview of "ReZero is All You Need: Fast Convergence at Large Depth"

The paper "ReZero is All You Need: Fast Convergence at Large Depth" explores enhancing deep network training efficiency by addressing critical aspects of signal propagation in Transformers. The work thoroughly examines vanishing singular values, convergence speed, and residual gate methodologies, providing empirical insights and architectural innovations.

Vanishing Singular Values in Transformers

The authors focus on the complications posed by LayerNorm and self-attention in preserving signal integrity across Transformer layers. The work highlights that LayerNorm can introduce vanishing singular values in the input-output Jacobian due to its method of normalizing inputs. Similarly, the self-attention mechanism introduces a degree of signal loss, inherently linked to the softmax function projecting embedding vectors onto limited dimensions. The paper suggests that conventional setups for self-attention and LayerNorm cannot by themselves maintain dynamical isometry, a condition crucial for unhindered signal flow as depth increases.

Convergence Speed

The paper extensively explores hyperparameter settings that influence convergence speed, especially in the context of scaling Transform models to larger dimensions. The paper reports on experiments conducted with model variants spanning 12 to 128 layers, indicating the impact of batch size adjustments and learning rate scaling on performance. Utilizing LAMB optimizer without a learning rate schedule, notable efficiencies on V100 GPUs were recorded, emphasizing consistent initialization schemes.

Residual Gates and ReZero

A significant contribution in this paper is the discussion of residual gate configurations. The paper presents a detailed analysis of techniques ranging from Highway Networks and ResNets to FixUp and SkipInit. The ReZero architecture, introduced as a minimalist approach, modifies residual connections by scaling the signal contribution of deep layers. By initializing the additional scalar multiplier $\alpha$ at zero, ReZero ensures that the network begins close to the identity map, encouraging stable gradient behavior. Comparative assessments reveal the effectiveness of ReZero against established methods, demonstrating improved convergence and maintained signal propagation.

Empirical Results on CIFAR-10

The research includes results from image recognition tasks utilizing the CIFAR-10 dataset. Employing a variety of learning rate schedules, including step-down and superconvergence schedules, the authors validate ReZero's competitiveness in achieving high inference accuracy within reasonable training times. These experiments underscore ReZero's potential for integration in conventional optimization routines without substantial hyperparameter adjustments.

Implications and Future Directions

The insights offered in this paper about signal dynamics and convergence optimization have practical implications in developing more efficient deep Transformer architectures. The ReZero framework, in particular, provides a compelling blueprint for mitigating training challenges associated with depth. Future exploration could involve adapting ReZero within different model structures or applying it to solve gradient-related issues in other architectures.

Overall, this work delineates an important stride in understanding and enhancing deep learning models' training efficiency, bridging theoretical foundations with empirical validations. The proposed ReZero mechanism offers a streamlined yet effective approach with promise for broader AI applications.

PDF Markdown

Related Papers

GitHub

GitHub - majumderb/rezero: Official PyTorch Repo for "ReZero is All You Need: Fast Convergence at Large Depth" (408 stars)

Tweets

https://twitter.com/hd_nvim/status/1840464730697166930