DeepNet: Scaling Transformers to 1,000 Layers (2203.00555v1)

Published 1 Mar 2022 in cs.CL and cs.LG

Abstract: In this paper, we propose a simple yet effective method to stabilize extremely deep Transformers. Specifically, we introduce a new normalization function (DeepNorm) to modify the residual connection in Transformer, accompanying with theoretically derived initialization. In-depth theoretical analysis shows that model updates can be bounded in a stable way. The proposed method combines the best of two worlds, i.e., good performance of Post-LN and stable training of Pre-LN, making DeepNorm a preferred alternative. We successfully scale Transformers up to 1,000 layers (i.e., 2,500 attention and feed-forward network sublayers) without difficulty, which is one order of magnitude deeper than previous deep Transformers. Remarkably, on a multilingual benchmark with 7,482 translation directions, our 200-layer model with 3.2B parameters significantly outperforms the 48-layer state-of-the-art model with 12B parameters by 5 BLEU points, which indicates a promising scaling direction.

PDF Abstract

DeepNet: Scaling Transformers to 1,000 Layers

The paper entitled "DeepNet: Scaling Transformers to 1,000 Layers" by Hongyu Wang et al. presents a significant contribution to stabilizing and scaling deep Transformer architectures. The authors introduce a novel normalization function, DeepNorm, designed to mitigate the instability issues that limit Transformer depth. Their approach facilitates the training of extremely deep networks with up to 1,000 layers, surpassing previously established limits by an order of magnitude.

Theoretical Foundation and Implementation

The paper identifies the exploding model updates as a primary factor contributing to instability in deep Transformer models. To address this, the authors propose DeepNorm, which adjusts the residual connections within the Transformer architecture and adopts a theoretically sound initialization strategy. DeepNorm effectively balances the advantages of Pre-LN (stable training) and Post-LN (good performance) methods by bounding model updates with a constant factor.

The implementation of DeepNorm is straightforward, requiring minimal changes in code, which aids in its practicality for wide adoption. Specifically, it introduces scaling factors ( $\alpha$ and $\beta$ ) tailored to the architecture's encoder-only, decoder-only, or encoder-decoder configurations, thereby enhancing training stability across various Transformer architectures.

Empirical Results

DeepNet demonstrates impressive empirical results, particularly in the domain of neural machine translation (NMT). Experiments conducted on the IWSLT-14 De-En and WMT-17 En-De datasets illustrate the model’s capacity to maintain stability and achieve high BLEU scores even as the model's depth increases from 10 layers to 100 layers. Notably, DeepNet outperforms existing deep Transformer variants, such as DLCL, ReZero, and NormFormer, especially when scaled to 100 or more layers.

Furthermore, in massively multilingual NMT, DeepNet is scalable up to 1,000 layers in an encoder-decoder configuration on the OPUS-100 dataset, achieving a substantial average BLEU score improvement over baseline models. This indicates the model's scalability potential and effectiveness in handling large-scale and multilingual datasets.

Practical and Theoretical Implications

The practical implications of this research are profound, providing a viable pathway to develop deep Transformer-based models without succumbing to instability issues that typically impair deep learning systems. The successful implementation of 1,000-layer models marks a significant milestone for those developing Transformers for high-capacity applications such as multilingual translation and potentially other domains like image recognition and LLMing.

The theoretical insights into bounding model updates contribute to the understanding of deep learning optimization, particularly for Transformer architectures. Future research may further elucidate the effects of various initialization schemes and normalization factors on even deeper networks or more complex architectures.

Conclusion

This research represents a significant advancement in Transformer scaling, with DeepNorm offering both theoretical and practical benefits by stabilizing extremely deep Transformer models. The findings and methods proposed are poised to influence future developments in Transformer research and deployment across various high-capacity AI applications. Continued exploration into the realms of initialization, normalization, and scalability metrics will likely spur further innovations derived from this foundational work.