Efficient Transformer Training via Learned Linear Growth Operators
Introduction
In recent years, the scaling of transformer models has been a significant driver of progress in the field of deep learning. However, this scaling comes at a steep computational cost, particularly because new, larger models are frequently trained from scratch even though they are often just scaled versions of smaller, existing models. This paper introduces a novel approach to leverage previously trained smaller models to initialize and thus accelerate the training of larger models. The proposed method, termed Learned Linear Growth Operator (LiGO), employs a data-driven mechanism to learn a linear mapping that transforms the parameters of a smaller pretrained model into an effective initialization for a larger model.
Methodology
Linear Growth Operator
LiGO operationalizes the idea that a larger model's parameters can be effectively initialized through a linear transformation of a smaller, pretrained model's parameters. Given the impracticality of learning a mapping between the full parameter spaces of small and large models directly, the paper structures this linear mapping via sparse width and depth expansion operators. These operators are further refined using Kronecker factorization, introducing efficiencies by parameter grouping across layers and neurons and embedding architectural knowledge within the transformation.
Application to Transformers
Specific implementation details include how embedding layers, attention mechanisms, and feedforward networks within transformers are transformed via LiGO. To ensure seamless integration across various components of a transformer architecture, parameter tying strategies are employed. The process addresses the embedding dimension transformations and aligns with the multi-headed nature of attention layers by design. This attention is extended to sequence-to-sequence tasks where token representations in embedding layers are transformed to address width growth and subsequent transformations throughout the model to align with these changes.
Experimentation and Results
Extensive experiments were conducted on a range of models including BERT, RoBERTa, GPT-2, and vision transformers like DeiT and CaiT. Across the board, LiGO demonstrated notable efficiency gains, saving up to 50% of the computational cost of training models from scratch while also maintaining or surpassing baseline performance levels on downstream tasks. Specifically, training efficiency improvements were observed not only in terms of parameter growth but also in real-world metrics like GPU wall time.
Comparison with Existing Methods
LiGO was pitted against several existing methods aimed at improving training efficiency through model growth, including StackBERT, MSLT, and bert2BERT. Notably, LiGO outperformed these methods, underscoring its effectiveness as a model growth and initialization strategy. Furthermore, LiGO presents an adaptable approach that's robust across different domains (language and vision tasks), model architectures, and optimization settings.
Implications and Future Directions
The results presented in this paper illustrate the potential of leveraging pretrained models for more efficient training of larger models. By adopting a structured and learned approach to parameter initialization, LiGO addresses the computational redundancy inherent in the current practice of training scaled-up models from scratch. This research has practical implications for ongoing efforts to scale transformer models and represents a step forward in the pursuit of more computationally efficient deep learning methodologies.
Looking ahead, several avenues for further research emerge. An immediate question is the applicability of LiGO to the very largest models currently in use, such as those in the GPT-3 family. Additionally, integrating LiGO with other efficient training strategies, such as layer and token dropping or staged training, could yield further efficiencies. Finally, the potential of LiGO to facilitate more dynamic scaling processes, where models are continuously grown and adapted to new tasks or data, offers an exciting future direction for exploration.
In summary, the Learned Linear Growth Operator (LiGO) presents a significant advance in the efficient training of scaled-up transformer models. By enabling the direct transfer of learned parameters from smaller to larger models, LiGO offers a promising route to mitigating the computational costs associated with ongoing model scaling efforts.