Weight subcloning: direct initialization of transformers using larger pretrained ones (2312.09299v1)

Published 14 Dec 2023 in cs.LG, cs.CL, and cs.CV

Abstract: Training large transformer models from scratch for a target task requires lots of data and is computationally demanding. The usual practice of transfer learning overcomes this challenge by initializing the model with weights of a pretrained model of the same size and specification to increase the convergence and training speed. However, what if no pretrained model of the required size is available? In this paper, we introduce a simple yet effective technique to transfer the knowledge of a pretrained model to smaller variants. Our approach called weight subcloning expedites the training of scaled-down transformers by initializing their weights from larger pretrained models. Weight subcloning involves an operation on the pretrained model to obtain the equivalent initialized scaled-down model. It consists of two key steps: first, we introduce neuron importance ranking to decrease the embedding dimension per layer in the pretrained model. Then, we remove blocks from the transformer model to match the number of layers in the scaled-down network. The result is a network ready to undergo training, which gains significant improvements in training speed compared to random initialization. For instance, we achieve 4x faster training for vision transformers in image classification and LLMs designed for next token prediction.

References (33)

Authors (8)

Mohammad Samragh (15 papers)
Mehrdad Farajtabar (56 papers)
Sachin Mehta (48 papers)
Raviteja Vemulapalli (29 papers)
Fartash Faghri (32 papers)
Devang Naik (26 papers)
Oncel Tuzel (62 papers)
Mohammad Rastegari (57 papers)

Citations (18)

View on Semantic Scholar

Summary

Weight Subcloning: Direct Initialization of Transformers Using Larger Pretrained Ones

This paper introduces a novel technique termed "weight subcloning" to expedite the training of scaled-down transformer models by initializing their weights from larger pretrained models. The proposed method effectively addresses the challenge faced when there is no pretrained model available of the desired size for a target task. By utilizing weight subcloning, the paper demonstrates significant improvements in training speed, specifically claiming a $4\times$ acceleration in convergence for both vision transformers in image classification and LLMs in next-token prediction tasks.

The core idea of weight subcloning involves two critical steps: first, it employs neuron importance ranking to identify and select the most influential neurons, thereby reducing the embedding dimensions per layer. Second, it removes redundant layers to prune the model down to the target architecture size. By selecting the most important neurons across all layers, the method maintains the integrity of the network while significantly reducing training complexity and computation time.

The numerical results presented support the effectiveness of the approach, outlining that a destination network initialized via weight subcloning can achieve comparable or improved accuracy in substantially fewer training epochs compared to randomly initialized counterparts. For example, in training a vision transformer on the ImageNet dataset, weight subcloning achieves 70% accuracy in merely 10 epochs, whereas random initialization requires 40 epochs to reach the same level.

In theoretical terms, the paper highlights the additive residual property of transformers, noting that individual blocks change the hidden representation only slightly. This property allows for a straightforward transfer of knowledge between different architectures in the same transformer family. The paper situates weight subcloning within broader research on model compression techniques, delineating differences from approaches such as knowledge distillation, weight sharing, and pruning. Notably, weight subcloning avoids some of the convergence challenges associated with these methods by directly transferring weights, without requiring additional training iterations for parameter mapping.

The implications of this research pertain broadly to the development and deployment of efficient transformer architectures. In practical terms, weight subcloning opens possibilities for faster deployment of custom transformer models across various applications, especially in environments with constrained computational resources. The technique's adeptness at maintaining model performance while enhancing training efficiency may also spur future studies into more advanced forms of model initialization and architectural modification.

The paper concludes with the recognition that current findings are specific to scenarios where the parent and destination models share a similar structural framework, and it suggests that further exploration of more extensive architectural changes remains an exciting avenue for future work. This could hold the potential for extending weight subcloning to an even broader class of models, thereby further influencing scalable and adaptive AI system design.