Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Weight subcloning: direct initialization of transformers using larger pretrained ones (2312.09299v1)

Published 14 Dec 2023 in cs.LG, cs.CL, and cs.CV

Abstract: Training large transformer models from scratch for a target task requires lots of data and is computationally demanding. The usual practice of transfer learning overcomes this challenge by initializing the model with weights of a pretrained model of the same size and specification to increase the convergence and training speed. However, what if no pretrained model of the required size is available? In this paper, we introduce a simple yet effective technique to transfer the knowledge of a pretrained model to smaller variants. Our approach called weight subcloning expedites the training of scaled-down transformers by initializing their weights from larger pretrained models. Weight subcloning involves an operation on the pretrained model to obtain the equivalent initialized scaled-down model. It consists of two key steps: first, we introduce neuron importance ranking to decrease the embedding dimension per layer in the pretrained model. Then, we remove blocks from the transformer model to match the number of layers in the scaled-down network. The result is a network ready to undergo training, which gains significant improvements in training speed compared to random initialization. For instance, we achieve 4x faster training for vision transformers in image classification and LLMs designed for next token prediction.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (33)
  1. What is the state of neural network pruning? Proceedings of machine learning and systems, 2:129–146, 2020.
  2. Once-for-all: Train one network and specialize it for efficient deployment. arXiv preprint arXiv:1908.09791, 2019.
  3. Transformer-xl: Attentive language models beyond a fixed-length context. arXiv preprint arXiv:1901.02860, 2019.
  4. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp.  248–255. Ieee, 2009.
  5. Gpt3. int8 (): 8-bit matrix multiplication for transformers at scale. Advances in Neural Information Processing Systems, 35:30318–30332, 2022.
  6. Jump to conclusions: Short-cutting transformers with linear transformations. arXiv preprint arXiv:2303.09435, 2023.
  7. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  8. Reinforce data, multiply impact: Improved model accuracy and robustness with dataset reinforcement. arXiv preprint arXiv:2303.08983, 2023.
  9. The lottery ticket hypothesis: Finding sparse, trainable neural networks. arXiv preprint arXiv:1803.03635, 2018.
  10. French, R. M. Catastrophic forgetting in connectionist networks. Trends in cognitive sciences, 3(4):128–135, 1999.
  11. The pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020.
  12. Transformer feed-forward layers build predictions by promoting concepts in the vocabulary space. arXiv preprint arXiv:2203.14680, 2022.
  13. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, pp.  249–256. JMLR Workshop and Conference Proceedings, 2010.
  14. Knowledge distillation: A survey. International Journal of Computer Vision, 129:1789–1819, 2021.
  15. A survey on vision transformer. IEEE transactions on pattern analysis and machine intelligence, 45(1):87–110, 2022.
  16. Learning both weights and connections for efficient neural network. Advances in neural information processing systems, 28, 2015.
  17. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  770–778, 2016.
  18. Channel pruning for accelerating very deep neural networks. In Proceedings of the IEEE international conference on computer vision, pp.  1389–1397, 2017.
  19. HuggingFace. Name of the model checkpoint. https://huggingface.co/gpt2-large, 2023. Hugging Face model checkpoint. Accecced on June 2023.
  20. Knowledge distillation via the target-aware transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  10915–10924, 2022.
  21. Weight distillation: Transferring the knowledge in neural network parameters. arXiv preprint arXiv:2009.09152, 2020.
  22. Deja vu: Contextual sparsity for efficient llms at inference time. In International Conference on Machine Learning, pp.  22137–22176. PMLR, 2023.
  23. Relu strikes back: Exploiting activation sparsity in large language models, 2023.
  24. Self-evolving vision transformer for chest x-ray diagnosis through knowledge distillation. Nature communications, 13(1):3848, 2022.
  25. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
  26. The right tool for the job: Matching model and instance complexities. arXiv preprint arXiv:2004.07453, 2020.
  27. Mediators in determining what processing bert performs first. arXiv preprint arXiv:2104.06400, 2021.
  28. Bert rediscovers the classical nlp pipeline. arXiv preprint arXiv:1905.05950, 2019.
  29. Alphanet: Improved training of supernets with alpha-divergence. In International Conference on Machine Learning, pp.  10760–10771. PMLR, 2021a.
  30. Attentivenas: Improving neural architecture search via attentive sampling. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  6418–6427, 2021b.
  31. Bignas: Scaling up neural architecture search with big single-stage models. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part VII 16, pp.  702–717. Springer, 2020.
  32. A survey of controllable text generation using transformer-based pre-trained language models. ACM Computing Surveys, 56(3):1–37, 2023.
  33. A survey on efficient training of transformers. arXiv preprint arXiv:2302.01107, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Mohammad Samragh (15 papers)
  2. Mehrdad Farajtabar (56 papers)
  3. Sachin Mehta (48 papers)
  4. Raviteja Vemulapalli (29 papers)
  5. Fartash Faghri (32 papers)
  6. Devang Naik (26 papers)
  7. Oncel Tuzel (62 papers)
  8. Mohammad Rastegari (57 papers)
Citations (18)

Summary

Weight Subcloning: Direct Initialization of Transformers Using Larger Pretrained Ones

This paper introduces a novel technique termed "weight subcloning" to expedite the training of scaled-down transformer models by initializing their weights from larger pretrained models. The proposed method effectively addresses the challenge faced when there is no pretrained model available of the desired size for a target task. By utilizing weight subcloning, the paper demonstrates significant improvements in training speed, specifically claiming a 4×4\times acceleration in convergence for both vision transformers in image classification and LLMs in next-token prediction tasks.

The core idea of weight subcloning involves two critical steps: first, it employs neuron importance ranking to identify and select the most influential neurons, thereby reducing the embedding dimensions per layer. Second, it removes redundant layers to prune the model down to the target architecture size. By selecting the most important neurons across all layers, the method maintains the integrity of the network while significantly reducing training complexity and computation time.

The numerical results presented support the effectiveness of the approach, outlining that a destination network initialized via weight subcloning can achieve comparable or improved accuracy in substantially fewer training epochs compared to randomly initialized counterparts. For example, in training a vision transformer on the ImageNet dataset, weight subcloning achieves 70% accuracy in merely 10 epochs, whereas random initialization requires 40 epochs to reach the same level.

In theoretical terms, the paper highlights the additive residual property of transformers, noting that individual blocks change the hidden representation only slightly. This property allows for a straightforward transfer of knowledge between different architectures in the same transformer family. The paper situates weight subcloning within broader research on model compression techniques, delineating differences from approaches such as knowledge distillation, weight sharing, and pruning. Notably, weight subcloning avoids some of the convergence challenges associated with these methods by directly transferring weights, without requiring additional training iterations for parameter mapping.

The implications of this research pertain broadly to the development and deployment of efficient transformer architectures. In practical terms, weight subcloning opens possibilities for faster deployment of custom transformer models across various applications, especially in environments with constrained computational resources. The technique's adeptness at maintaining model performance while enhancing training efficiency may also spur future studies into more advanced forms of model initialization and architectural modification.

The paper concludes with the recognition that current findings are specific to scenarios where the parent and destination models share a similar structural framework, and it suggests that further exploration of more extensive architectural changes remains an exciting avenue for future work. This could hold the potential for extending weight subcloning to an even broader class of models, thereby further influencing scalable and adaptive AI system design.