Weight subcloning: direct initialization of transformers using larger pretrained ones (2312.09299v1)
Abstract: Training large transformer models from scratch for a target task requires lots of data and is computationally demanding. The usual practice of transfer learning overcomes this challenge by initializing the model with weights of a pretrained model of the same size and specification to increase the convergence and training speed. However, what if no pretrained model of the required size is available? In this paper, we introduce a simple yet effective technique to transfer the knowledge of a pretrained model to smaller variants. Our approach called weight subcloning expedites the training of scaled-down transformers by initializing their weights from larger pretrained models. Weight subcloning involves an operation on the pretrained model to obtain the equivalent initialized scaled-down model. It consists of two key steps: first, we introduce neuron importance ranking to decrease the embedding dimension per layer in the pretrained model. Then, we remove blocks from the transformer model to match the number of layers in the scaled-down network. The result is a network ready to undergo training, which gains significant improvements in training speed compared to random initialization. For instance, we achieve 4x faster training for vision transformers in image classification and LLMs designed for next token prediction.
- What is the state of neural network pruning? Proceedings of machine learning and systems, 2:129–146, 2020.
- Once-for-all: Train one network and specialize it for efficient deployment. arXiv preprint arXiv:1908.09791, 2019.
- Transformer-xl: Attentive language models beyond a fixed-length context. arXiv preprint arXiv:1901.02860, 2019.
- Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248–255. Ieee, 2009.
- Gpt3. int8 (): 8-bit matrix multiplication for transformers at scale. Advances in Neural Information Processing Systems, 35:30318–30332, 2022.
- Jump to conclusions: Short-cutting transformers with linear transformations. arXiv preprint arXiv:2303.09435, 2023.
- An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
- Reinforce data, multiply impact: Improved model accuracy and robustness with dataset reinforcement. arXiv preprint arXiv:2303.08983, 2023.
- The lottery ticket hypothesis: Finding sparse, trainable neural networks. arXiv preprint arXiv:1803.03635, 2018.
- French, R. M. Catastrophic forgetting in connectionist networks. Trends in cognitive sciences, 3(4):128–135, 1999.
- The pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020.
- Transformer feed-forward layers build predictions by promoting concepts in the vocabulary space. arXiv preprint arXiv:2203.14680, 2022.
- Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, pp. 249–256. JMLR Workshop and Conference Proceedings, 2010.
- Knowledge distillation: A survey. International Journal of Computer Vision, 129:1789–1819, 2021.
- A survey on vision transformer. IEEE transactions on pattern analysis and machine intelligence, 45(1):87–110, 2022.
- Learning both weights and connections for efficient neural network. Advances in neural information processing systems, 28, 2015.
- Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778, 2016.
- Channel pruning for accelerating very deep neural networks. In Proceedings of the IEEE international conference on computer vision, pp. 1389–1397, 2017.
- HuggingFace. Name of the model checkpoint. https://huggingface.co/gpt2-large, 2023. Hugging Face model checkpoint. Accecced on June 2023.
- Knowledge distillation via the target-aware transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10915–10924, 2022.
- Weight distillation: Transferring the knowledge in neural network parameters. arXiv preprint arXiv:2009.09152, 2020.
- Deja vu: Contextual sparsity for efficient llms at inference time. In International Conference on Machine Learning, pp. 22137–22176. PMLR, 2023.
- Relu strikes back: Exploiting activation sparsity in large language models, 2023.
- Self-evolving vision transformer for chest x-ray diagnosis through knowledge distillation. Nature communications, 13(1):3848, 2022.
- Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
- The right tool for the job: Matching model and instance complexities. arXiv preprint arXiv:2004.07453, 2020.
- Mediators in determining what processing bert performs first. arXiv preprint arXiv:2104.06400, 2021.
- Bert rediscovers the classical nlp pipeline. arXiv preprint arXiv:1905.05950, 2019.
- Alphanet: Improved training of supernets with alpha-divergence. In International Conference on Machine Learning, pp. 10760–10771. PMLR, 2021a.
- Attentivenas: Improving neural architecture search via attentive sampling. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 6418–6427, 2021b.
- Bignas: Scaling up neural architecture search with big single-stage models. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part VII 16, pp. 702–717. Springer, 2020.
- A survey of controllable text generation using transformer-based pre-trained language models. ACM Computing Surveys, 56(3):1–37, 2023.
- A survey on efficient training of transformers. arXiv preprint arXiv:2302.01107, 2023.
- Mohammad Samragh (15 papers)
- Mehrdad Farajtabar (56 papers)
- Sachin Mehta (48 papers)
- Raviteja Vemulapalli (29 papers)
- Fartash Faghri (32 papers)
- Devang Naik (26 papers)
- Oncel Tuzel (62 papers)
- Mohammad Rastegari (57 papers)