Preparing Lessons for Progressive Training on Language Models (2401.09192v3)
Abstract: The rapid progress of Transformers in artificial intelligence has come at the cost of increased resource consumption and greenhouse gas emissions due to growing model sizes. Prior work suggests using pretrained small models to improve training efficiency, but this approach may not be suitable for new model structures. On the other hand, training from scratch can be slow, and progressively stacking layers often fails to achieve significant acceleration. To address these challenges, we propose a novel method called Apollo, which prep\textbf{a}res lessons for ex\textbf{p}anding \textbf{o}perations by \textbf{l}earning high-\textbf{l}ayer functi\textbf{o}nality during training of low layers. Our approach involves low-value-prioritized sampling (LVPS) to train different depths and weight sharing to facilitate efficient expansion. We also introduce an interpolation method for stable model depth extension. Experiments demonstrate that Apollo achieves state-of-the-art acceleration ratios, even rivaling methods using pretrained models, making it a universal and efficient solution for training deep models while reducing time, financial, and environmental costs.
- Token Merging: Your ViT But Faster. CoRR, abs/2210.09461.
- Swin-Unet: Unet-like Pure Transformer for Medical Image Segmentation. CoRR, abs/2105.05537.
- Multi-level Residual Networks from Dynamical Systems View. In ICLR (Poster). OpenReview.net.
- bert2BERT: Towards Reusable Pretrained Language Models. In ACL (1), 2134–2148. Association for Computational Linguistics.
- Net2Net: Accelerating Learning via Knowledge Transfer. In ICLR.
- Universal Transformers. In ICLR (Poster). OpenReview.net.
- BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In NAACL-HLT (1), 4171–4186. Association for Computational Linguistics.
- An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In ICLR. OpenReview.net.
- Efficient Training of BERT by Progressively Stacking. In ICML, volume 97 of Proceedings of Machine Learning Research, 2337–2346. PMLR.
- On the Transformer Growth for Progressive BERT Training. In NAACL-HLT, 5174–5180. Association for Computational Linguistics.
- Exploring Low Rank Training of Deep Neural Networks. CoRR, abs/2209.13569.
- Adam: A Method for Stochastic Optimization. In ICLR (Poster).
- ALBERT: A Lite BERT for Self-supervised Learning of Language Representations. In ICLR. OpenReview.net.
- Automated Progressive Learning for Efficient Training of Vision Transformers. In CVPR, 12476–12486. IEEE.
- Heuristic Rank Selection with Progressively Searching Tensor Ring Network. CoRR, abs/2009.10580.
- Compressing Recurrent Neural Networks with Tensor Ring for Action Recognition. In AAAI, 4683–4690. AAAI Press.
- Reusing Pretrained Models by Multi-linear Operators for Efficient Training. In NeurIPS.
- Knowledge Inheritance for Pre-trained Language Models. In NAACL-HLT, 3921–3937. Association for Computational Linguistics.
- Language models are unsupervised multitask learners. OpenAI blog, 1(8): 9.
- Know What You Don’t Know: Unanswerable Questions for SQuAD. In ACL (2), 784–789. Association for Computational Linguistics.
- SQuAD: 100, 000+ Questions for Machine Comprehension of Text. In EMNLP, 2383–2392. The Association for Computational Linguistics.
- A Primer in BERTology: What We Know About How BERT Works. Trans. Assoc. Comput. Linguistics, 8: 842–866.
- Green AI. CoRR, abs/1907.10597.
- Staged Training for Transformer Language Models. In ICML, volume 162 of Proceedings of Machine Learning Research, 19893–19908. PMLR.
- Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism. CoRR, abs/1909.08053.
- Attention is All you Need. In NIPS, 5998–6008.
- GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding. In ICLR (Poster). OpenReview.net.
- Energy-Aware Neural Architecture Optimization with Fast Splitting Steepest Descent. CoRR, abs/1910.03103.
- Tensor Networks Meet Neural Networks: A Survey and Future Perspectives. CoRR, abs/2302.09019.
- Concatenated Tensor Networks for Deep Multi-Task Learning. In ICONIP (5), volume 1333 of Communications in Computer and Information Science, 517–525. Springer.
- Learning to Grow Pretrained Models for Efficient Transformer Training. In ICLR. OpenReview.net.
- Firefly Neural Architecture Descent: a General Approach for Growing Neural Networks. In NeurIPS.
- Splitting Steepest Descent for Growing Neural Architectures. In NeurIPS, 10655–10665.
- Steepest Descent Neural Architecture Optimization: Escaping Local Optimum with Signed Neural Splitting. CoRR, abs/2003.10392.
- Taking Notes on the Fly Helps Language Pre-Training. In ICLR. OpenReview.net.
- Progressively Stacking 2.0: A Multi-stage Layerwise Training Method for BERT Training Speedup. CoRR, abs/2011.13635.
- Speeding up Deep Model Training by Sharing Weights and Then Unsharing. CoRR, abs/2110.03848.
- Accelerating Training of Transformer-Based Language Models with Progressive Layer Dropping. In NeurIPS.
- Aligning Books and Movies: Towards Story-Like Visual Explanations by Watching Movies and Reading Books. In ICCV, 19–27. IEEE Computer Society.