A General and Efficient Training for Transformer via Token Expansion (2404.00672v1)
Abstract: The remarkable performance of Vision Transformers (ViTs) typically requires an extremely large training cost. Existing methods have attempted to accelerate the training of ViTs, yet typically disregard method universality with accuracy dropping. Meanwhile, they break the training consistency of the original transformers, including the consistency of hyper-parameters, architecture, and strategy, which prevents them from being widely applied to different Transformer networks. In this paper, we propose a novel token growth scheme Token Expansion (termed ToE) to achieve consistent training acceleration for ViTs. We introduce an "initialization-expansion-merging" pipeline to maintain the integrity of the intermediate feature distribution of original transformers, preventing the loss of crucial learnable information in the training process. ToE can not only be seamlessly integrated into the training and fine-tuning process of transformers (e.g., DeiT and LV-ViT), but also effective for efficient training frameworks (e.g., EfficientTrain), without twisting the original training hyper-parameters, architecture, and introducing additional training strategies. Extensive experiments demonstrate that ToE achieves about 1.3x faster for the training of ViTs in a lossless manner, or even with performance gains over the full-token training baselines. Code is available at https://github.com/Osilly/TokenExpansion .
- Attention is all you need. NeurIPS, 30, 2017.
- Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In NAACL, pages 4171–4186, 2019.
- Language models are few-shot learners. NeurIPS, 33:1877–1901, 2020.
- Training data-efficient image transformers & distillation through attention. In ICLR, pages 10347–10357. PMLR, 2021.
- All tokens matter: Token labeling for training better vision transformers. NeurIPS, 34:18590–18602, 2021.
- End-to-end object detection with transformers. In ECCV, pages 213–229. Springer, 2020.
- Segformer: Simple and efficient design for semantic segmentation with transformers. NeurIPS, 34:12077–12090, 2021.
- An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2020.
- Deep residual learning for image recognition. In CVPR, pages 770–778, 2016.
- Chasing sparsity in vision transformers: An end-to-end exploration. NeurIPS, 34:19974–19988, 2021.
- Token merging: Your vit but faster. In ICLR, 2022.
- Network expansion for practical training acceleration. In CVPR, pages 20269–20279, 2023.
- Efficient training of bert by progressively stacking. In ICML, pages 2337–2346. PMLR, 2019.
- Global vision transformer pruning with hessian-aware saliency. In CVPR, pages 18547–18557, 2023.
- Width & depth pruning for vision transformers. In AAAI, volume 36, pages 3143–3151, 2022.
- Block pruning for faster transformers. In EMNLP, pages 10619–10629, 2021.
- Structured pruning learns compact and accurate models. In ACL, pages 1513–1528, 2022.
- Dynamicvit: Efficient vision transformers with dynamic token sparsification. NeurIPS, 34:13937–13949, 2021.
- Adavit: Adaptive vision transformers for efficient image recognition. In CVPR, pages 12309–12318, 2022.
- Adaptive token sampling for efficient vision transformers. In ECCV, pages 396–414. Springer, 2022.
- Spvit: Enabling faster vision transformers via latency-aware soft token pruning. In ECCV, pages 620–640. Springer, 2022.
- A-vit: Adaptive tokens for efficient vision transformer. In CVPR, pages 10809–10818, 2022.
- Q-detr: An efficient low-bit quantized detection transformer. In CVPR, pages 3842–3851, 2023.
- Q-vit: Accurate and fully quantized low-bit vision transformer. NeurIPS, 35:34451–34463, 2022.
- Bivit: Extremely compressed binary vision transformers. In ICCV, pages 5651–5663, 2023.
- Binaryvit: Pushing binary vision transformers towards convolutional models. In CVPR, pages 4664–4673, 2023.
- bert2bert: Towards reusable pretrained language models. In ACL, pages 2134–2148, 2022.
- Growing efficient deep networks by structured continuous sparsification. In ICLR, 2021.
- Autogrow: Automatic layer growing in deep convolutional networks. In KDD, pages 833–841, 2020.
- Efficienttrain: Exploring generalized curriculum learning for training visual backbones. In ICCV, pages 5852–5864, 2023.
- Automated progressive learning for efficient training of vision transformers. In CVPR, pages 12486–12496, 2022.
- Budgeted training for vision transformer. In ICLR, 2022.
- Deduplicating training data makes language models better. In ACL, pages 8424–8445, 2022.
- Efficientnetv2: Smaller models and faster training. In ICLR, pages 10096–10106. PMLR, 2021.
- Accelerating vision transformer training via a patch sampling schedule. arXiv preprint arXiv:2208.09520, 2022.
- On efficient training of large-scale deep learning models: A literature review. arXiv preprint arXiv:2304.03589, 2023.
- Efficient on-device training via gradient filtering. In CVPR, pages 3811–3820, 2023.
- Accelerating cnn training by pruning activation gradients. In ECCV, pages 322–338. Springer, 2020.
- Fractrain: Fractionally squeezing bit savings both temporally and spatially for efficient dnn training. NeurIPS, 33:12127–12139, 2020.
- E2-train: Training state-of-the-art cnns with over 80% energy savings. NeurIPS, 32, 2019.
- Budgeted training: Rethinking deep neural network training under resource constraints. In ICLR, 2019.
- Autoassist: A framework to accelerate training of deep neural networks. NeurIPS, 32, 2019.
- What’s the backward-forward flop ratio for neural networks? https://epochai.org/blog/backward-forward-FLOP-ratio, 2021. Accessed: 2023-9-28.
- Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. JMLR, 9:2579–2605, 2008.
- Imagenet: A large-scale hierarchical image database. In CVPR, pages 248–255. Ieee, 2009.
- Alex Krizhevsky et al. Learning multiple layers of features from tiny images. 2009.
- Pytorch: An imperative style, high-performance deep learning library. NeurIPS, 32, 2019.
- Do vision transformers see like convolutional neural networks? NeurIPS, 34:12116–12128, 2021.
- You only look at one sequence: Rethinking transformer in vision through object detection. NeurIPS, 34:26183–26197, 2021.
- Microsoft coco: Common objects in context. In ECCV, pages 740–755. Springer, 2014.
- Wenxuan Huang (19 papers)
- Yunhang Shen (55 papers)
- Jiao Xie (11 papers)
- Baochang Zhang (113 papers)
- Gaoqi He (9 papers)
- Ke Li (723 papers)
- Xing Sun (94 papers)
- Shaohui Lin (45 papers)