Transformer as Linear Expansion of Learngene (2312.05614v2)
Abstract: We propose expanding the shared Transformer module to produce and initialize Transformers of varying depths, enabling adaptation to diverse resource constraints. Drawing an analogy to genetic expansibility, we term such module as learngene. To identify the expansion mechanism, we delve into the relationship between the layer's position and its corresponding weight value, and find that linear function appropriately approximates this relationship. Building on this insight, we present Transformer as Linear Expansion of learnGene (TLEG), a novel approach for flexibly producing and initializing Transformers of diverse depths. Specifically, to learn learngene, we firstly construct an auxiliary Transformer linearly expanded from learngene, after which we train it through employing soft distillation. Subsequently, we can produce and initialize Transformers of varying depths via linearly expanding the well-trained learngene, thereby supporting diverse downstream scenarios. Extensive experiments on ImageNet-1K demonstrate that TLEG achieves comparable or better performance in contrast to many individual models trained from scratch, while reducing around 2x training cost. When transferring to several downstream classification datasets, TLEG surpasses existing initialization methods by a large margin (e.g., +6.87% on iNat 2019 and +7.66% on CIFAR-100). Under the situation where we need to produce models of varying depths adapting for different resource constraints, TLEG achieves comparable results while reducing around 19x parameters stored to initialize these models and around 5x pre-training costs, in contrast to the pre-training and fine-tuning approach. When transferring a fixed set of parameters to initialize different models, TLEG presents better flexibility and competitive performance while reducing around 2.9x parameters stored to initialize, compared to the pre-training approach.
- How to initialize your network? robust initialization for weightnorm & resnets. Advances in Neural Information Processing Systems, 32.
- Layer normalization. arXiv preprint arXiv:1607.06450.
- Deep equilibrium models. Advances in Neural Information Processing Systems, 32.
- Beit: Bert pre-training of image transformers. ICLR.
- Food-101–mining discriminative components with random forests. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part VI 13, 446–461. Springer.
- End-to-end object detection with transformers. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part I 16, 213–229. Springer.
- Breaking the Architecture Barrier: A Method for Efficient Knowledge Transfer Across Networks. arXiv preprint arXiv:2212.13970.
- Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, 248–255. Ieee.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
- An image is worth 16x16 words: Transformers for image recognition at scale. ICLR.
- Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, 249–256. JMLR Workshop and Conference Proceedings.
- Knowledge distillation: A survey. International Journal of Computer Vision, 129: 1789–1819.
- Levit: a vision transformer in convnet’s clothing for faster inference. In Proceedings of the IEEE/CVF international conference on computer vision, 12259–12269.
- Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, 770–778.
- Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415.
- Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531.
- Improving transformer optimization through better initialization. In International Conference on Machine Learning, 4475–4483. PMLR.
- Scaling up visual and vision-language representation learning with noisy text supervision. In International Conference on Machine Learning, 4904–4916. PMLR.
- Tinybert: Distilling bert for natural language understanding. EMNLP.
- An overview of principal component analysis. Journal of Signal and Information Processing, 4(3B): 173.
- Revealing the dark secrets of BERT. EMNLP.
- Learning multiple layers of features from tiny images. Technique Report.
- Albert: A lite bert for self-supervised learning of language representations. ICLR.
- Tiny imagenet visual recognition challenge. CS 231N, 7(7): 3.
- Efficient backprop. In Neural networks: Tricks of the trade, 9–50. Springer.
- Polytransform: Deep polygon transformer for instance segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 9131–9140.
- Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, 10012–10022.
- All you need is a good init. arXiv preprint arXiv:1511.06422.
- Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193.
- Learning transferable visual models from natural language supervision. In International conference on machine learning, 8748–8763. PMLR.
- TinyMIM: An empirical study of distilling MIM pre-trained models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 3687–3697.
- Meta-transfer learning for few-shot learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 403–412.
- Training data-efficient image transformers & distillation through attention. In International conference on machine learning, 10347–10357. PMLR.
- Matching networks for one shot learning. In Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain, volume 29, 3630–3638.
- Learngene: Inheriting Condensed Knowledge from the Ancestry Model to Descendant Models. arXiv preprint arXiv:2305.02279.
- Learngene: From Open-World to Your Learning Task. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, 8557–8565.
- Image as a foreign language: Beit pretraining for all vision and vision-language tasks. arXiv preprint arXiv:2208.10442.
- Minilm: Deep self-attention distillation for task-agnostic compression of pre-trained transformers. Advances in Neural Information Processing Systems, 33: 5776–5788.
- Tinyvit: Fast pretraining distillation for small vision transformers. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXI, 68–85. Springer.
- Few-shot incremental learning with continually evolved classifiers. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 12455–12464.
- Minivit: Compressing vision transformers with weight multiplexing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 12145–12154.
- Self-distillation: Towards efficient and compact neural networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(8): 4388–4403.
- Bbn: Bilateral-branch network with cumulative learning for long-tailed visual recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 9719–9728.
- Shiyu Xia (9 papers)
- Miaosen Zhang (7 papers)
- Xu Yang (222 papers)
- Ruiming Chen (3 papers)
- Haokun Chen (26 papers)
- Xin Geng (90 papers)