Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Transformer as Linear Expansion of Learngene (2312.05614v2)

Published 9 Dec 2023 in cs.AI and cs.LG

Abstract: We propose expanding the shared Transformer module to produce and initialize Transformers of varying depths, enabling adaptation to diverse resource constraints. Drawing an analogy to genetic expansibility, we term such module as learngene. To identify the expansion mechanism, we delve into the relationship between the layer's position and its corresponding weight value, and find that linear function appropriately approximates this relationship. Building on this insight, we present Transformer as Linear Expansion of learnGene (TLEG), a novel approach for flexibly producing and initializing Transformers of diverse depths. Specifically, to learn learngene, we firstly construct an auxiliary Transformer linearly expanded from learngene, after which we train it through employing soft distillation. Subsequently, we can produce and initialize Transformers of varying depths via linearly expanding the well-trained learngene, thereby supporting diverse downstream scenarios. Extensive experiments on ImageNet-1K demonstrate that TLEG achieves comparable or better performance in contrast to many individual models trained from scratch, while reducing around 2x training cost. When transferring to several downstream classification datasets, TLEG surpasses existing initialization methods by a large margin (e.g., +6.87% on iNat 2019 and +7.66% on CIFAR-100). Under the situation where we need to produce models of varying depths adapting for different resource constraints, TLEG achieves comparable results while reducing around 19x parameters stored to initialize these models and around 5x pre-training costs, in contrast to the pre-training and fine-tuning approach. When transferring a fixed set of parameters to initialize different models, TLEG presents better flexibility and competitive performance while reducing around 2.9x parameters stored to initialize, compared to the pre-training approach.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (43)
  1. How to initialize your network? robust initialization for weightnorm & resnets. Advances in Neural Information Processing Systems, 32.
  2. Layer normalization. arXiv preprint arXiv:1607.06450.
  3. Deep equilibrium models. Advances in Neural Information Processing Systems, 32.
  4. Beit: Bert pre-training of image transformers. ICLR.
  5. Food-101–mining discriminative components with random forests. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part VI 13, 446–461. Springer.
  6. End-to-end object detection with transformers. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part I 16, 213–229. Springer.
  7. Breaking the Architecture Barrier: A Method for Efficient Knowledge Transfer Across Networks. arXiv preprint arXiv:2212.13970.
  8. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, 248–255. Ieee.
  9. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
  10. An image is worth 16x16 words: Transformers for image recognition at scale. ICLR.
  11. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, 249–256. JMLR Workshop and Conference Proceedings.
  12. Knowledge distillation: A survey. International Journal of Computer Vision, 129: 1789–1819.
  13. Levit: a vision transformer in convnet’s clothing for faster inference. In Proceedings of the IEEE/CVF international conference on computer vision, 12259–12269.
  14. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, 770–778.
  15. Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415.
  16. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531.
  17. Improving transformer optimization through better initialization. In International Conference on Machine Learning, 4475–4483. PMLR.
  18. Scaling up visual and vision-language representation learning with noisy text supervision. In International Conference on Machine Learning, 4904–4916. PMLR.
  19. Tinybert: Distilling bert for natural language understanding. EMNLP.
  20. An overview of principal component analysis. Journal of Signal and Information Processing, 4(3B): 173.
  21. Revealing the dark secrets of BERT. EMNLP.
  22. Learning multiple layers of features from tiny images. Technique Report.
  23. Albert: A lite bert for self-supervised learning of language representations. ICLR.
  24. Tiny imagenet visual recognition challenge. CS 231N, 7(7): 3.
  25. Efficient backprop. In Neural networks: Tricks of the trade, 9–50. Springer.
  26. Polytransform: Deep polygon transformer for instance segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 9131–9140.
  27. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, 10012–10022.
  28. All you need is a good init. arXiv preprint arXiv:1511.06422.
  29. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193.
  30. Learning transferable visual models from natural language supervision. In International conference on machine learning, 8748–8763. PMLR.
  31. TinyMIM: An empirical study of distilling MIM pre-trained models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 3687–3697.
  32. Meta-transfer learning for few-shot learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 403–412.
  33. Training data-efficient image transformers & distillation through attention. In International conference on machine learning, 10347–10357. PMLR.
  34. Matching networks for one shot learning. In Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain, volume 29, 3630–3638.
  35. Learngene: Inheriting Condensed Knowledge from the Ancestry Model to Descendant Models. arXiv preprint arXiv:2305.02279.
  36. Learngene: From Open-World to Your Learning Task. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, 8557–8565.
  37. Image as a foreign language: Beit pretraining for all vision and vision-language tasks. arXiv preprint arXiv:2208.10442.
  38. Minilm: Deep self-attention distillation for task-agnostic compression of pre-trained transformers. Advances in Neural Information Processing Systems, 33: 5776–5788.
  39. Tinyvit: Fast pretraining distillation for small vision transformers. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXI, 68–85. Springer.
  40. Few-shot incremental learning with continually evolved classifiers. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 12455–12464.
  41. Minivit: Compressing vision transformers with weight multiplexing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 12145–12154.
  42. Self-distillation: Towards efficient and compact neural networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(8): 4388–4403.
  43. Bbn: Bilateral-branch network with cumulative learning for long-tailed visual recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 9719–9728.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Shiyu Xia (9 papers)
  2. Miaosen Zhang (7 papers)
  3. Xu Yang (222 papers)
  4. Ruiming Chen (3 papers)
  5. Haokun Chen (26 papers)
  6. Xin Geng (90 papers)
Citations (4)

Summary

Introduction

The initial configuration of parameters in deep neural networks, particularly Vision Transformers, remains vital for ensuring robust model performance. Adequate parameter initialization can significantly impact the final quality of the trained network. With extensive pre-training on large-scale data, models have demonstrated outstanding results across downstream tasks. However, this process can be both costly and inflexible, particularly when updating model parameters separately for each task. The paper introduces a new concept – the learngene – to address these challenges.

Learngene and TLEG

Drawing inspiration from genetics, this novel paradigm revolves around a base model unit called a learngene, which serves a similar purpose to an organismal gene. The learngene retains the most generalizable parts of a model, minimizing resource costs. A key innovation is the Transformer as Linear Expansion of learngene (TLEG) methodology. It offers an elastic model production technique, which through linear expansion of learngene parameters, can produce diverse model depths. This expansion is guided by the simple linear relationships between layer positions and parameter values, evident in well-trained Transformer models.

Methodology

The TLEG strategy is composed of two stages. In the first stage, an auxiliary model is trained using distilled knowledge acquired from larger ancestry models, ensuring only the essential parameters (learngenes) are updated. In the second stage, these learngenes are linearly expanded to initialize descendant models of varied depths, which are then fine-tuned for optimization. This strategy culminates in a versatile set of models that fit various computational requirements, from lightweight IoT devices to high-resource data centers.

Results and Impact

TLEG's efficiency gains are notable. The approach not only yields performance on par with or better than models trained from scratch, but it does so with approximately double the training efficiency and a significant reduction in model parameter storage – by about 19 times less when initializing diverse models. When transferring a single set of parameters across different scales, TLEG shows greater flexibility and performance while using approximately 2.9 times fewer parameters for initialization, compared to pre-training methods.

Conclusion

By leveraging the concept of learngenes, the TLEG framework heralds a new era of parameter initialization in AI. It sets a benchmark for flexible, efficient, and cost-effective model production that caters to different scales of computational resources. The implementation of TLEG can significantly transform how we initiate and deploy Vision Transformers in real-world scenarios, making AI systems more adaptable and accessible across varying application landscapes.