Transformer as Linear Expansion of Learngene (2312.05614v2)

Published 9 Dec 2023 in cs.AI and cs.LG

Abstract: We propose expanding the shared Transformer module to produce and initialize Transformers of varying depths, enabling adaptation to diverse resource constraints. Drawing an analogy to genetic expansibility, we term such module as learngene. To identify the expansion mechanism, we delve into the relationship between the layer's position and its corresponding weight value, and find that linear function appropriately approximates this relationship. Building on this insight, we present Transformer as Linear Expansion of learnGene (TLEG), a novel approach for flexibly producing and initializing Transformers of diverse depths. Specifically, to learn learngene, we firstly construct an auxiliary Transformer linearly expanded from learngene, after which we train it through employing soft distillation. Subsequently, we can produce and initialize Transformers of varying depths via linearly expanding the well-trained learngene, thereby supporting diverse downstream scenarios. Extensive experiments on ImageNet-1K demonstrate that TLEG achieves comparable or better performance in contrast to many individual models trained from scratch, while reducing around 2x training cost. When transferring to several downstream classification datasets, TLEG surpasses existing initialization methods by a large margin (e.g., +6.87% on iNat 2019 and +7.66% on CIFAR-100). Under the situation where we need to produce models of varying depths adapting for different resource constraints, TLEG achieves comparable results while reducing around 19x parameters stored to initialize these models and around 5x pre-training costs, in contrast to the pre-training and fine-tuning approach. When transferring a fixed set of parameters to initialize different models, TLEG presents better flexibility and competitive performance while reducing around 2.9x parameters stored to initialize, compared to the pre-training approach.

References (43)

Authors (6)

Shiyu Xia (9 papers)
Miaosen Zhang (7 papers)
Xu Yang (222 papers)
Ruiming Chen (3 papers)
Haokun Chen (26 papers)
Xin Geng (90 papers)

Citations (4)

View on Semantic Scholar

Summary

Introduction

The initial configuration of parameters in deep neural networks, particularly Vision Transformers, remains vital for ensuring robust model performance. Adequate parameter initialization can significantly impact the final quality of the trained network. With extensive pre-training on large-scale data, models have demonstrated outstanding results across downstream tasks. However, this process can be both costly and inflexible, particularly when updating model parameters separately for each task. The paper introduces a new concept – the learngene – to address these challenges.

Learngene and TLEG

Drawing inspiration from genetics, this novel paradigm revolves around a base model unit called a learngene, which serves a similar purpose to an organismal gene. The learngene retains the most generalizable parts of a model, minimizing resource costs. A key innovation is the Transformer as Linear Expansion of learngene (TLEG) methodology. It offers an elastic model production technique, which through linear expansion of learngene parameters, can produce diverse model depths. This expansion is guided by the simple linear relationships between layer positions and parameter values, evident in well-trained Transformer models.

Methodology

The TLEG strategy is composed of two stages. In the first stage, an auxiliary model is trained using distilled knowledge acquired from larger ancestry models, ensuring only the essential parameters (learngenes) are updated. In the second stage, these learngenes are linearly expanded to initialize descendant models of varied depths, which are then fine-tuned for optimization. This strategy culminates in a versatile set of models that fit various computational requirements, from lightweight IoT devices to high-resource data centers.

Results and Impact

TLEG's efficiency gains are notable. The approach not only yields performance on par with or better than models trained from scratch, but it does so with approximately double the training efficiency and a significant reduction in model parameter storage – by about 19 times less when initializing diverse models. When transferring a single set of parameters across different scales, TLEG shows greater flexibility and performance while using approximately 2.9 times fewer parameters for initialization, compared to pre-training methods.

Conclusion

By leveraging the concept of learngenes, the TLEG framework heralds a new era of parameter initialization in AI. It sets a benchmark for flexible, efficient, and cost-effective model production that caters to different scales of computational resources. The implementation of TLEG can significantly transform how we initiate and deploy Vision Transformers in real-world scenarios, making AI systems more adaptable and accessible across varying application landscapes.