DTL: Disentangled Transfer Learning for Visual Recognition (2312.07856v2)
Abstract: When pre-trained models become rapidly larger, the cost of fine-tuning on downstream tasks steadily increases, too. To economically fine-tune these models, parameter-efficient transfer learning (PETL) is proposed, which only tunes a tiny subset of trainable parameters to efficiently learn quality representations. However, current PETL methods are facing the dilemma that during training the GPU memory footprint is not effectively reduced as trainable parameters. PETL will likely fail, too, if the full fine-tuning encounters the out-of-GPU-memory issue. This phenomenon happens because trainable parameters from these methods are generally entangled with the backbone, such that a lot of intermediate states have to be stored in GPU memory for gradient propagation. To alleviate this problem, we introduce Disentangled Transfer Learning (DTL), which disentangles the trainable parameters from the backbone using a lightweight Compact Side Network (CSN). By progressively extracting task-specific information with a few low-rank linear mappings and appropriately adding the information back to the backbone, CSN effectively realizes knowledge transfer in various downstream tasks. We conducted extensive experiments to validate the effectiveness of our method. The proposed method not only reduces a large amount of GPU memory usage and trainable parameters, but also outperforms existing PETL methods by a significant margin in accuracy, achieving new state-of-the-art on several standard benchmarks. The code is available at https://github.com/heekhero/DTL.
- Layer normalization. arXiv preprint arXiv:1607.06450.
- BitFit: Simple parameter-efficient fine-tuning for Transformer-based masked language-models. In 60th Annual Meeting of the Association for Computational Linguistics, 1–9.
- Food-101 – Mining discriminative components with Random Forests. In European Conference on Computer Vision, volume 8694 of LNIP, 446–461. Springer.
- Emerging properties in self-supervised Vision Transformers. In IEEE/CVF International Conference on Computer Vision, 9630–9640.
- AdaptFormer: Adapting Vision Transformers for scalable visual recognition. In Advances in Neural Information Processing Systems, 16664–16678.
- Chollet, F. 2017. Xception: Deep learning with depthwise separable convolutions. In IEEE Conference on Computer Vision and Pattern Recognition, 1800–1807.
- BERT: Pre-training of deep bidirectional Transformers for language understanding. arXiv preprint arXiv:1810.04805.
- An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 1–21.
- Masked autoencoders are scalable vision learners. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 15979–15988.
- The many faces of robustness: A critical analysis of out-of-distribution generalization. In IEEE/CVF International Conference on Computer Vision, 8320–8329.
- Natural adversarial examples. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 15257–15266.
- Parameter-efficient transfer learning for NLP. In International Conference on Machine Learning, 2790–2799.
- LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations, 1–13.
- Visual prompt tuning. In European Conference on Computer Vision, volume 13693 of LNCS, 709–727. Springer.
- FacT: Factor-Tuning for lightweight adaptation on Vision Transformer. In AAAI Conference on Artificial Intelligence, 1060–1068.
- 3D object representations for fine-grained categorization. In IEEE International Conference on Computer Vision Workshops, 554–561.
- BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461.
- Pruning filters for efficient convnets. In International Conference on Learning Representations, 1–13.
- Scaling & shifting your features: A new baseline for efficient model tuning. In Advances in Neural Information Processing Systems, 109–123.
- Swin Transformer: Hierarchical Vision Transformer using shifted windows. In IEEE/CVF International Conference on Computer Vision, 9992–10002.
- Decoupled weight decay regularization. In International Conference on Learning Representations, 1–18.
- Fine-grained visual classification of aircraft. arXiv preprint arXiv:1306.5151.
- Automated flower classification over a large number of classes. In Indian Conference on Computer Vision, Graphics and Image Processing.
- Cats and dogs. In IEEE Conference on Computer Vision and Pattern Recognition, 3498–3505.
- Searching for activation functions. arXiv preprint arXiv:1710.05941.
- Do ImageNet classifiers generalize to ImageNet? In International Conference on Machine Learning, 5389–5400.
- LST: Ladder Side-Tuning for parameter and memory efficient transfer learning. In Advances in Neural Information Processing Systems, 12991–13005.
- Rethinking the Inception architecture for computer vision. In IEEE Conference on Computer Vision and Pattern Recognition, 2818–2826.
- Attention is all you need. In Advances in Neural Information Processing Systems, 6000–6010.
- Learning robust global representations by penalizing local predictive power. In Advances in Neural Information Processing Systems, 10506–10518.
- CutMix: Regularization strategy to train strong classifiers with localizable features. In IEEE/CVF International Conference on Computer Vision, 6022–6031.
- A large-scale study of representation learning with the Visual Task Adaptation Benchmark. arXiv preprint arXiv:1910.04867.
- mixup: Beyond empirical risk minimization. In International Conference on Learning Representations, 1–13.
- Neural prompt search. arXiv preprint arXiv:2206.04673.
- Domain generalization: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(4): 4396–4415.
- Minghao Fu (17 papers)
- Ke Zhu (59 papers)
- Jianxin Wu (82 papers)