Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 86 tok/s
Gemini 2.5 Pro 53 tok/s Pro
GPT-5 Medium 19 tok/s Pro
GPT-5 High 25 tok/s Pro
GPT-4o 84 tok/s Pro
Kimi K2 129 tok/s Pro
GPT OSS 120B 430 tok/s Pro
Claude Sonnet 4 37 tok/s Pro
2000 character limit reached

DePT: Decoupled Prompt Tuning (2309.07439v2)

Published 14 Sep 2023 in cs.CV

Abstract: This work breaks through the Base-New Tradeoff (BNT)dilemma in prompt tuning, i.e., the better the tuned model generalizes to the base (or target) task, the worse it generalizes to new tasks, and vice versa. Specifically, through an in-depth analysis of the learned features of the base and new tasks, we observe that the BNT stems from a channel bias issue, i.e., the vast majority of feature channels are occupied by base-specific knowledge, resulting in the collapse of taskshared knowledge important to new tasks. To address this, we propose the Decoupled Prompt Tuning (DePT) framework, which decouples base-specific knowledge from feature channels into an isolated feature space during prompt tuning, so as to maximally preserve task-shared knowledge in the original feature space for achieving better zero-shot generalization on new tasks. Importantly, our DePT is orthogonal to existing prompt tuning methods, hence it can improve all of them. Extensive experiments on 11 datasets show the strong flexibility and effectiveness of DePT. Our code and pretrained models are available at https://github.com/Koorye/DePT.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (60)
  1. Abien Fred Agarap. Deep learning using rectified linear units (relu). arXiv preprint arXiv:1803.08375, 2018.
  2. Food-101–mining discriminative components with random forests. In ECCV, pages 446–461. Springer, 2014.
  3. Prompt learning with optimal transport for vision-language models. ICLR, 2022.
  4. Improving video-text retrieval by multi-stream corpus alignment and dual softmax loss. arXiv preprint arXiv:2109.04290, 2021.
  5. Describing textures in the wild. In CVPR, pages 3606–3613, 2014.
  6. Imagenet: A large-scale hierarchical image database. In CVPR, pages 248–255. Ieee, 2009.
  7. Pros: Prompting-to-simulate generalized knowledge for universal cross-domain retrieval. CVPR, 2024.
  8. Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories. In CVPRW, pages 178–178. IEEE, 2004.
  9. Meta-fdmixup: Cross-domain few-shot learning guided by labeled target data. In ACM MM, pages 5326–5334, 2021.
  10. Styleadv: Meta style adversarial training for cross-domain few-shot learning. In CVPR, pages 24575–24584, 2023.
  11. Generating natural adversarial examples with universal perturbations for text classification. Neurocomputing, 471:175–182, 2022.
  12. Clip-adapter: Better vision-language models with feature adapters. arXiv preprint arXiv:2110.04544, 2021.
  13. Parameter-efficient transfer learning with diff pruning. ACL, 2020.
  14. Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 12(7):2217–2226, 2019.
  15. The many faces of robustness: A critical analysis of out-of-distribution generalization. In ICCV, pages 8340–8349, 2021.
  16. Parameter-efficient transfer learning for nlp. In ICML, pages 2790–2799, 2019.
  17. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
  18. Diversity-aware meta visual prompting. In CVPR, pages 10878–10887, 2023.
  19. Wenlan: Bridging vision and language by large-scale multi-modal pre-training. arXiv preprint arXiv:2103.06561, 2021.
  20. Scaling up visual and vision-language representation learning with noisy text supervision. In ICML, pages 4904–4916, 2021.
  21. Visual prompt tuning. In ECCV, pages 709–727, 2022.
  22. Semi-supervised video paragraph grounding with contrastive encoder. In CVPR, pages 2456–2465. IEEE, 2022a.
  23. Sdn: Semantic decoupling network for temporal language grounding. IEEE Transactions on Neural Networks and Learning Systems, pages 1–15, 2022b.
  24. Context-aware alignment and mutual masking for 3D-language pre-training. In CVPR, pages 10984–10994, 2023.
  25. Maple: Multi-modal prompt learning. In CVPR, pages 19113–19122, 2023a.
  26. Self-regulating prompts: Foundational model adaptation without forgetting. In ICCV, pages 15190–15200, 2023b.
  27. Vilt: Vision-and-language transformer without convolution or region supervision. In International Conference on Machine Learning, pages 5583–5594. PMLR, 2021.
  28. 3d object representations for fine-grained categorization. In ICCVW, pages 554–561, 2013.
  29. Align before fuse: Vision and language representation learning with momentum distillation. NeurIPS, 34:9694–9705, 2021.
  30. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. NeurIPS, 32, 2019.
  31. X-clip: End-to-end multi-grained contrastive learning for video-text retrieval. In ACM MM, pages 638–647, 2022.
  32. Fine-grained visual classification of aircraft. arXiv preprint arXiv:1306.5151, 2013.
  33. Disentangled multiplex graph representation learning. In ICML, pages 24983–25005, 2023.
  34. Self-supervised heterogeneous graph learning: a homogeneity and heterogeneity perspective. In ICLR, 2024.
  35. Automated flower classification over a large number of classes. In ICVGIP, pages 722–729. IEEE, 2008.
  36. Cats and dogs. In CVPR, pages 3498–3505. IEEE, 2012.
  37. Styleclip: Text-driven manipulation of stylegan imagery. In CVPR, pages 2085–2094, 2021.
  38. Learning transferable visual models from natural language supervision. In ICML, pages 8748–8763, 2021.
  39. Denseclip: Language-guided dense prediction with context-aware prompting. In CVPR, pages 18082–18091, 2022.
  40. Do imagenet classifiers generalize to imagenet? In ICML, pages 5389–5400. PMLR, 2019.
  41. Fads: Fourier-augmentation based data-shunting for few-shot classification. IEEE Transactions on Circuits and Systems for Video Technology, 2023a.
  42. Attention-based multi-view feature collaboration for decoupled few-shot learning. IEEE Transactions on Circuits and Systems for Video Technology, 2023b.
  43. Collaborative consortium of foundation models for open-world few-shot learning. In AAAI, 2024.
  44. Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402, 2012.
  45. Clip-nerf: Text-and-image driven manipulation of neural radiance fields. In CVPR, pages 3835–3844, 2022a.
  46. Learning to decompose visual features with latent textual prompts. ICLR, 2022b.
  47. Learning robust global representations by penalizing local predictive power. NeurIPS, 32, 2019.
  48. Sun database: Large-scale scene recognition from abbey to zoo. In CVPR, pages 3485–3492. IEEE, 2010.
  49. Visual-language prompt tuning with knowledge-guided context optimization. In CVPR, pages 6757–6767, 2023.
  50. Bitfit: Simple parameter-efficient fine-tuning for transformer-based masked language-models. arXiv preprint arXiv:2106.10199, 2021.
  51. Progressive meta-learning with curriculum. IEEE Transactions on Circuits and Systems for Video Technology, 32(9):5916–5930, 2022a.
  52. Free-lunch for cross-domain few-shot learning: Style-aware episodic training with robust contrastive learning. In ACM MM, pages 2586–2594, 2022b.
  53. From global to local: Multi-scale out-of-distribution detection. IEEE Transactions on Image Processing, 2023a.
  54. Deta: Denoised task adaptation for few-shot learning. In ICCV, pages 11541–11551, 2023b.
  55. Tip-adapter: Training-free adaption of clip for few-shot classification. In ECCV, pages 493–510. Springer, 2022c.
  56. Extract free dense labels from clip. In ECCV, pages 696–712. Springer, 2022a.
  57. Conditional prompt learning for vision-language models. In CVPR, pages 16816–16825, 2022b.
  58. Learning to prompt for vision-language models. International Journal of Computer Vision, 130(9):2337–2348, 2022c.
  59. Prompt-aligned gradient for prompt tuning. ICCV, 2023a.
  60. Complementarity-aware space learning for video-text retrieval. IEEE Transactions on Circuits and Systems for Video Technology, 2023b.
Citations (13)

Summary

  • The paper identifies channel bias as the root cause of the Base-New Tradeoff by showing how base-specific dominance impairs task-shared generalization.
  • It introduces the Decoupled Prompt Tuning (DePT) framework featuring a dual-head design to segregate base-specific and shared knowledge.
  • Empirical results across 11 datasets demonstrate consistent gains, with improvements of 1.31%-3.17% on base tasks and 0.71%-2.23% on new tasks.

Decoupled Prompt Tuning: A Comprehensive Analysis

The paper "DePT: Decoupled Prompt Tuning" presents an innovative framework designed to overcome the Base-New Tradeoff (BNT) dilemma in prompt tuning within the context of vision-language pre-trained models (VLPMs). The BNT dilemma highlights a significant challenge in prompt tuning, where improved generalization to base tasks often results in reduced performance on new tasks, and vice versa. Through an extensive analysis, this paper identifies the underlying cause of the BNT as a channel bias issue. The authors propose a novel Decoupled Prompt Tuning (DePT) framework that aims to resolve this problem by decoupling base-specific knowledge and comprehensive task-shared knowledge in a manner that preserves the zero-shot generalization capabilities of VLPMs.

Key Contributions

  1. Channel Bias Identification: The paper reveals that the BNT dilemma can be attributed to a channel bias issue in which base-specific knowledge dominates most feature channels during prompt tuning, leading to the collapse of task-shared knowledge needed for new tasks. This realization provides a new perspective on understanding the underlying mechanisms of performance degradation across unseen tasks.
  2. Decoupled Prompt Tuning (DePT) Framework: DePT introduces a Channel Adjusted Transfer (CAT) head during prompt tuning, which functions by isolating the base-specific knowledge into a separate feature space. This isolation is strategically implemented to allow the original feature space to retain task-shared knowledge critical for the generalization to new tasks. By using dual heads—one for base-specific and one for task-shared knowledge—DePT maximizes zero-shot generalization.
  3. Orthogonality to Existing Methods: The proposed framework is orthogonal to existing prompt tuning approaches, allowing it to enhance them with minimal computational overhead. This compatibility suggests that DePT can be seamlessly integrated into various current methodologies without necessitating extensive system overhauls.
  4. Empirical Validation Across Datasets: Extensive experiments were conducted using 11 diverse datasets, demonstrating that DePT consistently improves the performance of existing prompt tuning methods. Notably, DePT achieved significant gains without performance tradeoffs between base and new tasks.
  5. Numerical Results: DePT delivers consistent improvements with absolute gains ranging from 1.31% to 3.17% on base tasks and 0.71% to 2.23% on new tasks, averaged across the datasets. These results confirm the robustness and efficacy of DePT in addressing the BNT problem.

Implications and Future Directions

The research presented in this paper has significant practical and theoretical implications. By successfully tackling the BNT dilemma, DePT enhances the adaptability of vision-LLMs to diverse and unseen datasets, thus expanding their applicability across a variety of real-world scenarios. Furthermore, exploring the decoupling of feature spaces may stimulate further research in prompt tuning and vision-language pretraining, leading to even more sophisticated and efficient tuning methods.

Future work may focus on extending the DePT framework to additional model architectures beyond the VLPMs, including architectures in fields like natural language processing and multi-modal systems beyond visual and linguistic inputs. Additionally, investigating the potential of DePT in conjunction with other parameter-efficient learning paradigms, such as adapter tuning methods, could unlock further advancements in model adaptation strategies.

In conclusion, the Decoupled Prompt Tuning framework represents a valuable advancement in addressing one of the core challenges in adapting pre-trained models to new tasks, providing a pathway for enhanced flexibility and generalization in a rapidly evolving AI landscape.

Lightbulb On Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Github Logo Streamline Icon: https://streamlinehq.com

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube