Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

PromptKD: Unsupervised Prompt Distillation for Vision-Language Models (2403.02781v5)

Published 5 Mar 2024 in cs.CV
PromptKD: Unsupervised Prompt Distillation for Vision-Language Models

Abstract: Prompt learning has emerged as a valuable technique in enhancing vision-LLMs (VLMs) such as CLIP for downstream tasks in specific domains. Existing work mainly focuses on designing various learning forms of prompts, neglecting the potential of prompts as effective distillers for learning from larger teacher models. In this paper, we introduce an unsupervised domain prompt distillation framework, which aims to transfer the knowledge of a larger teacher model to a lightweight target model through prompt-driven imitation using unlabeled domain images. Specifically, our framework consists of two distinct stages. In the initial stage, we pre-train a large CLIP teacher model using domain (few-shot) labels. After pre-training, we leverage the unique decoupled-modality characteristics of CLIP by pre-computing and storing the text features as class vectors only once through the teacher text encoder. In the subsequent stage, the stored class vectors are shared across teacher and student image encoders for calculating the predicted logits. Further, we align the logits of both the teacher and student models via KL divergence, encouraging the student image encoder to generate similar probability distributions to the teacher through the learnable prompts. The proposed prompt distillation process eliminates the reliance on labeled data, enabling the algorithm to leverage a vast amount of unlabeled images within the domain. Finally, the well-trained student image encoders and pre-stored text features (class vectors) are utilized for inference. To our best knowledge, we are the first to (1) perform unsupervised domain-specific prompt-driven knowledge distillation for CLIP, and (2) establish a practical pre-storing mechanism of text features as shared class vectors between teacher and student. Extensive experiments on 11 datasets demonstrate the effectiveness of our method.

Essay on "PromptKD: Unsupervised Prompt Distillation for Vision-LLMs"

The paper "PromptKD: Unsupervised Prompt Distillation for Vision-LLMs" presents an innovative methodology for enhancing vision-LLMs (VLMs) such as CLIP, utilizing prompt distillation techniques in an unsupervised framework. This approach aims at transferring knowledge from a larger teacher model to a lightweight student model using prompt-driven imitation facilitated by unlabeled domain images.

Overview of the Methodology

The proposed method, PromptKD, introduces a two-stage framework. The first stage involves pre-training a large CLIP teacher model with few-shot domain labels. This pre-training optimizes the model for domain-specific tasks and stores the text features from the teacher's text encoder as class vectors. These pre-computed text features act as shared vectors between the teacher and student models, thus ensuring efficiency during the second stage of the framework.

In the subsequent stage, prompt distillation occurs. Here, both the teacher and student image encoders share the class vectors for logit calculation. By aligning the logits of the teacher and student models through KL divergence, the student learns to produce outputs akin to the teacher's, utilizing learnable prompts. This process effectively removes the reliance on labeled datasets, making use of large volumes of unlabeled images.

Results and Implications

Extensive experiments on 11 diverse datasets reveal that PromptKD achieves a state-of-the-art performance, outperforming contemporaneous methods across base-to-novel generalization tasks. Specifically, it demonstrates an average improvement of 2.70% for base classes and 4.63% for novel classes over previous benchmark scores. The framework leverages the architecture and pre-training benefits of VLMs like CLIP, focusing on learnable soft prompts that better adapt the model to domain-specific knowledge.

PromptKD highlights significant implications for the future development of AI, particularly in the vision-language domain. By employing unsupervised methods and eliminating labeled data requirements, the framework not only reduces the constraints posed by dataset limitations but also enhances the adaptability and scalability of VLMs. This has practical implications for scenarios where obtaining labeled data is challenging or costly.

Speculation on Future Developments

The adoption of distillation frameworks that utilize prompt-based techniques presents several avenues for advancement. Future research might explore intricate combinations of textual and visual modalities in prompting mechanisms, particularly considering the decoupled-modality characteristics that CLIP exploits. Additionally, further exploration into optimizing projector designs or distillation hyperparameters—specifically relating to different datasets and tasks—could lead to performance gains.

Another potential direction is exploring more sophisticated models or alternative architectures, with ViT-B/16 and ViT-L/14 serving as precedents. Investigating the impact of these architectures on distilled representations might yield valuable insights into prompt learning capabilities across varying model scales.

Conclusion

PromptKD introduces a novel unsupervised prompt distillation framework that addresses key limitations in VLMs, emphasizing efficiency and performance improvements. The framework's innovative approach of utilizing shared class vectors and prompt-driven knowledge transfer has demonstrated tangible improvements across numerous tasks. As the field progresses, the concepts and methodologies presented in this paper could catalyze further developments in the field of vision-language processing and beyond.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (66)
  1. Food-101–mining discriminative components with random forests. In ECCV, pages 446–461. Springer, 2014.
  2. Pkd: General distillation framework for object detectors via pearson correlation coefficient. arXiv preprint arXiv:2207.02039, 2022.
  3. Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. In CVPR, pages 3558–3568, 2021.
  4. Knowledge distillation with the reused teacher classifier. In CVPR, pages 11933–11942, 2022.
  5. E2vpt: An effective and efficient approach for visual prompt tuning. In ICCV, 2023.
  6. Describing textures in the wild. In CVPR, pages 3606–3613, 2014.
  7. Imagenet: A large-scale hierarchical image database. In CVPR, pages 248–255, 2009.
  8. Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories. In CVPR workshop, pages 178–178. IEEE, 2004.
  9. Clip-adapter: Better vision-language models with feature adapters. IJCV, pages 1–15, 2023.
  10. Domain adaptation via prompt learning. arXiv preprint arXiv:2202.06687, 2022.
  11. Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 12(7):2217–2226, 2019.
  12. The many faces of robustness: A critical analysis of out-of-distribution generalization. In ICCV, pages 8340–8349, 2021a.
  13. Natural adversarial examples. In CVPR, pages 15262–15271, 2021b.
  14. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
  15. Unsupervised prompt learning for vision-language models. arXiv preprint arXiv:2204.03649, 2022.
  16. Scaling up visual and vision-language representation learning with noisy text supervision. In ICML, pages 4904–4916. PMLR, 2021.
  17. Visual prompt tuning. In ECCV, pages 709–727. Springer, 2022.
  18. Improving zero-shot models with label distribution priors. arXiv preprint arXiv:2212.00784, 2022.
  19. Maple: Multi-modal prompt learning. In CVPR, pages 19113–19122, 2023a.
  20. Self-regulating prompts: Foundational model adaptation without forgetting. In ICCV, pages 15190–15200, 2023b.
  21. 3d object representations for fine-grained categorization. In ICCV workshop, pages 554–561, 2013.
  22. Improving clip robustness with knowledge distillation and self-training. arXiv preprint arXiv:2309.10361, 2023.
  23. Read-only prompt optimization for vision-language few-shot learning. In ICCV, pages 1401–1411, 2023.
  24. The power of scale for parameter-efficient prompt tuning. arXiv preprint arXiv:2104.08691, 2021.
  25. Prefix-tuning: Optimizing continuous prompts for generation. arXiv preprint arXiv:2101.00190, 2021.
  26. Online knowledge distillation via multi-branch diversity enhancement. In ACCV, 2020.
  27. Online knowledge distillation for efficient pose estimation. In ICCV, pages 11740–11750, 2021.
  28. Curriculum temperature for knowledge distillation. In AAAI, pages 1504–1512, 2023.
  29. Structured knowledge distillation for semantic segmentation. In CVPR, pages 2604–2613, 2019.
  30. Fine-grained visual classification of aircraft. arXiv preprint arXiv:1306.5151, 2013.
  31. Enhancing clip with clip: Exploring pseudolabeling for limited-label prompt tuning. arXiv preprint arXiv:2306.01669, 2023.
  32. Lafter: Label-free tuning of zero-shot classifier using language and unlabeled image collections. In NeurIPS, 2023.
  33. Automated flower classification over a large number of classes. In 2008 Sixth Indian conference on computer vision, graphics & image processing, pages 722–729. IEEE, 2008.
  34. Relational knowledge distillation. In CVPR, pages 3967–3976, 2019.
  35. Cats and dogs. In CVPR, pages 3498–3505. IEEE, 2012.
  36. Clipping: Distilling clip-based models with a student base for video-language retrieval. In CVPR, pages 18983–18992, 2023.
  37. Learning transferable visual models from natural language supervision. In ICML, pages 8748–8763. PMLR, 2021.
  38. Denseclip: Language-guided dense prediction with context-aware prompting. In CVPR, pages 18082–18091, 2022.
  39. Do imagenet classifiers generalize to imagenet? In ICML, pages 5389–5400. PMLR, 2019.
  40. Prompt pre-training with twenty-thousand classes for open-vocabulary visual recognition. arXiv preprint arXiv:2304.04704, 2023.
  41. Align your prompts: Test-time prompting with distribution alignment for zero-shot generalization. In NeurIPS, 2023.
  42. Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114, 2021.
  43. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In ACL, pages 2556–2565, 2018.
  44. Test-time prompt tuning for zero-shot generalization in vision-language models. NeurIPS, 35:14274–14289, 2022.
  45. Transductive unbiased embedding for zero-shot learning. In CVPR, pages 1024–1033, 2018.
  46. Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402, 2012.
  47. Transductive zero-shot learning with visual structure constraint. NeurIPS, 32, 2019.
  48. Learning to decompose visual features with latent textual prompts. ICLR, 2023a.
  49. Learning robust global representations by penalizing local predictive power. NeurIPS, 32, 2019a.
  50. Crosskd: Cross-head knowledge distillation for dense object detection. arXiv preprint arXiv:2306.11369, 2023b.
  51. A survey of zero-shot learning: Settings, methods, and applications. ACM Transactions on Intelligent Systems and Technology (TIST), 10(2):1–37, 2019b.
  52. Tinyclip: Clip distillation via affinity mimicking and weight inheritance. In ICCV, pages 21970–21980, 2023.
  53. Zero-shot learning—a comprehensive evaluation of the good, the bad and the ugly. T-PAMI, 41(9):2251–2265, 2018a.
  54. Feature generating networks for zero-shot learning. In CVPR, pages 5542–5551, 2018b.
  55. Sun database: Large-scale scene recognition from abbey to zoo. In CVPR, pages 3485–3492. IEEE, 2010.
  56. Mutual contrastive learning for visual representation learning. In AAAI, pages 3045–3053, 2022a.
  57. Cross-image relational knowledge distillation for semantic segmentation. In CVPR, pages 12319–12328, 2022b.
  58. Clip-kd: An empirical study of distilling clip models. arXiv preprint arXiv:2307.12732, 2023.
  59. Knowledge distillation via softmax regression representation learning. In ICLR, 2021.
  60. Fine-grained visual prompting. NeurIPS, 36, 2024.
  61. Detclip: Dictionary-enriched visual-concept paralleled pre-training for open-world detection. NeurIPS, 35:9125–9138, 2022.
  62. Learning a deep embedding model for zero-shot learning. In CVPR, pages 2021–2030, 2017.
  63. Deep mutual learning. In CVPR, pages 4320–4328, 2018.
  64. Decoupled knowledge distillation. In CVPR, pages 11953–11962, 2022.
  65. Conditional prompt learning for vision-language models. In CVPR, pages 16816–16825, 2022a.
  66. Learning to prompt for vision-language models. IJCV, 130(9):2337–2348, 2022b.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Zheng Li (326 papers)
  2. Xiang Li (1002 papers)
  3. Xinyi Fu (12 papers)
  4. Weiqiang Wang (171 papers)
  5. Jian Yang (503 papers)
  6. Shuo Chen (127 papers)
  7. Xin Zhang (904 papers)
Citations (14)