Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

What Makes Good Few-shot Examples for Vision-Language Models? (2405.13532v1)

Published 22 May 2024 in cs.CV

Abstract: Despite the notable advancements achieved by leveraging pre-trained vision-language (VL) models through few-shot tuning for downstream tasks, our detailed empirical study highlights a significant dependence of few-shot learning outcomes on the careful selection of training examples - a facet that has been previously overlooked in research. In this study, we delve into devising more effective strategies for the meticulous selection of few-shot training examples, as opposed to relying on random sampling, to enhance the potential of existing few-shot prompt learning methodologies. To achieve this, we assess the effectiveness of various Active Learning (AL) techniques for instance selection, such as Entropy and Margin of Confidence, within the context of few-shot training. Furthermore, we introduce two innovative selection methods - Representativeness (REPRE) and Gaussian Monte Carlo (Montecarlo) - designed to proactively pinpoint informative examples for labeling in relation to pre-trained VL models. Our findings demonstrate that both REPRE and Montecarlo significantly surpass both random selection and AL-based strategies in few-shot training scenarios. The research also underscores that these instance selection methods are model-agnostic, offering a versatile enhancement to a wide array of few-shot training methodologies.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (33)
  1. Food-101–mining discriminative components with random forests. In European conference on computer vision. Springer, 446–461.
  2. On Training Instance Selection for Few-Shot Neural Text Generation. CoRR abs/2107.03176 (2021). arXiv:2107.03176 https://arxiv.org/abs/2107.03176
  3. Describing textures in the wild. In Proceedings of the IEEE conference on computer vision and pattern recognition. 3606–3613.
  4. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition. Ieee, 248–255.
  5. Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories. In 2004 conference on computer vision and pattern recognition workshop. IEEE, 178–178.
  6. RefineTAD: Learning Proposal-free Refinement for Temporal Action Detection. In Proceedings of the 31st ACM International Conference on Multimedia. 135–143.
  7. Domain aligned CLIP for few-shot classification. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 5721–5730.
  8. Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 12, 7 (2019), 2217–2226.
  9. MaPLe: Multi-modal Prompt Learning. https://doi.org/10.48550/ARXIV.2210.03117
  10. 3d object representations for fine-grained categorization. In Proceedings of the IEEE international conference on computer vision workshops. 554–561.
  11. General Multi-Label Image Classification With Transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 16478–16488.
  12. Catr: Combinatorial-dependence audio-queried transformer for audio-visual video segmentation. In Proceedings of the 31st ACM International Conference on Multimedia. 1485–1494.
  13. Deep Hierarchical Semantic Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 1246–1257.
  14. What Makes Good In-Context Examples for GPT-3?. In Proceedings of Deep Learning Inside Out (DeeLIO 2022): The 3rd Workshop on Knowledge Extraction and Integration for Deep Learning Architectures. Association for Computational Linguistics, Dublin, Ireland and Online, 100–114. https://doi.org/10.18653/v1/2022.deelio-1.10
  15. Deeply Coupled Cross-Modal Prompt Learning. arXiv:2305.17903 [cs.CV]
  16. Enhancing Vision-Language Pre-Training with Jointly Learned Questioner and Dense Captioner. In Proceedings of the 31st ACM International Conference on Multimedia. 5120–5131.
  17. Fine-grained visual classification of aircraft. arXiv preprint arXiv:1306.5151 (2013).
  18. Maria-Elena Nilsback and Andrew Zisserman. 2008. Automated flower classification over a large number of classes. In 2008 Sixth Indian Conference on Computer Vision, Graphics & Image Processing. IEEE, 722–729.
  19. Cats and dogs. In 2012 IEEE conference on computer vision and pattern recognition. IEEE, 3498–3505.
  20. Sgva-clip: Semantic-guided visual adapting of vision-language models for few-shot image classification. IEEE Transactions on Multimedia (2023).
  21. On the utility of active instance selection for few-shot learning. NeurIPS HAMLETS (2020).
  22. Learning Transferable Visual Models From Natural Language Supervision. CoRR abs/2103.00020 (2021). arXiv:2103.00020 https://arxiv.org/abs/2103.00020
  23. Learning Transferable Visual Models From Natural Language Supervision. https://doi.org/10.48550/ARXIV.2103.00020
  24. A survey of deep active learning. ACM computing surveys (CSUR) 54, 9 (2021), 1–40.
  25. Burr Settles. 2009. Active learning literature survey. (2009).
  26. UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402 (2012).
  27. Feature Adaptation with CLIP for Few-shot Classification. In Proceedings of the 5th ACM International Conference on Multimedia in Asia. 1–7.
  28. Sun database: Large-scale scene recognition from abbey to zoo. In 2010 IEEE computer society conference on computer vision and pattern recognition. IEEE, 3485–3492.
  29. Oriented R-CNN for Object Detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). 3520–3529.
  30. Jingyi Xu and Hieu Le. 2022. Generating Representative Samples for Few-Shot Classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9003–9013.
  31. Learning to Prompt for Vision-Language Models. CoRR abs/2109.01134 (2021). arXiv:2109.01134 https://arxiv.org/abs/2109.01134
  32. Conditional prompt learning for vision-language models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 16816–16825.
  33. Learning to select base classes for few-shot classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4624–4633.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Zhaojun Guo (2 papers)
  2. Jinghui Lu (28 papers)
  3. Xuejing Liu (14 papers)
  4. Rui Zhao (241 papers)
  5. Fei Tan (25 papers)
  6. Zhenxing Qian (54 papers)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com