Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Proto-CLIP: Vision-Language Prototypical Network for Few-Shot Learning (2307.03073v3)

Published 6 Jul 2023 in cs.CV and cs.RO

Abstract: We propose a novel framework for few-shot learning by leveraging large-scale vision-LLMs such as CLIP. Motivated by unimodal prototypical networks for few-shot learning, we introduce Proto-CLIP which utilizes image prototypes and text prototypes for few-shot learning. Specifically, Proto-CLIP adapts the image and text encoder embeddings from CLIP in a joint fashion using few-shot examples. The embeddings from the two encoders are used to compute the respective prototypes of image classes for classification. During adaptation, we propose aligning the image and text prototypes of the corresponding classes. Such alignment is beneficial for few-shot classification due to the reinforced contributions from both types of prototypes. Proto-CLIP has both training-free and fine-tuned variants. We demonstrate the effectiveness of our method by conducting experiments on benchmark datasets for few-shot learning, as well as in the real world for robot perception. The project page is available at https://irvlutd.github.io/Proto-CLIP

Definition Search Book Streamline Icon: https://streamlinehq.com
References (38)
  1. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, pages 8748–8763, 2021.
  2. Generalizing from a few examples: A survey on few-shot learning. ACM computing surveys (CSUR), 53(3):1–34, 2020.
  3. Benchmarking in manipulation research: The YCB object and model set and benchmarking protocols. arXiv preprint arXiv:1502.03143, 2015.
  4. Microsoft COCO: Common objects in context. In European Conference on Computer Vision (ECCV), pages 740–755. Springer, 2014.
  5. ImageNet: A large-scale hierarchical image database. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 248–255, 2009.
  6. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision (IJCV), 123(1):32–73, 2017.
  7. Rethinking few-shot image classification: a good embedding is all you need? In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XIV 16, pages 266–282. Springer, 2020.
  8. Learning to prompt for vision-language models. International Journal of Computer Vision, 130(9):2337–2348, 2022.
  9. Clip-adapter: Better vision-language models with feature adapters. arXiv 2110.04544, 2021.
  10. Tip-adapter: Training-free adaption of clip for few-shot classification. arXiv preprint arXiv:2207.09519, 2022.
  11. Model-agnostic meta-learning for fast adaptation of deep networks. In International Conference on Machine Learning (ICML), pages 1126–1135, 2017.
  12. Prototypical networks for few-shot learning. Advances in Neural Information Processing Systems (NeurIPS), 30, 2017.
  13. Meta-dataset: A dataset of datasets for learning to learn from few examples. arXiv preprint arXiv:1903.03096, 2019.
  14. Crosstransformers: spatially-aware few-shot transfer. Advances in Neural Information Processing Systems (NeurIPS), 33:21981–21993, 2020.
  15. Fewsol: A dataset for few-shot object learning in robotic environments. arXiv preprint arXiv:2207.03333, 2022.
  16. Sus-x: Training-free name-only transfer of vision-language models. arXiv preprint arXiv:2211.16198, 2022.
  17. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022.
  18. S. Gidaris and N. Komodakis. Dynamic few-shot visual learning without forgetting. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4367–4375, 2018.
  19. Low-shot learning with imprinted weights. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 5822–5830, 2018.
  20. A closer look at few-shot classification. arXiv preprint arXiv:1904.04232, 2019.
  21. Matching networks for one shot learning. Advances in Neural Information Processing Systems (NeurIPS), 29, 2016.
  22. Learning to compare: Relation network for few-shot learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1199–1208, 2018.
  23. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018.
  24. 3d object representations for fine-grained categorization. In Proceedings of the IEEE international conference on computer vision workshops, pages 554–561, 2013.
  25. Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402, 2012.
  26. Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories. In 2004 conference on computer vision and pattern recognition workshop, pages 178–178. IEEE, 2004.
  27. M.-E. Nilsback and A. Zisserman. Automated flower classification over a large number of classes. In 2008 Sixth Indian Conference on Computer Vision, Graphics & Image Processing, pages 722–729. IEEE, 2008.
  28. Sun database: Large-scale scene recognition from abbey to zoo. In 2010 IEEE computer society conference on computer vision and pattern recognition, pages 3485–3492. IEEE, 2010.
  29. Describing textures in the wild. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3606–3613, 2014.
  30. Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 12(7):2217–2226, 2019.
  31. Fine-grained visual classification of aircraft. arXiv preprint arXiv:1306.5151, 2013.
  32. Cats and dogs. In 2012 IEEE conference on computer vision and pattern recognition, pages 3498–3505. IEEE, 2012.
  33. Food-101–mining discriminative components with random forests. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part VI 13, pages 446–461. Springer, 2014.
  34. L. Van Der Maaten. Accelerating t-sne using tree-based algorithms. The journal of machine learning research, 15(1):3221–3245, 2014.
  35. Robust speech recognition via large-scale weak supervision. arXiv preprint arXiv:2212.04356, 2022.
  36. Mean shift mask transformer for unseen object instance segmentation. arXiv preprint arXiv:2211.11679, 2022.
  37. Contact-graspnet: Efficient 6-dof grasp generation in cluttered scenes. In 2021 IEEE International Conference on Robotics and Automation (ICRA), pages 13438–13444. IEEE, 2021.
  38. Moveit![ros topics]. IEEE Robotics & Automation Magazine, 19(1):18–19, 2012.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Jishnu Jaykumar P (5 papers)
  2. Kamalesh Palanisamy (4 papers)
  3. Yu-Wei Chao (28 papers)
  4. Xinya Du (41 papers)
  5. Yu Xiang (128 papers)
Citations (9)