Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Improved Zero-Shot Classification by Adapting VLMs with Text Descriptions (2401.02460v2)

Published 4 Jan 2024 in cs.CV
Improved Zero-Shot Classification by Adapting VLMs with Text Descriptions

Abstract: The zero-shot performance of existing vision-LLMs (VLMs) such as CLIP is limited by the availability of large-scale, aligned image and text datasets in specific domains. In this work, we leverage two complementary sources of information -- descriptions of categories generated by LLMs and abundant, fine-grained image classification datasets -- to improve the zero-shot classification performance of VLMs across fine-grained domains. On the technical side, we develop methods to train VLMs with this "bag-level" image-text supervision. We find that simply using these attributes at test-time does not improve performance, but our training strategy, for example, on the iNaturalist dataset, leads to an average improvement of 4-5% in zero-shot classification accuracy for novel categories of birds and flowers. Similar improvements are observed in domains where a subset of the categories was used to fine-tune the model. By prompting LLMs in various ways, we generate descriptions that capture visual appearance, habitat, and geographic regions and pair them with existing attributes such as the taxonomic structure of the categories. We systematically evaluate their ability to improve zero-shot categorization in natural domains. Our findings suggest that geographic priors can be just as effective and are complementary to visual appearance. Our method also outperforms prior work on prompt-based tuning of VLMs. We release the benchmark, consisting of 14 datasets at https://github.com/cvl-umass/AdaptCLIPZS , which will contribute to future research in zero-shot recognition.

Technical Insights into CVPR Author Rebuttal Guidelines

This paper serves as a detailed instructional document for authors preparing a rebuttal to reviews received for their submissions to conferences such as the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). The guidelines are designed to offer a standardized framework that assists authors in effectively addressing reviewer feedback while adhering to the constraints imposed by the conference committee. The paper outlines the structural, content, and formatting specifications necessary for a well-regarded rebuttal.

Key Aspects of the Rebuttal Process

The principal aim of the rebuttal is to provide authors with a capability to address factual inaccuracies or to submit additional data as explicitly requested by reviewers. The rebuttal is not an avenue for introducing new findings or substantially altering the content of the original submission unless specifically solicited by the reviewers. This rule is critical in maintaining the integrity and consistency of the review process.

A noteworthy procedural aspect enforced by the CVPR committee, stemming from a 2018 PAMI-TC motion, is the recommendation that reviewers avoid demanding extensive new experiments during the rebuttal period. Thus, the emphasis remains on clarifying existing points rather than the restructuring or expansion of the research inquiry.

Formatting and Presentation Standards

The document stipulates specific formatting criteria to be followed by authors, ensuring uniformity and readability. The rebuttal must be precisely one page and maintain anonymity by refraining from the inclusion of any kind of author-identifiable information or external links. The standardized two-column layout, alongside defined margin parameters, must be strictly adhered to, ensuring equitable space for responses across submissions.

Graphics and equation placements are also scrutinized for clarity in both digital and printed formats. This is paramount since reviewers may print documents for evaluation, necessitating that all graphical elements be clearly legible without digital magnification.

Implications and Forward-Looking Insights

The enforcement of these rebuttal guidelines underscores the CVPR's commitment to a fair and transparent peer review ecosystem. By setting these structural standards, the conference ensures that authors focus their efforts on the methodological robustness and scientific merit of their rebuttal responses rather than aesthetic embellishments or content alterations.

Looking forward, as the field of AI and computer vision continues to expand, these guidelines may evolve to accommodate new forms of scientific expression, such as interactive or multi-modal submissions. The primary principle of enhancing communication efficacy, however, will likely remain paramount. This paper offers critical insights into maintaining the quality and consistency of scientific discourse within high-stakes academic settings, setting a precedent for future conferences and their review protocols.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (45)
  1. Food-101 – mining discriminative components with random forests. In European Conference on Computer Vision, 2014.
  2. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  3. Big self-supervised models are strong semi-supervised learners. Advances in neural information processing systems, 33:22243–22255, 2020.
  4. Cloob: Modern hopfield networks with infoloob outperform clip. Advances in neural information processing systems, 35:20450–20468, 2022.
  5. Decorate the newcomers: Visual domain prompt for continual test time adaptation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 7595–7603, 2023.
  6. Clip-adapter: Better vision-language models with feature adapters. International Journal of Computer Vision, pages 1–15, 2023.
  7. Finetune like you pretrain: Improved finetuning of zero-shot vision models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19338–19347, 2023.
  8. Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification, 2017.
  9. Scaling up visual and vision-language representation learning with noisy text supervision. In International conference on machine learning, pages 4904–4916. PMLR, 2021.
  10. Visual prompt tuning. In European Conference on Computer Vision, pages 709–727. Springer, 2022.
  11. Maple: Multi-modal prompt learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19113–19122, 2023.
  12. 3d object representations for fine-grained categorization. In Proceedings of the IEEE international conference on computer vision workshops, pages 554–561, 2013.
  13. Gist: Generating image-specific text for fine-grained object classification. arXiv e-prints, pages arXiv–2307, 2023.
  14. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International Conference on Machine Learning, pages 12888–12900. PMLR, 2022.
  15. Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023.
  16. Fine-grained visual classification of aircraft. Technical report, 2013.
  17. Enhancing clip with gpt-4: Harnessing visual descriptions as prompts. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 262–271, 2023.
  18. Visual classification via description from large language models. arXiv preprint arXiv:2210.07183, 2022.
  19. Lafter: Label-free tuning of zero-shot classifier using language and unlabeled image collections. arXiv preprint arXiv:2305.18287, 2023.
  20. Automated flower classification over a large number of classes. In 2008 Sixth Indian conference on computer vision, graphics & image processing, pages 722–729. IEEE, 2008.
  21. Chils: Zero-shot image classification with hierarchical label sets. In International Conference on Machine Learning, pages 26342–26362. PMLR, 2023.
  22. Svl-adapter: Self-supervised adapter for vision-language pretrained models. arXiv preprint arXiv:2210.03794, 2022.
  23. Sgva-clip: Semantic-guided visual adapting of vision-language models for few-shot image classification. IEEE Transactions on Multimedia, 2023.
  24. What does a platypus look like? generating customized prompts for zero-shot image classification. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15691–15701, 2023.
  25. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  26. Learning deep representations of fine-grained visual descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 49–58, 2016.
  27. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3):211–252, 2015.
  28. Test-time prompt tuning for zero-shot generalization in vision-language models. Advances in Neural Information Processing Systems, 35:14274–14289, 2022.
  29. Flava: A foundational language and vision alignment model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15638–15650, 2022.
  30. Ad-clip: Adapting domains in prompt space using clip. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4355–4364, 2023.
  31. Fixmatch: Simplifying semi-supervised learning with consistency and confidence. Advances in neural information processing systems, 33:596–608, 2020.
  32. Vl-ltr: Learning class-wise visual-linguistic representation for long-tailed visual recognition. In European Conference on Computer Vision, pages 73–91. Springer, 2022.
  33. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  34. Building a bird recognition app and large scale dataset with citizen scientists: The fine print in fine-grained dataset collection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 595–604, 2015.
  35. Benchmarking representation learning for natural world image collections. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12884–12893, 2021.
  36. The inaturalist species classification and detection dataset. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 8769–8778, 2018.
  37. The caltech-ucsd birds-200-2011 dataset. 2011.
  38. Robust fine-tuning of zero-shot models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7959–7971, 2022.
  39. Florence: A new foundation model for computer vision. arXiv preprint arXiv:2111.11432, 2021.
  40. Tip-adapter: Training-free clip-adapter for better vision-language modeling. arXiv preprint arXiv:2111.03930, 2021.
  41. Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:2303.16199, 2023.
  42. Domain prompt learning for efficiently adapting clip to unseen domains. arXiv preprint arXiv:2111.12853, 2021.
  43. Conditional prompt learning for vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16816–16825, 2022.
  44. Learning to prompt for vision-language models. International Journal of Computer Vision, 130(9):2337–2348, 2022.
  45. Prompt-aligned gradient for prompt tuning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15659–15669, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Oindrila Saha (13 papers)
  2. Grant Van Horn (23 papers)
  3. Subhransu Maji (78 papers)
Citations (11)
Github Logo Streamline Icon: https://streamlinehq.com

GitHub