Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

What does a platypus look like? Generating customized prompts for zero-shot image classification (2209.03320v3)

Published 7 Sep 2022 in cs.CV and cs.LG

Abstract: Open-vocabulary models are a promising new paradigm for image classification. Unlike traditional classification models, open-vocabulary models classify among any arbitrary set of categories specified with natural language during inference. This natural language, called "prompts", typically consists of a set of hand-written templates (e.g., "a photo of a {}") which are completed with each of the category names. This work introduces a simple method to generate higher accuracy prompts, without relying on any explicit knowledge of the task domain and with far fewer hand-constructed sentences. To achieve this, we combine open-vocabulary models with LLMs to create Customized Prompts via LLMs (CuPL, pronounced "couple"). In particular, we leverage the knowledge contained in LLMs in order to generate many descriptive sentences that contain important discriminating characteristics of the image categories. This allows the model to place a greater importance on these regions in the image when making predictions. We find that this straightforward and general approach improves accuracy on a range of zero-shot image classification benchmarks, including over one percentage point gain on ImageNet. Finally, this simple baseline requires no additional training and remains completely zero-shot. Code available at https://github.com/sarahpratt/CuPL.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (66)
  1. Promptsource: An integrated development environment and repository for natural language prompts. ArXiv, abs/2202.01279, 2022.
  2. Birdsnap: Large-scale fine-grained visual categorization of birds. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2011–2018, 2014.
  3. Are we done with imagenet? arXiv preprint arXiv:2006.07159, 2020.
  4. Food-101 – mining discriminative components with random forests. In European Conference on Computer Vision, 2014.
  5. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  6. Large-scale zero-shot image classification from rich and diverse textual descriptions. ArXiv, abs/2103.09669, 2021.
  7. Large-scale zero-shot image classification from rich and diverse textual descriptions. arXiv preprint arXiv:2103.09669, 2021.
  8. A short note on the kinetics-700 human action dataset. arXiv preprint arXiv:1907.06987, 2019.
  9. Algorithms to estimate Shapley value feature attributions. arXiv preprint arXiv:2207.07605, 2022.
  10. Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE, 105(10):1865–1883, 2017.
  11. Describing textures in the wild. In Proceedings of the IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2014.
  12. Learning to estimate Shapley values with vision transformers. arXiv preprint arXiv:2206.05282, 2022.
  13. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
  14. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  15. Link the head to the ”beak”: Zero shot learning from noisy text description at part precision. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 6288–6297, 2017.
  16. Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories. In 2004 conference on computer vision and pattern recognition workshop, pages 178–178. IEEE, 2004.
  17. Making pre-trained language models better few-shot learners. arXiv preprint arXiv:2012.15723, 2020.
  18. Partimagenet: A large, high-quality dataset of parts. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part VIII, pages 128–145. Springer, 2022.
  19. Fine-grained image classification via combining vision and language. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 7332–7340, 2017.
  20. Knowledgeable prompt-tuning: Incorporating knowledge into prompt verbalizer for text classification. In ACL, 2022.
  21. Attributes-guided and pure-visual attention alignment for few-shot recognition. In AAAI, 2021.
  22. Unsupervised prompt learning for vision-language models. arXiv preprint arXiv:2204.03649, 2022.
  23. Scaling up visual and vision-language representation learning with noisy text supervision. In International Conference on Machine Learning, pages 4904–4916. PMLR, 2021.
  24. How can we know what language models know? Transactions of the Association for Computational Linguistics, 8:423–438, 2020.
  25. Large language models are zero-shot reasoners. ArXiv, abs/2205.11916, 2022.
  26. 3d object representations for fine-grained categorization. In 4th International IEEE Workshop on 3D Representation and Recognition (3dRR-13), Sydney, Australia, 2013.
  27. Learning multiple layers of features from tiny images. 2009.
  28. The power of scale for parameter-efficient prompt tuning. ArXiv, abs/2104.08691, 2021.
  29. Prefix-tuning: Optimizing continuous prompts for generation. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), abs/2101.00190, 2021.
  30. Generated knowledge prompting for commonsense reasoning. In ACL, 2022.
  31. A unified approach to interpreting model predictions. Advances in Neural Information Processing Systems, 30, 2017.
  32. Fine-grained visual classification of aircraft. Technical report, 2013.
  33. Visual classification via description from large language models. arXiv preprint arXiv:2210.07183, 2022.
  34. George A Miller. Wordnet: a lexical database for english. Communications of the ACM, 38(11):39–41, 1995.
  35. Sampling permutations for Shapley value estimation. Journal of Machine Learning Research, 23(43):1–46, 2022.
  36. Automated flower classification over a large number of classes. In 2008 Sixth Indian Conference on Computer Vision, Graphics & Image Processing, pages 722–729. IEEE, 2008.
  37. Show your work: Scratchpads for intermediate computation with language models. ArXiv, abs/2112.00114, 2021.
  38. Cats and dogs. In 2012 IEEE conference on computer vision and pattern recognition, pages 3498–3505. IEEE, 2012.
  39. Zest: Zero-shot learning from text descriptions using textual similarity and visual summarization. In FINDINGS, 2020.
  40. Combined scaling for open-vocabulary image classification. arXiv preprint arXiv:2111.10050, 2021.
  41. Learning how to ask: Querying lms with mixtures of soft prompts. ArXiv, abs/2104.06599, 2021.
  42. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, pages 8748–8763. PMLR, 2021.
  43. Denseclip: Language-guided dense prediction with context-aware prompting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18082–18091, 2022.
  44. Do imagenet classifiers generalize to imagenet? In International Conference on Machine Learning, pages 5389–5400. PMLR, 2019.
  45. Learning deep representations of fine-grained visual descriptions. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 49–58, 2016.
  46. Learning to retrieve prompts for in-context learning. In NAACL, 2022.
  47. Is a caption worth a thousand images? a controlled study for representation learning. arXiv preprint arXiv:2207.07635, 2022.
  48. Exploiting cloze-questions for few-shot text classification and natural language inference. In EACL, 2021.
  49. Lloyd S Shapley et al. A value for n-person games. 1953.
  50. K-lite: Learning transferable visual models with external knowledge. ArXiv, abs/2204.09222, 2022.
  51. Eliciting knowledge from language models using automatically generated prompts. ArXiv, abs/2010.15980, 2020.
  52. Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402, 2012.
  53. Language models can see: Plugging visual controls in text generation. ArXiv, abs/2205.02655, 2022.
  54. Laurens Van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. Journal of machine learning research, 9(11), 2008.
  55. Gpt-j-6b: A 6 billion parameter autoregressive language model, 2021.
  56. Learning robust global representations by penalizing local predictive power. In Advances in Neural Information Processing Systems, pages 10506–10518, 2019.
  57. Chain of thought prompting elicits reasoning in large language models. ArXiv, abs/2201.11903, 2022.
  58. Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771, 2019.
  59. Robust fine-tuning of zero-shot models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7959–7971, 2022.
  60. Sun database: Large-scale scene recognition from abbey to zoo. In 2010 IEEE computer society conference on computer vision and pattern recognition, pages 3485–3492. IEEE, 2010.
  61. Language in a bottle: Language model guided concept bottlenecks for interpretable image classification. arXiv preprint arXiv:2211.11158, 2022.
  62. An empirical study of gpt-3 for few-shot knowledge-based vqa. In AAAI, 2022.
  63. Coca: Contrastive captioners are image-text foundation models. arXiv preprint arXiv:2205.01917, 2022.
  64. Multimodal knowledge alignment with reinforcement learning. ArXiv, abs/2205.12630, 2022.
  65. Differentiable prompt makes pre-trained language models better few-shot learners. ArXiv, abs/2108.13161, 2021.
  66. Learning to prompt for vision-language models. International Journal of Computer Vision, pages 1–12, 2022.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Sarah Pratt (8 papers)
  2. Ian Covert (18 papers)
  3. Rosanne Liu (25 papers)
  4. Ali Farhadi (138 papers)
Citations (168)

Summary

  • The paper introduces a novel approach for automatically generating customized text prompts to improve zero-shot image classification.
  • It leverages language models to create tailored prompts that capture distinctive visual features for clearer category differentiation.
  • Experimental results demonstrate enhanced classification accuracy and robust generalization across diverse image classes.

Overview of ICCV \LaTeX\ Author Guidelines

The paper "LaTeX Author Guidelines for ICCV Proceedings" provides a detailed set of instructions for preparing manuscripts intended for submission to the International Conference on Computer Vision (ICCV). This document serves as a comprehensive manual for authors to ensure their submissions adhere to the formatting and submission requirements crucial for the ICCV review process and subsequent publication.

Key Aspects of the Guidelines

The guidelines encompass various aspects of manuscript preparation, emphasizing the necessity for consistent formatting and strict adherence to submission protocols. Here are the principal components outlined in the paper:

  1. Abstract and Formatting: Authors are instructed to craft an abstract in a fully-justified, italicized style located at the beginning of the paper. The main text is to be formatted in a two-column layout with specified margins and fonts, predominantly utilizing Times or a similar typeface. This consistent formatting aids in the uniformity of the conference proceedings.
  2. Paper Length and Submission: Submissions are restricted to eight pages, excluding references. This constraint ensures brevity and conciseness in presenting research. The paper stipulates that overlength submissions will be summarily rejected, emphasizing the importance of adhering to these limitations. Additionally, no extra page charges are applied for the conference, a critical consideration for authors planning their submissions.
  3. Blind Review Process: The document delineates the requirements for the double-blind review process, clarifying common misconceptions. Authors are encouraged to anonymize their manuscripts by avoiding self-referential terms like "our" or "my" when citing their previous work. Explicit guidance on handling references to related submissions and technical reports is included to ensure compliance with the blind review policy.
  4. Mathematical Notations and Figures: Instructions concerning the numbering of equations and the formatting of figures and tables are provided. Proper structuring of mathematical content and related graphics is key to maintaining clarity and precision in scientific communication.
  5. Style and Layout: Authors must adhere to a specific style for headings, fonts, and other typographic elements, which include directives for constructing figures and captions, emphasizing readability and visual consistency. This standardization affects both the readability of the papers and the ease of review.
  6. Supporting Materials: The inclusion of supplemental documents, such as technical reports, is addressed. These materials can be crucial for reviewers but should be structured so that the primary submission remains comprehensible on its own.

Implications and Future Considerations

The detailed nature of these guidelines reflects the ongoing efforts of the ICCV organizers to maintain a high standard of quality and uniformity across submissions. The thorough formatting instructions allow authors to focus on the content of their research, potentially leading to more rigorous and concise presentations of scientific work.

Future considerations could involve the integration of automated tools that assist authors in checking compliance with these guidelines before submission. Additionally, as digital formats evolve, there may be a need to revisit and update these guidelines to incorporate advances in manuscript preparation technologies.

These guidelines are not only a reflection of ICCV's commitment to high standards in scientific publication but also set a benchmark for other conferences in the field. As AI and computer vision continue to advance, ensuring effective communication of ideas through meticulously prepared documents will remain a cornerstone of academic success.

Github Logo Streamline Icon: https://streamlinehq.com

GitHub