Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Prompt Pre-Training with Twenty-Thousand Classes for Open-Vocabulary Visual Recognition (2304.04704v2)

Published 10 Apr 2023 in cs.CV, cs.AI, and cs.CL

Abstract: This work proposes POMP, a prompt pre-training method for vision-LLMs. Being memory and computation efficient, POMP enables the learned prompt to condense semantic information for a rich set of visual concepts with over twenty-thousand classes. Once pre-trained, the prompt with a strong transferable ability can be directly plugged into a variety of visual recognition tasks including image classification, semantic segmentation, and object detection, to boost recognition performances in a zero-shot manner. Empirical evaluation shows that POMP achieves state-of-the-art performances on 21 datasets, e.g., 67.0% average accuracy on 10 classification datasets (+3.1% compared to CoOp) and 84.4 hIoU on open-vocabulary Pascal VOC segmentation (+6.9 compared to ZSSeg). Our code is available at https://github.com/amazon-science/prompt-pretraining.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (69)
  1. Learning representations by maximizing mutual information across views. In Neural Information Processing Systems, 2019.
  2. Spt: Semi-parametric prompt tuning for multitask prompted learning. 2022.
  3. Food-101 - mining discriminative components with random forests. In ECCV, 2014.
  4. Zero-shot semantic segmentation. ArXiv, abs/1906.00817, 2019.
  5. Language-aware soft prompting for vision & language foundation models. ArXiv, abs/2210.01115, 2022.
  6. Coco-stuff: Thing and stuff classes in context. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1209–1218, 2016.
  7. Behind the scene: Revealing the secrets of pre-trained vision-and-language models. In European Conference on Computer Vision, 2020.
  8. Per-pixel classification is not all you need for semantic segmentation. In Neural Information Processing Systems, 2021.
  9. Describing textures in the wild. 2014 IEEE Conference on Computer Vision and Pattern Recognition, pages 3606–3613, 2014.
  10. Imagenet: A large-scale hierarchical image database. 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255, 2009.
  11. Arcface: Additive angular margin loss for deep face recognition. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4685–4694, 2018.
  12. Variational prompt tuning improves generalization of vision-language models. ArXiv, abs/2210.02390, 2022.
  13. Decoupling zero-shot semantic segmentation. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 11573–11582, 2021.
  14. An image is worth 16x16 words: Transformers for image recognition at scale. ArXiv, abs/2010.11929, 2020.
  15. Learning to prompt for open-vocabulary object detection with vision-language model. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14064–14073, 2022.
  16. The pascal visual object classes (voc) challenge. International Journal of Computer Vision, 88:303–338, 2010.
  17. Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories. 2004 Conference on Computer Vision and Pattern Recognition Workshop, pages 178–178, 2004.
  18. Promptdet: Towards open-vocabulary detection using uncurated images. In European Conference on Computer Vision, 2022.
  19. Zero-shot detection via vision and language knowledge distillation. arXiv preprint arXiv:2104.13921, 2021.
  20. Context-aware feature generation for zero-shot semantic segmentation. Proceedings of the 28th ACM International Conference on Multimedia, 2020.
  21. Lvis: A dataset for large vocabulary instance segmentation. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5351–5359, 2019.
  22. Momentum contrast for unsupervised visual representation learning. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9726–9735, 2019.
  23. Deep residual learning for image recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2015.
  24. Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 12:2217–2226, 2019.
  25. Data-efficient image recognition with contrastive predictive coding. ArXiv, abs/1905.09272, 2019.
  26. The many faces of robustness: A critical analysis of out-of-distribution generalization. 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 8320–8329, 2020.
  27. Natural adversarial examples. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15257–15266, 2019.
  28. Scaling up visual and vision-language representation learning with noisy text supervision. In International Conference on Machine Learning, 2021.
  29. Maple: Multi-modal prompt learning. ArXiv, abs/2210.03117, 2022.
  30. Big transfer (bit): General visual representation learning. In European Conference on Computer Vision, 2019.
  31. 3d object representations for fine-grained categorization. 2013 IEEE International Conference on Computer Vision Workshops, pages 554–561, 2013.
  32. Language-driven semantic segmentation. ArXiv, abs/2201.03546, 2022.
  33. Supervision exists everywhere: A data efficient contrastive language-image pre-training paradigm. ArXiv, abs/2110.05208, 2021.
  34. Microsoft coco: Common objects in context. In European Conference on Computer Vision, 2014.
  35. Prompt distribution learning. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5196–5205, 2022.
  36. Fine-grained visual classification of aircraft. ArXiv, abs/1306.5151, 2013.
  37. George A. Miller. Wordnet: A lexical database for english. Commun. ACM, 38:39–41, 1995.
  38. The role of context for object detection and semantic segmentation in the wild. 2014 IEEE Conference on Computer Vision and Pattern Recognition, pages 891–898, 2014.
  39. Expanding language-image pretrained models for general video recognition. In European Conference on Computer Vision, 2022.
  40. Automated flower classification over a large number of classes. 2008 Sixth Indian Conference on Computer Vision, Graphics & Image Processing, pages 722–729, 2008.
  41. Cats and dogs. In 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, June 16-21, 2012, pages 3498–3505. IEEE Computer Society, 2012.
  42. Learning transferable visual models from natural language supervision. In ICML, 2021.
  43. Denseclip: Language-guided dense prediction with context-aware prompting. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 18061–18070, 2021.
  44. Do imagenet classifiers generalize to imagenet? In International Conference on Machine Learning, 2019.
  45. Delving into the openness of CLIP. In Findings of the Association for Computational Linguistics: ACL 2023. Association for Computational Linguistics, jul 2023.
  46. Learning relation alignment for calibrated cross-modal retrieval. In Annual Meeting of the Association for Computational Linguistics, 2021.
  47. Imagenet-21k pretraining for the masses. ArXiv, abs/2104.10972, 2021.
  48. Objects365: A large-scale, high-quality dataset for object detection. 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pages 8429–8438, 2019.
  49. Test-time prompt tuning for zero-shot generalization in vision-language models. ArXiv, abs/2209.07511, 2022.
  50. Ucf101: A dataset of 101 human actions classes from videos in the wild. ArXiv, abs/1212.0402, 2012.
  51. Contrastive multiview coding. In European Conference on Computer Vision, 2019.
  52. Representation learning with contrastive predictive coding. ArXiv, abs/1807.03748, 2018.
  53. Attention is all you need. ArXiv, abs/1706.03762, 2017.
  54. Cosface: Large margin cosine loss for deep face recognition. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5265–5274, 2018.
  55. Learning robust global representations by penalizing local predictive power. In Neural Information Processing Systems, 2019.
  56. Understanding contrastive representation learning through alignment and uniformity on the hypersphere. ArXiv, abs/2005.10242, 2020.
  57. Unsupervised feature learning via non-parametric instance discrimination. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3733–3742, 2018.
  58. Semantic projection network for zero- and few-label semantic segmentation. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 8248–8257, 2019.
  59. Sun database: Large-scale scene recognition from abbey to zoo. 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pages 3485–3492, 2010.
  60. Delving into inter-image invariance for unsupervised visual representations. International Journal of Computer Vision, 130:2994 – 3013, 2020.
  61. A simple baseline for zero-shot semantic segmentation with pre-trained vision-language model. ArXiv, abs/2112.14757, 2021.
  62. Filip: Fine-grained interactive language-image pre-training. ArXiv, abs/2111.07783, 2021.
  63. Unified vision and language prompt learning. ArXiv, abs/2210.07225, 2022.
  64. Scene parsing through ade20k dataset. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 5122–5130, 2017.
  65. Learning to prompt for vision-language models. International Journal of Computer Vision, 130:2337 – 2348, 2021.
  66. Conditional prompt learning for vision-language models. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 16795–16804, 2022.
  67. Detecting twenty-thousand classes using image-level supervision. In European Conference on Computer Vision, 2022.
  68. Probabilistic two-stage detection. ArXiv, abs/2103.07461, 2021.
  69. Eqco: Equivalent rules for self-supervised contrastive learning. ArXiv, abs/2010.01929, 2020.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Shuhuai Ren (30 papers)
  2. Aston Zhang (48 papers)
  3. Yi Zhu (233 papers)
  4. Shuai Zhang (319 papers)
  5. Shuai Zheng (67 papers)
  6. Mu Li (95 papers)
  7. Alex Smola (46 papers)
  8. Xu Sun (194 papers)
Citations (27)

Summary

An Expert Overview of POMP for Open-Vocabulary Visual Recognition

The paper presents a promising approach named Prompt Pre-training with Many Classes (POMP), tailored for vision-LLMs to enhance open-vocabulary visual recognition. The primary objective is to address the challenge of zero-shot recognition by introducing a method that efficiently manages computational and memory demands to support the extensive ImageNet-21K dataset, encompassing over 20,000 classes.

Key Contributions

  1. Prompt Pre-training (POMP): The authors propose a prompt pre-training framework that leverages the scalability of large-scale datasets like ImageNet-21K. By pre-training a universal soft prompt, POMP significantly bolsters the vision-LLMs' ability to generalize across novel visual categories without task-specific fine-tuning. This universality is achieved via a class sampling mechanism, drastically reducing memory requirements from 300 GB to less than 16 GB.
  2. Local Contrast and Correction: POMP innovatively employs a class sampling strategy termed 'local contrast' to limit computational overhead. This method samples a subset from the total class set for each iteration, thus narrowing the focus of contrastive learning and enhancing efficiency. Complementarily, a 'local correction' strategy is introduced to rectify the biases introduced by this sampling, ensuring the prompt maintains generalization capability.

Empirical Results

Empirical performance evaluations reveal that POMP outpaces existing state-of-the-art (SOTA) models across diverse visual recognition tasks:

  • Image Classification: On the ImageNet-21K test set, POMP attains a leading accuracy of 25.3%. Transferring this prompt to ten downstream image datasets results in the highest average accuracy of 67.0%, which is substantially superior to previous benchmarks, affirming its generalization strength across different domains.
  • Semantic Segmentation and Object Detection: For COCO Stuff and Pascal VOC semantic segmentation tasks, POMP garners a respectable improvement in harmonic IoU (hIoU) by 39.1 and 84.4, respectively, over prior methods such as ZSSeg. Similarly, POMP exhibits an increase in AP scores for object detection tasks, reflecting its efficacy in recognizing diverse and unseen object categories.

Practical and Theoretical Implications

The proposed POMP method holds profound implications for expanding the capabilities of visual recognition systems:

  • Scalability and Efficiency: POMP's design mitigates the prohibitive computational requirements traditionally associated with large-scale datasets and class sets, rendering it practical for deployment in diverse real-world applications where zero-shot capabilities are pivotal.
  • Generalization Across Tasks: The adaptability of the pre-trained prompt to various vision tasks without specialized fine-tuning underscores a significant step towards creating more versatile and robust AI systems, facilitating broader adoption across dynamic environments.

Speculations on Future Developments

Future work might explore the following directions:

  • Theoretical Robustness Analysis: There is a need for a rigorous theoretical examination of risk associated with empirical estimation of contrastive loss through class subsampling to strengthen the foundational aspects of the POMP framework.
  • Semantic Utilization via Hierarchies: Leveraging the semantic structure within datasets such as ImageNet-21K, augmented with techniques that utilize hyponym and hypernym relationships, could potentially refine the representational quality and effectiveness of the soft prompt.
  • Interpretability of Soft Prompts: Addressing the challenges surrounding the interpretability of the continuous optimized vectors in soft prompts could pave the way for more transparent AI systems and foster trust in AI-driven decision-making.

In conclusion, POMP advances the capabilities of vision-LLMs considerably, setting a foundation for future research in open-vocabulary recognition and contributing to the discourse on achieving more scalable, efficient, and generalizable AI systems.