Prompt Pre-Training with Twenty-Thousand Classes for Open-Vocabulary Visual Recognition (2304.04704v2)
Abstract: This work proposes POMP, a prompt pre-training method for vision-LLMs. Being memory and computation efficient, POMP enables the learned prompt to condense semantic information for a rich set of visual concepts with over twenty-thousand classes. Once pre-trained, the prompt with a strong transferable ability can be directly plugged into a variety of visual recognition tasks including image classification, semantic segmentation, and object detection, to boost recognition performances in a zero-shot manner. Empirical evaluation shows that POMP achieves state-of-the-art performances on 21 datasets, e.g., 67.0% average accuracy on 10 classification datasets (+3.1% compared to CoOp) and 84.4 hIoU on open-vocabulary Pascal VOC segmentation (+6.9 compared to ZSSeg). Our code is available at https://github.com/amazon-science/prompt-pretraining.
- Learning representations by maximizing mutual information across views. In Neural Information Processing Systems, 2019.
- Spt: Semi-parametric prompt tuning for multitask prompted learning. 2022.
- Food-101 - mining discriminative components with random forests. In ECCV, 2014.
- Zero-shot semantic segmentation. ArXiv, abs/1906.00817, 2019.
- Language-aware soft prompting for vision & language foundation models. ArXiv, abs/2210.01115, 2022.
- Coco-stuff: Thing and stuff classes in context. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1209–1218, 2016.
- Behind the scene: Revealing the secrets of pre-trained vision-and-language models. In European Conference on Computer Vision, 2020.
- Per-pixel classification is not all you need for semantic segmentation. In Neural Information Processing Systems, 2021.
- Describing textures in the wild. 2014 IEEE Conference on Computer Vision and Pattern Recognition, pages 3606–3613, 2014.
- Imagenet: A large-scale hierarchical image database. 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255, 2009.
- Arcface: Additive angular margin loss for deep face recognition. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4685–4694, 2018.
- Variational prompt tuning improves generalization of vision-language models. ArXiv, abs/2210.02390, 2022.
- Decoupling zero-shot semantic segmentation. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 11573–11582, 2021.
- An image is worth 16x16 words: Transformers for image recognition at scale. ArXiv, abs/2010.11929, 2020.
- Learning to prompt for open-vocabulary object detection with vision-language model. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14064–14073, 2022.
- The pascal visual object classes (voc) challenge. International Journal of Computer Vision, 88:303–338, 2010.
- Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories. 2004 Conference on Computer Vision and Pattern Recognition Workshop, pages 178–178, 2004.
- Promptdet: Towards open-vocabulary detection using uncurated images. In European Conference on Computer Vision, 2022.
- Zero-shot detection via vision and language knowledge distillation. arXiv preprint arXiv:2104.13921, 2021.
- Context-aware feature generation for zero-shot semantic segmentation. Proceedings of the 28th ACM International Conference on Multimedia, 2020.
- Lvis: A dataset for large vocabulary instance segmentation. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5351–5359, 2019.
- Momentum contrast for unsupervised visual representation learning. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9726–9735, 2019.
- Deep residual learning for image recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2015.
- Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 12:2217–2226, 2019.
- Data-efficient image recognition with contrastive predictive coding. ArXiv, abs/1905.09272, 2019.
- The many faces of robustness: A critical analysis of out-of-distribution generalization. 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 8320–8329, 2020.
- Natural adversarial examples. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15257–15266, 2019.
- Scaling up visual and vision-language representation learning with noisy text supervision. In International Conference on Machine Learning, 2021.
- Maple: Multi-modal prompt learning. ArXiv, abs/2210.03117, 2022.
- Big transfer (bit): General visual representation learning. In European Conference on Computer Vision, 2019.
- 3d object representations for fine-grained categorization. 2013 IEEE International Conference on Computer Vision Workshops, pages 554–561, 2013.
- Language-driven semantic segmentation. ArXiv, abs/2201.03546, 2022.
- Supervision exists everywhere: A data efficient contrastive language-image pre-training paradigm. ArXiv, abs/2110.05208, 2021.
- Microsoft coco: Common objects in context. In European Conference on Computer Vision, 2014.
- Prompt distribution learning. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5196–5205, 2022.
- Fine-grained visual classification of aircraft. ArXiv, abs/1306.5151, 2013.
- George A. Miller. Wordnet: A lexical database for english. Commun. ACM, 38:39–41, 1995.
- The role of context for object detection and semantic segmentation in the wild. 2014 IEEE Conference on Computer Vision and Pattern Recognition, pages 891–898, 2014.
- Expanding language-image pretrained models for general video recognition. In European Conference on Computer Vision, 2022.
- Automated flower classification over a large number of classes. 2008 Sixth Indian Conference on Computer Vision, Graphics & Image Processing, pages 722–729, 2008.
- Cats and dogs. In 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, June 16-21, 2012, pages 3498–3505. IEEE Computer Society, 2012.
- Learning transferable visual models from natural language supervision. In ICML, 2021.
- Denseclip: Language-guided dense prediction with context-aware prompting. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 18061–18070, 2021.
- Do imagenet classifiers generalize to imagenet? In International Conference on Machine Learning, 2019.
- Delving into the openness of CLIP. In Findings of the Association for Computational Linguistics: ACL 2023. Association for Computational Linguistics, jul 2023.
- Learning relation alignment for calibrated cross-modal retrieval. In Annual Meeting of the Association for Computational Linguistics, 2021.
- Imagenet-21k pretraining for the masses. ArXiv, abs/2104.10972, 2021.
- Objects365: A large-scale, high-quality dataset for object detection. 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pages 8429–8438, 2019.
- Test-time prompt tuning for zero-shot generalization in vision-language models. ArXiv, abs/2209.07511, 2022.
- Ucf101: A dataset of 101 human actions classes from videos in the wild. ArXiv, abs/1212.0402, 2012.
- Contrastive multiview coding. In European Conference on Computer Vision, 2019.
- Representation learning with contrastive predictive coding. ArXiv, abs/1807.03748, 2018.
- Attention is all you need. ArXiv, abs/1706.03762, 2017.
- Cosface: Large margin cosine loss for deep face recognition. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5265–5274, 2018.
- Learning robust global representations by penalizing local predictive power. In Neural Information Processing Systems, 2019.
- Understanding contrastive representation learning through alignment and uniformity on the hypersphere. ArXiv, abs/2005.10242, 2020.
- Unsupervised feature learning via non-parametric instance discrimination. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3733–3742, 2018.
- Semantic projection network for zero- and few-label semantic segmentation. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 8248–8257, 2019.
- Sun database: Large-scale scene recognition from abbey to zoo. 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pages 3485–3492, 2010.
- Delving into inter-image invariance for unsupervised visual representations. International Journal of Computer Vision, 130:2994 – 3013, 2020.
- A simple baseline for zero-shot semantic segmentation with pre-trained vision-language model. ArXiv, abs/2112.14757, 2021.
- Filip: Fine-grained interactive language-image pre-training. ArXiv, abs/2111.07783, 2021.
- Unified vision and language prompt learning. ArXiv, abs/2210.07225, 2022.
- Scene parsing through ade20k dataset. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 5122–5130, 2017.
- Learning to prompt for vision-language models. International Journal of Computer Vision, 130:2337 – 2348, 2021.
- Conditional prompt learning for vision-language models. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 16795–16804, 2022.
- Detecting twenty-thousand classes using image-level supervision. In European Conference on Computer Vision, 2022.
- Probabilistic two-stage detection. ArXiv, abs/2103.07461, 2021.
- Eqco: Equivalent rules for self-supervised contrastive learning. ArXiv, abs/2010.01929, 2020.
- Shuhuai Ren (30 papers)
- Aston Zhang (48 papers)
- Yi Zhu (233 papers)
- Shuai Zhang (319 papers)
- Shuai Zheng (67 papers)
- Mu Li (95 papers)
- Alex Smola (46 papers)
- Xu Sun (194 papers)