PerceptionCLIP: Visual Classification by Inferring and Conditioning on Contexts (2308.01313v3)
Abstract: Vision-LLMs like CLIP are widely used in zero-shot image classification due to their ability to understand various visual concepts and natural language descriptions. However, how to fully leverage CLIP's unprecedented human-like understanding capabilities to achieve better performance is still an open question. This paper draws inspiration from the human visual perception process: when classifying an object, humans first infer contextual attributes (e.g., background and orientation) which help separate the foreground object from the background, and then classify the object based on this information. Inspired by it, we observe that providing CLIP with contextual attributes improves zero-shot image classification and mitigates reliance on spurious features. We also observe that CLIP itself can reasonably infer the attributes from an image. With these observations, we propose a training-free, two-step zero-shot classification method PerceptionCLIP. Given an image, it first infers contextual attributes (e.g., background) and then performs object classification conditioning on them. Our experiments show that PerceptionCLIP achieves better generalization, group robustness, and interoperability. Our code is available at https://github.com/umd-huang-lab/perceptionCLIP
- Evaluating clip: towards characterization of broader capabilities and downstream implications. arXiv preprint arXiv:2108.02818, 2021.
- Food-101–mining discriminative components with random forests. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part VI 13, pages 446–461. Springer, 2014.
- Debiasing vision-language models via biased prompts. arXiv preprint arXiv:2302.00070, 2023.
- Instructblip: Towards general-purpose vision-language models with instruction tuning. (arXiv:2305.06500), Jun 2023. URL http://arxiv.org/abs/2305.06500. arXiv:2305.06500 [cs].
- Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
- Bayesian prompt learning for image-language model generalization, 2023.
- Improving clip training with language rewrites. arXiv preprint arXiv:2305.20088, 2023.
- Leveraging multiple descriptive features for robust few-shot image learning. ArXiv, abs/2307.04317, 2023. URL https://api.semanticscholar.org/CorpusID:259501400.
- Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 12(7):2217–2226, 2019.
- Benchmarking neural network robustness to common corruptions and surface variations. arXiv preprint arXiv:1807.01697, 2018.
- The many faces of robustness: A critical analysis of out-of-distribution generalization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 8340–8349, 2021a.
- Natural adversarial examples. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15262–15271, 2021b.
- Unsupervised prompt learning for vision-language models. arXiv preprint arXiv:2204.03649, 2022.
- Principles of neural science, Fifth Edition, volume 4. McGraw-hill New York, 2013.
- Segment anything. (arXiv:2304.02643), Apr 2023. doi: 10.48550/arXiv.2304.02643. URL http://arxiv.org/abs/2304.02643. arXiv:2304.02643 [cs].
- Zero-data learning of new tasks. In AAAI, volume 1, page 3, 2008.
- A tutorial on energy-based learning. Predicting structured data, 1(0), 2006.
- Cliper: A unified vision-language framework for in-the-wild facial expression recognition. ArXiv, abs/2303.00193, 2023a.
- Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. (arXiv:2301.12597), Jun 2023b. URL http://arxiv.org/abs/2301.12597. arXiv:2301.12597 [cs].
- Desco: Learning object recognition with rich language descriptions. (arXiv:2306.14060), Jun 2023c. doi: 10.48550/arXiv.2306.14060. URL http://arxiv.org/abs/2306.14060. arXiv:2306.14060 [cs].
- Just train twice: Improving group robustness without training group information. In International Conference on Machine Learning, pages 6781–6792. PMLR, 2021.
- Visual instruction tuning. (arXiv:2304.08485), Apr 2023. URL http://arxiv.org/abs/2304.08485. arXiv:2304.08485 [cs].
- Deep learning face attributes in the wild. In Proceedings of International Conference on Computer Vision (ICCV), December 2015.
- Doubly right object recognition: A why prompt for visual rationales. ArXiv, abs/2212.06202, 2022.
- Enhancing clip with clip: Exploring pseudolabeling for limited-label prompt tuning. (arXiv:2306.01669), Jun 2023. doi: 10.48550/arXiv.2306.01669. URL http://arxiv.org/abs/2306.01669. arXiv:2306.01669 [cs].
- Visual classification via description from large language models. arXiv preprint arXiv:2210.07183, 2022.
- Lafter: Label-free tuning of zero-shot classifier using language and unlabeled image collections. (arXiv:2305.18287), May 2023. URL http://arxiv.org/abs/2305.18287. arXiv:2305.18287 [cs].
- Automated flower classification over a large number of classes. In 2008 Sixth Indian conference on computer vision, graphics & image processing, pages 722–729. IEEE, 2008.
- Chils: Zero-shot image classification with hierarchical label sets. ArXiv, abs/2302.02551, 2023.
- OpenAI. Gpt-4 technical report, 2023.
- Cats and dogs. In 2012 IEEE conference on computer vision and pattern recognition, pages 3498–3505. IEEE, 2012.
- Judea Pearl. Causal inference in statistics: An overview. 2009.
- What does a platypus look like? generating customized prompts for zero-shot image classification. arXiv preprint arXiv:2209.03320, 2022.
- Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
- Geode: a geographically diverse evaluation dataset for object recognition. (arXiv:2301.02560), Apr 2023. doi: 10.48550/arXiv.2301.02560. URL http://arxiv.org/abs/2301.02560. arXiv:2301.02560 [cs].
- Hierarchical text-conditional image generation with clip latents. (arXiv:2204.06125), Apr 2022. doi: 10.48550/arXiv.2204.06125. URL http://arxiv.org/abs/2204.06125. arXiv:2204.06125 [cs].
- Do imagenet classifiers generalize to imagenet? In International conference on machine learning, pages 5389–5400. PMLR, 2019.
- High-resolution image synthesis with latent diffusion models. page 10684–10695, 2022. URL https://openaccess.thecvf.com/content/CVPR2022/html/Rombach_High-Resolution_Image_Synthesis_With_Latent_Diffusion_Models_CVPR_2022_paper.html.
- Waffling around for performance: Visual classification with random words and broad concepts, 2023.
- Distributionally robust neural networks. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=ryxGuJrFvS.
- Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114, 2021.
- Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE international conference on computer vision, pages 618–626, 2017.
- Test-time prompt tuning for zero-shot generalization in vision-language models. arXiv preprint arXiv:2209.07511, 2022.
- Sus-x: Training-free name-only transfer of vision-language models. (arXiv:2211.16198), Jul 2023. doi: 10.48550/arXiv.2211.16198. URL http://arxiv.org/abs/2211.16198. arXiv:2211.16198 [cs].
- The caltech-ucsd birds-200-2011 dataset. 2011.
- Learning robust global representations by penalizing local predictive power. Advances in Neural Information Processing Systems, 32, 2019.
- Medclip: Contrastive learning from unpaired medical images and text. ArXiv, abs/2210.10163, 2022.
- Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903, 2022.
- Mitigating spurious correlations in multi-modal models during fine-tuning. ArXiv, abs/2304.03916, 2023.
- Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601, 2023.
- When and why vision-language models behave like bags-of-words, and what to do about it? In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=KRLUvxh8uaX.
- Places: A 10 million image database for scene recognition. IEEE transactions on pattern analysis and machine intelligence, 40(6):1452–1464, 2017.
- Conditional prompt learning for vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16816–16825, 2022a.
- Learning to prompt for vision-language models. International Journal of Computer Vision, 130(9):2337–2348, 2022b.
- Prompt-aligned gradient for prompt tuning. arXiv preprint arXiv:2205.14865, 2022.
- Minigpt-4: Enhancing vision-language understanding with advanced large language models. (arXiv:2304.10592), Apr 2023. URL http://arxiv.org/abs/2304.10592. arXiv:2304.10592 [cs].
- Bang An (33 papers)
- Sicheng Zhu (15 papers)
- Michael-Andrei Panaitescu-Liess (7 papers)
- Chaithanya Kumar Mummadi (16 papers)
- Furong Huang (150 papers)