Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Towards Realistic Zero-Shot Classification via Self Structural Semantic Alignment (2308.12960v3)

Published 24 Aug 2023 in cs.CV

Abstract: Large-scale pre-trained Vision LLMs (VLMs) have proven effective for zero-shot classification. Despite the success, most traditional VLMs-based methods are restricted by the assumption of partial source supervision or ideal vocabularies, which rarely satisfy the open-world scenario. In this paper, we aim at a more challenging setting, Realistic Zero-Shot Classification, which assumes no annotation but instead a broad vocabulary. To address this challenge, we propose the Self Structural Semantic Alignment (S3A) framework, which extracts the structural semantic information from unlabeled data while simultaneously self-learning. Our S3A framework adopts a unique Cluster-Vote-Prompt-Realign (CVPR) algorithm, which iteratively groups unlabeled data to derive structural semantics for pseudo-supervision. Our CVPR process includes iterative clustering on images, voting within each cluster to identify initial class candidates from the vocabulary, generating discriminative prompts with LLMs to discern confusing candidates, and realigning images and the vocabulary as structural semantic alignment. Finally, we propose to self-learn the CLIP image encoder with both individual and structural semantic alignment through a teacher-student learning strategy. Our comprehensive experiments across various generic and fine-grained benchmarks demonstrate that the S3A method offers substantial improvements over existing VLMs-based approaches, achieving a more than 15% accuracy improvement over CLIP on average. Our codes, models, and prompts are publicly released at https://github.com/sheng-eatamath/S3A.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (63)
  1. Label-embedding for image classification. IEEE transactions on pattern analysis and machine intelligence, 38(7): 1425–1438.
  2. K-means++ the advantages of careful seeding. In Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms, 1027–1035.
  3. Beit: Bert pre-training of image transformers. arXiv preprint arXiv:2106.08254.
  4. Video stream retrieval of unseen queries using semantic memory. arXiv preprint arXiv:1612.06753.
  5. Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision, 9650–9660.
  6. Prompt learning with optimal transport for vision-language models. arXiv preprint arXiv:2210.01253.
  7. Randaugment: Practical automated data augmentation with a reduced search space. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, 702–703.
  8. Dale, R. 2021. GPT-3: What’s it good for? Natural Language Engineering, 27(1): 113–118.
  9. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, 248–255. Ieee.
  10. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929.
  11. Learning Generative Visual Models from Few Training Examples: An Incremental Bayesian Approach Tested on 101 Object Categories. 2004 Conference on Computer Vision and Pattern Recognition Workshop, 178–178.
  12. Clip-adapter: Better vision-language models with feature adapters. arXiv preprint arXiv:2110.04544.
  13. Open-vocabulary image segmentation. arXiv preprint arXiv:2112.12143.
  14. Bootstrap your own latent-a new approach to self-supervised learning. Advances in neural information processing systems, 33: 21271–21284.
  15. What’s in a Name? Beyond Class Indices for Image Recognition. arXiv preprint arXiv:2304.02364.
  16. Scaling up visual and vision-language representation learning with noisy text supervision. In International conference on machine learning, 4904–4916. PMLR.
  17. Improving Zero-Shot Models with Label Distribution Priors. arXiv preprint arXiv:2212.00784.
  18. Diffusion Models for Zero-Shot Open-Vocabulary Segmentation. arXiv preprint arXiv:2306.09316.
  19. Maple: Multi-modal prompt learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 19113–19122.
  20. Novel dataset for fine-grained image categorization: Stanford dogs. In Proc. CVPR workshop on fine-grained visual categorization (FGVC). Citeseer.
  21. Learning multiple layers of features from tiny images.
  22. Imagenet classification with deep convolutional neural networks. Communications of the ACM, 60(6): 84–90.
  23. Kuhn, H. W. 1955. The Hungarian method for the assignment problem. Naval research logistics quarterly, 2(1-2): 83–97.
  24. Learning to detect unseen object classes by between-class attribute transfer. In 2009 IEEE conference on computer vision and pattern recognition, 951–958. IEEE.
  25. Predicting deep zero-shot convolutional neural networks using textual descriptions. In Proceedings of the IEEE international conference on computer vision, 4247–4255.
  26. Align and Prompt: Video-and-Language Pre-training with Entity Prompts. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 4943–4953.
  27. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597.
  28. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International Conference on Machine Learning, 12888–12900. PMLR.
  29. Masked unsupervised self-training for zero-shot image classification. arXiv preprint arXiv:2206.02967.
  30. A Simple Approach and Benchmark for 21,000-Category Object Detection. In European Conference on Computer Vision, 1–18. Springer.
  31. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys, 55(9): 1–35.
  32. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101.
  33. Miller, G. A. 1995. WordNet: a lexical database for English. Communications of the ACM, 38(11): 39–41.
  34. Chils: Zero-shot image classification with hierarchical label sets. In International Conference on Machine Learning, 26342–26362. PMLR.
  35. CLIP-GCD: Simple Language Guided Generalized Category Discovery. arXiv preprint arXiv:2305.10420.
  36. Cats and dogs. 2012 IEEE Conference on Computer Vision and Pattern Recognition, 3498–3505.
  37. A review of generalized zero-shot learning methods. IEEE transactions on pattern analysis and machine intelligence.
  38. What does a platypus look like? Generating customized prompts for zero-shot image classification. ArXiv, abs/2209.03320.
  39. Learning transferable visual models from natural language supervision. In International conference on machine learning, 8748–8763. PMLR.
  40. Sentence-bert: Sentence embeddings using siamese bert-networks. arXiv preprint arXiv:1908.10084.
  41. Rethinking the Openness of CLIP. arXiv preprint arXiv:2206.01986.
  42. Prompt Pre-Training with Twenty-Thousand Classes for Open-Vocabulary Visual Recognition. arXiv preprint arXiv:2304.04704.
  43. Imagenet-21k pretraining for the masses. arXiv preprint arXiv:2104.10972.
  44. Rousseeuw, P. J. 1987. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Journal of computational and applied mathematics, 20: 53–65.
  45. Breeds: Benchmarks for subpopulation shift. arXiv preprint arXiv:2008.04859.
  46. Concept generalization in visual representation learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 9629–9639.
  47. Prompting Large Language Models with Answer Heuristics for Knowledge-based Visual Question Answering. ArXiv, abs/2303.01903.
  48. Zero-shot Unsupervised Transfer Instance Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 4847–4857.
  49. Generalized Category Discovery. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 7492–7501.
  50. A survey of zero-shot learning: Settings, methods, and applications. ACM Transactions on Intelligent Systems and Technology (TIST), 10(2): 1–37.
  51. Parametric Classification for Generalized Category Discovery: A Baseline Study. arXiv:2211.11727.
  52. Towards Open Vocabulary Learning: A Survey. arXiv preprint arXiv:2306.15880.
  53. Open-vocabulary panoptic segmentation with text-to-image diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2955–2966.
  54. An Empirical Study of GPT-3 for Few-Shot Knowledge-Based VQA. ArXiv, abs/2109.05014.
  55. Open-vocabulary detr with conditional matching. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part IX, 106–122. Springer.
  56. Vision-language models for vision tasks: A survey. arXiv preprint arXiv:2304.00685.
  57. PromptCAL: Contrastive Affinity Learning via Auxiliary Prompts for Generalized Novel Category Discovery. arXiv:2212.05590.
  58. Bertscore: Evaluating text generation with bert. arXiv preprint arXiv:1904.09675.
  59. Extract free dense labels from clip. In European Conference on Computer Vision, 696–712. Springer.
  60. Learning to prompt for vision-language models. International Journal of Computer Vision, 130(9): 2337–2348.
  61. Detecting twenty-thousand classes using image-level supervision. In European Conference on Computer Vision, 350–368. Springer.
  62. Zegclip: Towards adapting clip for zero-shot semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 11175–11185.
  63. PointCLIP V2: Adapting CLIP for Powerful 3D Open-world Learning. ArXiv, abs/2211.11682.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Sheng Zhang (212 papers)
  2. Muzammal Naseer (67 papers)
  3. Guangyi Chen (45 papers)
  4. Zhiqiang Shen (172 papers)
  5. Salman Khan (244 papers)
  6. Kun Zhang (353 papers)
  7. Fahad Khan (24 papers)
Citations (2)
X Twitter Logo Streamline Icon: https://streamlinehq.com