Progressive Semantic-Guided Vision Transformer for Zero-Shot Learning (2404.07713v2)
Abstract: Zero-shot learning (ZSL) recognizes the unseen classes by conducting visual-semantic interactions to transfer semantic knowledge from seen classes to unseen ones, supported by semantic information (e.g., attributes). However, existing ZSL methods simply extract visual features using a pre-trained network backbone (i.e., CNN or ViT), which fail to learn matched visual-semantic correspondences for representing semantic-related visual features as lacking of the guidance of semantic information, resulting in undesirable visual-semantic interactions. To tackle this issue, we propose a progressive semantic-guided vision transformer for zero-shot learning (dubbed ZSLViT). ZSLViT mainly considers two properties in the whole network: i) discover the semantic-related visual representations explicitly, and ii) discard the semantic-unrelated visual information. Specifically, we first introduce semantic-embedded token learning to improve the visual-semantic correspondences via semantic enhancement and discover the semantic-related visual tokens explicitly with semantic-guided token attention. Then, we fuse low semantic-visual correspondence visual tokens to discard the semantic-unrelated visual information for visual enhancement. These two operations are integrated into various encoders to progressively learn semantic-related visual representations for accurate visual-semantic interactions in ZSL. The extensive experiments show that our ZSLViT achieves significant performance gains on three popular benchmark datasets, i.e., CUB, SUN, and AWA2. Codes are available at: https://github.com/shiming-chen/ZSLViT .
- Label-embedding for image classification. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38:1425–1438, 2016.
- Implicit and explicit attention for zero-shot learning. In GCPR, 2021.
- Multi-head self-attention via vision transformer for zero-shot learning. In IMVIP, 2021.
- End-to-end object detection with transformers. In ECCV, 2020.
- Emerging properties in self-supervised vision transformers. In ICCV, pages 9630–9640, 2021.
- Egans: Evolutionary generative adversarial network search for zero-shot learning. IEEE Transactions on Evolutionary Computation, 2023.
- Transzero: Attribute-guided transformer for zero-shot learning. In AAAI, 2022.
- Msdn: Mutually semantic distillation network for zero-shot learning. In CVPR, 2022.
- Transzero++: Cross attribute-guided transformer for zero-shot learning. IEEE transactions on pattern analysis and machine intelligence, 2022.
- Evolving semantic prototype improves generative zero-shot learning. In ICML, 2023.
- Free: Feature refinement for generalized zero-shot learning. In ICCV, 2021.
- Hsva: Hierarchical semantic-visual adaptation for zero-shot learning. In NeurIPS, 2021.
- Duet: Cross-modal semantic grounding for contrastive zero-shot learning. In AAAI, 2023.
- Per-pixel classification is not all you need for semantic segmentation. In NeurIPS, 2021.
- Image-free classifier injection for zero-shot classification. In ICCV, 2023.
- An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021.
- Generative multi-label zero-shot learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45:14611–14624, 2021.
- Contrastive embedding for generalized zero-shot learning. In CVPR, 2021.
- Align your prompts: Test-time prompting with distribution alignment for zero-shot generalization. In NeurIPS, 2023.
- Deep residual learning for image recognition. In CVPR, pages 770–778, 2016.
- Semantic compression embedding for generative zero-shot learning. In IJCAI, 2022.
- Local relation networks for image recognition. In ICCV, pages 3463–3472, 2019.
- Fine-grained generalized zero-shot learning via dense attribute-based attention. In CVPR, pages 4482–4492, 2020.
- Compositional zero-shot learning via fine-grained dense feature composition. In NeurIPS, 2020.
- Semantic feature extraction for generalized zero-shot learning. In AAAI, 2022.
- Learning to detect unseen object classes by between-class attribute transfer. In CVPR, 2009.
- Attribute-based classification for zero-shot visual object categorization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36:453–465, 2014.
- Zero-data learning of new tasks. In AAAI, 2008.
- Vs-boost: Boosting visual-semantic association for generalized zero-shot learning. In IJCAI, 2023.
- Not all patches are what you need: Expediting vision transformers via token reorganizations. In ICLR, 2022.
- Progressive semantic-visual mutual adaption for generalized zero-shot learning. In CVPR, 2023.
- Generalized zero-shot learning with deep calibration network. In NeurIPS, 2018.
- Goal-oriented gaze estimation for zero-shot learning. In CVPR, 2021.
- I2mvformer: Large language model generated multi-view document supervision for zero-shot image classification. In CVPR, 2023.
- I2dformer: Learning image to document attention for zero-shot image classification. In NeurIPS, 2022.
- Latent embedding feedback and discriminative features for zero-shot classification. In ECCV, 2020.
- Discriminative region-based multi-label zero-shot learning. In ICCV, pages 8711–8720, 2021.
- Zero-shot learning with semantic output codes. In NeurIPS, 2009.
- Sun attribute database: Discovering, annotating, and recognizing scene attributes. In CVPR, pages 2751–2758, 2012.
- Learning transferable visual models from natural language supervision. In ICML, 2021.
- Dynamicvit: Efficient vision transformers with dynamic token sparsification. In NeurIPS, 2021.
- Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115:211–252, 2015.
- Training data-efficient image transformers & distillation through attention. In ICML, 2021.
- Attention is all you need. In NeurIPS, 2017.
- Leveraging seen and unseen semantic relationships for generative zero-shot learning. In ECCV, 2020.
- Transductive zero-shot learning with visual structure constraint. In NeurIPS, 2019.
- Improving zero-shot generalization for clip with synthesized prompts. In ICCV, 2023.
- Caltech-ucsd birds 200. Technical Report CNS-TR-2010-001, Caltech,, 2010.
- Feature generating networks for zero-shot learning. In CVPR, 2018.
- Zero-shot learning — the good, the bad and the ugly. CVPR, pages 3077–3086, 2017.
- F-vaegan-d2: A feature generating framework for any-shot learning. In CVPR, pages 10267–10276, 2019.
- Attentive region embedding network for zero-shot learning. In CVPR, pages 9376–9385, 2019.
- Attribute prototype network for zero-shot learning. In NeurIPS, 2020.
- Counterfactual zero-shot and open-set visual recognition. In CVPR, 2021.
- Not all tokens are equal: Human-centric visual analysis via token clustering transformer. In CVPR, 2022.
- Towards realistic zero-shot classification via self structural semantic alignment. In AAAI, 2023.
- Learning to prompt for vision-language models. International Journal of Computer Vision, 130:2337 – 2348, 2022.
- Semantic-guided multi-attention localization for zero-shot learning. In NeurIPS, 2019.
- Shiming Chen (29 papers)
- Wenjin Hou (10 papers)
- Salman Khan (244 papers)
- Fahad Shahbaz Khan (225 papers)