Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Progressive Semantic-Guided Vision Transformer for Zero-Shot Learning (2404.07713v2)

Published 11 Apr 2024 in cs.CV and cs.LG

Abstract: Zero-shot learning (ZSL) recognizes the unseen classes by conducting visual-semantic interactions to transfer semantic knowledge from seen classes to unseen ones, supported by semantic information (e.g., attributes). However, existing ZSL methods simply extract visual features using a pre-trained network backbone (i.e., CNN or ViT), which fail to learn matched visual-semantic correspondences for representing semantic-related visual features as lacking of the guidance of semantic information, resulting in undesirable visual-semantic interactions. To tackle this issue, we propose a progressive semantic-guided vision transformer for zero-shot learning (dubbed ZSLViT). ZSLViT mainly considers two properties in the whole network: i) discover the semantic-related visual representations explicitly, and ii) discard the semantic-unrelated visual information. Specifically, we first introduce semantic-embedded token learning to improve the visual-semantic correspondences via semantic enhancement and discover the semantic-related visual tokens explicitly with semantic-guided token attention. Then, we fuse low semantic-visual correspondence visual tokens to discard the semantic-unrelated visual information for visual enhancement. These two operations are integrated into various encoders to progressively learn semantic-related visual representations for accurate visual-semantic interactions in ZSL. The extensive experiments show that our ZSLViT achieves significant performance gains on three popular benchmark datasets, i.e., CUB, SUN, and AWA2. Codes are available at: https://github.com/shiming-chen/ZSLViT .

Definition Search Book Streamline Icon: https://streamlinehq.com
References (58)
  1. Label-embedding for image classification. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38:1425–1438, 2016.
  2. Implicit and explicit attention for zero-shot learning. In GCPR, 2021.
  3. Multi-head self-attention via vision transformer for zero-shot learning. In IMVIP, 2021.
  4. End-to-end object detection with transformers. In ECCV, 2020.
  5. Emerging properties in self-supervised vision transformers. In ICCV, pages 9630–9640, 2021.
  6. Egans: Evolutionary generative adversarial network search for zero-shot learning. IEEE Transactions on Evolutionary Computation, 2023.
  7. Transzero: Attribute-guided transformer for zero-shot learning. In AAAI, 2022.
  8. Msdn: Mutually semantic distillation network for zero-shot learning. In CVPR, 2022.
  9. Transzero++: Cross attribute-guided transformer for zero-shot learning. IEEE transactions on pattern analysis and machine intelligence, 2022.
  10. Evolving semantic prototype improves generative zero-shot learning. In ICML, 2023.
  11. Free: Feature refinement for generalized zero-shot learning. In ICCV, 2021.
  12. Hsva: Hierarchical semantic-visual adaptation for zero-shot learning. In NeurIPS, 2021.
  13. Duet: Cross-modal semantic grounding for contrastive zero-shot learning. In AAAI, 2023.
  14. Per-pixel classification is not all you need for semantic segmentation. In NeurIPS, 2021.
  15. Image-free classifier injection for zero-shot classification. In ICCV, 2023.
  16. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021.
  17. Generative multi-label zero-shot learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45:14611–14624, 2021.
  18. Contrastive embedding for generalized zero-shot learning. In CVPR, 2021.
  19. Align your prompts: Test-time prompting with distribution alignment for zero-shot generalization. In NeurIPS, 2023.
  20. Deep residual learning for image recognition. In CVPR, pages 770–778, 2016.
  21. Semantic compression embedding for generative zero-shot learning. In IJCAI, 2022.
  22. Local relation networks for image recognition. In ICCV, pages 3463–3472, 2019.
  23. Fine-grained generalized zero-shot learning via dense attribute-based attention. In CVPR, pages 4482–4492, 2020.
  24. Compositional zero-shot learning via fine-grained dense feature composition. In NeurIPS, 2020.
  25. Semantic feature extraction for generalized zero-shot learning. In AAAI, 2022.
  26. Learning to detect unseen object classes by between-class attribute transfer. In CVPR, 2009.
  27. Attribute-based classification for zero-shot visual object categorization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36:453–465, 2014.
  28. Zero-data learning of new tasks. In AAAI, 2008.
  29. Vs-boost: Boosting visual-semantic association for generalized zero-shot learning. In IJCAI, 2023.
  30. Not all patches are what you need: Expediting vision transformers via token reorganizations. In ICLR, 2022.
  31. Progressive semantic-visual mutual adaption for generalized zero-shot learning. In CVPR, 2023.
  32. Generalized zero-shot learning with deep calibration network. In NeurIPS, 2018.
  33. Goal-oriented gaze estimation for zero-shot learning. In CVPR, 2021.
  34. I2mvformer: Large language model generated multi-view document supervision for zero-shot image classification. In CVPR, 2023.
  35. I2dformer: Learning image to document attention for zero-shot image classification. In NeurIPS, 2022.
  36. Latent embedding feedback and discriminative features for zero-shot classification. In ECCV, 2020.
  37. Discriminative region-based multi-label zero-shot learning. In ICCV, pages 8711–8720, 2021.
  38. Zero-shot learning with semantic output codes. In NeurIPS, 2009.
  39. Sun attribute database: Discovering, annotating, and recognizing scene attributes. In CVPR, pages 2751–2758, 2012.
  40. Learning transferable visual models from natural language supervision. In ICML, 2021.
  41. Dynamicvit: Efficient vision transformers with dynamic token sparsification. In NeurIPS, 2021.
  42. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115:211–252, 2015.
  43. Training data-efficient image transformers & distillation through attention. In ICML, 2021.
  44. Attention is all you need. In NeurIPS, 2017.
  45. Leveraging seen and unseen semantic relationships for generative zero-shot learning. In ECCV, 2020.
  46. Transductive zero-shot learning with visual structure constraint. In NeurIPS, 2019.
  47. Improving zero-shot generalization for clip with synthesized prompts. In ICCV, 2023.
  48. Caltech-ucsd birds 200. Technical Report CNS-TR-2010-001, Caltech,, 2010.
  49. Feature generating networks for zero-shot learning. In CVPR, 2018.
  50. Zero-shot learning — the good, the bad and the ugly. CVPR, pages 3077–3086, 2017.
  51. F-vaegan-d2: A feature generating framework for any-shot learning. In CVPR, pages 10267–10276, 2019.
  52. Attentive region embedding network for zero-shot learning. In CVPR, pages 9376–9385, 2019.
  53. Attribute prototype network for zero-shot learning. In NeurIPS, 2020.
  54. Counterfactual zero-shot and open-set visual recognition. In CVPR, 2021.
  55. Not all tokens are equal: Human-centric visual analysis via token clustering transformer. In CVPR, 2022.
  56. Towards realistic zero-shot classification via self structural semantic alignment. In AAAI, 2023.
  57. Learning to prompt for vision-language models. International Journal of Computer Vision, 130:2337 – 2348, 2022.
  58. Semantic-guided multi-attention localization for zero-shot learning. In NeurIPS, 2019.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Shiming Chen (29 papers)
  2. Wenjin Hou (10 papers)
  3. Salman Khan (244 papers)
  4. Fahad Shahbaz Khan (225 papers)
Citations (4)
X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets