Progressive Semantic-Guided Vision Transformer for Zero-Shot Learning (2404.07713v2)

Published 11 Apr 2024 in cs.CV and cs.LG

Abstract: Zero-shot learning (ZSL) recognizes the unseen classes by conducting visual-semantic interactions to transfer semantic knowledge from seen classes to unseen ones, supported by semantic information (e.g., attributes). However, existing ZSL methods simply extract visual features using a pre-trained network backbone (i.e., CNN or ViT), which fail to learn matched visual-semantic correspondences for representing semantic-related visual features as lacking of the guidance of semantic information, resulting in undesirable visual-semantic interactions. To tackle this issue, we propose a progressive semantic-guided vision transformer for zero-shot learning (dubbed ZSLViT). ZSLViT mainly considers two properties in the whole network: i) discover the semantic-related visual representations explicitly, and ii) discard the semantic-unrelated visual information. Specifically, we first introduce semantic-embedded token learning to improve the visual-semantic correspondences via semantic enhancement and discover the semantic-related visual tokens explicitly with semantic-guided token attention. Then, we fuse low semantic-visual correspondence visual tokens to discard the semantic-unrelated visual information for visual enhancement. These two operations are integrated into various encoders to progressively learn semantic-related visual representations for accurate visual-semantic interactions in ZSL. The extensive experiments show that our ZSLViT achieves significant performance gains on three popular benchmark datasets, i.e., CUB, SUN, and AWA2. Codes are available at: https://github.com/shiming-chen/ZSLViT .

PDF HTML Abstract

Summarize Bookmark Chat (Pro)

References (58)

Authors (4)

Shiming Chen (29 papers)
Wenjin Hou (10 papers)
Salman Khan (244 papers)
Fahad Shahbaz Khan (225 papers)

Citations (4)

View on Semantic Scholar

Progressive Semantic-Guided Vision Transformer for Zero-Shot Learning (2404.07713v2)

Related Papers

Tweets