Visual-Semantic Decomposition and Partial Alignment for Document-based Zero-Shot Learning (2407.15613v2)
Abstract: Recent work shows that documents from encyclopedias serve as helpful auxiliary information for zero-shot learning. Existing methods align the entire semantics of a document with corresponding images to transfer knowledge. However, they disregard that semantic information is not equivalent between them, resulting in a suboptimal alignment. In this work, we propose a novel network to extract multi-view semantic concepts from documents and images and align the matching rather than entire concepts. Specifically, we propose a semantic decomposition module to generate multi-view semantic embeddings from visual and textual sides, providing the basic concepts for partial alignment. To alleviate the issue of information redundancy among embeddings, we propose the local-to-semantic variance loss to capture distinct local details and multiple semantic diversity loss to enforce orthogonality among embeddings. Subsequently, two losses are introduced to partially align visual-semantic embedding pairs according to their semantic relevance at the view and word-to-patch levels. Consequently, we consistently outperform state-of-the-art methods under two document sources in three standard benchmarks for document-based zero-shot learning. Qualitatively, we show that our model learns the interpretable partial association.
- Label-Embedding for Attribute-Based Classification. In 2013 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2013. 819–826.
- Label-Embedding for Image Classification. IEEE Trans. Pattern Anal. Mach. Intell. 38, 7 (2016), 1425–1438.
- Evaluation of output embeddings for fine-grained image classification. In 2015 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015. 2927–2936.
- Ziad Al-Halah and Rainer Stiefelhagen. 2017. Automatic Discovery, Association Estimation and Learning of Semantic Attributes for a Thousand Categories. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017. 5112–5121.
- Dat Huynh andEhsan Elhamifar. 2020. Fine-Grained Generalized Zero-Shot Learning via Dense Attribute-Based Attention. In 2020 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2020. 4482–4492.
- Predicting Deep Zero-Shot Convolutional Neural Networks Using Textual Descriptions. In 2015 IEEE/CVF International Conference on Computer Vision, ICCV 2015. 4247–4255.
- Longformer: The Long-Document Transformer. CoRR abs/2004.05150 (2020). arXiv:2004.05150
- Language Models are Few-Shot Learners. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020.
- Sebastian Bujwid and Josephine Sullivan. 2021. Large-Scale Zero-Shot Image Classification from Rich and Diverse Textual Descriptions. CoRR abs/2103.09669 (2021). arXiv:2103.09669
- Synthesized Classifiers for Zero-Shot Learning. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016. 5327–5336.
- An Empirical Study and Analysis of Generalized Zero-Shot Learning for Object Recognition in the Wild. In Computer Vision - ECCV 2016., Vol. 9906. 52–68.
- TransZero: Attribute-Guided Transformer for Zero-Shot Learning. In Thirty-Sixth AAAI Conference on Artificial Intelligence, AAAI 2022. 330–338.
- GNDAN: Graph Navigated Dual Attention Network for Zero-Shot Learning. IEEE Trans. Neural Networks Learn. Syst. 35, 4 (2024), 4516–4529.
- MSDN: Mutually Semantic Distillation Network for Zero-Shot Learning. In 2022 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2022. 7602–7611.
- DUET: Cross-Modal Semantic Grounding for Contrastive Zero-Shot Learning. In Thirty-Seventh AAAI Conference on Artificial Intelligence, AAAI 2023, Thirty-Fifth Conference on Innovative Applications of Artificial Intelligence, IAAI 2023. 405–413.
- Zero-Shot Learning by Harnessing Adversarial Samples. In Proceedings of the 31st ACM International Conference on Multimedia, MM 2023. 4138–4146.
- PaLM: Scaling Language Modeling with Pathways. CoRR abs/2204.02311 (2022). arXiv:2204.02311
- Probabilistic Embeddings for Cross-Modal Retrieval. In 2021 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021. 8415–8424.
- Default Probability. Cognitive Science (1991).
- ImageNet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2009. 248–255.
- BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019. 4171–4186.
- An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In 9th International Conference on Learning Representations, ICLR 2021.
- Link the Head to the ”Beak”: Zero Shot Learning from Noisy Text Description at Part Precision. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017. 6288–6297.
- GPTs are GPTs: An Early Look at the Labor Market Impact Potential of Large Language Models. CoRR abs/2303.10130 (2023). arXiv:2303.10130
- DeViSE: A Deep Visual-Semantic Embedding Model. In Advances in Neural Information Processing Systems 26: Annual Conference on Neural Information Processing Systems 2013, NeurIPS 2013. 2121–2129.
- Dual Part Discovery Network for Zero-Shot Learning. In Proceedings of the 30st ACM International Conference on Multimedia, MM 2022. 3244–3252.
- Dan Hendrycks and Kevin Gimpel. 2016. Bridging Nonlinearities and Stochastic Regularizers with Gaussian Error Linear Units. CoRR abs/1606.08415 (2016). arXiv:1606.08415
- Improving Word Representations via Global Context and Multiple Word Prototypes. In The 50th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference. 873–882.
- Generating Visual Representations for Zero-Shot Classification. In 2017 IEEE/CVF International Conference on Computer Vision, ICCV 2017 - Workshops. 2666–2673.
- Rethinking Knowledge Graph Propagation for Zero-Shot Learning. In 2019 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019. 11487–11496.
- Learning Systems of Concepts with an Infinite Relational Model. In Proceedings, The Twenty-First National Conference on Artificial Intelligence and the Eighteenth Innovative Applications of Artificial Intelligence Conference. 381–388.
- Jihyung Kil and Wei-Lun Chao. 2021. Revisiting Document Representations for Large-Scale Zero-Shot Learning. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2021. 3117–3128.
- Improving Cross-Modal Retrieval with Set of Diverse Embeddings. In 2023 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2023. 23422–23431.
- En-Compactness: Self-Distillation Embedding & Contrastive Generation for Generalized Zero-Shot Learning. In 2022 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2022. 9296–9305.
- Learning to detect unseen object classes by between-class attribute transfer. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2009. 951–958.
- Attribute-Based Classification for Zero-Shot Visual Object Categorization. IEEE Trans. Pattern Anal. Mach. Intell. 36, 3 (2014), 453–465.
- Text-Adaptive Multiple Visual Prototype Matching for Video-Text Retrieval. In Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022.
- Progressive Semantic-Visual Mutual Adaption for Generalized Zero-Shot Learning. In 2023 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2023. 15337–15346.
- Object-Centric Learning with Slot Attention. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020.
- Enhancing CLIP with GPT-4: Harnessing Visual Descriptions as Prompts. In 2023 IEEE/CVF International Conference on Computer Vision, ICCV 2023 - Workshops. 262–271.
- Sachit Menon and Carl Vondrick. 2023. Visual Classification via Description from Large Language Models. In 11th International Conference on Learning Representations, ICLR 2023.
- Distributed Representations of Words and Phrases and their Compositionality. In Advances in Neural Information Processing Systems 26: Annual Conference on Neural Information Processing Systems 2013, NeurIPS 2013. 3111–3119.
- I2MVFormer: Large Language Model Generated Multi-View Document Supervision for Zero-Shot Image Classification. In 2023 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2023. 15169–15179.
- I2DFormer: Learning Image to Document Attention for Zero-Shot Image Classification. In Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022.
- Learning Graph Embeddings for Compositional Zero-Shot Learning. In 2021 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021. 953–962.
- Maria-Elena Nilsback and Andrew Zisserman. 2008. Automated Flower Classification over a Large Number of Classes. In Sixth Indian Conference on Computer Vision, Graphics & Image Processing, ICVGIP 2008. 722–729.
- Zero-shot Learning with Semantic Output Codes. In Advances in Neural Information Processing Systems 22: Annual Conference on Neural Information Processing Systems 2009, NeurIPS 2009. 1410–1418.
- Glove: Global Vectors for Word Representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014. 1532–1543.
- What does a platypus look like? Generating customized prompts for zero-shot image classification. In IEEE/CVF International Conference on Computer Vision, ICCV 2023. 15645–15655.
- Less is More: Zero-Shot Learning from Online Textual Documents with Noise Suppression. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016. 2249–2257.
- Learning Transferable Visual Models From Natural Language Supervision. In Proceedings of the 38th International Conference on Machine Learning, ICML 2021., Vol. 139. 8748–8763.
- ChatGPT-Powered Hierarchical Comparisons for Image Classification. In Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023.
- Bernardino Romera-Paredes and Philip H. S. Torr. 2015. An embarrassingly simple approach to zero-shot learning. In Proceedings of the 32nd International Conference on Machine Learning, ICML 2015., Vol. 37. 2152–2161.
- Waffling around for Performance: Visual Classification with Random Words and Broad Concepts. In 2023 IEEE/CVF International Conference on Computer Vision, ICCV 2023. 15700–15711.
- Gerard Salton and Chris Buckley. 1988. Term-Weighting Approaches in Automatic Text Retrieval. Inf. Process. Manag. 24, 5 (1988), 513–523.
- Zero-Shot Learning Through Cross-Modal Transfer. In Advances in Neural Information Processing Systems 26: Annual Conference on Neural Information Processing Systems 2013, NeurIPS 2013. 935–943.
- Selective Zero-Shot Classification with Augmented Attributes. In Computer Vision - ECCV 2018., Vol. 11213. 474–490.
- MPNet: Masked and Permuted Pre-training for Language Understanding. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020.
- Yale Song and Mohammad Soleymani. 2019. Polysemous Visual-Semantic Embedding for Cross-Modal Retrieval. In 2019 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019. 1979–1988.
- Distinguishing Unseen from Seen for Generalized Zero-shot Learning. In 2022 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2022. 7875–7884.
- Llama 2: Open Foundation and Fine-Tuned Chat Models. CoRR abs/2307.09288 (2023).
- Generalized Zero-Shot Learning via Synthesized Examples. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018. 4281–4289.
- The Caltech-UCSD Birds-200-2011 Dataset. california institute of technology (2011).
- Zero-Shot Recognition via Semantic Embeddings and Knowledge Graphs. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018. 6857–6866.
- Website. 2001. Wikipedia. https://en.wikipedia.org/.
- Website. 2020. A-Z Animals. https://a-z-animals.com/.
- Website. 2022. All About Birds. https://www.allaboutbirds.org/.
- Latent Embeddings for Zero-Shot Classification. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016. 69–77.
- Zero-Shot Learning - A Comprehensive Evaluation of the Good, the Bad and the Ugly. IEEE Trans. Pattern Anal. Mach. Intell. 41, 9 (2019), 2251–2265.
- Feature Generating Networks for Zero-Shot Learning. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018. 5542–5551.
- F-VAEGAN-D2: A Feature Generating Framework for Any-Shot Learning. In 2019 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019. 10275–10284.
- Attribute Prototype Network for Zero-Shot Learning. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020.
- VGSE: Visually-Grounded Semantic Embeddings for Zero-Shot Learning. In 2022 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2022. 9306–9315.
- Wikipedia2Vec: An Efficient Toolkit for Learning and Visualizing the Embeddings of Words and Entities from Wikipedia. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, EMNLP 2020. 23–30.
- Designing Category-Level Attributes for Discriminative Visual Recognition. In 2013 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2013. 771–778.
- Yang Zhang and Songhe Feng. 2023. Enhancing Domain-Invariant Parts for Generalized Zero-Shot Learning. In Proceedings of the 31st ACM International Conference on Multimedia, MM 2023. 6283–6291.
- M3R: Masked Token Mixup and Cross-Modal Reconstruction for Zero-Shot Learning. In Proceedings of the 31st ACM International Conference on Multimedia, MM 2023. 3161–3171.
- A Generative Adversarial Approach for Zero-Shot Learning From Noisy Texts. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018. 1004–1013.
- Learning Feature-to-Feature Translator by Alternating Back-Propagation for Generative Zero-Shot Learning. In 2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019. 9843–9853.
- Xiangyan Qu (5 papers)
- Jing Yu (99 papers)
- Keke Gai (21 papers)
- Jiamin Zhuang (7 papers)
- Yuanmin Tang (7 papers)
- Gang Xiong (37 papers)
- Gaopeng Gou (15 papers)
- Qi Wu (323 papers)