Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Visual-Semantic Decomposition and Partial Alignment for Document-based Zero-Shot Learning (2407.15613v2)

Published 22 Jul 2024 in cs.CV

Abstract: Recent work shows that documents from encyclopedias serve as helpful auxiliary information for zero-shot learning. Existing methods align the entire semantics of a document with corresponding images to transfer knowledge. However, they disregard that semantic information is not equivalent between them, resulting in a suboptimal alignment. In this work, we propose a novel network to extract multi-view semantic concepts from documents and images and align the matching rather than entire concepts. Specifically, we propose a semantic decomposition module to generate multi-view semantic embeddings from visual and textual sides, providing the basic concepts for partial alignment. To alleviate the issue of information redundancy among embeddings, we propose the local-to-semantic variance loss to capture distinct local details and multiple semantic diversity loss to enforce orthogonality among embeddings. Subsequently, two losses are introduced to partially align visual-semantic embedding pairs according to their semantic relevance at the view and word-to-patch levels. Consequently, we consistently outperform state-of-the-art methods under two document sources in three standard benchmarks for document-based zero-shot learning. Qualitatively, we show that our model learns the interpretable partial association.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (79)
  1. Label-Embedding for Attribute-Based Classification. In 2013 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2013. 819–826.
  2. Label-Embedding for Image Classification. IEEE Trans. Pattern Anal. Mach. Intell. 38, 7 (2016), 1425–1438.
  3. Evaluation of output embeddings for fine-grained image classification. In 2015 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015. 2927–2936.
  4. Ziad Al-Halah and Rainer Stiefelhagen. 2017. Automatic Discovery, Association Estimation and Learning of Semantic Attributes for a Thousand Categories. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017. 5112–5121.
  5. Dat Huynh andEhsan Elhamifar. 2020. Fine-Grained Generalized Zero-Shot Learning via Dense Attribute-Based Attention. In 2020 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2020. 4482–4492.
  6. Predicting Deep Zero-Shot Convolutional Neural Networks Using Textual Descriptions. In 2015 IEEE/CVF International Conference on Computer Vision, ICCV 2015. 4247–4255.
  7. Longformer: The Long-Document Transformer. CoRR abs/2004.05150 (2020). arXiv:2004.05150
  8. Language Models are Few-Shot Learners. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020.
  9. Sebastian Bujwid and Josephine Sullivan. 2021. Large-Scale Zero-Shot Image Classification from Rich and Diverse Textual Descriptions. CoRR abs/2103.09669 (2021). arXiv:2103.09669
  10. Synthesized Classifiers for Zero-Shot Learning. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016. 5327–5336.
  11. An Empirical Study and Analysis of Generalized Zero-Shot Learning for Object Recognition in the Wild. In Computer Vision - ECCV 2016., Vol. 9906. 52–68.
  12. TransZero: Attribute-Guided Transformer for Zero-Shot Learning. In Thirty-Sixth AAAI Conference on Artificial Intelligence, AAAI 2022. 330–338.
  13. GNDAN: Graph Navigated Dual Attention Network for Zero-Shot Learning. IEEE Trans. Neural Networks Learn. Syst. 35, 4 (2024), 4516–4529.
  14. MSDN: Mutually Semantic Distillation Network for Zero-Shot Learning. In 2022 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2022. 7602–7611.
  15. DUET: Cross-Modal Semantic Grounding for Contrastive Zero-Shot Learning. In Thirty-Seventh AAAI Conference on Artificial Intelligence, AAAI 2023, Thirty-Fifth Conference on Innovative Applications of Artificial Intelligence, IAAI 2023. 405–413.
  16. Zero-Shot Learning by Harnessing Adversarial Samples. In Proceedings of the 31st ACM International Conference on Multimedia, MM 2023. 4138–4146.
  17. PaLM: Scaling Language Modeling with Pathways. CoRR abs/2204.02311 (2022). arXiv:2204.02311
  18. Probabilistic Embeddings for Cross-Modal Retrieval. In 2021 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021. 8415–8424.
  19. Default Probability. Cognitive Science (1991).
  20. ImageNet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2009. 248–255.
  21. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019. 4171–4186.
  22. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In 9th International Conference on Learning Representations, ICLR 2021.
  23. Link the Head to the ”Beak”: Zero Shot Learning from Noisy Text Description at Part Precision. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017. 6288–6297.
  24. GPTs are GPTs: An Early Look at the Labor Market Impact Potential of Large Language Models. CoRR abs/2303.10130 (2023). arXiv:2303.10130
  25. DeViSE: A Deep Visual-Semantic Embedding Model. In Advances in Neural Information Processing Systems 26: Annual Conference on Neural Information Processing Systems 2013, NeurIPS 2013. 2121–2129.
  26. Dual Part Discovery Network for Zero-Shot Learning. In Proceedings of the 30st ACM International Conference on Multimedia, MM 2022. 3244–3252.
  27. Dan Hendrycks and Kevin Gimpel. 2016. Bridging Nonlinearities and Stochastic Regularizers with Gaussian Error Linear Units. CoRR abs/1606.08415 (2016). arXiv:1606.08415
  28. Improving Word Representations via Global Context and Multiple Word Prototypes. In The 50th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference. 873–882.
  29. Generating Visual Representations for Zero-Shot Classification. In 2017 IEEE/CVF International Conference on Computer Vision, ICCV 2017 - Workshops. 2666–2673.
  30. Rethinking Knowledge Graph Propagation for Zero-Shot Learning. In 2019 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019. 11487–11496.
  31. Learning Systems of Concepts with an Infinite Relational Model. In Proceedings, The Twenty-First National Conference on Artificial Intelligence and the Eighteenth Innovative Applications of Artificial Intelligence Conference. 381–388.
  32. Jihyung Kil and Wei-Lun Chao. 2021. Revisiting Document Representations for Large-Scale Zero-Shot Learning. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2021. 3117–3128.
  33. Improving Cross-Modal Retrieval with Set of Diverse Embeddings. In 2023 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2023. 23422–23431.
  34. En-Compactness: Self-Distillation Embedding & Contrastive Generation for Generalized Zero-Shot Learning. In 2022 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2022. 9296–9305.
  35. Learning to detect unseen object classes by between-class attribute transfer. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2009. 951–958.
  36. Attribute-Based Classification for Zero-Shot Visual Object Categorization. IEEE Trans. Pattern Anal. Mach. Intell. 36, 3 (2014), 453–465.
  37. Text-Adaptive Multiple Visual Prototype Matching for Video-Text Retrieval. In Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022.
  38. Progressive Semantic-Visual Mutual Adaption for Generalized Zero-Shot Learning. In 2023 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2023. 15337–15346.
  39. Object-Centric Learning with Slot Attention. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020.
  40. Enhancing CLIP with GPT-4: Harnessing Visual Descriptions as Prompts. In 2023 IEEE/CVF International Conference on Computer Vision, ICCV 2023 - Workshops. 262–271.
  41. Sachit Menon and Carl Vondrick. 2023. Visual Classification via Description from Large Language Models. In 11th International Conference on Learning Representations, ICLR 2023.
  42. Distributed Representations of Words and Phrases and their Compositionality. In Advances in Neural Information Processing Systems 26: Annual Conference on Neural Information Processing Systems 2013, NeurIPS 2013. 3111–3119.
  43. I2MVFormer: Large Language Model Generated Multi-View Document Supervision for Zero-Shot Image Classification. In 2023 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2023. 15169–15179.
  44. I2DFormer: Learning Image to Document Attention for Zero-Shot Image Classification. In Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022.
  45. Learning Graph Embeddings for Compositional Zero-Shot Learning. In 2021 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021. 953–962.
  46. Maria-Elena Nilsback and Andrew Zisserman. 2008. Automated Flower Classification over a Large Number of Classes. In Sixth Indian Conference on Computer Vision, Graphics & Image Processing, ICVGIP 2008. 722–729.
  47. Zero-shot Learning with Semantic Output Codes. In Advances in Neural Information Processing Systems 22: Annual Conference on Neural Information Processing Systems 2009, NeurIPS 2009. 1410–1418.
  48. Glove: Global Vectors for Word Representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014. 1532–1543.
  49. What does a platypus look like? Generating customized prompts for zero-shot image classification. In IEEE/CVF International Conference on Computer Vision, ICCV 2023. 15645–15655.
  50. Less is More: Zero-Shot Learning from Online Textual Documents with Noise Suppression. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016. 2249–2257.
  51. Learning Transferable Visual Models From Natural Language Supervision. In Proceedings of the 38th International Conference on Machine Learning, ICML 2021., Vol. 139. 8748–8763.
  52. ChatGPT-Powered Hierarchical Comparisons for Image Classification. In Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023.
  53. Bernardino Romera-Paredes and Philip H. S. Torr. 2015. An embarrassingly simple approach to zero-shot learning. In Proceedings of the 32nd International Conference on Machine Learning, ICML 2015., Vol. 37. 2152–2161.
  54. Waffling around for Performance: Visual Classification with Random Words and Broad Concepts. In 2023 IEEE/CVF International Conference on Computer Vision, ICCV 2023. 15700–15711.
  55. Gerard Salton and Chris Buckley. 1988. Term-Weighting Approaches in Automatic Text Retrieval. Inf. Process. Manag. 24, 5 (1988), 513–523.
  56. Zero-Shot Learning Through Cross-Modal Transfer. In Advances in Neural Information Processing Systems 26: Annual Conference on Neural Information Processing Systems 2013, NeurIPS 2013. 935–943.
  57. Selective Zero-Shot Classification with Augmented Attributes. In Computer Vision - ECCV 2018., Vol. 11213. 474–490.
  58. MPNet: Masked and Permuted Pre-training for Language Understanding. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020.
  59. Yale Song and Mohammad Soleymani. 2019. Polysemous Visual-Semantic Embedding for Cross-Modal Retrieval. In 2019 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019. 1979–1988.
  60. Distinguishing Unseen from Seen for Generalized Zero-shot Learning. In 2022 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2022. 7875–7884.
  61. Llama 2: Open Foundation and Fine-Tuned Chat Models. CoRR abs/2307.09288 (2023).
  62. Generalized Zero-Shot Learning via Synthesized Examples. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018. 4281–4289.
  63. The Caltech-UCSD Birds-200-2011 Dataset. california institute of technology (2011).
  64. Zero-Shot Recognition via Semantic Embeddings and Knowledge Graphs. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018. 6857–6866.
  65. Website. 2001. Wikipedia. https://en.wikipedia.org/.
  66. Website. 2020. A-Z Animals. https://a-z-animals.com/.
  67. Website. 2022. All About Birds. https://www.allaboutbirds.org/.
  68. Latent Embeddings for Zero-Shot Classification. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016. 69–77.
  69. Zero-Shot Learning - A Comprehensive Evaluation of the Good, the Bad and the Ugly. IEEE Trans. Pattern Anal. Mach. Intell. 41, 9 (2019), 2251–2265.
  70. Feature Generating Networks for Zero-Shot Learning. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018. 5542–5551.
  71. F-VAEGAN-D2: A Feature Generating Framework for Any-Shot Learning. In 2019 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019. 10275–10284.
  72. Attribute Prototype Network for Zero-Shot Learning. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020.
  73. VGSE: Visually-Grounded Semantic Embeddings for Zero-Shot Learning. In 2022 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2022. 9306–9315.
  74. Wikipedia2Vec: An Efficient Toolkit for Learning and Visualizing the Embeddings of Words and Entities from Wikipedia. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, EMNLP 2020. 23–30.
  75. Designing Category-Level Attributes for Discriminative Visual Recognition. In 2013 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2013. 771–778.
  76. Yang Zhang and Songhe Feng. 2023. Enhancing Domain-Invariant Parts for Generalized Zero-Shot Learning. In Proceedings of the 31st ACM International Conference on Multimedia, MM 2023. 6283–6291.
  77. M3R: Masked Token Mixup and Cross-Modal Reconstruction for Zero-Shot Learning. In Proceedings of the 31st ACM International Conference on Multimedia, MM 2023. 3161–3171.
  78. A Generative Adversarial Approach for Zero-Shot Learning From Noisy Texts. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018. 1004–1013.
  79. Learning Feature-to-Feature Translator by Alternating Back-Propagation for Generative Zero-Shot Learning. In 2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019. 9843–9853.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Xiangyan Qu (5 papers)
  2. Jing Yu (99 papers)
  3. Keke Gai (21 papers)
  4. Jiamin Zhuang (7 papers)
  5. Yuanmin Tang (7 papers)
  6. Gang Xiong (37 papers)
  7. Gaopeng Gou (15 papers)
  8. Qi Wu (323 papers)

Summary

We haven't generated a summary for this paper yet.