Query-Based Knowledge Sharing for Open-Vocabulary Multi-Label Classification (2401.01181v1)
Abstract: Identifying labels that did not appear during training, known as multi-label zero-shot learning, is a non-trivial task in computer vision. To this end, recent studies have attempted to explore the multi-modal knowledge of vision-language pre-training (VLP) models by knowledge distillation, allowing to recognize unseen labels in an open-vocabulary manner. However, experimental evidence shows that knowledge distillation is suboptimal and provides limited performance gain in unseen label prediction. In this paper, a novel query-based knowledge sharing paradigm is proposed to explore the multi-modal knowledge from the pretrained VLP model for open-vocabulary multi-label classification. Specifically, a set of learnable label-agnostic query tokens is trained to extract critical vision knowledge from the input image, and further shared across all labels, allowing them to select tokens of interest as visual clues for recognition. Besides, we propose an effective prompt pool for robust label embedding, and reformulate the standard ranking learning into a form of classification to allow the magnitude of feature vectors for matching, which both significantly benefit label recognition. Experimental results show that our framework significantly outperforms state-of-the-art methods on zero-shot task by 5.9% and 4.5% in mAP on the NUS-WIDE and Open Images, respectively.
- Cdul: Clip-driven unsupervised learning for multi-label image classification. arXiv preprint arXiv:2307.16634, 2023.
- Vlmo: Unified vision-language pre-training with mixture-of-modality-experts. Advances in Neural Information Processing Systems, 35:32897–32912, 2022.
- Semantic diversity learning for zero-shot multi-label classification. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 640–650, 2021.
- Nus-wide: a real-world web image database from national university of singapore. In Proceedings of the ACM international conference on image and video retrieval, pages 1–9, 2009.
- Learning to prompt for open-vocabulary object detection with vision-language model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14084–14093, 2022.
- Transductive multi-label zero-shot learning. arXiv preprint arXiv:1503.07790, 2015.
- Open-vocabulary image segmentation. arXiv preprint arXiv:2112.12143, 2021.
- Open-vocabulary object detection via vision and language knowledge distillation. arXiv preprint arXiv:2104.13921, 2021.
- Generative multi-label zero-shot learning. arXiv preprint arXiv:2101.11606, 2021.
- Open-vocabulary multi-label classification via multi-modal knowledge transfer. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 808–816, 2023.
- Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
- A shared multi-attention framework for multi-label zero-shot learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8776–8786, 2020.
- Open-vocabulary instance segmentation via robust cross-modal pseudo-labeling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7020–7031, 2022.
- The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale. International Journal of Computer Vision, 128(7):1956–1981, 2020.
- Multi-label zero-shot learning with structured knowledge graphs. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1576–1585, 2018.
- Zero-shot image tagging by hierarchical semantic embedding. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 879–882, 2015.
- Visualbert: A simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557, 2019.
- Align before fuse: Vision and language representation learning with momentum distillation. Advances in neural information processing systems, 34:9694–9705, 2021.
- Query2label: A simple transformer way to multi-label classification. arXiv preprint arXiv:2107.10834, 2021.
- Ml22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPTp-encoder: On exploration of channel-class correlation for multi-label zero-shot learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23859–23868, 2023.
- Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
- Open-vocabulary one-stage detection with hierarchical visual-language knowledge distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14074–14083, 2022.
- Discriminative region-based multi-label zero-shot learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 8731–8740, 2021.
- Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 1532–1543, 2014.
- Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
- Deep0tag: Deep multiple instance learning for zero-shot image tagging. IEEE Transactions on Multimedia, 22(1):242–255, 2019.
- Multiple instance visual-semantic embedding. In BMVC, 2017.
- Dualcoop: Fast adaptation to multi-label recognition with limited annotations. Advances in Neural Information Processing Systems, 35:30569–30582, 2022.
- Open-vocabulary object detection using captions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14393–14402, 2021.
- Fast zero-shot image tagging. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 5985–5994. IEEE, 2016.
- Learning to prompt for vision-language models. International Journal of Computer Vision, 130(9):2337–2348, 2022.
- Xuelin Zhu (8 papers)
- Jian Liu (404 papers)
- Dongqi Tang (9 papers)
- Jiawei Ge (15 papers)
- Weijia Liu (9 papers)
- Bo Liu (484 papers)
- Jiuxin Cao (18 papers)