Open-Vocabulary Multi-Label Classification via Multi-Modal Knowledge Transfer (2207.01887v2)

Published 5 Jul 2022 in cs.CV

Abstract: Real-world recognition system often encounters the challenge of unseen labels. To identify such unseen labels, multi-label zero-shot learning (ML-ZSL) focuses on transferring knowledge by a pre-trained textual label embedding (e.g., GloVe). However, such methods only exploit single-modal knowledge from a LLM, while ignoring the rich semantic information inherent in image-text pairs. Instead, recently developed open-vocabulary (OV) based methods succeed in exploiting such information of image-text pairs in object detection, and achieve impressive performance. Inspired by the success of OV-based methods, we propose a novel open-vocabulary framework, named multi-modal knowledge transfer (MKT), for multi-label classification. Specifically, our method exploits multi-modal knowledge of image-text pairs based on a vision and language pre-training (VLP) model. To facilitate transferring the image-text matching ability of VLP model, knowledge distillation is employed to guarantee the consistency of image and label embeddings, along with prompt tuning to further update the label embeddings. To further enable the recognition of multiple objects, a simple but effective two-stream module is developed to capture both local and global features. Extensive experimental results show that our method significantly outperforms state-of-the-art methods on public benchmark datasets. The source code is available at https://github.com/sunanhe/MKT.

PDF Abstract

Open-Vocabulary Multi-Label Classification via Multi-Modal Knowledge Transfer

The paper "Open-Vocabulary Multi-Label Classification via Multi-Modal Knowledge Transfer" presents a novel approach to multi-label classification that extends the capabilities of traditional frameworks by incorporating open vocabulary and multi-modal knowledge transfer techniques. This method demonstrates significant advancements over existing models, particularly in the context of predicting unseen labels, which is a common challenge in real-world applications.

Framework and Methodology

The proposed framework, termed Multi-Modal Knowledge Transfer (MKT), leverages both vision and language pre-training (VLP) models to enhance multi-label classification. The novelty of MKT lies in its ability to utilize multi-modal information derived from image-text pairs, rather than relying solely on single-modality LLMs like GloVe. This is achieved through the integration of a vision encoder and a VLP model, specifically CLIP, to facilitate image-text embedding alignment.

Vision Transformer: The backbone of the framework is a vision transformer model, which extracts visual features essential for classification tasks. The integration of both local and global feature extraction via a two-stream module enhances the model's ability to handle multiple labels within an image.
Knowledge Distillation: This process ensures that embeddings generated by the vision model are consistent with those produced by the VLP model. By distilling knowledge from the CLIP image encoder, MKT aligns image embeddings with label embeddings, thereby improving performance in recognizing unseen labels.
Prompt Tuning: To refine label embeddings further, the paper incorporates prompt tuning. This method fine-tunes the context embeddings of text prompts, allowing the model to generate label embeddings that are better suited for vision tasks.

Experimental Results

The performance of MKT was evaluated using two large-scale datasets: NUS-WIDE and Open Images. The results show that MKT substantially outperforms state-of-the-art models in both zero-shot learning (ZSL) and generalized zero-shot learning (GZSL) tasks. For instance, on the NUS-WIDE dataset, MKT achieved a mean Average Precision (mAP) of 37.6% on ZSL tasks, significantly outperforming traditional frameworks like LESA and BiAM. This indicates MKT's enhanced capacity to generalize across seen and unseen labels, a notable advancement in the multi-label classification domain.

Implications and Future Directions

The introduction of MKT presents notable implications for practical applications in fields such as scene understanding, surveillance systems, and autonomous driving, where recognizing a vast array of unseen labels in real-time is crucial. The method bridges the gap between vision and language, leveraging the strengths of both domains to enrich the classification process.

Looking forward, the integration of MKT into broader AI systems could lead to more robust and contextually aware models. Future research may focus on further refining multi-modal knowledge transfer mechanisms or exploring alternative architectures to enhance computational efficiency. Additionally, extending this framework to other vision tasks, such as object detection and segmentation, could provide insights into its versatility and applicability across various domains within artificial intelligence.

PDF Markdown Bookmark Chat (Pro)

Authors (6)

Sunan He (13 papers)
Taian Guo (9 papers)
Tao Dai (57 papers)
Ruizhi Qiao (18 papers)
Bo Ren (60 papers)
Shu-Tao Xia (171 papers)

Citations (40)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - sunanhe/MKT: Official implementation of "Open-Vocabulary Multi-Label Classification via Multi-Modal Knowledge Transfer". (122 stars)