Query2Label: A Simple Transformer Way to Multi-Label Classification (2107.10834v1)

Published 22 Jul 2021 in cs.CV

Abstract: This paper presents a simple and effective approach to solving the multi-label classification problem. The proposed approach leverages Transformer decoders to query the existence of a class label. The use of Transformer is rooted in the need of extracting local discriminative features adaptively for different labels, which is a strongly desired property due to the existence of multiple objects in one image. The built-in cross-attention module in the Transformer decoder offers an effective way to use label embeddings as queries to probe and pool class-related features from a feature map computed by a vision backbone for subsequent binary classifications. Compared with prior works, the new framework is simple, using standard Transformers and vision backbones, and effective, consistently outperforming all previous works on five multi-label classification data sets, including MS-COCO, PASCAL VOC, NUS-WIDE, and Visual Genome. Particularly, we establish $91.3\%$ mAP on MS-COCO. We hope its compact structure, simple implementation, and superior performance serve as a strong baseline for multi-label classification tasks and future studies. The code will be available soon at https://github.com/SlongLiu/query2labels.

Citations (160)

View on Semantic Scholar

Summary

The paper presents a novel method that uses Transformer decoders to query labels via cross-attention for efficient multi-label predictions.
It integrates label embeddings with a vision backbone to exhaustively aggregate local discriminative features for each class.
Experiments on datasets like MS-COCO and PASCAL VOC demonstrate state-of-the-art performance with high mean Average Precision scores.

"Query2Label: A Simple Transformer Way to Multi-Label Classification" introduces an innovative methodology for multi-label classification leveraging Transformer architectures. This academic work addresses the inherent complexity of predicting multiple labels for each image, a prevalent requirement in many machine learning applications.

Key Contributions and Methodology

Transformer Decoders for Querying Labels:
- The crux of the method is utilizing Transformer decoders to query the existence of class labels directly. This is in contrast to traditional multi-label classification techniques that often rely on multi-head architectures or ensemble methods.
- Label embeddings serve as queries to the Transformer’s cross-attention mechanism. This allows the model to adaptively extract local discriminative features relevant to each specific label.
Effective Use of Cross-Attention:
- The built-in cross-attention module in the Transformer decoder plays a critical role. Each label embedding acts as a query to probe and pool information from a feature map created by a vision backbone (like ResNet or Vision Transformer).
- This method ensures that the features pertinent to each class are effectively aggregated, enhancing the model’s classification accuracy.
Vision Backbone Integration:
- The approach integrates seamlessly with existing vision backbones, providing a flexible and standardized way to leverage pre-trained models for feature extraction. This aspect emphasizes the simplicity and generalizability of the proposed framework.

Performance and Evaluation

The paper demonstrates the effectiveness of the proposed framework through comprehensive experiments on several benchmark datasets:

MS-COCO: Achieved a remarkable $91.3\%$ mean Average Precision (mAP), setting a new benchmark for multi-label classification on this dataset.
PASCAL VOC: Similarly impressive results were obtained, outperforming previous methods.
NUS-WIDE and Visual Genome: The method showed consistent performance gains across these varied datasets, underscoring its robustness and adaptability.

Advantages Over Prior Approaches

The authors highlight several advantages compared to existing multi-label classification methods:

Simplicity and Efficiency:
- The proposed framework is straightforward, relying on standard Transformer and vision backbone architectures without needing intricate modifications.
- This simplicity translates to ease of implementation and potentially more efficient training and inference.
Superior Performance:
- Across all tested datasets, the Query2Label approach consistently outperformed previous state-of-the-art models, demonstrating its efficacy in real-world scenarios.
- The adaptive extraction of local discriminative features is particularly noteworthy, catering elegantly to the multi-object characteristics of most images in these datasets.

Implications

The architecture and methodology suggested by "Query2Label" establish a strong baseline for future multi-label classification tasks. Leveraging Transformers in this manner opens up new avenues for research and potential enhancements in the field of computer vision.

The authors believe that their proposed approach, with its compact structure and straightforward implementation, will serve as a vital reference point and inspirational starting point for subsequent work in multi-label classification. The availability of the code on GitHub further underscores their commitment to fostering continued innovation and practical application in this area.

In summary, this paper presents a significant advancement in multi-label classification by innovatively employing Transformer decoders, demonstrating high performance and ease of adoption, thus setting new standards for future research and applications.

PDF Markdown

Related Papers

GitHub

GitHub - SlongLiu/query2labels: Official implementation of paper "Query2Label: A Simple Transformer Way to Multi-Label Classification". (423 stars)