ML-Decoder: Scalable and Versatile Classification Head (2111.12933v2)

Published 25 Nov 2021 in cs.CV and cs.LG

Abstract: In this paper, we introduce ML-Decoder, a new attention-based classification head. ML-Decoder predicts the existence of class labels via queries, and enables better utilization of spatial data compared to global average pooling. By redesigning the decoder architecture, and using a novel group-decoding scheme, ML-Decoder is highly efficient, and can scale well to thousands of classes. Compared to using a larger backbone, ML-Decoder consistently provides a better speed-accuracy trade-off. ML-Decoder is also versatile - it can be used as a drop-in replacement for various classification heads, and generalize to unseen classes when operated with word queries. Novel query augmentations further improve its generalization ability. Using ML-Decoder, we achieve state-of-the-art results on several classification tasks: on MS-COCO multi-label, we reach 91.4% mAP; on NUS-WIDE zero-shot, we reach 31.1% ZSL mAP; and on ImageNet single-label, we reach with vanilla ResNet50 backbone a new top score of 80.7%, without extra data or distillation. Public code is available at: https://github.com/Alibaba-MIIL/ML_Decoder

Citations (90)

View on Semantic Scholar

Summary

The paper presents ML-Decoder, an attention-based classification head that replaces self-attention with a group-decoding scheme for improved scalability and efficiency.
The paper demonstrates state-of-the-art performance with 91.4% mAP on MS-COCO and 80.7% top accuracy on ImageNet, showcasing its effectiveness across various tasks.
The paper enables zero-shot learning by using fixed NLP-based queries for semantic extension, allowing the model to generalize to unseen classes efficiently.

An Academic Overview of "ML-Decoder: Scalable and Versatile Classification Head"

The paper "ML-Decoder: Scalable and Versatile Classification Head" presents a novel approach to enhancing the classification capabilities of neural networks through an attention-based classification head termed ML-Decoder. This work provides substantial advancements in handling multi-label classification tasks, addressing both computational efficiency and generalization across unseen classes.

Core Contributions

The authors introduce ML-Decoder as a classification head designed to surpass the limitations of traditional approaches like global average pooling (GAP) and existing attention-based models such as transformer-decoders. The ML-Decoder architecture modifies the conventional transformer-decoder by eliminating the self-attention layer, thus reducing computational complexity from quadratic to linear with respect to the number of queries. Further, the group-decoding scheme, a key component of this architecture, enhances scalability by employing a fixed number of queries that are extrapolated to the required number of classes through a group fully-connected layer. This makes ML-Decoder adaptable to datasets containing thousands of classes without incurring significant computational overhead.

The versatility of ML-Decoder is underscored by its applicability as a drop-in replacement for existing classification heads across various tasks—from single-label and multi-label classification to zero-shot learning (ZSL). The attention mechanism of ML-Decoder exploits the spatial data more effectively than GAP, offering improved accuracy while maintaining efficiency. For multi-label zero-shot learning, the model leverages fixed NLP-based queries corresponding to word embeddings, allowing for semantic extension to unseen classes.

Experimental Results

The ML-Decoder provides state-of-the-art performance on several benchmarks. It achieves 91.4% mAP on MS-COCO multi-label classification, a significant improvement over previous methods. In the challenging domain of zero-shot learning, ML-Decoder reaches 31.1% ZSL mAP on the NUS-WIDE dataset, showcasing its superior generalization abilities. Moreover, on the ImageNet single-label dataset, with a ResNet50 backbone, it attains a novel top score of 80.7% accuracy, demonstrating the efficacy of the proposed architecture even in traditional classification settings.

Implications and Future Prospects

The architectural innovations in ML-Decoder highlight a shift towards more scalable and efficient models for classification tasks. The reduced reliance on self-attention layers and the introduction of group-decoding can be viewed as a paradigm for designing more computationally efficient transformers that retain or enhance task-specific performance. In practical terms, ML-Decoder's ability to generalize to unseen classes using word query augmentations propels its applicability in dynamic environments where encountering new categories is commonplace.

Given these promising results, future research could explore the application of ML-Decoder in domains beyond image classification. There is potential for extending this attention mechanism to other computer vision challenges, such as object detection and video recognition, as well as to natural language processing tasks that require scalable and adaptable classification strategies. The introduction of ML-Decoder encourages further investigation into reducing the computational burden of transformer architectures while simultaneously expanding their capability to learn from and adapt to more complex and diverse inputs.

In conclusion, while the ML-Decoder does not claim to be revolutionary, its strategic improvements in architecture position it as a forward-thinking solution capable of addressing the ever-growing demands of classification tasks in diverse and complex datasets. Its integration within the broader array of machine learning and computer vision applications provides a template for how future architectures might be designed for optimal performance with manageable computational resources.

PDF Markdown

Related Papers

GitHub

GitHub - Alibaba-MIIL/ML_Decoder: Official PyTorch implementation of "ML-Decoder: Scalable and Versatile Classification Head" (2021) (314 stars)