Multi-class Token Transformer for Weakly Supervised Semantic Segmentation (2203.02891v1)

Published 6 Mar 2022 in cs.CV

Abstract: This paper proposes a new transformer-based framework to learn class-specific object localization maps as pseudo labels for weakly supervised semantic segmentation (WSSS). Inspired by the fact that the attended regions of the one-class token in the standard vision transformer can be leveraged to form a class-agnostic localization map, we investigate if the transformer model can also effectively capture class-specific attention for more discriminative object localization by learning multiple class tokens within the transformer. To this end, we propose a Multi-class Token Transformer, termed as MCTformer, which uses multiple class tokens to learn interactions between the class tokens and the patch tokens. The proposed MCTformer can successfully produce class-discriminative object localization maps from class-to-patch attentions corresponding to different class tokens. We also propose to use a patch-level pairwise affinity, which is extracted from the patch-to-patch transformer attention, to further refine the localization maps. Moreover, the proposed framework is shown to fully complement the Class Activation Mapping (CAM) method, leading to remarkably superior WSSS results on the PASCAL VOC and MS COCO datasets. These results underline the importance of the class token for WSSS.

Citations (171)

View on Semantic Scholar

Summary

The paper presents MCTformer, a transformer-based model that leverages multiple class tokens to create class-specific localization maps using weak supervision.
It refines localization with patch-level affinities extracted from transformer attention and enhances results by integrating class activation mapping.
Experiments on PASCAL VOC and MS COCO yield state-of-the-art mIoU scores, validating its effective performance in semantic segmentation.

Multi-class Token Transformer for Weakly Supervised Semantic Segmentation

The paper "Multi-class Token Transformer for Weakly Supervised Semantic Segmentation" presents an innovative transformer-based framework for Weakly Supervised Semantic Segmentation (WSSS), designated as MCTformer. The framework aims to generate high-quality class-specific object localization maps using only weak supervision, such as image-level labels, thereby reducing the dependency on pixel-level annotated data, which is often labor-intensive and costly to acquire.

Key Contributions

Multi-class Token Transformer (MCTformer): The central contribution of this work is the introduction of MCTformer, which incorporates multiple class tokens to learn interactions between class tokens and patch tokens within the transformer model. This mechanism allows the derivation of class-discriminative object localization maps, as opposed to the traditional vision transformers that typically produce class-agnostic maps through a single class token.
Patch-level Affinity for Refinement: In addition to generating localization maps, the MCTformer leverages a patch-level pairwise affinity extracted from the patch-to-patch transformer attention. This affinity is subsequently used to refine the generated localization maps, enhancing their accuracy and completeness.
Integration with Class Activation Mapping (CAM): The framework is designed to complement the CAM method effectively. By applying CAM on the patch tokens within MCTformer, the authors demonstrate significant performance improvements over existing WSSS methods.

Experimental Validation

The paper exhibits extensive experiments on benchmark datasets, PASCAL VOC and MS COCO, to validate the performance of MCTformer. The proposed method achieves remarkable results, setting new state-of-the-art performance benchmarks with a mean Intersection-over-Union (mIoU) of 71.6% on PASCAL VOC and 42.0% on MS COCO. These empirical results substantiate the efficacy of the proposed method and highlight its potential as a powerful tool in the field of semantic segmentation.

Implications and Future Directions

The development of MCTformer offers valuable insights into the use of transformers for dense prediction tasks like semantic segmentation. Its ability to generate class-specific localization maps using minimal supervision could lead to more efficient semi-supervised and unsupervised segmentation techniques. Future work could explore extending this framework to real-time applications, optimizing the model for computational efficiency, or incorporating additional contextual information to further refine localization accuracy. Additionally, expanding this approach to other domains, such as video segmentation or 3D point cloud analysis, could reveal further applications and benefits of the multi-token transformer architecture.

In summary, MCTformer represents a significant advancement in utilizing transformer models for semantic segmentation, providing a scalable approach to tackle the complexities associated with weakly supervised learning scenarios.

PDF Markdown