Self-supervised Equivariant Attention Mechanism for Weakly Supervised Semantic Segmentation (2004.04581v1)

Published 9 Apr 2020 in cs.CV

Abstract: Image-level weakly supervised semantic segmentation is a challenging problem that has been deeply studied in recent years. Most of advanced solutions exploit class activation map (CAM). However, CAMs can hardly serve as the object mask due to the gap between full and weak supervisions. In this paper, we propose a self-supervised equivariant attention mechanism (SEAM) to discover additional supervision and narrow the gap. Our method is based on the observation that equivariance is an implicit constraint in fully supervised semantic segmentation, whose pixel-level labels take the same spatial transformation as the input images during data augmentation. However, this constraint is lost on the CAMs trained by image-level supervision. Therefore, we propose consistency regularization on predicted CAMs from various transformed images to provide self-supervision for network learning. Moreover, we propose a pixel correlation module (PCM), which exploits context appearance information and refines the prediction of current pixel by its similar neighbors, leading to further improvement on CAMs consistency. Extensive experiments on PASCAL VOC 2012 dataset demonstrate our method outperforms state-of-the-art methods using the same level of supervision. The code is released online.

View on arXiv

Authors (5)

Yude Wang (4 papers)
Jie Zhang (847 papers)
Meina Kan (15 papers)
Shiguang Shan (136 papers)
Xilin Chen (119 papers)

Citations (570)

View on Semantic Scholar

Summary

An Overview of the Self-supervised Equivariant Attention Mechanism for Weakly Supervised Semantic Segmentation

The paper presents a novel approach to weakly supervised semantic segmentation using a self-supervised equivariant attention mechanism (SEAM), designed to address the inherent limitations of Class Activation Maps (CAMs) in this domain. The research aims to improve the consistency and accuracy of CAMs, which typically suffer from under- and over-activation issues due to the constraints of image-level supervision.

Key Contributions

Equivariant Regularization: The paper introduces the concept of equivariant regularization to enhance CAM quality. This self-supervised approach imposes a consistency regularization on predicted CAMs from various transformed images, thereby bridging the supervision gap between fully and weakly supervised semantic segmentation.
Pixel Correlation Module (PCM): This module leverages contextual appearance information to refine pixel predictions using affinity attention maps. PCM enhances CAMs by integrating low-level feature similarities, leading to better object boundary alignment.
Siamese Network Architecture: The method employs a siamese network framework coupled with equivariant cross regularization (ECR) loss to integrate PCM and self-supervision effectively. This architecture allows for consistent CAM predictions over different spatial transformations, addressing equivariance in spatial transformations.

Results

The authors conduct extensive experiments on the PASCAL VOC 2012 dataset, demonstrating that SEAM surpasses state-of-the-art methods using similar supervision levels. The proposed method achieves a mean intersection over union (mIoU) of 64.5% on the validation set and 65.7% on the test set without relying on additional saliency detection inputs. These results highlight the efficacy of SEAM in producing well-aligned and consistent CAMs, notably improving semantic segmentation tasks from weak supervisions.

Implications and Future Directions

The implications of this work are significant for the development of more effective weakly supervised machine learning models. By narrowing the gap between fully and weakly supervised models, SEAM has the potential to reduce the need for expensive pixel-level annotations, making semantic segmentation more accessible and scalable.

For future research, the exploration of additional transformations and the refinement of PCM could further enhance model accuracy and consistency. Additionally, extending this approach to other domains of computer vision could reveal broader applications and adaptability across different tasks.

In conclusion, the SEAM framework offers substantial advancements in weakly supervised semantic segmentation by leveraging self-supervision and contextual refinement, setting a benchmark for future explorations in reducing dependency on detailed annotations while improving performance.

PDF Markdown

Related Papers

Find Related Papers