Mining Cross-Image Semantics for Weakly Supervised Semantic Segmentation (2007.01947v2)

Published 3 Jul 2020 in cs.CV, cs.LG, and eess.IV

Abstract: This paper studies the problem of learning semantic segmentation from image-level supervision only. Current popular solutions leverage object localization maps from classifiers as supervision signals, and struggle to make the localization maps capture more complete object content. Rather than previous efforts that primarily focus on intra-image information, we address the value of cross-image semantic relations for comprehensive object pattern mining. To achieve this, two neural co-attentions are incorporated into the classifier to complimentarily capture cross-image semantic similarities and differences. In particular, given a pair of training images, one co-attention enforces the classifier to recognize the common semantics from co-attentive objects, while the other one, called contrastive co-attention, drives the classifier to identify the unshared semantics from the rest, uncommon objects. This helps the classifier discover more object patterns and better ground semantics in image regions. In addition to boosting object pattern learning, the co-attention can leverage context from other related images to improve localization map inference, hence eventually benefiting semantic segmentation learning. More essentially, our algorithm provides a unified framework that handles well different WSSS settings, i.e., learning WSSS with (1) precise image-level supervision only, (2) extra simple single-label data, and (3) extra noisy web data. It sets new state-of-the-arts on all these settings, demonstrating well its efficacy and generalizability. Moreover, our approach ranked 1st place in the Weakly-Supervised Semantic Segmentation Track of CVPR2020 Learning from Imperfect Data Challenge.

Authors (4)

Guolei Sun (31 papers)
Wenguan Wang (103 papers)
Jifeng Dai (131 papers)
Luc Van Gool (570 papers)

Citations (286)

View on Semantic Scholar

Summary

The paper presents a dual co-attention framework that captures common and contrastive semantics to generate richer object representations.
It refines class activation maps by integrating semantic similarities and disparities, leading to more accurate pseudo-labels.
The approach achieves state-of-the-art results on benchmarks while reducing the dependency on costly pixel-level annotations.

Overview of "Mining Cross-Image Semantics for Weakly Supervised Semantic Segmentation"

The authors of "Mining Cross-Image Semantics for Weakly Supervised Semantic Segmentation" address the challenge of learning semantic segmentation using only image-level supervision. Traditionally, fully supervised approaches require extensive pixel-wise annotations, which are costly to obtain. This work contributes to the field by exploiting the relationships between images, referred to as cross-image semantics, to enhance weakly supervised semantic segmentation (WSSS).

Existing WSSS techniques typically rely on class activation maps derived from classifier networks. However, these methods often focus on the most discriminative object parts, failing to capture comprehensive object semantics. The proposed method distinguishes itself by utilizing cross-image semantic interactions, which improves the richness of the learned object representations. It does so through two innovative co-attention mechanisms within a classifier that capture semantic similarities and disparities between image pairs.

Methodology

The paper introduces a co-attention-based classification framework designed to exploit cross-image semantic relations. The approach is twofold:

Co-Attention for Common Semantics: The first co-attention mechanism identifies shared semantic features between two images, enhancing the classifier's ability to ground semantics across co-attentive regions. This shared semantic understanding is intended to expand the classifier's grasp of complete object patterns, facilitating more precise localization.
Contrastive Co-Attention for Exclusive Semantics: Complementing the common semantics approach, the contrastive co-attention mechanism identifies and focuses on unshared semantics in image pairs. This enhances the ability of the classifier to differentiate and refine object semantics further, improving the generalization of the learnt features across different images.

These mechanisms enable the framework to provide more complete segmentation by utilizing paired image learning. The method subsequently generates object localization maps that serve as pseudo-labels to guide the training of a segmentation model.

Results

The paper reports state-of-the-art performance on several benchmarks, demonstrating the effectiveness and robustness of combining both intra-image information and inter-image semantic relations. The authors highlight that their framework is versatile, showing high efficacy across different weak supervision settings, including scenarios with clean image-level supervision, simple single-label data, and noisy web-derived data.

Implications and Future Directions

The efficacy of this approach has significant implications for the development of weakly supervised techniques in scenarios where extensive labeled data may not be feasible. By leveraging both common and distinctive semantic features across image pairs, the framework better exploits the supervisory signals inherent in weak supervision settings. This approach could spur further research into more efficient data utilization practices, particularly in domains where labeled data is scarce or expensive to collect.

Future research could extend this work by integrating additional sources of weak supervision, such as temporal data from video sequences, or exploring new architectures that further enhance the learning of cross-image semantics. There is also potential to investigate how such approaches can be adapted and optimized for real-time applications where computational efficiency becomes critical.

In conclusion, "Mining Cross-Image Semantics for Weakly Supervised Semantic Segmentation" provides a substantial contribution to the field of semantic segmentation, particularly under weak supervision constraints, and opens avenues for more data-efficient learning paradigms.

PDF Markdown