Joint Learning of Saliency Detection and Weakly Supervised Semantic Segmentation (1909.04161v1)

Published 9 Sep 2019 in cs.CV

Abstract: Existing weakly supervised semantic segmentation (WSSS) methods usually utilize the results of pre-trained saliency detection (SD) models without explicitly modeling the connections between the two tasks, which is not the most efficient configuration. Here we propose a unified multi-task learning framework to jointly solve WSSS and SD using a single network, \ie saliency, and segmentation network (SSNet). SSNet consists of a segmentation network (SN) and a saliency aggregation module (SAM). For an input image, SN generates the segmentation result and, SAM predicts the saliency of each category and aggregating the segmentation masks of all categories into a saliency map. The proposed network is trained end-to-end with image-level category labels and class-agnostic pixel-level saliency labels. Experiments on PASCAL VOC 2012 segmentation dataset and four saliency benchmark datasets show the performance of our method compares favorably against state-of-the-art weakly supervised segmentation methods and fully supervised saliency detection methods.

Authors (4)

Yu Zeng (60 papers)
Yunzhi Zhuge (17 papers)
Huchuan Lu (199 papers)
Lihe Zhang (40 papers)

Citations (175)

View on Semantic Scholar

Summary

Joint Learning of Saliency Detection and Weakly Supervised Semantic Segmentation

The paper "Joint Learning of Saliency Detection and Weakly Supervised Semantic Segmentation" addresses the intrinsic link between saliency detection (SD) and weakly supervised semantic segmentation (WSSS). Authored by Yu Zeng, Yunzhi Zhuge, Huchuan Lu, and Lihe Zhang from the Dalian University of Technology, China, this research explores a unified framework for both tasks using a single neural network architecture dubbed the Saliency and Segmentation Network (SSNet).

The framework proposed by the authors effectively utilizes weak supervision, combining image-level category labels with class-agnostic pixel-level saliency labels. Unlike traditional WSSS approaches that rely on pre-trained saliency models as a separate pre-processing step, SSNet directly integrates saliency detection into the segmentation process, allowing both tasks to inform and improve each other through shared learning.

SSNet comprises two key components: a segmentation network (SN) and a saliency aggregation module (SAM). The SN is responsible for generating the segmentation results for an input image. SAM then predicts the saliency of each category and aggregates the segmentation masks into a comprehensive saliency map. This dual-task learning exploits the synergy between SD and WSSS by training the network end-to-end, as opposed to sequentially. Furthermore, the predictive capacity of saliency maps derived from segmentation results enhances the spatial precision of WSSS outcomes.

Numerical results presented in the paper demonstrate that the proposed method achieves competitive performance on significant benchmarks. Specifically, experiments conducted on the PASCAL VOC 2012 and several saliency benchmark datasets revealed SSNet's comparative advantage over existing WSSS methods, including some that are fully supervised. Notably, the SSNet approach outperformed on mIOU benchmarks compared to methods using stronger supervisions such as bounding boxes and scribbles.

The authors claim that SSNet not only reduces the training cost but also outperforms previous methods on both SD and WSSS tasks. This is due to its innovative architecture that allows shared learning across tasks, which not only conserves computational resources but also enriches the learning capacities of segmentation networks through the explicit modeling of saliency-data relationships.

This research opens avenues for further exploration into multi-task learning in the field of computer vision. By showing that a unified model optimized end-to-end can train effectively using weak supervision, the authors demonstrate the potential for reducing annotation costs and improving model performance via cross-task synergies. Future work could focus on exploring this joint learning paradigm across other domains, improving network efficiency, and expanding applicability to more semantic classes and real-world scenarios.

Overall, this paper significantly contributes to the ongoing development of more efficient, scalable, and cost-effective computer vision models by highlighting the advantages of multi-task learning using a single network architecture.

Related Papers

Find Related Papers