UC-Net: Uncertainty Inspired RGB-D Saliency Detection via Conditional Variational Autoencoders (2004.05763v1)

Published 13 Apr 2020 in cs.CV

Abstract: In this paper, we propose the first framework (UCNet) to employ uncertainty for RGB-D saliency detection by learning from the data labeling process. Existing RGB-D saliency detection methods treat the saliency detection task as a point estimation problem, and produce a single saliency map following a deterministic learning pipeline. Inspired by the saliency data labeling process, we propose probabilistic RGB-D saliency detection network via conditional variational autoencoders to model human annotation uncertainty and generate multiple saliency maps for each input image by sampling in the latent space. With the proposed saliency consensus process, we are able to generate an accurate saliency map based on these multiple predictions. Quantitative and qualitative evaluations on six challenging benchmark datasets against 18 competing algorithms demonstrate the effectiveness of our approach in learning the distribution of saliency maps, leading to a new state-of-the-art in RGB-D saliency detection.

Citations (300)

View on Semantic Scholar

Summary

The paper introduces UC-Net, a novel CVAE-based framework that models human annotation uncertainty to produce multiple saliency maps per RGB-D input.
It integrates a DepthCorrectionNet and latent variable mapping to refine noisy depth data and capture diverse labeling possibilities efficiently.
The method achieves superior performance against 18 state-of-the-art algorithms on six challenging datasets, demonstrating robust handling of ambiguous scenarios.

Overview of UC-Net: Uncertainty Inspired RGB-D Saliency Detection

This paper introduces UC-Net, a novel framework employing uncertainty modeling for RGB-D saliency detection using Conditional Variational Autoencoders (CVAE). The authors address the limitation of existing methods that treat saliency detection as a point estimation problem, producing a single saliency map per input image. Instead, UC-Net models human annotation uncertainty by generating multiple saliency maps for each RGB-D input, which are then utilized to form an accurate consensus-driven saliency prediction. This probabilistic approach aims to capture the inherent subjectivity in human visual perception better than deterministic models.

Methodology

UC-Net uses a CVAE-based architecture to draw from a distribution of potential saliency maps rather than a single deterministic output. Several components are integral to the UC-Net framework:

LatentNet: Comprised of a PriorNet and PosteriorNet, this module maps RGB-D inputs and ground-truth annotations into a latent space, capturing diverse labeling possibilities.
DepthCorrectionNet: This auxiliary network refines raw depth data, tackling noise issues by ensuring depth features align with RGB image cues, using a depth correction strategy guided by semantic-level losses.
SaliencyNet and PredictionNet: These modules implement a shared encoder-decoder architecture mixed with stochastic latent features and deterministic saliency features, producing the final saliency predictions.
Saliency Consensus: During testing, multiple stochastic saliency predictions are refined into a final map via a consensus mechanism emulating majority voting, analogous to human annotation processes.

Results and Evaluation

UC-Net's efficacy was benchmarked across six challenging datasets, outperforming 18 competitive algorithms, both handcrafted and deep learning-based, on metrics such as S-measure, F-measure, E-measure, and mean absolute error (MAE). Most notably, UC-Net excelled with complex RGB-D images, indicating superior robustness in uncertain scenarios. The architecture demonstrated notable improvements on datasets with high depth variability, leveraging its DepthCorrectionNet effectively.

Implications and Future Work

UC-Net signifies a step towards incorporating probabilistic reasoning into RGB-D saliency detection, challenging standard deterministic pipelines. By acknowledging and modeling uncertainty, UC-Net offers a compelling pathway for applications in areas where saliency predictions are highly subjective or ambiguous. Furthermore, UC-Net can potentially inspire methodological expansions into other domains such as video object segmentation and co-saliency detection.

Future research could explore extensions of the UC-Net framework beyond RGB-D sensing, incorporating richer data modalities or integrating with broader contextual understanding models. The approach of leveraging MVAE (Multimodal VAE) could further enhance the robustness of saliency detection across varied environments. Moreover, obtaining datasets with multiple annotator inputs could further refine the effectiveness of the probabilistic labeling mechanism, enhancing the versatility and practical applicability of UC-Net across diverse real-world scenarios.

PDF Markdown